...

View Full Version : Parsing HTML



Apothem
02-23-2011, 10:43 PM
I know I can download and install modules for parsing HTML, but I do not have root privileges to install. Therefore I would like to ask the following questions:

If I made an HTML parsing module file (1 py file), will it be inefficient? i.e. will it be slow/use more than needed memory?
Would splitting up the string into multiple parts be faster (like by using BNF notation), or is it faster to use str.find() and then substring it all?

oesxyl
02-25-2011, 12:13 AM
I know I can download and install modules for parsing HTML, but I do not have root privileges to install.
HTMLParser is standard module since python 2.2, is istalled with python:

http://docs.python.org/release/2.6.6/library/htmlparser.html


Therefore I would like to ask the following questions:

If I made an HTML parsing module file (1 py file), will it be inefficient? i.e. will it be slow/use more than needed memory?
Would splitting up the string into multiple parts be faster (like by using BNF notation), or is it faster to use str.find() and then substring it all?

from second question seems that you don't know to much about parser so answer to the first question is yes, will be inefficient. Probably you can learn if you want but will take some time. There are some tools for this:

http://nedbatchelder.com/text/python-parsers.html
http://onlamp.com/pub/a/python/2006/01/26/pyparsing.html

best regards

Apothem
02-25-2011, 06:36 AM
The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:
Method #1
1) Every HTML tag and tag attribute is stored into a node.
2) Each node is within a nodelist's list (not dict) as a reference
3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

As such the list looks similar to this:

body
div ('id'='container')
h2 ('class': 'title')
div ('class': 'meta')
div ('class': 'content')
p
div ('id'='footer')
ul
li
li

So here's another method I originally did, but thought it... used extra memory?
Method #2
1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total

Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?

oesxyl
02-25-2011, 05:15 PM
The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:
Method #1
1) Every HTML tag and tag attribute is stored into a node.
2) Each node is within a nodelist's list (not dict) as a reference
3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

As such the list looks similar to this:

body
div ('id'='container')
h2 ('class': 'title')
div ('class': 'meta')
div ('class': 'content')
p
div ('id'='footer')
ul
li
li

So here's another method I originally did, but thought it... used extra memory?
Method #2
1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total
imo method 2 is better. Some things you can think about:
- relative to 2#1 ( notation method#item ), id's and classes are attributes so you need to have only elements( tagnames ) and attributes
- a page is a tree where the root of the tree is 'html' with two leafs 'head' and 'body', and you can continue this way to describe the whole page( something like what you described in method 1#). That is why i think the best way is to implement a class tree and extend it after.


Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?
there three things in a parser:
1. lexical analyzer, code to identifiy each token in the input, you can use here regex, string functions or whatever you find that is easy and fast to do

http://en.wikipedia.org/wiki/Lexical_analysis

2. syntactic analyzer, code to assemble tokens into a AST, abstract syntax tree

http://en.wikipedia.org/wiki/Parser
http://en.wikipedia.org/wiki/Abstract_syntax_tree

3. code to fill the data structure( what you described in methods 1# and 2#) with the results ( the AST).

usualy 2 and 3 are combined in a single step, sometimes all three are combined.

best regards

Samhain13
02-28-2011, 11:54 PM
Easy way out? Use BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ :D

Apothem
03-01-2011, 10:23 AM
Yes well, parsing HTML just wasn't a good idea in my end. I'm going to just use plain old regular expressions to search for things. If you are able to prove that BeautifulSoup is much faster than regular expressions in matching a) for an element of x ID and b) elements of a certain class, and c) can elements before an element + its parent.

Samhain13
03-01-2011, 12:03 PM
Oooh, lots of machismo going around. "If you can prove that...."

Here's the thing. If you can get the League Results table from this page:
http://www.footballalliance.ph/league/Results.php

and return it as JSON without using any sort of SGML/HTML/XML parser and using only regular expressions, for updating in real-time the League Table in this page:
http://www.pinoyfootball.com/News

perhaps, we'll talk more.

And while you're at it, let's see you get the links and titles of the latest active threads from this page:
http://usapangfootball.proboards.com/index.cgi?action=newestthreads

so that they can be rendered as a navigation list as what is done in this page:
http://www.pinoyfootball.com

Your challenge assumes that you're getting well-formed, even valid HTML source code. But the real world is full of soup. What happens if the source code you're evaluating has two, three elements that have the same ID? What happens when you're parsing elements that have multiple classes that are defined in varying order, like:


<a href="#" class="nav-link no-underline primary-link">Some String</a>
<a href="#" class="no-underline secondary-link nav-link">Another String</a>
<a href="#" class="secondary-link underline-this"><b>Yet Another String</b></a>

Apothem
03-01-2011, 01:16 PM
It seems you are making a lot of incorrect assumptions about what I know and what I want. Your response is a fallacy. Just because you know that this module makes scraping easier does not mean its performance is better than using RegEx. Therefore your retort is invalid.

Regardless, this thread is done. My questions have "long" been (more than needed to) answered by oesxyl.

Samhain13
03-01-2011, 02:00 PM
Riiiight... :thumbsup:

oesxyl
03-01-2011, 02:46 PM
It seems you are making a lot of incorrect assumptions about what I know and what I want. Your response is a fallacy. Just because you know that this module makes scraping easier does not mean its performance is better than using RegEx. Therefore your retort is invalid.

Regardless, this thread is done. My questions have "long" been (more than needed to) answered by oesxyl.
thank you, :)
i agree that using regex in a specific case for a given task, is many time faster then a general parser. But are only few parser who implement error recovery and dealing with errors and invalid markup, Samhain13 is right about BeautifulSoup. I just notice that i didn't even mention about this problem, my bad, :)

best regards



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum