Thread: Parsing HTML
View Single Post
Old 02-25-2011, 06:36 AM   PM User | #3
Apothem
Regular Coder

 
Apothem's Avatar
 
Join Date: Mar 2008
Posts: 380
Thanks: 36
Thanked 25 Times in 25 Posts
Apothem is an unknown quantity at this point
The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:
Method #1
1) Every HTML tag and tag attribute is stored into a node.
2) Each node is within a nodelist's list (not dict) as a reference
3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

As such the list looks similar to this:
Code:
body
	div ('id'='container')
		h2 ('class': 'title')
		div ('class': 'meta')
		div ('class': 'content')
			p
	div ('id'='footer')
		ul
			li
			li
So here's another method I originally did, but thought it... used extra memory?
Method #2
1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total

Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?

Last edited by Apothem; 02-25-2011 at 06:42 AM..
Apothem is offline   Reply With Quote