Go Back   CodingForums.com > :: Server side development > Other server side languages/ issues > Python

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 02-23-2011, 10:43 PM   PM User | #1
Apothem
Regular Coder

 
Apothem's Avatar
 
Join Date: Mar 2008
Posts: 380
Thanks: 36
Thanked 25 Times in 25 Posts
Apothem is an unknown quantity at this point
Parsing HTML

I know I can download and install modules for parsing HTML, but I do not have root privileges to install. Therefore I would like to ask the following questions:
  • If I made an HTML parsing module file (1 py file), will it be inefficient? i.e. will it be slow/use more than needed memory?
  • Would splitting up the string into multiple parts be faster (like by using BNF notation), or is it faster to use str.find() and then substring it all?
Apothem is offline   Reply With Quote
Old 02-25-2011, 12:13 AM   PM User | #2
oesxyl
Master Coder


 
Join Date: Dec 2007
Posts: 6,682
Thanks: 436
Thanked 890 Times in 879 Posts
oesxyl is a jewel in the roughoesxyl is a jewel in the roughoesxyl is a jewel in the rough
Quote:
Originally Posted by Apothem View Post
I know I can download and install modules for parsing HTML, but I do not have root privileges to install.
HTMLParser is standard module since python 2.2, is istalled with python:

http://docs.python.org/release/2.6.6...tmlparser.html

Quote:
Therefore I would like to ask the following questions:
  • If I made an HTML parsing module file (1 py file), will it be inefficient? i.e. will it be slow/use more than needed memory?
  • Would splitting up the string into multiple parts be faster (like by using BNF notation), or is it faster to use str.find() and then substring it all?
from second question seems that you don't know to much about parser so answer to the first question is yes, will be inefficient. Probably you can learn if you want but will take some time. There are some tools for this:

http://nedbatchelder.com/text/python-parsers.html
http://onlamp.com/pub/a/python/2006/...pyparsing.html

best regards
oesxyl is offline   Reply With Quote
Users who have thanked oesxyl for this post:
Apothem (02-25-2011)
Old 02-25-2011, 06:36 AM   PM User | #3
Apothem
Regular Coder

 
Apothem's Avatar
 
Join Date: Mar 2008
Posts: 380
Thanks: 36
Thanked 25 Times in 25 Posts
Apothem is an unknown quantity at this point
The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:
Method #1
1) Every HTML tag and tag attribute is stored into a node.
2) Each node is within a nodelist's list (not dict) as a reference
3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

As such the list looks similar to this:
Code:
body
	div ('id'='container')
		h2 ('class': 'title')
		div ('class': 'meta')
		div ('class': 'content')
			p
	div ('id'='footer')
		ul
			li
			li
So here's another method I originally did, but thought it... used extra memory?
Method #2
1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total

Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?

Last edited by Apothem; 02-25-2011 at 06:42 AM..
Apothem is offline   Reply With Quote
Old 02-25-2011, 05:15 PM   PM User | #4
oesxyl
Master Coder


 
Join Date: Dec 2007
Posts: 6,682
Thanks: 436
Thanked 890 Times in 879 Posts
oesxyl is a jewel in the roughoesxyl is a jewel in the roughoesxyl is a jewel in the rough
Quote:
Originally Posted by Apothem View Post
The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:
Method #1
1) Every HTML tag and tag attribute is stored into a node.
2) Each node is within a nodelist's list (not dict) as a reference
3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

As such the list looks similar to this:
Code:
body
	div ('id'='container')
		h2 ('class': 'title')
		div ('class': 'meta')
		div ('class': 'content')
			p
	div ('id'='footer')
		ul
			li
			li
So here's another method I originally did, but thought it... used extra memory?
Method #2
1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total
imo method 2 is better. Some things you can think about:
- relative to 2#1 ( notation method#item ), id's and classes are attributes so you need to have only elements( tagnames ) and attributes
- a page is a tree where the root of the tree is 'html' with two leafs 'head' and 'body', and you can continue this way to describe the whole page( something like what you described in method 1#). That is why i think the best way is to implement a class tree and extend it after.

Quote:
Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?
there three things in a parser:
1. lexical analyzer, code to identifiy each token in the input, you can use here regex, string functions or whatever you find that is easy and fast to do

http://en.wikipedia.org/wiki/Lexical_analysis

2. syntactic analyzer, code to assemble tokens into a AST, abstract syntax tree

http://en.wikipedia.org/wiki/Parser
http://en.wikipedia.org/wiki/Abstract_syntax_tree

3. code to fill the data structure( what you described in methods 1# and 2#) with the results ( the AST).

usualy 2 and 3 are combined in a single step, sometimes all three are combined.

best regards
oesxyl is offline   Reply With Quote
Users who have thanked oesxyl for this post:
Apothem (02-26-2011)
Old 02-28-2011, 11:54 PM   PM User | #5
Samhain13
Regular Coder

 
Samhain13's Avatar
 
Join Date: Aug 2008
Location: Pilipinas
Posts: 165
Thanks: 4
Thanked 18 Times in 18 Posts
Samhain13 is on a distinguished road
Easy way out? Use BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
__________________
I am a Man of Truth. I am a Free Human Person. I am a Peacemaker.
** Independent Multimedia Artist in Pasig **
Samhain13 is offline   Reply With Quote
Old 03-01-2011, 10:23 AM   PM User | #6
Apothem
Regular Coder

 
Apothem's Avatar
 
Join Date: Mar 2008
Posts: 380
Thanks: 36
Thanked 25 Times in 25 Posts
Apothem is an unknown quantity at this point
Yes well, parsing HTML just wasn't a good idea in my end. I'm going to just use plain old regular expressions to search for things. If you are able to prove that BeautifulSoup is much faster than regular expressions in matching a) for an element of x ID and b) elements of a certain class, and c) can elements before an element + its parent.
Apothem is offline   Reply With Quote
Old 03-01-2011, 12:03 PM   PM User | #7
Samhain13
Regular Coder

 
Samhain13's Avatar
 
Join Date: Aug 2008
Location: Pilipinas
Posts: 165
Thanks: 4
Thanked 18 Times in 18 Posts
Samhain13 is on a distinguished road
Oooh, lots of machismo going around. "If you can prove that...."

Here's the thing. If you can get the League Results table from this page:
http://www.footballalliance.ph/league/Results.php

and return it as JSON without using any sort of SGML/HTML/XML parser and using only regular expressions, for updating in real-time the League Table in this page:
http://www.pinoyfootball.com/News

perhaps, we'll talk more.

And while you're at it, let's see you get the links and titles of the latest active threads from this page:
http://usapangfootball.proboards.com...=newestthreads

so that they can be rendered as a navigation list as what is done in this page:
http://www.pinoyfootball.com

Your challenge assumes that you're getting well-formed, even valid HTML source code. But the real world is full of soup. What happens if the source code you're evaluating has two, three elements that have the same ID? What happens when you're parsing elements that have multiple classes that are defined in varying order, like:

Code:
<a href="#" class="nav-link no-underline primary-link">Some String</a>
<a href="#" class="no-underline secondary-link nav-link">Another String</a>
<a href="#" class="secondary-link underline-this"><b>Yet Another String</b></a>
__________________
I am a Man of Truth. I am a Free Human Person. I am a Peacemaker.
** Independent Multimedia Artist in Pasig **

Last edited by Samhain13; 03-01-2011 at 12:09 PM..
Samhain13 is offline   Reply With Quote
Old 03-01-2011, 01:16 PM   PM User | #8
Apothem
Regular Coder

 
Apothem's Avatar
 
Join Date: Mar 2008
Posts: 380
Thanks: 36
Thanked 25 Times in 25 Posts
Apothem is an unknown quantity at this point
It seems you are making a lot of incorrect assumptions about what I know and what I want. Your response is a fallacy. Just because you know that this module makes scraping easier does not mean its performance is better than using RegEx. Therefore your retort is invalid.

Regardless, this thread is done. My questions have "long" been (more than needed to) answered by oesxyl.
Apothem is offline   Reply With Quote
Old 03-01-2011, 02:00 PM   PM User | #9
Samhain13
Regular Coder

 
Samhain13's Avatar
 
Join Date: Aug 2008
Location: Pilipinas
Posts: 165
Thanks: 4
Thanked 18 Times in 18 Posts
Samhain13 is on a distinguished road
Riiiight...
__________________
I am a Man of Truth. I am a Free Human Person. I am a Peacemaker.
** Independent Multimedia Artist in Pasig **
Samhain13 is offline   Reply With Quote
Old 03-01-2011, 02:46 PM   PM User | #10
oesxyl
Master Coder


 
Join Date: Dec 2007
Posts: 6,682
Thanks: 436
Thanked 890 Times in 879 Posts
oesxyl is a jewel in the roughoesxyl is a jewel in the roughoesxyl is a jewel in the rough
Quote:
Originally Posted by Apothem View Post
It seems you are making a lot of incorrect assumptions about what I know and what I want. Your response is a fallacy. Just because you know that this module makes scraping easier does not mean its performance is better than using RegEx. Therefore your retort is invalid.

Regardless, this thread is done. My questions have "long" been (more than needed to) answered by oesxyl.
thank you,
i agree that using regex in a specific case for a given task, is many time faster then a general parser. But are only few parser who implement error recovery and dealing with errors and invalid markup, Samhain13 is right about BeautifulSoup. I just notice that i didn't even mention about this problem, my bad,

best regards
oesxyl is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:30 AM.


Advertisement
Log in to turn off these ads.