Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 10 of 10

Thread: Parsing HTML

  1. #1
    Regular Coder Apothem's Avatar
    Join Date
    Mar 2008
    Posts
    380
    Thanks
    36
    Thanked 25 Times in 25 Posts

    Parsing HTML

    I know I can download and install modules for parsing HTML, but I do not have root privileges to install. Therefore I would like to ask the following questions:
    • If I made an HTML parsing module file (1 py file), will it be inefficient? i.e. will it be slow/use more than needed memory?
    • Would splitting up the string into multiple parts be faster (like by using BNF notation), or is it faster to use str.find() and then substring it all?

  • #2
    Master Coder
    Join Date
    Dec 2007
    Posts
    6,682
    Thanks
    436
    Thanked 890 Times in 879 Posts
    Quote Originally Posted by Apothem View Post
    I know I can download and install modules for parsing HTML, but I do not have root privileges to install.
    HTMLParser is standard module since python 2.2, is istalled with python:

    http://docs.python.org/release/2.6.6...tmlparser.html

    Therefore I would like to ask the following questions:
    • If I made an HTML parsing module file (1 py file), will it be inefficient? i.e. will it be slow/use more than needed memory?
    • Would splitting up the string into multiple parts be faster (like by using BNF notation), or is it faster to use str.find() and then substring it all?
    from second question seems that you don't know to much about parser so answer to the first question is yes, will be inefficient. Probably you can learn if you want but will take some time. There are some tools for this:

    http://nedbatchelder.com/text/python-parsers.html
    http://onlamp.com/pub/a/python/2006/...pyparsing.html

    best regards

  • Users who have thanked oesxyl for this post:

    Apothem (02-25-2011)

  • #3
    Regular Coder Apothem's Avatar
    Join Date
    Mar 2008
    Posts
    380
    Thanks
    36
    Thanked 25 Times in 25 Posts
    The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

    I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:
    Method #1
    1) Every HTML tag and tag attribute is stored into a node.
    2) Each node is within a nodelist's list (not dict) as a reference
    3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
    4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
    5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
    6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
    7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

    As such the list looks similar to this:
    Code:
    body
    	div ('id'='container')
    		h2 ('class': 'title')
    		div ('class': 'meta')
    		div ('class': 'content')
    			p
    	div ('id'='footer')
    		ul
    			li
    			li
    So here's another method I originally did, but thought it... used extra memory?
    Method #2
    1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
    2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
    3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
    4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

    Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total

    Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?
    Last edited by Apothem; 02-25-2011 at 06:42 AM.

  • #4
    Master Coder
    Join Date
    Dec 2007
    Posts
    6,682
    Thanks
    436
    Thanked 890 Times in 879 Posts
    Quote Originally Posted by Apothem View Post
    The funny thing is that just today I found that (HTMLParser) page; I wasn't searching the right terms before.

    I'm not too savvy when it comes to performance and when I search up, but the way I've set my class is as such:
    Method #1
    1) Every HTML tag and tag attribute is stored into a node.
    2) Each node is within a nodelist's list (not dict) as a reference
    3) Every node is also a nodelist, but the cardinality of the nodelist can be 0
    4) Two nodes are adjacent iff they are nested within the same tag block (i.e. "title" and "script" are adjacent if they are inside a "head" tag).
    5) Every adjacent node can getPrev or getNext to get the tag behind/in front of it, respectively, if there is none it will return None
    6) A nodelist has the methods getElementById, getElementsByTagName, getElementsByClassName
    7) Each of the getElement(s) method iterates through all of the nodes in nodelists, including the nodes within each of the node's nodelist (recursively), to find matching ids, tag name, or class name.

    As such the list looks similar to this:
    Code:
    body
    	div ('id'='container')
    		h2 ('class': 'title')
    		div ('class': 'meta')
    		div ('class': 'content')
    			p
    	div ('id'='footer')
    		ul
    			li
    			li
    So here's another method I originally did, but thought it... used extra memory?
    Method #2
    1) Has members tagnames, classes, and ids in nodelist, each of which are dicts that contains a list (not dict) of node references.
    2) Each node is within the (single) nodelist's tagnames (is a list) as a reference
    3) Each node has a predecessor and successor member. The only difference is that the parent of a node can be obtained through getPrev (i.e. being in the same tag block does not matter).
    4) Because there are members for getElement(s)By(Id/ClassName/TagName)(), it only needs to either: a) return the self.ids['myid'], b) return self.classes['classname'], or c) return self.tagnames['tagname']

    Also, for each "run" I get about 8 class/tag names from a page that has a total of about 4000 tags total
    imo method 2 is better. Some things you can think about:
    - relative to 2#1 ( notation method#item ), id's and classes are attributes so you need to have only elements( tagnames ) and attributes
    - a page is a tree where the root of the tree is 'html' with two leafs 'head' and 'body', and you can continue this way to describe the whole page( something like what you described in method 1#). That is why i think the best way is to implement a class tree and extend it after.

    Based on my descriptions, would it have been better if I stuck with Method #2, or is Method #1 fine as is? Or would it be better to use regular expressions?
    there three things in a parser:
    1. lexical analyzer, code to identifiy each token in the input, you can use here regex, string functions or whatever you find that is easy and fast to do

    http://en.wikipedia.org/wiki/Lexical_analysis

    2. syntactic analyzer, code to assemble tokens into a AST, abstract syntax tree

    http://en.wikipedia.org/wiki/Parser
    http://en.wikipedia.org/wiki/Abstract_syntax_tree

    3. code to fill the data structure( what you described in methods 1# and 2#) with the results ( the AST).

    usualy 2 and 3 are combined in a single step, sometimes all three are combined.

    best regards

  • Users who have thanked oesxyl for this post:

    Apothem (02-26-2011)

  • #5
    Regular Coder Samhain13's Avatar
    Join Date
    Aug 2008
    Location
    Pilipinas
    Posts
    169
    Thanks
    4
    Thanked 18 Times in 18 Posts
    Easy way out? Use BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
    I am a Man of Truth. I am a Free Human Person. I am a Peacemaker.
    ** Independent Multimedia Artist in Pasig **

  • #6
    Regular Coder Apothem's Avatar
    Join Date
    Mar 2008
    Posts
    380
    Thanks
    36
    Thanked 25 Times in 25 Posts
    Yes well, parsing HTML just wasn't a good idea in my end. I'm going to just use plain old regular expressions to search for things. If you are able to prove that BeautifulSoup is much faster than regular expressions in matching a) for an element of x ID and b) elements of a certain class, and c) can elements before an element + its parent.

  • #7
    Regular Coder Samhain13's Avatar
    Join Date
    Aug 2008
    Location
    Pilipinas
    Posts
    169
    Thanks
    4
    Thanked 18 Times in 18 Posts
    Oooh, lots of machismo going around. "If you can prove that...."

    Here's the thing. If you can get the League Results table from this page:
    http://www.footballalliance.ph/league/Results.php

    and return it as JSON without using any sort of SGML/HTML/XML parser and using only regular expressions, for updating in real-time the League Table in this page:
    http://www.pinoyfootball.com/News

    perhaps, we'll talk more.

    And while you're at it, let's see you get the links and titles of the latest active threads from this page:
    http://usapangfootball.proboards.com...=newestthreads

    so that they can be rendered as a navigation list as what is done in this page:
    http://www.pinoyfootball.com

    Your challenge assumes that you're getting well-formed, even valid HTML source code. But the real world is full of soup. What happens if the source code you're evaluating has two, three elements that have the same ID? What happens when you're parsing elements that have multiple classes that are defined in varying order, like:

    Code:
    <a href="#" class="nav-link no-underline primary-link">Some String</a>
    <a href="#" class="no-underline secondary-link nav-link">Another String</a>
    <a href="#" class="secondary-link underline-this"><b>Yet Another String</b></a>
    Last edited by Samhain13; 03-01-2011 at 12:09 PM.
    I am a Man of Truth. I am a Free Human Person. I am a Peacemaker.
    ** Independent Multimedia Artist in Pasig **

  • #8
    Regular Coder Apothem's Avatar
    Join Date
    Mar 2008
    Posts
    380
    Thanks
    36
    Thanked 25 Times in 25 Posts
    It seems you are making a lot of incorrect assumptions about what I know and what I want. Your response is a fallacy. Just because you know that this module makes scraping easier does not mean its performance is better than using RegEx. Therefore your retort is invalid.

    Regardless, this thread is done. My questions have "long" been (more than needed to) answered by oesxyl.

  • #9
    Regular Coder Samhain13's Avatar
    Join Date
    Aug 2008
    Location
    Pilipinas
    Posts
    169
    Thanks
    4
    Thanked 18 Times in 18 Posts
    Riiiight...
    I am a Man of Truth. I am a Free Human Person. I am a Peacemaker.
    ** Independent Multimedia Artist in Pasig **

  • #10
    Master Coder
    Join Date
    Dec 2007
    Posts
    6,682
    Thanks
    436
    Thanked 890 Times in 879 Posts
    Quote Originally Posted by Apothem View Post
    It seems you are making a lot of incorrect assumptions about what I know and what I want. Your response is a fallacy. Just because you know that this module makes scraping easier does not mean its performance is better than using RegEx. Therefore your retort is invalid.

    Regardless, this thread is done. My questions have "long" been (more than needed to) answered by oesxyl.
    thank you,
    i agree that using regex in a specific case for a given task, is many time faster then a general parser. But are only few parser who implement error recovery and dealing with errors and invalid markup, Samhain13 is right about BeautifulSoup. I just notice that i didn't even mention about this problem, my bad,

    best regards


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •