Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 2 of 2
  1. #1
    New Coder
    Join Date
    Dec 2009
    Location
    US
    Posts
    53
    Thanks
    9
    Thanked 0 Times in 0 Posts

    HTML Parsing with C#, HTML Variation problem

    I'm trying to parse some HTML with C# but have run into a problem with the tags around listing URLs being inconsistent.

    Code:
    <ul class="contact clearfix">
                                            
                                                <h4>Visit Site:</h4>
                                                <strong><span style="font-size: 12px;font-weight: bold;color:#228fbb;">www.blauel.com/</span></strong>
    </ul>
    Code:
    <h4>
                                                    Visit Site:</h4>
                                                <strong><a href="http://ps-architects.co.uk/" target="_blank" >
                                                    ps-architects.co.uk/
                                                </a></strong></li>
                                                
                                            <li>
                                                <h4>
    Code:
    <ul class="contact clearfix">
                                            
                                                <h4>Visit Site:</h4>
                                                <strong><span style="font-size: 12px;font-weight: bold;color:#228fbb;">www.bbm-architects.co.uk/</span></strong>
                                                
                                        </ul>
    For other items pulled from the HTML, I'd doing something like find the index of "Visit Site:" (or the tag marking the element like for <Name>listing name</name>) and then get everything from the end of that, the >, until hitting a <. But with what comes between the visit site and the actual url varying, and being unable to do anything about that, I'm not sure how I can get just the url correctly. Is there any good way to do this?

  2. #2
    Senior Coder deathshadow's Avatar
    Join Date
    Feb 2016
    Location
    Keene, NH
    Posts
    2,927
    Thanks
    3
    Thanked 416 Times in 405 Posts
    Yeah, fix the HTML, it's invalid. H4 directly as a child of a UL is gibberish. I very much doubt structurally those should be all the way down at H4. Slopping style="" with pixel metric fonts is accessibility rubbish and developer ignorance. I even more so doubt that a domain as text would be recieving "more emphasis", the grammatical meaning of <strong> (since strong does NOT mean "make this text bold", that's just the default appearance on screen media)

    Realistically a browser would/should be having as much trouble deciphering that train wreck of how NOT to write HTML as your own parser likely is. That's why broken/incomprehensible/inconsistent/invalid HTML is basically useless trash.

    Though I'm REALLY not sure what it is you are asking, as only ONE of your code blocks actually links anywhere. the others aren't links so why are you trying to pull their text content?

    Also given how badly broken the markup is, this is NOT something I'd be trying to hard-code from any c dialect. I'd probably pre-process it off a built DOM in the browser with userscript (via something like greasemonkey or tampermonkey) to create something C# is more cleanly able to digest for far less effort.

    But that hingest on how well formed the DOM is after it struggles its way though that incompetent laundry list of how NOT to write HTML.
    Last edited by deathshadow; Dec 26th, 2017 at 08:39 AM.
    “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.” – C.A.R. Hoare, The 1980 ACM Turing Award Lecture
    http://www.cutcodedown.com


 

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •