Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 6 of 6
  1. #1
    New to the CF scene
    Join Date
    Apr 2008
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    handling bad HTML in Javascript

    This is about JS, trust me (I think).

    I have an XSLT that converts RSS to a neat and readable HTML document. The resulting HTML document uses javascript to do some fancy tricks. My HTML is table-free HTML using a CSS style sheet to display elements. My problem is that some of the RSS feeds display HTML in their content and their HTML is sometimes bad. Their bad HTML breaks my pretty HTML.

    I need to know if there is a way to properly contain bad HTML so it won't affect mine or to test their HTML for errors before I try to display it.

    I know Google Reader does something like this because the feeds that break my reader don't break Google. If I look at the Google source code, the offending elements are left out. How are they detecting and trapping that?

  • #2
    Regular Coder mjlorbet's Avatar
    Join Date
    Jan 2008
    Location
    Milwaukee, WI
    Posts
    724
    Thanks
    8
    Thanked 96 Times in 95 Posts
    Code:
     
    function doFixBadHTML(rssFeedContent){
    return rssFeedContent.replace(/>/g, "&gt;").replace(/</g, "&lt;");
    }
    this will at least prevent the bad html from rendering, just showing the actual tags instead. also, i'm not 100&#37; that my regex is good, so don't kill me if it doesn't work, it's the principle i'm trying to convey on that.

    short of writing a validation engine otherwise, i'm afraid there is no simple answer that i'm aware of.
    -Mike
    "Want me to precludify him, like some kind of dispatcherator?... Can do!" -Bender

  • #3
    New to the CF scene
    Join Date
    Apr 2008
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks. My problem is that I don't know how to detect the bad HTML.

  • #4
    Regular Coder mjlorbet's Avatar
    Join Date
    Jan 2008
    Location
    Milwaukee, WI
    Posts
    724
    Thanks
    8
    Thanked 96 Times in 95 Posts
    that function would just handle the content of each feed, converting any html content inside into visible text, not just bad content.

    meaning, let's say they embedded an image in the body portion of the post (not in the specific section image section of the rss), that code would stop it from displaying the picture and actually write out the code for the image on the page.

    detecting bad html requires the ability to validate the code, matching opening/closing tags, optional closing tags on the elements that need them, proper nesting, etc.

    inline styles that are being imported on those objects may present even more fun

    there is something else to consider here too, how does one programmatically fix html? we can infer how to do it i suppose, but take this for example

    Code:
     
    <div>
    <div style="color:red; background-color:#000; width:100px; height:100px;">
    Some text
    <img src = "hello.jpg" />
    </div>
    All done
    which div is supposed to close, and for that matter where? fixing the html won't necessarily fix the presentation, not to mention that this case is fairly ambiguous.
    Code:
     
    <div>
    <div style="color:red; background-color:#000; width:100px; height:100px;">
    Some text
    <img src = "hello.jpg" />
    </div>
    All done
    </div>
    would be an obvious answer, just tack on the closing tag to the end
    Code:
     
    <div>
    <div style="color:red; background-color:#000; width:100px; height:100px;"></div>
    Some text
    <img src = "hello.jpg" />
    </div>
    All done
    would be another obvious answer, if you parsed this code and determined that there was a div without a closing tag, closing it immediatly would keep the parent presentation from breaking but possibly kill the child presentation
    Code:
     
    <div></div>
    <div style="color:red; background-color:#000; width:100px; height:100px;">
    Some text
    <img src = "hello.jpg" />
    </div>
    All done
    is along the lines of the one above

    Code:
     
    <div>
    <div style="color:red; background-color:#000; width:100px; height:100px;">
    Some text
    <img src = "hello.jpg" />
    </div>
    All done
    notice the img tag.... there is no closing tag for it, there doesn't need to be, but simple validation based off opening < and </ tokens would likely try closing it (which may or may not cause problems, either way you look at it, changing from this syntax is undesirable), additionally we could check for the xhtml(xml) convention of /> on a childless node, but that's not always proper either.

    things to consider
    Last edited by mjlorbet; 04-28-2008 at 09:29 PM.
    -Mike
    "Want me to precludify him, like some kind of dispatcherator?... Can do!" -Bender

  • #5
    New to the CF scene
    Join Date
    Apr 2008
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have no way of knowing what bad HTML elements I will encounter. One feed (Coding Horror) is missing a closing div and one (Boing Boing) has some improperly encoded HTML tags. Not knowing what errors I might find is the biggest problem. I am currently chasing another idea that I got on a different forum to solve the problem. The idea is to inject the HTML into an iframe which will isolate the bad HTML from my HTML. So far that is working but I have to now worry about iframe heights.

  • #6
    Regular Coder mjlorbet's Avatar
    Join Date
    Jan 2008
    Location
    Milwaukee, WI
    Posts
    724
    Thanks
    8
    Thanked 96 Times in 95 Posts
    yeah, that can get pretty tricky, best of luck
    -Mike
    "Want me to precludify him, like some kind of dispatcherator?... Can do!" -Bender


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •