...

View Full Version : handling bad HTML in Javascript



zkent
04-28-2008, 08:58 PM
This is about JS, trust me (I think).

I have an XSLT that converts RSS to a neat and readable HTML document. The resulting HTML document uses javascript to do some fancy tricks. My HTML is table-free HTML using a CSS style sheet to display elements. My problem is that some of the RSS feeds display HTML in their content and their HTML is sometimes bad. Their bad HTML breaks my pretty HTML.

I need to know if there is a way to properly contain bad HTML so it won't affect mine or to test their HTML for errors before I try to display it.

I know Google Reader does something like this because the feeds that break my reader don't break Google. If I look at the Google source code, the offending elements are left out. How are they detecting and trapping that?

mjlorbet
04-28-2008, 09:11 PM
function doFixBadHTML(rssFeedContent){
return rssFeedContent.replace(/>/g, "&gt;").replace(/</g, "&lt;");
}


this will at least prevent the bad html from rendering, just showing the actual tags instead. also, i'm not 100&#37; that my regex is good, so don't kill me if it doesn't work, it's the principle i'm trying to convey on that.

short of writing a validation engine otherwise, i'm afraid there is no simple answer that i'm aware of.

zkent
04-28-2008, 09:19 PM
Thanks. My problem is that I don't know how to detect the bad HTML.

mjlorbet
04-28-2008, 09:22 PM
that function would just handle the content of each feed, converting any html content inside into visible text, not just bad content.

meaning, let's say they embedded an image in the body portion of the post (not in the specific section image section of the rss), that code would stop it from displaying the picture and actually write out the code for the image on the page.

detecting bad html requires the ability to validate the code, matching opening/closing tags, optional closing tags on the elements that need them, proper nesting, etc.

inline styles that are being imported on those objects may present even more fun

there is something else to consider here too, how does one programmatically fix html? we can infer how to do it i suppose, but take this for example



<div>
<div style="color:red; background-color:#000; width:100px; height:100px;">
Some text
<img src = "hello.jpg" />
</div>
All done


which div is supposed to close, and for that matter where? fixing the html won't necessarily fix the presentation, not to mention that this case is fairly ambiguous.


<div>
<div style="color:red; background-color:#000; width:100px; height:100px;">
Some text
<img src = "hello.jpg" />
</div>
All done
</div>

would be an obvious answer, just tack on the closing tag to the end


<div>
<div style="color:red; background-color:#000; width:100px; height:100px;"></div>
Some text
<img src = "hello.jpg" />
</div>
All done

would be another obvious answer, if you parsed this code and determined that there was a div without a closing tag, closing it immediatly would keep the parent presentation from breaking but possibly kill the child presentation


<div></div>
<div style="color:red; background-color:#000; width:100px; height:100px;">
Some text
<img src = "hello.jpg" />
</div>
All done

is along the lines of the one above



<div>
<div style="color:red; background-color:#000; width:100px; height:100px;">
Some text
<img src = "hello.jpg" />
</div>
All done

notice the img tag.... there is no closing tag for it, there doesn't need to be, but simple validation based off opening < and </ tokens would likely try closing it (which may or may not cause problems, either way you look at it, changing from this syntax is undesirable), additionally we could check for the xhtml(xml) convention of /> on a childless node, but that's not always proper either.

things to consider

zkent
04-29-2008, 02:28 AM
I have no way of knowing what bad HTML elements I will encounter. One feed (Coding Horror) is missing a closing div and one (Boing Boing) has some improperly encoded HTML tags. Not knowing what errors I might find is the biggest problem. I am currently chasing another idea that I got on a different forum to solve the problem. The idea is to inject the HTML into an iframe which will isolate the bad HTML from my HTML. So far that is working but I have to now worry about iframe heights.

mjlorbet
04-29-2008, 02:33 AM
yeah, that can get pretty tricky, best of luck



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum