...

View Full Version : RegExp assistance



awarberg
05-08-2009, 04:53 PM
Hi

I am writing a small javascript application which reads an rss feed. The description tag of each item contains a cdata section.

I want to parse the cdata section and extract bits of information for display on google maps.

Below is an example of this cdata section contents.


<div><b>Projekt:</b> Clausholmvej</div>
<div><b>Længdegrad:</b> 55.642802</div>
<div><b>Breddegrad:</b> 12.338333</div>
<div><b>Indhold:</b> <div class=ExternalClass5039DAD80923490DBD90804763287407>
<div>Kryds 1</div>
<div> </div>
<div><em>Mere kursiv tekst</em></div>
<div> </div>
<div><strong>Fed tekst</strong></div>
<div> </div>
<div><a href="http://code.google.com/support/bin/answer.py?answer=65622&amp;topic=11364">Et link</a></div></div></div>

I am able to extract "Projekt", "Længdegrad" and "Breddegrad" using a regular expression.

I am, however, not able to extract "Indhold" (marked in red). Indhold is a html snippet which is supposed to go into the Google Maps window, which pops up on clicking on a marker.

Could you help me and suggest a regexp that extracts Indhold for me? Please note the cdata example is copied verbatim as it arrives from the rss generator (Sharepoint...). I have no control of the generated rss. Also note that the ExternalClass... is expected to vary.

NB: I have tried putting the contents of cdata into a jQuery object for processing via jQuery. While this approach works in Google Chrome it fails in Internet Explorer, which is a mandatory supported browser for this application. I am forced to extracting these data bits by plain old regexp.

Thank you so much for your help.

Best regards
Andreas

thecaligarmo
05-08-2009, 09:37 PM
Can you post what RegExp you are using and also what the results are. The more code you provide the easier it is to help.

awarberg
05-11-2009, 09:28 AM
Below is the code I use to extract the three primitive data items.



var projectDetails = {};

jQuery.each(["Projekt", "Længdegrad", "Breddegrad"], function(){
var pattern = new RegExp("<div><b>" + this + ":<\/b>(.+)<\/div>");
var match = pattern.exec(cdataText);
projectDetails[this] = jQuery.trim(match[1]);
});

var project = projectDetails["Projekt"];
var lat = projectDetails["Længdegrad"];
var lng = projectDetails["Breddegrad"];


Where cdatatext is the sample text I gave you in my first post.

For that sample I would get:


project == "Clausholmvej"
lat == "55.642802"
lng == "12.338333"


Now I just need a regular expression to extract the html snippet "Indhold".

Best regards
Andreas

Old Pedant
05-11-2009, 10:06 PM
Not sure you can do this with a single regular expression, at all.

Not even easy to do with multiple reg exps.

I'd just opt to do it in ordinary string manipulation code.

Possible implementation:


<script>
test =
"<div><b>Projekt:</b> Clausholmvej</div>"
+ "<div><b>Længdegrad:</b> 55.642802</div>"
+ "<div><b>Breddegrad:</b> 12.338333</div>"
+ "<div><b>Indhold:</b> <div class=ExternalClass5039DAD80923490DBD90804763287407>"
+ "<div>Kryds 1</div>"
+ "<div> </div>"
+ "<div><em>Mere kursiv tekst</em></div>"
+ "<div> </div>"
+ "<div><strong>Fed tekst</strong></div>"
+ "<div> </div>"
+ "<div><a href=\"http://code.google.com/support/bin/answer.py?answer=65622&amp;topic=11364\">Et link</a></div></div></div>"
+ "<div>MORE STUFF HERE</div>";

function findIndHold( findIn )
{
var str = findIn.toLowerCase();
// find start:
var indholdAt = str.indexOf("indhold"); // we are past the <div>
var extAt = str.indexOf("<div", indholdAt ); // the div with class=Extern...
var startAt = str.indexOf("<div",extAt+4); // start of part we care about!
// find matching end
var cur = startAt + 4;
var count = 2;
while ( count > 0 )
{
var find1 = str.indexOf("<div",cur);
var find2 = str.indexOf("</div",cur);
if ( find1 < find2 )
{
++count;
cur = find1 + 4;
} else {
--count;
cur = find2 + 5;
}
}
var endAt = cur - 5;
return findIn.substring( startAt, endAt );
}
alert(test);
alert( findIndHold(test) );
</script>


Seems to work. Test it with other examples.

I did the toLowerCase() so it works if you might have <DIV> in place of <div> in some places. If you are sure you don't need that, you can omit it.

venegal
05-11-2009, 11:50 PM
It's actually quite similar to what you did before:


var cdataText=(<r><![CDATA[
<div><b>Projekt:</b> Clausholmvej</div>
<div><b>Længdegrad:</b> 55.642802</div>
<div><b>Breddegrad:</b> 12.338333</div>
<div><b>Indhold:</b> <div class=ExternalClass5039DAD80923490DBD90804763287407>
<div>Kryds 1</div>
<div> </div>
<div><em>Mere kursiv tekst</em></div>
<div> </div>
<div><strong>Fed tekst</strong></div>
<div> </div>
<div><a href="http://code.google.com/support/bin/answer.py?answer=65622&amp;topic=11364">Et link</a></div></div></div>
]]></r>).toString();

cdataText = cdataText.replace(/\n/g, "\\n");

var pattern = /<div><b>Indhold:<\/b> <div class=ExternalClass.*?>(.*?<\/div>)<\/div><\/div>/;
var match = pattern.exec(cdataText);

var indhold = (match[1]).replace(/\\n/g, "\n");

alert(indhold);


Only strange thing I did there is replace line feeds and put them back in afterwards because of problems with regexps and multiline strings.

Old Pedant
05-12-2009, 01:06 AM
No, I don't think that works, Venegal.

Suppose that the input data looks like this, with what I have added in red:


<div><b>Projekt:</b> Clausholmvej</div>
<div><b>Længdegrad:</b> 55.642802</div>
<div><b>Breddegrad:</b> 12.338333</div>
<div><b>Indhold:</b> <div class=ExternalClass5039DAD80923490DBD90804763287407>
<div>Kryds 1</div>
<div> </div>
<div><em>Mere kursiv tekst</em><div>embed1<div>embed2</div></div></div>
<div> </div>
<div><strong>Fed tekst</strong></div>
<div> </div>
<div><a href="http://code.google.com/support/bin/answer.py?answer=65622&amp;topic=11364">Et link</a>
</div></div></div>
Your regexp will terminate at the end of the </div> after my added text.

You have assumed that his HTML will always have exactly that same format.

If you happen to be right, then yes, your solution works. But if the contents of the html colored like this is indeed just arbitrary HTML coming from some database, then you can't possibly predict that a string of three </div>'s in a row will be the real end of the search string. And even as simple a thing as the HTML ending in

</div> </div> </div>

(that is, with lots of spaces) would cause your regex to fail.

venegal
05-12-2009, 01:57 AM
You're right, that would fail indeed. Regex is not the right tool for parsing complex markup, and I didn't read your answer before posting myself, sorry.

But when helping out with a regex problem I like to assume the simplest case until told otherwise, and then adapt, if necessary, in order to keep things from getting needlessly convoluted.

Maybe awarberg knows that the html is always of the same form and there will never be nested divs?

Or, if not, maybe the end of that snippet is always the end of the cdata?

Or, if not, maybe after the "Indhold" part there will always be another part starting with "<div><b>Something:" like the ones before?

In my opinion there are too many possible scenarios, in which a simple regexp will work perfectly, to just reject a regexp solution completely in favor of a more flexible and complex one.

Old Pedant
05-12-2009, 02:14 AM
Yes, I admit to trying to always find the most general answers. That's why I even used toLowerCase( ) to make sure that looking for <div and </div would work. And that's why I didn't search for <div>, since I assumed it would be possible to encounter something like <div style="color: red;">

But yeah, my solution is quite possibly overkill. Maybe the original poster will eventually reply and we'll find out.

awarberg
05-20-2009, 04:09 PM
Hi guys

Thank you very much for your help!

To answer the questions:

- After Indhold, there will not be another "<div><b>Something:" part.
- The cdata section only contains the html snippet so the end of the snippet is immediately preceding the end of the cdata section.

I have opted for the regexp solution by venegal since it is easier for me to understand. I can see I was somewhat close but the newline part must have tripped me.

As I mentioned the list comes from Sharepoint and the html snippets are created using Sharepoint Content Editor. I have made some trial runs and it seems unlikely that html snippets will be generated containing nested divs. I think the regexp solution will be robust. But, obviously, I would like to have a generic regexp which didn't make this assumption. Is it possible to use back references (eg. \0) in the regexp to solve this?

Anyway, the solution works now - in IE7 as well - so thank you very much for your help.

Best regards
Andreas



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum