PDA

View Full Version : Need help with regexp



PassiveSmoking
Apr 16th, 2007, 03:13 PM
I am writing a javascript that is intended to help users look up jargon terms. It pulls the terms in from an XML file, searches for them in the HTML and replaces them with hyperlinks to a page that defines the term. (This is done in a loop that iterates over all the terms defined in the XML file, so that terms can be easily edited)

At the moment I'm using a simple regexp, consisting of just the keyword, or set of keywords I want to turn into jargon links, and it works pretty well, except there is a danger that the script will replace terms that are already inside hyperlinks or other tags where replacement wouldn't be appropriate. This could potentially break the page.

What i need is a regular expression that matches the keyword, as long as it is not contained within a hyperlink.

For example, one regexp operation might look like this:



searchRegExp.compile ('keystone correction|keystoning', 'ig');
testData.innerHTML = testData.innerHTML.replace (searchRegExp, '<a class="jargonbuster" href="#" onclick="return (jargonDiv (\'keystone_correction\'));">$&</a>');


but I'd really need something like this:



searchRegExp.compile ('Not an <a> tag' + 'keystone correction|keystoning' + 'Not an </a> tag', 'ig');

PassiveSmoking
Apr 17th, 2007, 10:39 AM
Here's my attempt at a solution. It is supposed to match against a keyword if it's not enclosed in a <a> ... </a> pair or in a tag attribute (which I define as ="..."). For the most part it seems to work in tests with the Javascript and with The Regex Coach, but it will also ignore any keywords that it finds that are inside quotes. It seems to not take the = into account.



searchRegExp.compile ('(?!<a.*?>|=")' + xmlTerms[thisTerm].getAttribute ('regex') + '(?!</a>|")', 'ig');

PassiveSmoking
Apr 18th, 2007, 10:22 AM
The solution from my last post doesn't seem to work 100% reliably Using the following example regex with the following example input text results in matches being made that I don't want.

Example regex (with case ignore and global matching turned on):


(?!<a.*?>|=\")ansi lumens(?!</a>|\")

Exampl input:


* 1600 ANSI Lumens
* 1600 <a href="ANSI Lumens">ANSI Lumens</a>
* 1600 <a href="Test ANSI Lumens">Test ANSI Lumens</a>
* 1600 <a href="ANSI Lumens Test">ANSI Lumens Test</a>
* 1600 <a href="Test ANSI Lumens Test">Test ANSI Lumens Test</a>


Desired result:


* 1600 ANSI Lumens
* 1600 <a href="ANSI Lumens">ANSI Lumens</a>
* 1600 <a href="Test ANSI Lumens">Test ANSI Lumens</a>
* 1600 <a href="ANSI Lumens Test">ANSI Lumens Test</a>
* 1600 <a href="Test ANSI Lumens Test">Test ANSI Lumens Test</a>


Actual result:


* 1600 ANSI Lumens
* 1600 <a href="ANSI Lumens">ANSI Lumens</a>
* 1600 <a href="Test ANSI Lumens">Test ANSI Lumens</a>
* 1600 <a href="ANSI Lumens Test">ANSI Lumens Test</a>
* 1600 <a href="Test ANSI Lumens Test">Test ANSI Lumens Test</a>

rwedge
Apr 19th, 2007, 01:56 AM
One way to do it:
<script type="text/javascript">
term = /ANSI Lumens/i;
var at = new Array('1600 ANSI Lumens',
'1600 <a href="ANSI Lumens">ANSI Lumens</a>',
'1600 <a href="Test ANSI Lumens">Test ANSI Lumens</a>',
'1600 <a href="ANSI Lumens Test">ANSI Lumens Test</a>',
'1600 <a href="Test ANSI Lumens Test">Test ANSI Lumens Test</a>');
for (var i = 0;i < at.length;i++) {
(n=at[i].replace(/<a(.|\n)*<\/a>/ig,'')).match(term)? alert(n):false;
}
</script>

PassiveSmoking
Apr 19th, 2007, 03:27 PM
Thanks but I can't quite figure out your code well enough to adapt it for my needs.

After experimenting I tried the following as the regex to use, and found it worked apparently perfectly, except in Internet Explorer 6 where it seems to cause a bug in the regexp engine to manifest itself. I don't have access to IE7 so can't tell if it works there or not.



searchRegExp.compile ('(?!<a.*?>|=")' + xmlTerms[thisTerm].getAttribute ('regex') + '(?!.*?</a>|")', 'ig');


This code gives the desired results in FireFox and in Opera. The .*? in the second term (the on that's supposed to look vor the </a> tag or the " character and prevent the regexp from matching if found) causes some of the text that should match to fail to match.

PassiveSmoking
Apr 23rd, 2007, 01:31 PM
I'm really struggling to find a solution to this problem, so I'm going to divulge as much as I am allowed to about the script and the data it is working with.

A HTML file loads the script, which in turn loads an XML file and parses the HTML file within which it is embedded.

The HTML (minus style information and some other parts):



<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>
Hitachi CP-RX60Z XGA LCD Projector CP-RX60Z from PC World Business
</title>
<!-- Begin jargon buster -->
<link rel="stylesheet" href="styles/jargonbuster.css" type="text/css" media="all" />
<script type="text/javascript">var dataSource = 'xml/projection.xml';</script>
<script type="text/javascript" src="javascript/jargonbuster.js"></script>
<!-- End jargon buster -->
</head>
<body>
<div id="content">
<!-- Page Content Goes Here -->
<div id="bread">
&nbsp; <a href="/catalogue/Displays">Displays</a> <a
href="/catalogue/Displays/Projection">Projection</a> <a
href="/catalogue/Displays/Projection/Projectors">Projectors</a>
</div>
<div class="item">
<h1>
Hitachi CP-RX60Z XGA LCD Projector
</h1>
<p>
Hitachi CP-RX60Z XGA LCD Projector
</p>
<h2>
Product Features
</h2>
<p>
<strong>Ideal solution for those looking for performance at a
feel-good price</strong>
</p>
<ul class="features">
<li>
3LCD Technology
</li>
<li>
<a id="3LCD" href="http://www.example.com/">3LCD</a> Technology
</li>
<li>
<a id ="3LCD test" href="http://www.example.com/">3LCD test</a> Technology
</li>
<li>
<a id="test 3LCD" href="http://www.example.com/">test 3LCD</a> Technology
</li>
<li>
<a id="test 3LCD test" href="http://www.example.com/">test 3LCD test</a> Technology
</li>
<li><abbr>3LCD</abbr></li>
<li>
XGA Resolution
</li>
<li>
1600 ANSI Lumens
</li>
<li>
1.2x optical zoom
</li>
<li>
Multiple Connectivity to accommodate installation
</li>
<li>
Digital Keystone Correction
</li>
<li>
500:1 Contrast Ratio
</li>
<li>
2.3kg
</li>
</ul>
<h2>
Product description
</h2>
<div class="readable">
<p>
This versatile digital projector is the ideal solution for those
looking for performance at a feel-good price. The Hitachi CP-RX60Z
can be used for a number of applications including education,
mobile business, home use and even home theatre. With a powerful
1600 ANSI Lumens and XGA resolution and weighing only 2.3Kg, the
CP-RX60Z offers you more. The CP-RX60Z uses Hitachi's proprietary
video-processing circuits and 3 LCD technology, which gives better
colour reproduction and higher brightness than other technologies.
</p>
</div>
</div>
<!-- Page Content Ends Here -->
</div>
</body>
</html>



The script:



/*
Script for highlighting jargon terms that open a div that load a definition via AJAX
*/

var regexPrefix = '(?!<a.*?>|\=\")';
var regexPostfix = '(?!.*?</a>|\")';
// var regexPostfix = '(?!\</a\>|\")';
var regexOptions = 'ig';

var mouseX = 0;
var mouseY = 0;

function createDynamicLink (newId, newHref, newContent)

// Generate a new DOM hyperlink
{
newLink = document.createElement ('a');
newLink.id = newId;
newLink.href = newHref;
newLink.appendChild (newContent);
return (newLink);
}

function getMouse (e)

// Get mouse position
{
if (e.pageX)
{
// Standards browsers
mouseX = e.pageX;
mouseY = e.pageY;
}
else
if (window.event)
{
// Internet Explorer
mouseX = window.event.clientX + document.body.scrollLeft;
mouseY = window.event.clientY + document.body.scrollTop;
}
else
{
mouseX = null;
mouseY = null;
}
}

function initializeRequest ()

// Set up XMLHttp
{
var requestObject = null;

// Attempt to create an XMLHttpRequest objext (Mozilla, Safari, etc)
if (window.XMLHttpRequest)
{
requestObject = new XMLHttpRequest ();
if (requestObject.overrideMimeType)
{
requestObject.overrideMimeType ('application/xml');
}
}
else
// Attempt to create an ActiveX XMLHTTP object (Internet Explorer)
if (window.ActiveXObject)
{
try
{
requestObject = new ActiveXObject ("Msxml2.XMLHTTP");
}
catch (e)
{
// Try alternative method of creating an ActivX XMLHTTP object
try
{
requestObject = new ActiveXObject ("Microsoft.XMLHTTP");
}
catch (e) {}
}
}
// Return the created XMLHTTP object on success or NULL on failure
return (requestObject);
}

function makeRequest ()

// Get product data from XML server
{
// If we got a valid XMLHttp object then use it to make a request
if (http_request = initializeRequest ())
{
http_request.onreadystatechange = function () { getJargon (http_request); };
http_request.open ("GET", dataSource, true);
http_request.send (null);
}
}

function getJargon (xmlObject)

// Get jargon terms
{
// Check that we actually have a valid response to process
if (xmlObject.readyState == 4)
{
// Check for successful completion of HTTP session
if (xmlObject.status == 200)
{
// Parse the returned XML
xmlItems = xmlObject.responseXML;
if (xmlItems.getElementsByTagName('jargon')[0])
{
xmlRoot = xmlItems.documentElement;
// Get the jargon terms
if (xmlTerms = xmlRoot.getElementsByTagName ('jargonterm'))
{
// Grab the content div for processing
contentDiv = document.getElementById ('content');
// Set up Regular Expression
searchRegExp = new RegExp ('', '');
// Iterate over jargon term nodes
for (thisTerm = 0; thisTerm < xmlTerms.length; thisTerm++)
{
if (thisDesc = xmlTerms[thisTerm].getElementsByTagName ('description')[0])
// Create an entry for this item in the jargonTerms array
{
jargonTerms [xmlTerms[thisTerm].getAttribute ('id')] = new Object ();
jargonTerms [xmlTerms[thisTerm].getAttribute ('id')]['term'] = xmlTerms[thisTerm].getAttribute ('term');
jargonTerms [xmlTerms[thisTerm].getAttribute ('id')]['description'] = thisDesc.firstChild.data;
jargonTerms [xmlTerms[thisTerm].getAttribute ('id')]['href'] = xmlTerms[thisTerm].getAttribute ('href');
}
// Build a regex string that will look for the term as long as it isn't contained within a hyperlink or double quotes
searchRegExp.compile (regexPrefix + xmlTerms[thisTerm].getAttribute ('regex') + regexPostfix, regexOptions);
// Find the term in the content div and replace it with a link
contentDiv.innerHTML = contentDiv.innerHTML.replace (searchRegExp, '<a class="jargonbuster" href="' + xmlTerms[thisTerm].getAttribute ('href') + '" onclick="window.open (\'' + xmlTerms[thisTerm].getAttribute ('href') + '\', \'Jargon\', \'width=647,height=680,scrollbars=yes,resizable=yes\').focus (); return (false);" onmouseover="return (jargonDiv (\'' + xmlTerms[thisTerm].getAttribute ('id') + '\'));">$&</a>');
}
//document.write (contentDiv.innerHTML);
//document.close ();
}
else
{
// No jargon terms
}
}
else
{
// Malformed response
}
}
else
{
// Server error of some sort
}
}
else
{
// Still waiting for operation to complete
}
}

function jargonDiv (selectedTerm)

// Manage the jargon buster div
{
if (!document.getElementById ('ajaxNode'))
{
// If the div is already visable, then hide it
try
{
document.body.removeChild (this);
}
catch (e) {}
// Update div contents
ajaxHeadTitle.innerHTML = jargonTerms [selectedTerm]['term'];
ajaxNodeText.innerHTML = jargonTerms [selectedTerm]['description'];
// Update More Info link
ajaxLinkMoreInfo.href = jargonTerms [selectedTerm]['href'];
ajaxLinkMoreInfo.onclick = function ()
{
window.open (jargonTerms [selectedTerm]['href'], 'Jargon', 'width=647,height=680,scrollbars=yes,resizable=yes').focus ();
return (false);
}
// Update div position
ajaxNode.style.left = (mouseX - 100).toString () + 'px';
ajaxNode.style.top = (mouseY + 2).toString () + 'px';
// Show the div
document.body.appendChild (ajaxNode);
}
return (false);
}

// ------------------------------------------------------------------------------------------------

// Initialize mouse position monitoring
try
{
// Standards compliant method of capturing mouse
document.addEventListener ('mousemove', getMouse, false);
domMode = 'standard';
}
catch (e)
{
try
{
// Internet Explorer method of capturing mouse
document.attachEvent ('onmousemove', getMouse);
domMode = 'ie';
}
catch (e)
{
try
{
// Last resort method of capturing mouse
document.onmousemove = getMouse;
domMode = 'fallback';
}
catch (e)
{
// All options exhausted
domMode = null;
}
}
}

if (domMode)
{
// Variable initialization
var jargonTerms = new Object ();

// Node used for displaying help
var ajaxNode = document.createElement ('div');
ajaxNode.id = 'ajaxNode';

// Child divs for jargon node
ajaxNodeTitle = document.createElement ('div');
ajaxNodeText = document.createElement ('div');
ajaxNodeFooter = document.createElement ('div');

ajaxNodeTitle.id = 'ajaxNodeTitle';
ajaxNodeText.id = 'ajaxNodeText';
ajaxNodeFooter.id = 'ajaxNodeFooter';

// Help title
ajaxHeadTitle = document.createElement ('h3');
ajaxHeadTitle.id = 'ajaxHeadTitle';
ajaxHeadTitle.innerHTML = 'Test';

// Close div hyperlink
ajaxLinkCloseContent = document.createElement ('img');
ajaxLinkCloseContent.src = 'images/close.gif';
ajaxLinkCloseContent.alt = 'Close';
ajaxLinkClose = createDynamicLink ('ajaxLinkClose', '#', ajaxLinkCloseContent);
ajaxLinkClose.onclick = function ()
{
try
{
document.body.removeChild (ajaxNode);
}
catch (e) {}
return (false);
}

// More Info hyperlink
ajaxLinkMoreInfo = createDynamicLink ('ajaxLinkClose', 'javascript:return (false);', document.createTextNode ('More info'));

// Associate child nodes with the main div
ajaxNode.appendChild (ajaxNodeTitle);
ajaxNode.appendChild (ajaxNodeText);
ajaxNode.appendChild (ajaxNodeFooter);
ajaxNodeTitle.appendChild (ajaxLinkClose);
ajaxNodeTitle.appendChild (ajaxHeadTitle);
ajaxNodeFooter.appendChild (ajaxLinkMoreInfo);
switch (domMode)
{
case 'standard' :
window.addEventListener ('load', makeRequest, false);
break;
case 'ie' :
window.attachEvent ('onload', makeRequest);
break;
case 'fallback' :
window.onload = makeRequest;
break;
default :
break;
}
}


The XML



<?xml version="1.0" encoding="UTF-8"?>
<jargon>
<jargonterm id="3lcd" term="3 LCD" regex="3\s?lcd" href="/glossary/projectors/lcd.jtp">
<description>
&lt;p&gt;
3LCD is the most widely used projection technology system in the world. Light from the projector's lamp is split into red green and blue, and each individual colour is passed through a dedicated LCD before the three colours are recombined to form a full colour image.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="ansi_Lumens" term="ANSI Lumens" regex="ansi\slumens" href="/glossary/projectors/ansi.jtp">
<description>
&lt;p&gt;
ANSI Lumens are a measure of the overall brightness of a projector. In nearly all projectors the centre of the image is brighter than the edges, so ANSI Lumens are calculated as an average brightness across the entire surface of the display.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="aspect_ratio" term="Aspect Ratio" regex="aspect\sratio" href="/glossary/projectors/ratio.jtp">
<description>
&lt;p&gt;
An aspect ratio is a measure of the ratio between the width of a display and its height. The two most common aspect ratios are 4:3 (where a display has 3 units of height for every 4 units of width) and 16:9 (where a display has 9 units of height for every 16 units of width). The 16:9 aspect ratio corresponds to widescreen.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="contrast_ratio" term="Contrast Ratio" regex="contrast\sratio" href="/glossary/projectors/contrastratio.jtp">
<description>
&lt;p&gt;
The contrast ratio of a display is a measure of the difference between the darkest shade it can produce (black) and the brightest shade (white). Generally speaking, the higher the contrast ratio, the better the quality of the projected image will be.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="dlp" term="DLP" regex="dlp" href="/glossary/projectors/lcd.jtp">
<description>
&lt;p&gt;
DLP is short for Digital Light Processing. DLP technology is a revolutionary display solution that uses an optical semiconductor to manipulate light digitally. It is a highly reliable, all-digital display chip that delivers the best picture across a broad range of products, including large screen digital TVs, and projectors for business, home, professional venue and digital cinema.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="lens_shift" term="Lens Shift" regex="lens\sshift" href="/glossary/projectors/lensshift.jtp">
<description>
&lt;p&gt;
Lens shift is a feature of high-end projectors. It is similar to Keystone correction, in that it attempts to compensate for the keystoning effect, except that the effect is achieved by physically adjusting the angle of the lens to square up the image. As lens shifting works optically rather than digitally the quality of the resulting image is generally better than could be achieved with keystone correction.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="keystone_correction" term="Keystone Correction" regex="keystone\scorrection" href="/glossary/projectors/keystone.jtp">
<description>
&lt;p&gt;
'Keystoning' is the name given to the effect on the projected image when the projector sits below or above the centre of the screen. Keystone correction counteracts this effect by digitally compressing the image at the bottom or top, resulting in a squarer, more professional image.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="lamp_hours" term="Lamp Hours" regex="lamp\shours" href="/glossary/projectors/lamphours.jtp">
<description>
&lt;p&gt;
Lamp hours is simply a measure of how long the light source in a projector is expected to last before you will need to replace it. The higher the lamp hours, the longer the bulb is expected to last. Note that this figure is only a guideline.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="resolution" term="Resolution" regex="resolution" href="/glossary/projectors/resolution.jtp">
<description>
&lt;p&gt;
The resolution of a display is a measure of its image quality. Images generated by projectors are made up of a grid of dots, called pixels. The more pixels the projector supports, the more detailed the image it can output will be.
&lt;/p&gt;
</description>
</jargonterm>
<jargonterm id="weight" term="Weight" regex="weight|kg" href="/glossary/projectors/weight.jtp">
<description>
&lt;p&gt;
The weight of a projector will determine how portable it is. Lighter projectors are more easily moved around, but larger desktop projectors still offer more features and are better value for money.
&lt;/p&gt;
</description>
</jargonterm>
</jargon>

PassiveSmoking
Apr 24th, 2007, 01:13 PM
After experimenting I found a regexp that would produce the effects I need, except Internet Explorer's regexp implementation seems to be different enough from other browsers for it to fail under IE. I think I'm going to need to take an entirely different approach.

The regexp I came up with is:



keyword_for_replacement_goes_here(?!.*?(</a>|/>|\"))


I know what I want to do can be done because I've seen sites using similar systems, like Tom's Hardware Guide. I just can't get it to work in IE in a way that doesn't cause unwanted replacements to occur (such as in hyperlinks and HTML attributes)

I really need some help. If it can't be done with a simple regexp, then are there any other options?

PassiveSmoking
Apr 25th, 2007, 01:59 PM
I still haven't solved this but I know what's causing the problem now.

On one test page there are hidden form fields that contain a value that includes the letters "usb". USB is one of the terms that I want to treat as a keyword. I access the contents of the div I want to run my script on via the innerHTML property.

The HTML you get out of an innerHTML property is not what is written in the source. For example, the input field that is causing me problems is written thus in the code:



<input type="hidden" name="code" value="BELUSB01" />


When I document.write out the contents of the innerHTML property of the div in FireFox this changes to:



<input type="hidden" value="BELUSB01" name="code">


but when I do the same thing in IE the result is:



<INPUT TYPE=hidden VALUE=BELUSB01 NAME=code>


I can write a regexp that will work properly in the first two cases, but with the third case the HTML has become so badly mangled that it seems to apparently be impossible to write a regexp that works in all three cases.

Is there a way in IE to correct the contents of innerHTML so that it hasn't been mangled to the extent it appears to be?

liorean
Apr 25th, 2007, 02:06 PM
Sound like it would be a smarter idea to use the DOM for this... the innerHTML string isn't guaranteed to follow any special form, and attribute order may be different in other browsers.

PassiveSmoking
Apr 25th, 2007, 04:05 PM
Have you got any advice or know of any tutorials on doing that? As far as I can tell it would require iterating through all the elements inside the content div, finding the text in each fragment, removing the original text, attaching a link and appending it in the location where the original text was. This sounds like a pretty heavyweight piece of code, unless I've missed something. It also sounds like it would be painfully slow, especially on IE which has poor DOM performance.

PassiveSmoking
Apr 25th, 2007, 05:46 PM
While I would love to create a fully standards compliant (or as standards compliant as any code that uses XHR can be) solution there is a deadline to meet.

After a hard day's slaving over the regex coach I have come up with a regular expression that seems to do the job adequately in my 3 main testing browsers (IE6, Opera 9, FireFox 2) and thought I'd post it here in case anyone could find a good use for it.

It is a bit of a monster though, and could definately stand to be improved. If any of you guys could think of improvements then please feel free to reply.



var regexPostfix = '(?!([a-zA-Z0-9 _-]*?"?>|.*? [a-zA-Z0-9_-]*?="?[a-zA-Z0-9 _-]"?|.*?</a>|.*?</option>|.*?" />))';

/*
What does the postfix regex mean?

(?!([a-zA-Z0-9 _-]*?"?>|.*? [a-zA-Z0-9_-]*?="?[a-zA-Z0-9 _-]"?|.*?</a>|.*?</option>|.*?" />)) is added as a regular
expression in the search and replace operation to restrict keyword replacement to prevent page breasage. The
keyword regular expressions retrieved from the XML are appended with it.

The regex means:
(?!( - The keyword must not be followed by:
[a-zA-Z0-9 _-]*?"?> - The end of the containing HTML attribute and tag (A sequence of
characters followed by an optional " and a > character)
|.*? [a-zA-Z0-9_-]*?="?[a-zA-Z0-9 _-]"? - OR the end of the containing HTML attribute followed by another attribute
(A sequence of characters followed by whitespace followed by a sequence of
characters that are valid for an attribute name followed by an = character
followed by an optional " character followed by a sequence of characters
followed by an optional " character
|.*?</a> - OR an anchor closing tag
|.*?</option> - OR a select box option closing tag
|.*?" /> - OR an XHTML empty tag closing string
))

This regex is not perfect but it does at least seem to ensure unmangled HTML tags in the 3 browsers used for testing
(IE6, Firefox 2, Opera 9)
*/

Ancora
Apr 25th, 2007, 05:52 PM
See:

http://www.freewarefiles.com/program_4_44_26075.html