PDA

View Full Version : RegExp to match any tag+ contents


ahosang
05-16-2003, 01:25 AM
Let's say i had a string of HTML(as a variable). But this string could contain newline(linefeed or carriage return) as HTML often does.
I want to match say the contents of any<script> tag and it contents - then replace with empty string.

<html>
<head>
<title></title>
<script>
function stripHtml(strHTML) {
var strOutput="";
scriptPattern=/<script>.*<\/script>/gi;
// Replace all script with an empty string
strOutput = strHTML.replace(scriptPattern, "");
alert(strOutput);
}
</script>

</head>

<body onload="stripHtml(document.body.innerHTML)">
Test Page<br>
<script>
var hello="world";
var second="good";
</script>
<b><u>test block of <i>text</i> to see the script</u></b>
</body>
</html>

Doesn't work as is because of the newLines. Works if you make the script tag and content one line - that's not the workaroud. I'm trying to make it work with linefeeds.

Say you involve the linefeeds:
scriptPattern=/<script>\r*\n*.*\n*\r*<\/script>/gi;// my other attempt

A script could contain a number of lines. I want the regExp to just match any pair of script tags and whatever could lie between (lots of characters and linefeeds possibly)

ahosang
05-16-2003, 01:33 AM
Found it!

ahosang
05-16-2003, 01:45 AM
Nope false alarm :-(

cheesebagpipe
05-16-2003, 03:49 AM
scriptPattern=/(\s)?<script(.| )*>(.|\s)*<\/script>(\s)?/gi;

Could use:

scriptPattern=/(\s)*<script(.| )*>(.|\s)*<\/script>(\s)*/gi;

..to lose all whitespace on either end.

ahosang
05-16-2003, 04:01 AM
Actually I did find it!
It's:
scriptPattern=/<script>(.|\s)*?<\/script>/gi;
To make sure that individual <script>'s are grabbed rather than say 1st <script> and 2nd </script> - (if there are two scripts for example)

ahosang
05-16-2003, 04:02 AM
Yeh I realise I left off the bit for characters inside the <script> tag like type="text/javascript" etc

cheesebagpipe
05-16-2003, 04:35 AM
...aargh! hadn't thought of that. Non-greedy indeed.

scriptPattern = /\s*<script(.| )*>(.|\s)*?<\/script>\s*/gi;
strOutput = strHTML.replace(scriptPattern, '\n');

Personal taste. ;)