Spudhead
09-05-2008, 01:22 PM
This is doing my nut in. I want to go through a chunk of HTML and pull out all the hrefs. I need to do a submatch on each href, to pull out (1) the URL, and (2) the link text.
This is what I've got:
dim oRE,oMatches
set oRE = New RegExp
oRE.IgnoreCase = True
oRE.Global = True
oRE.Pattern = "<a\s+href=""(http://.*?)"">(\s*\n|.+?\s*)</a>"
set oMatches = oRE.Execute(sHTML)
if not isEmpty(oMatches) then
for c = 0 to oMatches.Count-1
strURL = oMatches(c).SubMatches(0)
strContent = oMatches(c).SubMatches(1)
response.write strURL
next
end if
It almost works, but seems to fall over on the closing double-quote around the URL. If it comes across the following code, for example:
<a href="http://www.site.com" target="_blank" style="color:#FFFFFF;text-decoration:none;">Linky</a>
Then response.write strURL produces:
http://www.site.com" target="_blank" style="color:#FFFFFF;text-decoration:none;
How can I stop this?
I've tried taking the quotes out of the regexp:
oRE.Pattern = "<a\s+href=" & chr(34) & "(http://.*?)" & chr(34) & ">(\s*\n|.+?\s*)</a>"
but that did nothing.
Also.... I'll readily admit I copied and pasted this regexp, and then tinkered a bit. The bit in bold is confusing me, though:
<a\s+href=""(http://.*?)"">(\s*\n|.+?\s*)</a>
If I mess around with it, it doesn't work :o But what's it doing? It looks like... match zero or more of any single character, then match one or more of that. Huh??
Err... help :D
This is what I've got:
dim oRE,oMatches
set oRE = New RegExp
oRE.IgnoreCase = True
oRE.Global = True
oRE.Pattern = "<a\s+href=""(http://.*?)"">(\s*\n|.+?\s*)</a>"
set oMatches = oRE.Execute(sHTML)
if not isEmpty(oMatches) then
for c = 0 to oMatches.Count-1
strURL = oMatches(c).SubMatches(0)
strContent = oMatches(c).SubMatches(1)
response.write strURL
next
end if
It almost works, but seems to fall over on the closing double-quote around the URL. If it comes across the following code, for example:
<a href="http://www.site.com" target="_blank" style="color:#FFFFFF;text-decoration:none;">Linky</a>
Then response.write strURL produces:
http://www.site.com" target="_blank" style="color:#FFFFFF;text-decoration:none;
How can I stop this?
I've tried taking the quotes out of the regexp:
oRE.Pattern = "<a\s+href=" & chr(34) & "(http://.*?)" & chr(34) & ">(\s*\n|.+?\s*)</a>"
but that did nothing.
Also.... I'll readily admit I copied and pasted this regexp, and then tinkered a bit. The bit in bold is confusing me, though:
<a\s+href=""(http://.*?)"">(\s*\n|.+?\s*)</a>
If I mess around with it, it doesn't work :o But what's it doing? It looks like... match zero or more of any single character, then match one or more of that. Huh??
Err... help :D