View Full Version : Ignoring boundaries of text
eightysix
07-31-2007, 02:14 AM
Okay, so I'm parsing an HTML file and I need to ignore text in certain boundaries. For example:
<!-- START
-->
Blah, blah, blah.
<!-- END
-->
How would I do that? Thanks in advance.
KevinADC
07-31-2007, 05:25 AM
How are you currently parsing the html document?
eightysix
07-31-2007, 05:55 AM
How are you currently parsing the html document?
Basically, I'm running the perl script through CLI with the file as an argument like this:
perl script.pl file.html fixed.html
I want to append everything to another file EXCEPT what's in between certain HTML comment tags:
<!-- KEEP THIS
Bar
-->
<!-- DELETE THIS BEGIN
-->
Foo
<!-- DELETE THIS END
-->
<!-- KEEP THIS
Bar
-->
So in my new file, I just want to keep what's in the KEEP THIS comment. I was thinking of doing regex matches, but I want everything between a certain range, not just what's on the line where the regex picks up a match.
nkrgupta
07-31-2007, 07:46 AM
You might be better off using the module HTML::Parser (http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm) . Even still if you want to use regex, look at the switches m, s and x in http://www.perl.com/doc/manual/html/pod/perlre.html#DESCRIPTION
Naveen
KevinADC
07-31-2007, 08:20 PM
Basically, I'm running the perl script through CLI with the file as an argument like this:
perl script.pl file.html fixed.html
OK, so what is script.pl?
eightysix
07-31-2007, 09:24 PM
OK, so what is script.pl?
Here's a code snippet of what I've got so far:
#!/usr/bin/perl
use warnings;
use strict;
my $in = shift;
my $out = shift;
open(IN, $in) or die "Error: $!";
open(OUT, ">$out") or die "Error: $!";
while(<IN>) {
while(<IN>) {
if(m/<title>/ or m/<link rel.*>/) {
s/<title>.*/<title>NEW TITLE<\/title>/;
s/<link rel.*>/\n/;
}
s/<!--.*/\n/;
s/.*-->/\n/;
print OUT $_, "\n";
}
}
And here's a snippet of the format of the HTML file:
<!-- HEADER START
-->
<title>OLD TITLE</title>
<link rel="stylesheet" type="text/css" href="stylesheet.css">
<!-- HEADER END
-->
However, there are other HTML comments with START and END in the file and I would like to get rid of both the comments and what's in between them.
FishMonger
07-31-2007, 10:31 PM
It would be best to use a parser as nkrputa sugegsts, but see if this does what you want.
while (<IN>) {
next if /^<!-- / .. /^-->/;
if(/<title>/ or /<link rel.*>/) {
s/<title>.*/<title>NEW TITLE<\/title>/;
s/<link rel.*>/\n/;
}
print OUT;
}
eightysix
07-31-2007, 10:56 PM
It would be best to use a parser as nkrputa sugegsts, but see if this does what you want.
while (<IN>) {
next if /^<!-- / .. /^-->/;
if(/<title>/ or /<link rel.*>/) {
s/<title>.*/<title>NEW TITLE<\/title>/;
s/<link rel.*>/\n/;
}
print OUT;
}
Excellent! Works like a charm. However, I still have a trailing "-->". Is there a way to regex match to the next line? I tried this but it didn't work:
next if /^<!--\s*HEADER START/ .. /^<!--\s*HEADER END\s?^-->/;
FishMonger
07-31-2007, 11:15 PM
Try removing the anchors:
next if /<!--/ .. /-->/;
eightysix
07-31-2007, 11:23 PM
Try removing the anchors:
next if /<!--/ .. /-->/;
The thing is that there are parts of the code in closed between certain HTML comments that I want to keep. So doing the above would nuke pretty much the whole page. :(
FishMonger
07-31-2007, 11:39 PM
Without seeing a more complete example of your file with an explanation of the comment sections that you want to keep vs the ones deleted, it difficult to give an exact solution.
Try this version:
next if /<!--\s*\w+ START/ .. /-->/ or /<!--\s*\w+ END/ .. /-->/;
You should also do a google search on 'perl flip-flop operator' which will help you to understand what the regex is doing and how to modify it to suit your needs.
EDIT:
next if /<!--\s*\w+ (START|END)/ .. /-->/;
eightysix
08-01-2007, 06:33 AM
Without seeing a more complete example of your file with an explanation of the comment sections that you want to keep vs the ones deleted, it difficult to give an exact solution.
Try this version:
next if /<!--\s*\w+ START/ .. /-->/ or /<!--\s*\w+ END/ .. /-->/;
You should also do a google search on 'perl flip-flop operator' which will help you to understand what the regex is doing and how to modify it to suit your needs.
EDIT:
next if /<!--\s*\w+ (START|END)/ .. /-->/;
@Fishmonger: Thanks for the research material. I was aware of the "flip-flop operator" although I didn't know it was called that. I also didn't know that it could be used that way since I was under the impression it only worked with number ranges.
I pretty much get the jist of how to parse the data now. Thanks for the swift replies everyone. They were really helpful. :thumbsup:
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.