PDA

View Full Version : parsing difficulty using regex


shadkeene
09-19-2007, 03:36 AM
Hi,
I'm parsing multiple lines of data and eventually will be comparing values. However, I've run into some difficulty looping through two sets of data. Each set should be an array of three weather observations. However, I'm only able to extract the first line's wind direction instead of three. Therefore, I end up with two wind directions instead of six.
I used the linelen function to determine if my attempt to split the string into three arrays (3 different weather observations) was failing, and it seems to be. I'm splitting on new line character, so I'm wondering what's wrong?

Another issued I just ran into is using the "|" operator for matching one pattern or another. Sometimes winds are in the format 27015KT and other times 27015G25KT. So I tried to match both patterns...only working for the first, not the second. Thanks for any help.
Here's my code:

#!/perl/bin/perl -w
use CGI qw(:standard);
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);
use strict;
use LWP::Simple;
use HTML::TokeParser::Simple

print header;
print start_html("WindshftObs");

my $sjc="sjc";
my $sfo="sfo";
my $sql="sql";




my @data = Winds($sfo, $sjc, $sql);
print "@data";


sub Winds
{

return "Error: No argument sent to Winds" unless @_;
my @apt = @_;
my @data;

foreach my $icao (@apt) {
my $url = "http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=&sid=K$icao&num=3&raw=3&dbn=m&banner=off";
my $content= get($url) or die "Error getting file: $!";
my @lines = split(/'\n'/, $content);
my $linelen = scalar(@lines);
print "$linelen<br>";
foreach my $line (@lines) {
print "$line<br><br>";
if ($line =~ (/(\d{3})+\d{2}KT|(\d{3})+\d{2}G\d{2}KT/)) {
push @data, $1;
}
}
}
return @data;
}

print end_html;

And my results are:
1
KSFO 190056Z 26018G26KT 10SM FEW014 16/09 A2993 AO2 PK WND 27029/0027 SLP135 T01610094
KSFO 182356Z 26017G28KT 10SM FEW014 17/09 A2994 AO2 PK WND 26028/2349 SLP137 T01670094 10200 20167 56008
KSFO 182256Z 26014KT 10SM FEW014 17/10 A2995 AO2 SLP141 T01720100


1
KSJC 190053Z 34016KT 10SM FEW026 19/10 A2991 AO2 PK WND 34027/0024 SLP126 T01940100
KSJC 182353Z 33016G21KT 10SM FEW026 22/11 A2990 AO2 SLP124 T02170106 10250 20189 56014
KSJC 182253Z 32011KT 10SM CLR 23/09 A2991 AO2 SLP127 T02330094


1
KSQL 190047Z 25014G22KT 10SM SKC 17/10 A2993
KSQL 182347Z 27020G25KT 10SM SKC 18/10 A2992
KSQL 182247Z 29014KT 10SM SKC 21/09 A2992


340

Only get one value because regex is not getting the 25014G22KT pattern, and the three lines appear to be one string, not an array of three strings. Thanks again for any help,
Shad

FishMonger
09-19-2007, 04:42 AM
Take out the single quotes in the regex

my @lines = split(/'\n'/, $content);

should be
my @lines = split(/\n/, $content);

shadkeene
09-19-2007, 05:36 AM
Fishmonger,
Thanks for the help, but when I take off the single quotes, I'm still getting the same output...stumped, but will look at more options.
And confused as to why I'm getting "11" for size of array...
Shad

FishMonger
09-19-2007, 06:46 AM
Fishmonger,
Thanks for the help, but when I take off the single quotes, I'm still getting the same output...stumped, but will look at more options.
And confused as to why I'm getting "11" for size of array...
Shad

What output are you expecting?
What did you expect the array size to be?

Lets strip down your script and run a test.

#!/perl/bin/perl

use warnings;
use strict;
use LWP::Simple;
use Data::Dumper;

my $sjc="sjc";
my $sfo="sfo";
my $sql="sql";

Winds($sjc, $sfo);

sub Winds
{
return "Error: No argument sent to Winds" unless @_;
my @apt = @_;

foreach my $icao (@apt) {
my $url = "http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=&sid=K$icao&num=3&raw=3&dbn=m&banner=off";
my $content= get($url) or die "Error getting file: $!";
my @lines = split(/\n/, $content);
print "$url\n";
print Dumper @lines;
print "\n\n";
}
}

Outputs:
http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=&sid=Ksjc&num=3&raw=3&dbn=m&banner=off
$VAR1 = '<html><head><style type=\'text/css\'>';
$VAR2 = '<!--';
$VAR3 = 'A:link { color: #0000FF; text-decoration: none; font-family: Arial, Helvetica, San Serif}';
$VAR4 = 'A:Visited { color: #0000FF; text-decoration: none; font-family: Arial, Helvetica, San Serif}';
$VAR5 = 'A:hover { color : #FF0000;text-decoration: underline; font-family: Arial, Helvetica, San Serif}';
$VAR6 = 'table { font-size: 9pt; font-family: Arial, Helvetica, San Serif}';
$VAR7 = 'A { font-size: 9pt; font-family: Arial, Helvetica, San Serif}';
$VAR8 = '.formbox { margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px}';
$VAR9 = '-->';
$VAR10 = '</style></head>';
$VAR11 = '<html><head><meta http-equiv=\'refresh\' content=\'300\'><title>OBS:San Jose, San Jose International Airport</title>KSJC 190353Z 35007KT 10SM CLR 16/08 A2993 AO2 SLP134 T01610083<br/>KSJC 190253Z 34007KT 10SM FEW016 17/09 A2992 AO2 SLP130 T01670089 51006<br/>KSJC 190153Z 34014KT 10SM FEW022 18/10 A2991 AO2 SLP129 T01780100<br/></body></html>';


http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=&sid=Ksfo&num=3&raw=3&dbn=m&banner=off
$VAR1 = '<html><head><style type=\'text/css\'>';
$VAR2 = '<!--';
$VAR3 = 'A:link { color: #0000FF; text-decoration: none; font-family: Arial, Helvetica, San Serif}';
$VAR4 = 'A:Visited { color: #0000FF; text-decoration: none; font-family: Arial, Helvetica, San Serif}';
$VAR5 = 'A:hover { color : #FF0000;text-decoration: underline; font-family: Arial, Helvetica, San Serif}';
$VAR6 = 'table { font-size: 9pt; font-family: Arial, Helvetica, San Serif}';
$VAR7 = 'A { font-size: 9pt; font-family: Arial, Helvetica, San Serif}';
$VAR8 = '.formbox { margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px}';
$VAR9 = '-->';
$VAR10 = '</style></head>';
$VAR11 = '<html><head><meta http-equiv=\'refresh\' content=\'300\'><title>OBS:San Francisco, San Francisco International Airport</title>KSFO 190356Z 25013KT 10SM FEW015 14/09 A2992 AO2 SLP132 T01440094<br/>KSFO 190256Z 25011KT 10SM FEW015 14/09 A2992 AO2 SLP130 T01440094 58006<br/>KSFO 190156Z 26014G21KT 10SM FEW014 15/09 A2992 AO2 PK WND 27026/0113 SLP133 T01500094<br/></body></html>';


Which is exactly what I'd expect. What output would you expect?

shadkeene
09-20-2007, 05:31 PM
Fishmonger,
My goal is to acquire 2 arrays of size 3 each...which would be 3 lines of observations per loop (SFO and SJC airports)...disregarding sql airport. It seems that my code's a bit too simple as the URL must not be just text but has some html to it. The html address returns the latest three observations from any given airport. Would you suggest the html::tokeparser::simple to glean 3 lines of text for each icao? I'm looking to get this...and separate each by newline so I can parse the italicized elements from the six lines and place them in variables.:


KSFO 190356Z 25013KT 10SM FEW015 14/09 A2992 AO2 SLP132 T01440094
KSFO 190256Z 25011KT 10SM FEW015 14/09 A2992 AO2 SLP130 T01440094 58006
KSFO 190156Z 26014G21KT 10SM FEW014 15/09 A2992 AO2 PK WND 27026/0113 SLP133 T01500094

KSJC 190353Z 35007KT 10SM CLR 16/08 A2993 AO2 SLP134 T01610083
KSJC 190253Z 34007KT 10SM FEW016 17/09 A2992 AO2 SLP130 T01670089 51006
KSJC 190153Z 34014KT 10SM FEW022 18/10 A2991 AO2 SLP129 T01780100


Thanks for any guidance...guess I'm used to working with more text-oriented files and the html returns are throwing me off.
Shad

FishMonger
09-20-2007, 10:19 PM
Parsing html with a regex is almost never a good approach. You should be using an html parser such as the HTML::TokeParser::Simple module that you're thinking of using. I'm not sure that acquiring 2 arrays is the best approach. I'd probably use a hash of arrays.

Here's a working test script, however, I'm just printing the data instead of building the arrays.

#!/usr//bin/perl

use warnings;
use strict;
use CGI qw(:standard);
use CGI::Carp qw(fatalsToBrowser);
use LWP::Simple qw(!head);
use HTML::TokeParser;

print header;
print start_html("WindshftObs");

my $sjc="sjc";
my $sfo="sfo";
my $sql="sql";

Winds($sfo, $sjc, $sql);


sub Winds
{

return "Error: No argument sent to Winds" unless @_;
my @apt = @_;
my @data;

foreach my $icao (@apt) {
my $url = "http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=&sid=K$icao&num=3&raw=3&dbn=m&banner=off";
my $content= get($url) or die "Error getting file: $!";
my $p = HTML::TokeParser->new(\$content) || die "Can't open: $!";

$p->empty_element_tags(1); # configure its behaviour

while (my $token = $p->get_token) {
print $token->[1] . "<br>\n" if $token->[1] =~ /^k$icao/i;
}
print "<br>\n";
}
}

print end_html;

shadkeene
09-20-2007, 10:38 PM
Fishmonger,
thanks for the working script...I've built the arrays now and am on my way...
Shad