...

View Full Version : Newbie converting from PHP - Regular expression



Phil Jackson
12-06-2009, 04:40 PM
Hi all, im a PHP man but a certain script of mine is not performing well in PHP so im trying to learn Perl.

Could someone point me into the right direction of why this is not working?



#!/usr/bin/perl
my $url = 'http://www.actwebdesigns.co.uk/';

use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content; #die unless content is found

if(@links = ($content =~ m#<a[^>]+)>#isg)){
foreach (@links) {
print $_ . "\n";
}
}else{
print "Could not find links.";
}


any help much appreciated

FishMonger
12-06-2009, 05:04 PM
You're missing the opening capturing paren in the regex and the closing paren is in the wrong place.

All Perl scripts should include the strict and warnings pragmas. Those pragmas will point out lots of problems that can be difficult to track down.
In this case the error is
Unmatched ) in regex; marked by <-- HERE in m/<a[^>]+) <-- HERE / at ..
Here's the corrected version.

#!/usr/bin/perl

use strict;
use warnings;
use LWP::Simple;

my $url = 'http://www.actwebdesigns.co.uk/';
my $content = get $url;
die "Couldn't get $url" unless defined $content; #die unless content is found

if(my @links = ($content =~ m#(<a[^>]+>)#isg)){

print "$_\n" for @links;

}else{
print "Could not find links.";
}

FishMonger
12-06-2009, 05:14 PM
You may want to look at:

HTML::LinkExtor - Extract links from an HTML document
http://search.cpan.org/~gaas/HTML-Parser-3.64/lib/HTML/LinkExtor.pm

Phil Jackson
12-06-2009, 05:23 PM
#!/usr/bin/perl
use strict;
use warnings;

print "\n";
print "Enter a website: http://www.";
my $url = 'http://www.'.<>.'/';

use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content; #die unless content is found

#@pagesArray = (url);
#@foundPagesArray = ();
#@mainPagesArray = ();

if(@links = ($content =~ m#<(?:(?:a)|(?:area))[^>]*href=\"([^\"\#\?]+(?:(?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/)))\"#isg)){
foreach (@links) {
print $_ . "\n";
}
}else{
print "Could not find links.";
}

lovely, yet again!

fixed that and added those strict things, but now getting error messages.

one being

Global symbol "@links" requires explicit package name at C:\perlscripts\hello.pl line 17.

The other thing being i changed


my $url = 'http://www.website.co.uk/';
to

my $url = 'http://www.'.<>.'/';

but keeps saying that it couldn't get the content when it could before... Any ideas?

oesxyl
12-06-2009, 05:41 PM
if(my @links = ($content =~ m#<(?:(?:a)|(?:area))[^>]*href=\"([^\"\#\?]+(?:(?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/)))\"#isg)){
foreach (@links) {
print $_ . "\n";
}
}else{
print "Could not find links.";
}

lovely, yet again!

fixed that and added those strict things, but now getting error messages.

one being

Global symbol "@links" requires explicit package name at C:\perlscripts\hello.pl line 17.

in perl with strict pragma you must use my, our or local first time when use a variable. If you look to FishMonger you will see a my before @link




my $url = 'http://www.'.<>.'/';

but keeps saying that it couldn't get the content when it could before... Any ideas?
<> will get the line but also the line terminator, \n *nix, \r mac or \r\n for windows.
you can use chomp to remove it.

a url can be mailto:..., absolute 'http://....' or relative, simple 'index.php'. Will be hard to reimplement all this stuff in a regex and there are oprimisation problems, for example using foreach as you did is slower the using for like in FishMonger code.
Therefor is better to use HTML::LinkExtor module, see FishMonger post for link.

best regards

Phil Jackson
12-06-2009, 05:49 PM
Thanks, i will look into that.. is this how chomp would be used?


my $content = get "http://www" . chomp($url) . "/";

oesxyl
12-06-2009, 05:55 PM
Thanks, i will look into that.. is this how chomp would be used?


my $content = get "http://www" . chomp($url) . "/";
yes, is correct.

a suggestion. You can use Data:: Dumper like you use var_dump in php to see the content of one or more variables. For example:



use Data::Dumper;
print Dumper(@links, $content, $url);


best regards

FishMonger
12-06-2009, 06:02 PM
#!/usr/bin/perl

use strict;
use warnings;
use LWP::Simple;

print "\nEnter a website: http://www.";
chomp(my $domain = <>);

my $url = "http://www.$domain";
my $content = get $url;

die "Couldn't get $url" unless defined $content; #die unless content is found

#@pagesArray = (url);
#@foundPagesArray = ();
#@mainPagesArray = ();

# complex regex's like this are rarely needed
# and in most cases they're the wrong approach
# this one could be reduced, but the module I pointed to would be better.
if(my @links = ($content =~ m#<(?:(?:a)|(?:area))[^>]*href=\"([^\"\#\?]+(?:(?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/)))\"#isg)){
foreach (@links) {
print $_ . "\n";
}
}else{
print "Could not find links.";
}

FishMonger
12-06-2009, 06:13 PM
use warnings;
use YAPE::Regex::Explain;

print YAPE::Regex::Explain->new('m#<(?:(?:a)|(?:area))[^>]*href=\"([^\"\#\?]+(?:(?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/)))\"#isg')->explain;

Outputs:

The regular expression:

(?-imsx:m#<(?:(?:a)|(?:area))[^>]*href=\"([^\"\#\?]+(?:(?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/)))\"#isg)

matches as follows:

NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
m#< 'm#<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
a 'a'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
area 'area'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[^>]* any character except: '>' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
href= 'href='
----------------------------------------------------------------------
\" '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^\"\#\?]+ any character except: '\"', '\#', '\?'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
html 'html'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
php 'php'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
aspx 'aspx'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
htm 'htm'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
asp 'asp'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
shtml 'shtml'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\" '"'
----------------------------------------------------------------------
#isg '#isg'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Phil Jackson
12-06-2009, 06:20 PM
Hmm, very helpful material guys. Shall keep on chuggin through!

Phil Jackson
12-06-2009, 06:35 PM
Sorry again but does anybody know if there is a function to count the number of elements in an array?

oesxyl
12-06-2009, 06:38 PM
Sorry again but does anybody know if there is a function to count the number of elements in an array?


print scalar @links;


you will find often $#links in some scripts.

best regards

Phil Jackson
12-06-2009, 07:06 PM
Was wondering if you could help. I wanted to be able to find the value and key from an array. So i did some googling and found this;



# initialize a hash structure (name, age)
my %persons = (John => 25, Anne =>32, Paul =>22, Smith => 29);
foreach my $name ( %persons ) {
print $name . $persons{$name} . "\n";
}


I'm hoping someone could explain it slightly.
1) how come the array uses % and not @
2) im struggling to work out how this has come about being the key: $persons{$name}

any help much appreciated...again

Phil Jackson
12-06-2009, 07:08 PM
Was wondering if you could help. I wanted to be able to find the value and key from an array. So i did some googling and found this;



# initialize a hash structure (name, age)
my %persons = (John => 25, Anne =>32, Paul =>22, Smith => 29);
foreach my $name ( %persons ) {
print $name . $persons{$name} . "\n";
}


I'm hoping someone could explain it slightly.
1) how come the array uses % and not @
2) im struggling to work out how this has come about being the key: $persons{$name}

any help much appreciated...again



does $persons{$name} mean "get key from %persons where value = $name value" ?

oesxyl
12-06-2009, 07:47 PM
does $persons{$name} mean "get key from %persons where value = $name value" ?
get each element from list and assign it to $name

keys and values will return the keys or values as a list from a hash array and foreach expect a list inside round brackets:


# initialize a hash structure (name, age)
my %persons = (John => 25, Anne =>32, Paul =>22, Smith => 29);
foreach my $name (keys %persons ) {
print $name . $persons{$name} . "\n";
}


despite php you can use , instead of . in print like this:


print $name, $persons{$name}, "\n";


best regards

FishMonger
12-06-2009, 09:33 PM
Unlike PHP, Perl uses different identifiers for each data types.

$var is a scalar
@var is an array
%var is a hash (also called an associative array)
&something is a subroutine
*something is a typeglob


print $name . $persons{$name} . "\n";
# or
print $name, $persons{$name}, "\n";
is often written as

print "$name$persons{$name}\n";

bazz
12-06-2009, 10:56 PM
Was wondering if you could help. I wanted to be able to find the value and key from an array. So i did some googling and found this;



# initialize a hash structure (name, age)
my %persons = (John => 25, Anne =>32, Paul =>22, Smith => 29);
foreach my $name ( %persons ) {
print $name . $persons{$name} . "\n";
}


I'm hoping someone could explain it slightly.
1) how come the array uses % and not @

Because it is a hash, or an associative array, where the key is associated to the value. An array is just a flat list.

print this line in your code to see what the hash reads like in a more simple-to-understand way (hopefully).

print Dumper \%persons;
view the output in your source code in FF at least.



2) im struggling to work out how this has come about being the key: $persons{$name}


To read in from a hash, you can use a while loop.



while (my ($key, $value) = each %persons){
print qq( k=$key : v=$value <br /> );
}


Now instead of that you can shortcut the process, because you want a specific valuie from a specific key.

following in the same vein as in the while loop, the $persons{$name}; might be more easily understood like this:



$persons{$key};

see?

persons is the name of the hash and $name the key. So that will retunr the value that relates (is associated with), that key. if the name was John Doe the code would be


$persons{'John'};

That would return Doe.

In case the single quotes add to confusion they shouldn't. They are needed only when the value being used is a constant, ie not a $variable.

I really struggled with hashes and am still only reasonably proficient. I empathise. :)

bazz

Phil Jackson
12-07-2009, 07:28 AM
Thx!

Is there a preg replace equivalent tut you could please point me to?

Phil Jackson
12-07-2009, 07:34 AM
is $page = $page =~ s/^\///g; right for removing a "/" from beginning of a string if is there?

oesxyl
12-07-2009, 07:40 AM
Thx!

Is there a preg replace equivalent tut you could please point me to?

http://perldoc.perl.org/perlrequick.html
http://perldoc.perl.org/perlretut.html

best regards

oesxyl
12-07-2009, 07:50 AM
is $page = $page =~ s/^\///g; right for removing a "/" from beginning of a string if is there?
yes, but you must see what g modifier do and the difference between = and =~

http://perldoc.perl.org/perlop.html



$page = '/something'; # assign '/something' to $page
$page =~ s/^\///; # replace in $page and reassign result to $page


php pcre is kind of perl regex in fact.

best regards

Phil Jackson
12-07-2009, 06:53 PM
Thak you yet again. another proplem that i am finding hard to get info on. How do i remove a value and hash from an array? any tuts would be great!

oesxyl
12-07-2009, 07:49 PM
Thak you yet again. another proplem that i am finding hard to get info on. How do i remove a value and hash from an array? any tuts would be great!
there is a delete but is usualy avoided because is pretty inefficient. Other way is to undef what you want to remove.
can you give an example of where you want to use this?

best regards

Phil Jackson
12-07-2009, 09:01 PM
This is what i have so far... it maybe completely wrong but its me converting PHP from reading tuts...



#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Data::Dumper;
use DBI;

print "\nEnter a website: http://www.";
chomp(my $url = <>);
my %pagesArray = ($url);
my @foundPagesArray = ('');
my @mainPagesArray = ('');
print "\n\n";

while ( true ) {
if( scalar %pagesArray != 0 ) {
foreach my $page (%pagesArray) {
my $key = $pagesAray{$page};
$page =~ s/^\///;
if( $page =~ m#((?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/))$#is ){
my $pregQuote = "http://www.".$url->quote;
if( $page =~ m#^$pregQuote#is ){
if( my $content = get "http://www." . $url . "/" ){

}
}else{
delete $pagesArray{$key};
}
}else{
delete $pagesArray{$key};
}
}
}else{
break;
}
}

bazz
12-07-2009, 09:27 PM
I hope I haven't missed something earlier in the thread but, it seems to me that the hash %pagesArray doesn't have any values associated with the key ($url). That means it would be a simple array.

On that basis I would try something like this:



#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Data::Dumper;
use DBI;

print "\nEnter a website: http://www.";
chomp(my $url = <>);
my @pagesArray = ($url);
my @new_array;
print "\n\n";

foreach my $page (@pagesArray) {
$page =~ s/^\///;

if( $page =~ m/((?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/))$/is ){
$page = "http://www.$page";
push (@new_array, $page);
}
}


if you want to build a new array with only those which have the chosen / desired, file extension, then you can include the line in red.

Alternatively, if you need to prevent duplicates, you could use a hash



#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Data::Dumper;
use DBI;

print "\nEnter a website: http://www.";
chomp(my $url = <>);
my %pages_hash;
$pages_hash{$url} = 1;
my @new_array;
print "\n\n";

while (my ($key, $value) = each %pages_hash) {
$key =~ s/^\///;

if( $key =~ m/((?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/))$/is ){
$key = "http://www.$key";
push (@new_array, $key);
}
}


I don't get the difference between a 'page' and the 'url' so I may not be providing the answer you need.

hth

bazz

FishMonger
12-07-2009, 10:01 PM
Was that the actual script you're using or a ballpark representation of the script?

Why are you using an infinite loop?


C:\test>perl -c PhilJackson.pl
Global symbol "%pagesAray" requires explicit package name at PhilJackson.pl line 18.
Bareword "true" not allowed while "strict subs" in use at PhilJackson.pl line 36.
Bareword "break" not allowed while "strict subs" in use at PhilJackson.pl line 34.
PhilJackson.pl had compilation errors.

FishMonger
12-07-2009, 10:06 PM
This is not going to do what you expect.
$url->quote

FishMonger
12-07-2009, 10:10 PM
This is what I get after fixing the typo.


C:\test>PhilJackson.pl

Enter a website: http://www.google.com
Odd number of elements in hash assignment at C:\test\PhilJackson.pl line 10, <> line 1.


Argument "1/8" isn't numeric in numeric ne (!=) at C:\test\PhilJackson.pl line 16, <> line 1.
Use of uninitialized value $key in delete at C:\test\PhilJackson.pl line 30, <> line 1.
Use of uninitialized value $page in hash element at C:\test\PhilJackson.pl line 18, <> line 1.
Use of uninitialized value $page in substitution (s///) at C:\test\PhilJackson.pl line 19, <> line 1.
Use of uninitialized value $page in pattern match (m//) at C:\test\PhilJackson.pl line 20, <> line 1.
Use of uninitialized value $key in delete at C:\test\PhilJackson.pl line 30, <> line 1.

oesxyl
12-08-2009, 02:43 AM
This is what i have so far... it maybe completely wrong but its me converting PHP from reading tuts...
doesn't matter if is wrong as far you learn and in my opinion you did a good job, :)



#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Data::Dumper;
use DBI;

print "\nEnter a website: http://www.";
my $url = chomp <>;
print "\n\n";

my @urlstack = ($url); # push here url each time you find and shift one by one from head to get each page
my %allurl = (); # hash to check used url

while(@urlstack){
my $page = shift @urlstack;
$page =~ s/^\///;
if( $page =~ m#((?:\.html)|(?:\.php)|(?:\.aspx)|(?:\.htm)|(?:\.asp)|(?:\.shtml)|(?:/))$#is ){
my $pregQuote = "http://www.".$url->quote;
if( $page =~ m#^$pregQuote#is ){
if(!exists($allurl{$page})){
if( my $content = get "http://www." . $url . "/" ){
my @links = .... ; # get links from current page somehow
push @urlstack, @links; # push links
$allurl{$page}++;
}
}
}
}
}

I didn't test this and I didn't change the lines where you use regex, in my opinion you must change this lines.
There are few things you must take care:
- in %allurl you need to put url in kind of standard format so you can leter decide, using exists, if two url represent same page
- you need to test if a url is internal to the domain you ar interested or external, else you will start to get pages from a site and soon you will discover that you try to get all the internet, :)
- you must test status after lwp get and do something if it fail.
- read about how to write and use sub, equiv of php functions, or better how to write modules. Things are pretty complex and you will lose control very fast.

Now about other problem.You must keep in mind that people don't like to fetch pages and you must respect that. I suggest you to look over LWP::RobotUA module, always to try to fetch robots.txt, obey to what is inside.

best regards

FishMonger
12-08-2009, 03:04 AM
I'd need to run a test, but it looks like the regex could be reduced to this:

/\.(s?html?|php|aspx?)$/i

oesxyl
12-08-2009, 03:55 AM
I'd need to run a test, but it looks like the regex could be reduced to this:

/\.(s?html?|php|aspx?)$/i
in my opinion is a better idea to use URI module:

http://search.cpan.org/dist/URI/

a lot of avantages: can use a base for domain, use canonical form to check equiv of two url, avoid external url and uri with protocol different then http, convert from relative to absolute, and so on.

best regards



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum