...

View Full Version : Getting information from URL's



Lee Stevens
02-20-2008, 04:11 PM
Hey guys i managed to get curl to work and with POST, but now when it retrieves the content i only would like to strip information from the URL's.

Example:
<a href="profile.php?id=226345">WildestThing</a>

There would normally be 20+ links exept different ID's and user names.

I would like it to list like this:

WildestThing - 226345
WildestThing1 - 226346

So on and on.

I've thought of eregi, but i not really sure what regex or how to get it from the URL/HTML.

Any help would be great!

Thank You!!

kbluhm
02-20-2008, 06:43 PM
First off, don't use eregi(), or any functions from PHP's ereg library for that matter. They are slow and have basically been deprecated in favor of the preg library... and will be removed altogether in PHP6.

You'll want to have a look at preg_match_all().

Try running this function and see what you come up with:


/**
* parse_links()
* Returns the number of matches on success, or boolean FALSE if no matches.
* Assigns matches to second parameter's variable name
*/
function parse_links( $input, & $matches = NULL )
{
$regexp = '/\<a.+href\="profile\.php\?id\=(\d+)".*\>(.+)\<\/a\>/Usi';
$count = preg_match_all( $regexp, $input, $m, PREG_SET_ORDER );
if ( $count )
{
$matches = array();
for ( $i = 0; $i < $count; $i++ )
{
$matches[] = array
(
'id' => $m[$i][1],
'name' => $m[$i][2],
);
}
return $count;
}
return FALSE;
}


Usage:


$source = file_get_contents( $url ); // however you get the HTML source

if ( parse_links( $source, $matches ) )
{
print_r( $matches );
}
else
{
echo 'No matches';
}


It should give you something like so:


Array
(
[0] => Array
(
[id] => 226345
[name] => WildestThing
)

[1] => Array
(
[id] => 226346
[name] => WildestThing1
)

[2] => Array
(
[id] => 226347
[name] => WildestThing2
)

[3] => Array
(
[id] => 226348
[name] => WildestThing3
)

[4] => Array
(
[id] => 226349
[name] => WildestThing4
)

)

Lee Stevens
02-20-2008, 09:25 PM
Thank you very much!

But i manged to sort somthing out:


if(preg_match_all('/<a href="profile\.php\?id=(\d+)">(.*?)<\/a>/i', $content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $line_num => $val)
{
$userinfo[$line_num]['userid'] = $val[1];
$userinfo[$line_num]['username'] = $val[2];
}
}



I was useing curl to get the information.

kbluhm
02-20-2008, 09:30 PM
What was the something that you managed to sort out?

Also, this bit of the regexp that you modified...
(.*?)... is redundant. The asterisk says zero or more. The question mark says optional (zero or one). There is no reason for the question mark when using the asterisk.

oesxyl
02-20-2008, 09:48 PM
Also, this bit of the regexp that you modified...
(.*?)... is redundant. The asterisk says zero or more. The question mark says optional (zero or one). There is no reason for the question mark when using the asterisk.

there is a reason, see PCRE_UNGREEDY or U modifier, :)


In my opinion is better to use:

[^<]*
instead of that if don't expect to have only text and no other html elements between a tags.


best regards

kbluhm
02-20-2008, 09:56 PM
Oh, he also ripped out the ungreedy modifier, as well as some other changes. How nice.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum