...

View Full Version : PHP link scraping



MattClark
09-13-2011, 07:36 AM
Hi guys! I'm trying to turn a page filled with a giant table of links into an array that I can use to check the links for validity. I realize there are better ways to do this. It's more of a learning process than anything. However, when using the code below that I've been trying to edit, it's giving no results. Is there any noticable reasons as to why it's not giving me the desired result?

Thanks in advance!!

Matt

Here is a sample of a row from the table I am trying to scrape.


<tr>
<td>1</td><td>The Hangover 2</td><td>http://www.novamov.com/video/kcyzc7aoduw12</td><td>http://www.putlocker.com/file/F72561F9414120CA</td><td>http://www.putlocker.com/file/24D2A737D555C0D9</td><td>http://www.putlocker.com/file/98592CE881B32D29</td><td>http://www.sockshare.com/file/1BE3ED2D67C9918E</td></tr>



And here is my code that is returning 0 results:



<?php
// get the HTML
$html = file_get_contents("choosing to hide url here");

preg_match_all(
'/<tr>
<td>(.*?)<\/td>
<td>(.*?)<\/td>
<td>(.*?)<\/td>
<td>(.*?)<\/td>
<td>(.*?)<\/td>
<td>(.*?)<\/td>
<td>(.*?)<\/td>
<\/tr>/s',
$html,
$posts,
PREG_SET_ORDER // formats data into an array of posts
);
$num_records = @mysql_num_rows($posts);

foreach ($posts as $post) {
$movie_id = $post[1];
$title = $post[2];
$version1 = $post[3];
$version2 = $post[4];
$version3 = $post[5];
$version4 = $post[6];
$version5 = $post[7];
}

if ($num_records < 1) {
print "No results";
} else {
echo $posts;
};
?>

gvre
09-13-2011, 10:39 AM
You should use count, not mysql_num_rows, when need to count array elements.

Try the following code


$html = '<tr>
<td>1</td><td>The Hangover 2</td><td>http://www.novamov.com/video/kcyzc7aoduw12</td><td>http://www.putlocker.com/file/F72561F9414120CA</td><td>http://www.putlocker.com/file/24D2A737D555C0D9</td><td>http://www.putlocker.com/file/98592CE881B32D29</td><td>http://www.sockshare.com/file/1BE3ED2D67C9918E</td></tr>';

$pattern = '#<tr>\s*<td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td></tr>#si';
if (preg_match_all($pattern, $html, $posts, PREG_SET_ORDER))
{
foreach($posts as $post)
{
print_r($post);
$movie_id = $post[1];
$title = $post[2];
$version1 = $post[3];
$version2 = $post[4];
$version3 = $post[5];
$version4 = $post[6];
$version5 = $post[7];
}
}

MattClark
09-14-2011, 04:05 PM
I'm slightly confused. Each of the two codes are doing the same thing...they're pulling every link from a text file that contains every link on my site. What I'm ultimately trying to do is get the page content of each individual link and make sure that the movie player is still embedded on the pages that those links go to.

When I do it, it's getting the page content of every link, but it's putting it all onto the same page, so that I can't html scrape each of them individually. I'm guessing i'm supposed to remove them from the array? But I'm not entirely sure how.

MattClark
09-14-2011, 10:55 PM
bump. if anyone knows what I would do to fix this problem, I would appreciate it greatly!



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum