...

View Full Version : PHP HTML scraper



MattClark
10-18-2011, 05:22 AM
1) Project Details: I have a text file stored that has every link in my database. There are 5 versions of each link. I need an html scraper that goes through each link and checks to see if the video is still embedded.

2) Payment method/ details: Paypal

PM me with any questions, Thanks!

_Aerospace_Eng_
10-18-2011, 05:48 AM
So you have a list of urls where the videos might be embedded and then you need to check for the embed code? Did you give your users a certain code to use for embedding? Can you post a sample of your text file? Feel free to remove the urls and replace them with domain.com or something.

MattClark
10-18-2011, 06:13 AM
I use this code to print all links from my database to a text file.



<?php
require_once('mysql_connect.php'); // connect to the database

$movie_list = 'List of all movie links<hr /><br /><table>';
$sql = mysql_query("SELECT movie_id, title, version1, version1_source, version2, version2_source, version3, version3_source, version4, version4_source, version5, version5_source FROM movies ORDER BY movie_id");
while($row = mysql_fetch_array($sql)){
$movie_id = $row['movie_id'];
$title = $row['title'];
$version1 = $row['version1'];
$version1_source = $row['version1_source'];
$version2 = $row['version2'];
$version2_source = $row['version2_source'];
$version3 = $row['version3'];
$version3_source = $row['version3_source'];
$version4 = $row['version4'];
$version4_source = $row['version4_source'];
$version5 = $row['version5'];
$version5_source = $row['version5_source'];
$movie_list .= '
<tr>
<td>' . $movie_id . '</td><td>' . $title . '</td><td>' . $version1 . '</td><td>' . $version1_source . '</td><td>' . $version2 . '</td><td>' . $version2_source . '</td><td>' . $version3 . '</td><td>' . $version3_source . '</td><td>' . $version4 . '</td><td>' . $version4_source . '</td><td>' . $version5 . '</td><td>' . $version5_source . '</td></tr>';
}

$movie_list .= '</table>';

?>

<html>
<?php echo $movie_list; ?>

</html>


Then the beginning of what I've got already to check each of the links for the video player:



<html>
<table>
<?php
$page="http://xxxxxxxxxxx.php";

# INITIATE CURL.
$curl = curl_init();
file_get_contents('$page');
# CURL SETTINGS.
curl_setopt($curl, CURLOPT_URL,"$page");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);

# GRAB THE FILE.
$html = curl_exec($curl);

# CLOSE CURL.
curl_close($curl);

exit;

echo $page;
preg_match_all(
'/
<tr>
<td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><\/tr>/s',
$page,
$posts, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
$movie_id = $post[1];
$title = $post[2];
$version1 = $post[3];
$version1_source = $post[4];
$version2 = $post[5];
$version2_source = $post[6];
$version3 = $post[7];
$version3_source = $post[8];
$version4 = $post[9];
$version4_source = $post[10];
$version5 = $post[11];
$version5_source = $post[12];

// do something with data


I want it to check each link in database to see if the embedded video player is still on the other side of the link, and if it is not, echo out the link, version, id, etc. Sorry if I'm doing a bad job explaining what I need. I don't understand cURL well at all.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum