...

View Full Version : Text Selector



markman641
11-17-2011, 12:37 AM
1) Project Details: (be as specific as possible): This seems like a project that isn't my knowledge.

I would like to create a program or a code I can put it my page or something along the lines of that, that will visit a URL that I input, select certain text from a page, and inserts the data into an sql table called "offers"

I want it to visit a site and select the text after "Campaign Name" and "Requirements", "Country", "rate", "category" and "URL"

All of the above things are in a new <td> after the <td> that says Campaign name and stuff. EXCEPT URL which is in a DIV TABLE.

Here is a screenshot of one of the pages I would like it to extract info from: http://snpr.cm/BOUlNk.png

COPY OF ONE OF THE PAGES SOURCE CODE: http://pastebin.com/raw.php?i=9QgCHMk8
(http://pastebin.com/raw.php?i=9QgCHMk8)
If you need any more info please don't hesitate to ask. This is not in my field of knowledge which is why I've come here to ask you guys.

2) Payment method/ details (Paypal, check? Timeline?): Free? :( I have no money.

Rowsdower!
11-17-2011, 08:54 PM
Here is a really sloppy and convoluted script:


<?php
$string=file_get_contents('http://www.example.com/path/to/page.html');
$start='<table';
$end='</table>';

$string=substr($string,strpos($string,$start),strrpos($string,$end)-strpos($string,$start));
$string='<table><tr>'.substr($string,strpos($string,'<td width="30%" align="right"><b>ID</b></td>'));
$string=substr($string,0,strpos($string,$end))."</table>";
$string=str_replace(' ',' ',str_replace(' ',' ',str_replace("\n\n","\n",$string)));


print $string;


libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML($string);
$xml = simplexml_import_dom($dom);
libxml_use_internal_errors(FALSE);

$result = $xml->xpath("//td");
//print_r(each($result[5]));
$temp=each($result[5]);
print "<p>Campaign Name: ".$temp[1][0]."</p>\n";
$temp=each($result[9]);
print "<p>Description: ".$temp[1]."</p>\n";
$temp=each($result[11]);
print "<p>Requirements: ".$temp[1]."</p>\n";
$temp=each($result[13]);
print "<p>Category: ".$temp[1]."</p>\n";
$temp=each($result[15]);
print "<p>Country: ".$temp[1]."</p>\n";
$temp=each($result[17]);
print "<p>Rate: ".$temp[1]."</p>\n";
?>

It might not work on anything other than your specified source code (other offers may be structured differently, I wouldn't know) but it's a start... And you get what you pay for!

markman641
11-19-2011, 12:18 AM
I just thought of something... I need to be logged into the site to access a page and idk if the script would be able to go to the URL... DX

So im not too sure if that would quite work.. but i will see

Edit: doesnt work, this is the error i got: http://snpr.cm/gyF8yE.png

markman641
11-21-2011, 04:58 AM
anyone?

Rowsdower!
11-23-2011, 12:54 PM
You may not have the ability to use file_get_contents() on a remote address with your host. What happens if you just try this:


<?php
$string=file_get_contents('http://www.google.com/');
$start='<table';
$end='</table>';

$string=substr($string,strpos($string,$start),strrpos($string,$end)-strpos($string,$start));
$string='<table><tr>'.substr($string,strpos($string,'<td width="30%" align="right"><b>ID</b></td>'));
$string=substr($string,0,strpos($string,$end))."</table>";
$string=str_replace(' ',' ',str_replace(' ',' ',str_replace("\n\n","\n",$string)));


print $string;
?>

Does anything show up in the page or is it blank?

markman641
11-27-2011, 01:00 AM
It shows up blank

markman641
11-29-2011, 11:48 PM
anyone?

Rowsdower!
11-30-2011, 12:51 PM
One more try... If this most basic example returns a blank result in your browser then your host simply doesn't allow remote use of file_get_contents() in which case you can't do what you are wanting to do...


<?php
$string=file_get_contents('http://www.google.com/');
print $string;
?>

Try that and if you get a blank result then you know you're hosed. If not, then the script I provided earlier needs some work or else you need to find another script to do the job.

But whatever you do you will need to be able to use either the cURL library or else be able to use file_get_contents() on a remote address. There is no other way to get another website's content to your server on-the-fly.

markman641
12-01-2011, 12:34 AM
That worked! Which means your other script didnt work.

Rowsdower!
12-01-2011, 05:00 PM
That worked! Which means your other script didnt work.

OK then. At least you're past the first hurdle.

I know that when I plugged in your sampled source code from your target page my script ran just fine. So if your source code was representative of the actual source code you encounter then my script should work. (If not, then you need to try adjusting the substring and string replacement functions until you narrow things down to the result you want.)

Not to ask an insulting question, but you did update this line to use the actual URL you want to scrape, didn't you?


<?php
$string=file_get_contents('http://www.example.com/path/to/page.html');
$start='<table';
$end='</table>';

...

And my script assumes that the page you want to scrape is not behind a login or anything requiring a cookie. Because if you have to log in to see the screen that you want to scrape (or if you have to have a certain value set in a cookie) then this method won't be able to actually see the data you are trying to collect. You would need to use the cURL library instead. Do you need a cookie or a login to see the page you are trying to scrape?

markman641
12-02-2011, 01:56 AM
yes i changed it. and yes it does need a login but i figured if i logged in then used the script it would work.. maybe not.

Rowsdower!
12-02-2011, 12:55 PM
yes i changed it. and yes it does need a login but i figured if i logged in then used the script it would work.. maybe not.

Yeah, definitely not. Your host's server (not your logged-in browser on your own computer) is visiting the page and your server does not share your session/cookies. It's like any other random user trying to visit the page from another computer while you're logged in. It's going to hit a login wall.

You need to look into PHP's cURL library (which has the ability to make the server visit the page, simulating a real user, and log in/navigate pages). Then you capture contents from the logged-in state and log back out when you're finished with the capture.

In order to do the cURL method one would have to have access to a valid account on the target site so they could see the way the log-in works and what things need to be "clicked" on and submitted in order to get around. Bottom line: I don't think you have much hope of getting a cURL script with login done for you for free.

My advice would be to save up for a few weeks and post a paid work offer for someone to do this for you or else spend that same amount of time (or less) learning to use cURL on your own. If you can at least get the script logged in and grab the page data you want then you can script-bash that with what I have provided already to get a working model. The cURL part may or may not be messy (depending on how your target site's login system is set up). The cURL library in and of itself is not difficult to use, but navigating a website with cURL can get very tricky (and can break when the target site updates their code if they make changes to the login system's URL or variable names). If the login uses javascript then that can be another, possibly signficant layer of trouble to work out.

Anyway, this link may be of some help to get you started with cURL log-ins:
http://stackoverflow.com/questions/1082302/file-get-contents-from-url-that-is-only-accessible-after-log-in-to-website

markman641
12-02-2011, 11:12 PM
This is what I found by scanning the internet, But it's not working. I get the error:
Couldn't connect to http://proleadsmedia.com/publishers/campaigns/view.php?wid=592&cid=4811.

Also, am I supposed to have a Cookie.txt file? Here is the code:


<?php
// INIT CURL
$ch = curl_init();

// SET URL FOR THE POST FORM LOGIN
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/login.php');

// ENABLE HTTP POST
curl_setopt ($ch, CURLOPT_POST, 1);

// SET POST PARAMETERS : FORM VALUES FOR EACH FIELD
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'Username=********&Password=******');

// IMITATE CLASSIC BROWSER'S BEHAVIOUR : HANDLE COOKIES
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');

# Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
# not to print out the results of its query.
# Instead, it will return the results as a string return value
# from curl_exec() instead of the usual true/false.
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);

// EXECUTE 1st REQUEST (FORM LOGIN)
$store = curl_exec ($ch);

// SET FILE TO DOWNLOAD
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/campaigns/view.php?wid=592&cid=4811');

// EXECUTE 2nd REQUEST (FILE DOWNLOAD)
$content = curl_exec ($ch);





$url = "http://proleadsmedia.com/publishers/campaigns/view.php?wid=592&cid=4811";

//unique text to determine start goes here

$start = "start.txt";

//insert end text here

$end = "end.txt";

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL, $url );

curl_setopt ($ch, CURLOPT_HEADER, 0);

curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);

$result = curl_exec ($ch) or die ("Couldn't connect to $url.");

curl_close ($ch);

$startposition = strpos($result,$start);

if($startposition > 0){

$endposition = strpos($result,$end, $startposition);

//add enough chars to include the tag

$endposition += strlen($end);

$length = $endposition-$startposition;

$result = substr($result,$startposition,$length);

echo $result;

}else

echo "<center><h3>Not found - try again later.</h3></center>";








// CLOSE CURL
curl_close ($ch);

?>



BUT THEN I also just tried:


<?

$loginUrl = 'http://proleadsmedia.com/publishers/login.php'; //action from the login form
$loginFields = array('username'=>'m********', 'password'=>'********'); //login form field names and values
$remotePageUrl = 'http://proleadsmedia.com/publishers/campaigns/view.php?wid=&cid=12462'; //url of the page you want to save

$login = getUrl($loginUrl, 'post', $loginFields); //login to the site

$remotePage = getUrl($remotePageUrl); //get the remote page

function getUrl($url, $method='', $vars='') {
$ch = curl_init();
if ($method == 'post') {
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $vars);
}
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies/cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies/cookies.txt');
$buffer = curl_exec($ch);
curl_close($ch);
return $buffer;
}

?>

and it came up as a blank page

Rowsdower!
12-05-2011, 07:49 PM
Two things:

1) Change your proleads password immediately. You forgot to delete it in one instance in your posted code (I know, I accidentally accessed it once and had to log out when I was testing).

2) You missed one piece of the proper URL that the login form goes to. Try updating your initial setup with this:


<?php
// INIT CURL
$ch = curl_init();

// SET URL FOR THE POST FORM LOGIN
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/login.php?next');

// ENABLE HTTP POST
curl_setopt ($ch, CURLOPT_POST, true);

// SET POST PARAMETERS : FORM VALUES FOR EACH FIELD
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'username=markman641&password=************');

// IMITATE CLASSIC BROWSER'S BEHAVIOUR : HANDLE COOKIES
curl_setopt ($ch, CURLOPT_COOKIEJAR, './cookie.txt');

# Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
# not to print out the results of its query.
# Instead, it will return the results as a string return value
# from curl_exec() instead of the usual true/false.
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);

// EXECUTE 1st REQUEST (FORM LOGIN)
$store = curl_exec ($ch);

// SET FILE TO DOWNLOAD
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/campaigns/view.php?wid=592&cid=4811');

// EXECUTE 2nd REQUEST (FILE DOWNLOAD)
$content = curl_exec ($ch);





$url = "http://proleadsmedia.com/publishers/campaigns/view.php?wid=592&cid=4811";

//unique text to determine start goes here

$start = "start.txt";

//insert end text here

$end = "end.txt";

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL, $url );

curl_setopt ($ch, CURLOPT_HEADER, 0);

curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);

$result = curl_exec ($ch) or die ("Couldn't connect to $url.");

curl_close ($ch);

$startposition = strpos($result,$start);

if($startposition > 0){

$endposition = strpos($result,$end, $startposition);

//add enough chars to include the tag

$endposition += strlen($end);

$length = $endposition-$startposition;

$result = substr($result,$startposition,$length);

echo $result;

}else

echo "<center><h3>Not found - try again later.</h3></center>";








// CLOSE CURL
curl_close ($ch);

?>

That worked for me in a quick test (or at least, it logged me in and got me to the first detail page in the script and printed the contents after which I used exit(0) to prevent any further processing).

After that, you have at least got a working login and the ability to navigate. You should be able to patch things up from that point.

Rowsdower!
12-05-2011, 07:55 PM
Short version, printing the page in question rather than processing it:


<?php
// INIT CURL
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/login.php?next');
curl_setopt ($ch, CURLOPT_POST, true);
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'username=markman641&password=************');
curl_setopt ($ch, CURLOPT_COOKIEJAR, './cookie.txt');
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);

// EXECUTE 1st REQUEST (FORM LOGIN)
$store = curl_exec ($ch);

// SET FILE TO DOWNLOAD
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/campaigns/view.php?wid=592&cid=4811');

// EXECUTE 2nd REQUEST (FILE DOWNLOAD)
$content = curl_exec ($ch);

// LOG BACK OUT
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/logout.php');
$logged_out = curl_exec ($ch);

// CLOSE CURL
curl_close ($ch);

print $content; //instead of printing in real application you would search the contents string for the data you need...
?>

Fumigator
12-05-2011, 10:26 PM
This has turned into a PHP help thread so I'm moving it to the PHP forum.

markman641
12-05-2011, 11:46 PM
Oh my god thank you for telling me about my password!! I owe you one man! and THANK YOU it works!!!

markman641
12-06-2011, 12:24 AM
Okay well it works.. except for the specific text selector part.. i keep getting the error:


Warning: Variable passed to each() is not an array or object in /home/moneyeve/public_html/textselector.php on line 49
Campaign Name:


Warning: Variable passed to each() is not an array or object in /home/moneyeve/public_html/textselector.php on line 51
Description:


Warning: Variable passed to each() is not an array or object in /home/moneyeve/public_html/textselector.php on line 53
Requirements:


Warning: Variable passed to each() is not an array or object in /home/moneyeve/public_html/textselector.php on line 55
Category:


Warning: Variable passed to each() is not an array or object in /home/moneyeve/public_html/textselector.php on line 57
Country:


Warning: Variable passed to each() is not an array or object in /home/moneyeve/public_html/textselector.php on line 59
Rate:

what I did was i changed all the $string's to $content


this is my code:


<?php
// INIT CURL
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/login.php?next');
curl_setopt ($ch, CURLOPT_POST, true);
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'username=markman641&password=********');
curl_setopt ($ch, CURLOPT_COOKIEJAR, './cookie.txt');
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);

// EXECUTE 1st REQUEST (FORM LOGIN)
$store = curl_exec ($ch);

// SET FILE TO DOWNLOAD
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/campaigns/view.php?wid=592&cid=4811');

// EXECUTE 2nd REQUEST (FILE DOWNLOAD)
$content = curl_exec ($ch);

// LOG BACK OUT
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/logout.php');
$logged_out = curl_exec ($ch);

// CLOSE CURL
curl_close ($ch);

print $content; //instead of printing in real application you would search the contents string for the data you need...

$start='<table';
$end='</table>';
$content=substr(content,strpos(content,$start),strrpos(content,$end)-strpos(content,$start));
$content='<table><tr>'.substr(content,strpos(content,'<td width="30%" align="right"><b>ID</b></td>'));
$content=substr(content,0,strpos(content,$end))."</table>";
$content=str_replace(' ',' ',str_replace(' ',' ',str_replace("\n\n","\n",content)));


print $content;


libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML(content);
$xml = simplexml_import_dom($dom);
libxml_use_internal_errors(FALSE);

$result = $xml->xpath("//td");
//print_r(each($result[5]));
$temp=each($result[5]);
print "<p>Campaign Name: ".$temp[1][0]."</p>\n";
$temp=each($result[9]);
print "<p>Description: ".$temp[1]."</p>\n";
$temp=each($result[11]);
print "<p>Requirements: ".$temp[1]."</p>\n";
$temp=each($result[13]);
print "<p>Category: ".$temp[1]."</p>\n";
$temp=each($result[15]);
print "<p>Country: ".$temp[1]."</p>\n";
$temp=each($result[17]);
print "<p>Rate: ".$temp[1]."</p>\n";
?>

markman641
12-07-2011, 01:09 AM
anyone?

Rowsdower!
12-07-2011, 12:54 PM
Hint: Does something stand out to you as being incorrect in this line?


$dom->loadHTML(content);

markman641
12-08-2011, 10:21 PM
Haha thanks.. Well that wasnt the only line i forgot the $ in front of content! Well it works.. ALMOST!!!!!!!!!!!!!!!!!!!!!! the only thing that has messed up is the campaign name.

This is what it output:


Campaign Name: $

It SHOULD say:

Campaign Name: $1MM Easter Egg Sweepstakes *Incent*

Code:


<?php
// INIT CURL
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/login.php?next');
curl_setopt ($ch, CURLOPT_POST, true);
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'username=markman641&password=************');
curl_setopt ($ch, CURLOPT_COOKIEJAR, './cookie.txt');
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);

// EXECUTE 1st REQUEST (FORM LOGIN)
$store = curl_exec ($ch);

// SET FILE TO DOWNLOAD
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/campaigns/view.php?wid=592&cid=4811');

// EXECUTE 2nd REQUEST (FILE DOWNLOAD)
$content = curl_exec ($ch);

// LOG BACK OUT
curl_setopt($ch, CURLOPT_URL, 'http://proleadsmedia.com/publishers/logout.php');
$logged_out = curl_exec ($ch);

// CLOSE CURL
curl_close ($ch);

print $content; //instead of printing in real application you would search the contents string for the data you need...

$start='<table';
$end='</table>';
$content=substr($content,strpos($content,$start),strrpos($content,$end)-strpos($content,$start));
$content='<table><tr>'.substr($content,strpos($content,'<td width="30%" align="right"><b>ID</b></td>'));
$content=substr($content,0,strpos($content,$end))."</table>";
$content=str_replace(' ',' ',str_replace(' ',' ',str_replace("\n\n","\n",$content)));


print $content;


libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML($content);
$xml = simplexml_import_dom($dom);
libxml_use_internal_errors(FALSE);

$result = $xml->xpath("//td");
//print_r(each($result[5]));
$temp=each($result[5]);
print "<p>Campaign Name: ".$temp[1][0]."</p>\n";
$temp=each($result[9]);
print "<p>Description: ".$temp[1]."</p>\n";
$temp=each($result[11]);
print "<p>Requirements: ".$temp[1]."</p>\n";
$temp=each($result[13]);
print "<p>Category: ".$temp[1]."</p>\n";
$temp=each($result[15]);
print "<p>Country: ".$temp[1]."</p>\n";
$temp=each($result[17]);
print "<p>Rate: ".$temp[1]."</p>\n";
?>

also one last thing. It;s supposed to get the "default url" which is lower down in the page.. what line would I add and what would i put if i wanted to extract more? here is the source code for it:


<table cellpadding="2" cellspacing="0" style="border:0px;">
<tbody><tr>
<td style="width:50px" align="right"><b>URL</b></td>
<td style="width:540px"><div style="overflow:auto;max-height:100px;padding:2px" onclick="autoSelect(this);" class="table">http://proleadsmedia.com/m/1e13250t89231e4t16a/</div></td>
</tr>
<tr>
<td align="right"><b>HTML</b></td>
<td><div style="overflow:auto;max-height:100px;padding:2px" onclick="autoSelect(this);" class="table">&lt;a href="http://proleadsmedia.com/m/1e13250t89231e4t16a/"&gt;Android - Be Naughty (AUS)&lt;/a&gt;</div></td>
</tr>
</tbody></table>

Rowsdower!
12-09-2011, 02:37 PM
This is the part where I want you to fly from the nest. Read up on how to use simplexml and $xml->xpath and such to navigate an imported DOM and get the data you want. You have a working example there that just needs some tweaking.

On this part:

$temp=each($result[5]);
print "<p>Campaign Name: ".$temp[1][0]."</p>\n";

Look inside the array $temp to see what's there and see how you might need to proceed (used print_r($temp); in there once and see what's available to you).

I have brought you very close. It's up to you to get the rest of the way. If you get stuck with a specific piece of code that you can't get to work I can help unstick you, but I'm not going to suggest anything new for now. :thumbsup:

markman641
12-10-2011, 05:36 AM
Okay well I figured out the Name problem.. All i did was delete that [0] that was in there randomly.

Now I have to try and figure out this code so I can get it to extract more info... never seen anything like it and it's confusing..

markman641
12-11-2011, 11:21 PM
I am completely lost...

Rowsdower!
12-12-2011, 01:12 PM
This is the key to your future:

http://php.net/manual/en/simplexmlelement.xpath.php

Learn what xpath is and how to use it (there are examples and sample functions on that page as well as a description of the function). That is how you will first begin to get your information from the DOM.

markman641
12-15-2011, 11:29 PM
i looked and i still dont understand..

Rowsdower!
12-16-2011, 01:45 PM
i looked and i still dont understand..

Sometimes that happens...

markman641
12-18-2011, 08:02 PM
why cant you just add it? i mean its just one more line i need extracted.

Rowsdower!
12-19-2011, 02:48 PM
why cant you just add it? i mean its just one more line i need extracted.

can != should

Show me anything (or everything) that you have tried in an effort to solve this yourself. Explain why you thought your solution should have worked, what the script then did that was wrong, and how you think it needs to be adjusted in order to work properly.

I am perfectly willing to discuss those topics with you. I simply will not provide any further explicit code for you on this one. I hope you can appreciate why.

markman641
01-09-2012, 12:01 AM
i tried googling everything and i cant figure out anything in that script..
ugh
this script is necessary for my site

tangoforce
01-09-2012, 03:14 AM
I just thought of something... I need to be logged into the site to access a page and idk if the script would be able to go to the URL... DX

So im not too sure if that would quite work.. but i will see


I just thought of something... I need to be logged into the site to access a page and idk if the script would be able to go to the URL... DX

So im not too sure if that would quite work.. but i will see

You posted your last comment back on page one when Rows originally started helping you. Seeing as you've repeated it line for line, here is something Rows also said on page 2:



I am perfectly willing to discuss those topics with you. I simply will not provide any further explicit code for you on this one. I hope you can appreciate why.

markman641
01-16-2012, 04:18 AM
uhmm what?



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum