cfructose
03-26-2010, 03:38 AM
Based on code found elsewhere, I'm trying to return the number of hits Google shows for any given word.
Here's the complete script, in case you're interested, and it works perfectly. You won't need to wade through this in order to understand my point later on:
function getWordCount($word) {
// Connect and send request. If can't connect, return false
if (($h = fsockopen('www.google.com', 80))!== false) {
fwrite($h,"GET /search?hl=en&q=%22".urlencode($word)."%22 HTTP/1.1\r\n");
fwrite($h,"Connection: close\r\n");
fwrite($h,"Host: www.google.com\r\n\r\n");
// Read response
$response = '';
while (!feof($h)) $response .= fread($h, 8096);
fclose($h);
$needle = 'resultStats';
$pos = strpos($response, $needle);
if ($pos !== false) {
$response = strip_tags(substr($response, $pos));
$response = substr($response, (strlen($needle)+4), (strlen($needle)+30));
$needle2 = ' of';
$pos = strpos($response, $needle2);
$response = substr($response, $pos);
$expl = explode('for', $response);
$response = preg_replace ('/[^\d\s]/', '', $expl[0]);
}
else {
echo $needle." was not found";
}
return $response;
}
}
The reason for this post is that I'm very conscious that this code depends entirely on the precise details of the html that Google outputs.
If $needle == (for example) 'hamster', then the code searches the html output for the part that contains the phrase:
"Results 1 - 10 of about 16,900,000 for hamster "
I finally managed to code something that succeeds, however messily, in grabbing the number of hits... Taken from the code above:
$needle2 = ' of';
$pos = strpos($response, $needle2);
$response = substr($response, $pos);
$expl = explode('for', $response);
This explodes the html at the words 'of' and 'for', resulting in the number we want (the number of Google hits for 'hamster'). The number ends up in the first index of the array, i.e. in $expl[0], and I then strip away all the unwanted stuff with:
$response = preg_replace ('/[^\d\s]/', '', $expl[0]);
This all seems a very clumsy, and insecure way of going about this.
Can anyone suggest an alternative method of using PHP to query Google for this data? Something that doesn't depend on the html layout?
As a side-question, what's the status of the legality of performing such a request? Am I breaking any laws by grabbing statistics from Google in this manner?
Thanks for any insights.
Here's the complete script, in case you're interested, and it works perfectly. You won't need to wade through this in order to understand my point later on:
function getWordCount($word) {
// Connect and send request. If can't connect, return false
if (($h = fsockopen('www.google.com', 80))!== false) {
fwrite($h,"GET /search?hl=en&q=%22".urlencode($word)."%22 HTTP/1.1\r\n");
fwrite($h,"Connection: close\r\n");
fwrite($h,"Host: www.google.com\r\n\r\n");
// Read response
$response = '';
while (!feof($h)) $response .= fread($h, 8096);
fclose($h);
$needle = 'resultStats';
$pos = strpos($response, $needle);
if ($pos !== false) {
$response = strip_tags(substr($response, $pos));
$response = substr($response, (strlen($needle)+4), (strlen($needle)+30));
$needle2 = ' of';
$pos = strpos($response, $needle2);
$response = substr($response, $pos);
$expl = explode('for', $response);
$response = preg_replace ('/[^\d\s]/', '', $expl[0]);
}
else {
echo $needle." was not found";
}
return $response;
}
}
The reason for this post is that I'm very conscious that this code depends entirely on the precise details of the html that Google outputs.
If $needle == (for example) 'hamster', then the code searches the html output for the part that contains the phrase:
"Results 1 - 10 of about 16,900,000 for hamster "
I finally managed to code something that succeeds, however messily, in grabbing the number of hits... Taken from the code above:
$needle2 = ' of';
$pos = strpos($response, $needle2);
$response = substr($response, $pos);
$expl = explode('for', $response);
This explodes the html at the words 'of' and 'for', resulting in the number we want (the number of Google hits for 'hamster'). The number ends up in the first index of the array, i.e. in $expl[0], and I then strip away all the unwanted stuff with:
$response = preg_replace ('/[^\d\s]/', '', $expl[0]);
This all seems a very clumsy, and insecure way of going about this.
Can anyone suggest an alternative method of using PHP to query Google for this data? Something that doesn't depend on the html layout?
As a side-question, what's the status of the legality of performing such a request? Am I breaking any laws by grabbing statistics from Google in this manner?
Thanks for any insights.