View Full Version : Get number of Google hits

03-26-2010, 04:38 AM
Based on code found elsewhere, I'm trying to return the number of hits Google shows for any given word.

Here's the complete script, in case you're interested, and it works perfectly. You won't need to wade through this in order to understand my point later on:

function getWordCount($word) {
// Connect and send request. If can't connect, return false
if (($h = fsockopen('www.google.com', 80))!== false) {
fwrite($h,"GET /search?hl=en&q=%22".urlencode($word)."%22 HTTP/1.1\r\n");
fwrite($h,"Connection: close\r\n");
fwrite($h,"Host: www.google.com\r\n\r\n");

// Read response
$response = '';
while (!feof($h)) $response .= fread($h, 8096);

$needle = 'resultStats';
$pos = strpos($response, $needle);

if ($pos !== false) {
$response = strip_tags(substr($response, $pos));
$response = substr($response, (strlen($needle)+4), (strlen($needle)+30));

$needle2 = ' of';
$pos = strpos($response, $needle2);
$response = substr($response, $pos);
$expl = explode('for', $response);

$response = preg_replace ('/[^\d\s]/', '', $expl[0]);
else {
echo $needle." was not found";

return $response;

The reason for this post is that I'm very conscious that this code depends entirely on the precise details of the html that Google outputs.

If $needle == (for example) 'hamster', then the code searches the html output for the part that contains the phrase:

"Results 1 - 10 of about 16,900,000 for hamster "

I finally managed to code something that succeeds, however messily, in grabbing the number of hits... Taken from the code above:

$needle2 = ' of';
$pos = strpos($response, $needle2);
$response = substr($response, $pos);
$expl = explode('for', $response);

This explodes the html at the words 'of' and 'for', resulting in the number we want (the number of Google hits for 'hamster'). The number ends up in the first index of the array, i.e. in $expl[0], and I then strip away all the unwanted stuff with:

$response = preg_replace ('/[^\d\s]/', '', $expl[0]);

This all seems a very clumsy, and insecure way of going about this.

Can anyone suggest an alternative method of using PHP to query Google for this data? Something that doesn't depend on the html layout?

As a side-question, what's the status of the legality of performing such a request? Am I breaking any laws by grabbing statistics from Google in this manner?

Thanks for any insights.

03-26-2010, 05:52 AM
This is interesting.

My first thought is that you can parse out the count more reliably if you change your code to this:

$response = "Results 1 - 10 of about 16,900,000 for hamster ";

$needle2 = ' of about ';
$pos = strpos($response, $needle2);
$response = substr($response, $pos+10);
$expl = explode(' for ', $response);

echo $expl[0]."<br>"; // returns count
echo trim($expl[1]); // returns search phrase

This should better isolate the count. Testing for the position of ' of about ' and ' for ' (with outer spaces) avoids problems when search phrases contain the 'of' or 'for' character combinations within the search words. Also, checking for the substring position should start at the end of the matching string ($pos+10), not at the beginning.

This makes the regex unnecessary.

03-26-2010, 07:09 AM
Thanks for these ideas.

Try Googling:

"purple rodent-like"
(a random weird phrase I just improvised in the hope that it would have very few hits!)

You'll notice that Google outputs:

Results 1 - 6 of 6 for "purple rodent-like". (0.09 seconds)

...which doesn't contain the word 'about'!

I had already realised this, hence not having handled my 'explodes' in quite the way you suggest. Your revised code is, however, clearly an improvement, and it's easily ammended to reflect the fact that sometimes the text reads 'of', and sometimes 'of about'.

Without meaning to appear ungrateful though, the primary purpose of my posting was to see if perhaps there's some way of getting at this data without recourse to such methods. I can't think of anything that could possibly work, so I may be stuck with php code that will break in the event that Google ever changes the formatting of their search results output.

If no-one has any alternative ideas, then I'll certainly go with your regex-less, neater version.

This is just a shot in the dark, but might there be some Google-provided API that could help with getting these figures?

What about the legal question I raised at the bottom of my initial posting?

P.S. How bizarre that "purple rodent-like" has actually been used by people!

P.P.S. Welcome to this forum!

03-26-2010, 03:41 PM
If you do the search with the quotes, then you are doing an exact match search and it changes the results. An exact match returns only pages that use the exact phrase somewhere on the page, plus pages that are identified with that keyword theme due to text links.

I though you were just doing a typical broad match (without quotes) which returns any page that uses any of the search words somewhere on the page.

If you are trying to see how many competing pages there are for a particular search phrase, you should use an exact match.

That still becomes a little tricky because it looks like Google adds the word "about" whenever there are more than 1,000 results and when you do a broad match search, but it isn't even consistent about that. Major search engines only display the first 1,000 results. You cannot get to the search results pages beyond that.

Getting back to your original question, I don't know of an easier way to pull the search results count from Google. Be aware that you should be careful about using a script like this. Don't set it up to check 500 phrases in a few minutes. Google started blocking IPs for search results tools a few years ago. They do not like scrapers. If they detect too many search requests in a short period of time, they will likely block your IP. If you have any Google accounts, they know who you are through Google cookies. Big Brother is watching!

Thanks for the welcome. I will probably be visiting regularly. :D

03-30-2010, 09:06 PM
Yeah, I'll sometimes be doing a single word, and sometimes a phrase.

Anyhow, your cautions about Big Brother are enough to convince me that I ought to abandon this idea entirely! If the project I'm working on is to be successful at all, then there would indeed be hundreds of 'scrapes' per minute.

Ho hum. Back to the drawing board.

I don't suppose you can think of any clever ways to work out whether any given two-word phrase is a common collocation in the English language (i.e. whether they occur together with notable frequency), short of querying Google?

My initial plan was to set a minimum number for Google hits (say, 1000), and if the query string ended up returning more than that number, then I would know that those two words have been used together often enough that they're worth considering as an example for my purposes. (My purposes require knowing that these two words aren't utter nonsense /ludicrously rare when juxtaposed).

An open source database of words alongside the frequency of occurrence of other words before and after them seems like a highly unlikely thing to exist! (I've searched, of course, and can find nothing usable!)

Well, even though I would be delighted to hear any suggestions about how this apparently intractable problem could be solved, I shan't hold my breath!

Thanks so much for your input. It was invaluable.

03-30-2010, 09:38 PM
Forgive me because I did not read the entire thread, but it seems like you are trying to page-scrape google.... why are you not using the google api (http://code.google.com/apis/ajaxsearch/)?

03-30-2010, 10:20 PM
There is such a thing?!

To answer your rhetorical question "Why aren't you using the Google API", well, because I'm so ill-informed that I had no idea it existed!

You've rescued me from my despair!

Well, I went to the site, and I already have an AJAX Search API Key.
Gosh, that was quick and simple!

Thanks for the heads-up.

Am I right in presuming that I can now legitimately perform thousands of requests for such data without Google blacklisting me?

03-30-2010, 10:29 PM
heh, glad to be of help and avoid you re-coding the wheel :)

I believe the free API key allows you up to 1,000 request a day or something. Google certainly won't blacklist you for using it because it is a published feature, as long as you stay within their fair usage policies. Check out the FAQ (http://code.google.com/apis/ajaxsearch/faq/#violatingTOS).