This explodes the html at the words 'of' and 'for', resulting in the number we want (the number of Google hits for 'hamster'). The number ends up in the first index of the array, i.e. in $expl, and I then strip away all the unwanted stuff with:
My first thought is that you can parse out the count more reliably if you change your code to this:
$response = "Results 1 - 10 of about 16,900,000 for hamster ";
$needle2 = ' of about ';
$pos = strpos($response, $needle2);
$response = substr($response, $pos+10);
$expl = explode(' for ', $response);
echo $expl."<br>"; // returns count
echo trim($expl); // returns search phrase
This should better isolate the count. Testing for the position of ' of about ' and ' for ' (with outer spaces) avoids problems when search phrases contain the 'of' or 'for' character combinations within the search words. Also, checking for the substring position should start at the end of the matching string ($pos+10), not at the beginning.
This makes the regex unnecessary.
Last edited by TopDogger; 03-26-2010 at 04:55 AM..
(a random weird phrase I just improvised in the hope that it would have very few hits!)
You'll notice that Google outputs:
Results 1 - 6 of 6 for "purple rodent-like". (0.09 seconds)
...which doesn't contain the word 'about'!
I had already realised this, hence not having handled my 'explodes' in quite the way you suggest. Your revised code is, however, clearly an improvement, and it's easily ammended to reflect the fact that sometimes the text reads 'of', and sometimes 'of about'.
Without meaning to appear ungrateful though, the primary purpose of my posting was to see if perhaps there's some way of getting at this data without recourse to such methods. I can't think of anything that could possibly work, so I may be stuck with php code that will break in the event that Google ever changes the formatting of their search results output.
If no-one has any alternative ideas, then I'll certainly go with your regex-less, neater version.
This is just a shot in the dark, but might there be some Google-provided API that could help with getting these figures?
What about the legal question I raised at the bottom of my initial posting?
P.S. How bizarre that "purple rodent-like" has actually been used by people!
P.P.S. Welcome to this forum!
Last edited by cfructose; 03-26-2010 at 06:10 AM..
Reason: Adding something
If you do the search with the quotes, then you are doing an exact match search and it changes the results. An exact match returns only pages that use the exact phrase somewhere on the page, plus pages that are identified with that keyword theme due to text links.
I though you were just doing a typical broad match (without quotes) which returns any page that uses any of the search words somewhere on the page.
If you are trying to see how many competing pages there are for a particular search phrase, you should use an exact match.
That still becomes a little tricky because it looks like Google adds the word "about" whenever there are more than 1,000 results and when you do a broad match search, but it isn't even consistent about that. Major search engines only display the first 1,000 results. You cannot get to the search results pages beyond that.
Getting back to your original question, I don't know of an easier way to pull the search results count from Google. Be aware that you should be careful about using a script like this. Don't set it up to check 500 phrases in a few minutes. Google started blocking IPs for search results tools a few years ago. They do not like scrapers. If they detect too many search requests in a short period of time, they will likely block your IP. If you have any Google accounts, they know who you are through Google cookies. Big Brother is watching!
Thanks for the welcome. I will probably be visiting regularly.
Last edited by TopDogger; 03-26-2010 at 02:49 PM..
Yeah, I'll sometimes be doing a single word, and sometimes a phrase.
Anyhow, your cautions about Big Brother are enough to convince me that I ought to abandon this idea entirely! If the project I'm working on is to be successful at all, then there would indeed be hundreds of 'scrapes' per minute.
Ho hum. Back to the drawing board.
I don't suppose you can think of any clever ways to work out whether any given two-word phrase is a common collocation in the English language (i.e. whether they occur together with notable frequency), short of querying Google?
My initial plan was to set a minimum number for Google hits (say, 1000), and if the query string ended up returning more than that number, then I would know that those two words have been used together often enough that they're worth considering as an example for my purposes. (My purposes require knowing that these two words aren't utter nonsense /ludicrously rare when juxtaposed).
An open source database of words alongside the frequency of occurrence of other words before and after them seems like a highly unlikely thing to exist! (I've searched, of course, and can find nothing usable!)
Well, even though I would be delighted to hear any suggestions about how this apparently intractable problem could be solved, I shan't hold my breath!
heh, glad to be of help and avoid you re-coding the wheel
I believe the free API key allows you up to 1,000 request a day or something. Google certainly won't blacklist you for using it because it is a published feature, as long as you stay within their fair usage policies. Check out the FAQ.
If I have been helpful, use the "thank" button - It makes me happy!