09-07-2009, 01:43 PM
I'm trying to construct a robots.txt to only allow crawling by certain bots. I know how to disallow specific bots but name but not how to only include specific ones by name.

One bot in particular is causing problems (5x the crawl traffic of googlebot!) - well awstats says its a bot 'identified by empty user agent string' - is it even possible to block a bot that doesn't have a user-agent? I'm also thinking if it doesn't identify itself it probably wont obey robots.txt. I could block it in php but can any think of anything legit that might not provide a user agent string?

09-07-2009, 02:12 PM
I tried blocking the blank user agents with

Options +FollowSymlinks
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteRule .* - [F,L]

but that blocked me and based on the loads it blocked everyone else too.

09-09-2009, 05:43 PM
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-Agent: MSNBot
Allow: /
User-Agent: Slurp
Allow: /

Those are the 3 main ones.

It's debatable whether the Allow has to come before Disallow. Allow might also only work on Googlebot. Experiment with different variations and see.

09-16-2009, 06:47 PM
Neither MSN or Slurp support the Allow directive so the above will block them.

Robots.txt is not a very flexible tool Nancy, you will need to disallow on an individual basis. Plus, there are many more traffic sending search engines than those 3, so blanket blocking everthing else, even if you could do it, would be a poor idea. Ande, even if you could, as you say why would they listen to your robots.txt file? Most aggressive bots are malicious and won't take any notice at all!

If they are genuine bots that will listen, then they would be listed with a user agent and a web address.

09-16-2009, 07:56 PM
You may be correct, but according to their respective sites, both support the Allow:
Yahoo (http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html)
Bing (http://www.bing.com/community/blogs/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx)

I agree that a malicious bot will ignore it anyway. I like this guy's approach, even if it is a little heavy handed: Bot Trap (http://www.kloth.net/internet/bottrap.php)

09-17-2009, 12:14 PM
I stand corrected. Historically it was a Google only innovation but evidently has (at last) been adopted by the others. Thanks for correcting me.