Go Back   CodingForums.com > :: Client side development > General web building

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rate Thread
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 09-07-2009, 12:43 PM   PM User | #1
NancyJ
Senior Coder

 
NancyJ's Avatar
 
Join Date: Feb 2005
Location: Bradford, UK
Posts: 3,162
Thanks: 19
Thanked 65 Times in 64 Posts
NancyJ will become famous soon enough
Robots.txt blocking everything except Google, MSN, Yahoo

I'm trying to construct a robots.txt to only allow crawling by certain bots. I know how to disallow specific bots but name but not how to only include specific ones by name.

One bot in particular is causing problems (5x the crawl traffic of googlebot!) - well awstats says its a bot 'identified by empty user agent string' - is it even possible to block a bot that doesn't have a user-agent? I'm also thinking if it doesn't identify itself it probably wont obey robots.txt. I could block it in php but can any think of anything legit that might not provide a user agent string?
__________________
http://www.hazelryan.co.uk
NancyJ is offline   Reply With Quote
Old 09-07-2009, 01:12 PM   PM User | #2
NancyJ
Senior Coder

 
NancyJ's Avatar
 
Join Date: Feb 2005
Location: Bradford, UK
Posts: 3,162
Thanks: 19
Thanked 65 Times in 64 Posts
NancyJ will become famous soon enough
I tried blocking the blank user agents with
Code:
Options +FollowSymlinks
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteRule .* - [F,L]
but that blocked me and based on the loads it blocked everyone else too.
__________________
http://www.hazelryan.co.uk
NancyJ is offline   Reply With Quote
Old 09-09-2009, 04:43 PM   PM User | #3
Fisher
Regular Coder

 
Join Date: Jan 2009
Posts: 316
Thanks: 7
Thanked 92 Times in 91 Posts
Fisher is on a distinguished road
Code:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-Agent: MSNBot
Allow: /
User-Agent: Slurp
Allow: /
Those are the 3 main ones.

It's debatable whether the Allow has to come before Disallow. Allow might also only work on Googlebot. Experiment with different variations and see.

Last edited by Fisher; 09-09-2009 at 04:49 PM..
Fisher is offline   Reply With Quote
Old 09-16-2009, 05:47 PM   PM User | #4
Pennimus
Senior Coder

 
Join Date: Jul 2005
Location: UK
Posts: 1,051
Thanks: 6
Thanked 13 Times in 13 Posts
Pennimus is on a distinguished road
Neither MSN or Slurp support the Allow directive so the above will block them.

Robots.txt is not a very flexible tool Nancy, you will need to disallow on an individual basis. Plus, there are many more traffic sending search engines than those 3, so blanket blocking everthing else, even if you could do it, would be a poor idea. Ande, even if you could, as you say why would they listen to your robots.txt file? Most aggressive bots are malicious and won't take any notice at all!

If they are genuine bots that will listen, then they would be listed with a user agent and a web address.
Pennimus is offline   Reply With Quote
Old 09-16-2009, 06:56 PM   PM User | #5
Fisher
Regular Coder

 
Join Date: Jan 2009
Posts: 316
Thanks: 7
Thanked 92 Times in 91 Posts
Fisher is on a distinguished road
Quote:
Originally Posted by Pennimus View Post
Neither MSN or Slurp support the Allow directive so the above will block them.
You may be correct, but according to their respective sites, both support the Allow:
Yahoo
Bing

I agree that a malicious bot will ignore it anyway. I like this guy's approach, even if it is a little heavy handed: Bot Trap
Fisher is offline   Reply With Quote
Old 09-17-2009, 11:14 AM   PM User | #6
Pennimus
Senior Coder

 
Join Date: Jul 2005
Location: UK
Posts: 1,051
Thanks: 6
Thanked 13 Times in 13 Posts
Pennimus is on a distinguished road
I stand corrected. Historically it was a Google only innovation but evidently has (at last) been adopted by the others. Thanks for correcting me.
Pennimus is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 08:12 PM.


Advertisement
Log in to turn off these ads.