...

View Full Version : Tags to stop search engines from spidering a page?



WA
07-01-2002, 12:32 PM
I'm sure I knew this before, but what are the tag(s) for preventing search engines from spidering a page? If you know any other techniques to further ensure this, please do share as well.

The old WA Forum has long been a spider magnet with its static pages, and I believe it's time to put a stop to it.

Thanks,

joh6nn
07-01-2002, 12:38 PM
i've never actually used any of this, George, but this is what Google's suggested reading on the subject:
http://www.robotstxt.org/wc/norobots.html

boxer_1
07-01-2002, 12:51 PM
In addition to the robots.txt file mentioned in the page Joh6nn linked to, I've also been using the following meta tag (all the values I know of listed for example purposes:

<meta name="robots" content="All | NoIndex | NoFollow | None">

Edit: Here's another useful link on the subject from WDVL:

http://www.wdvl.com/Location/Search/Robots.html :thumbsup:

justame
07-01-2002, 02:53 PM
wa...
/me has this from feyd from just a longgg® time back...

"quote...
Now, you need to create a new file, named ROBOTS.TXT (case does not matter) within that new file, it should contain this:

#/robots.txt file for http://www.domain.com/

User-agent: *
Disallow: /consultants
Disallow: /images
Disallow: /bin
Disallow: /cgi-bin/
Disallow: /test

The first line should be on ONE line only, and designates to the spider what it is doing.

User agent * means all agents must conform to this robots.txt file

Disallow tells it not to go into the following directories. So, in short, if you don't want him in your image directory, disallow it...

The robots.txt simply acts as a rulebook for the current site, the spider must do as it says.

You also might want to duplicate your keyword list and place it in one of the ALT tags of one of your first or second images which loads on your index page...some spiders, specifically Alta Vista, like this...

All search engine robots, or spiders, recognize and conform to the robots.txt standard. Some search engines also recognize the META ROBOTS tag, but for your purposes, the robots.txt exclusions would work fine.

Just do:
Disallow: /anyother.html
Disallow: /contact.html
Disallow: /images
Disallow: /about
Disallow everything but the files you do want listed, and remember that you can just
Disallow an entire folder, rather than disallow each individual page (unless there are pages within that folder you want to list and pages you don't want to list, then you would have to disallow all the pages you do not want listed in that folder rather than the entire folder)...endquote"

Jeepers
07-01-2002, 08:20 PM
Originally posted by boxer_1
Edit: Here's another useful link on the subject from WDVL:

http://www.wdvl.com/Location/Search/Robots.html

I had always believed that there was only a disallow instruction, now it seems as if there is also an allow instruction.

I'm wondering, if you disallow everything on the first line and then on following lines allow certain files or directories, would it work??

boxer_1
07-01-2002, 09:04 PM
Originally posted by Jeepers


I had always believed that there was only a disallow instruction, now it seems as if there is also an allow instruction.

I'm wondering, if you disallow everything on the first line and then on following lines allow certain files or directories, would it work??

Hmm...I believe that if you want to 'Allow' access to certain files and Disallow others you should specify what you wish to Allow first, though I'm not sure it makes a difference.

Here's a quote from the page I provided the link to:

The record starts with one or more User-agent lines, specifying which robots the record applies to, followed by "Disallow" and "Allow" instructions to that robot. To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

For example:
User-agent: webcrawler
User-agent: infoseek
Allow: /tmp/ok.html
Disallow: /tmp

WebCrawler and InfoSeek are not allowed access to the /tmp/ directory, except to access ok.html. All other robots are allowed unrestricted access.

There are also quite a few other links on the WDVL page I linked to with examples and more in-depth information ;).

Feyd
07-01-2002, 09:27 PM
Sorry, there is no such thing as an allow command. Period. Forget it.

You can only Disallow, anything NOT DISALLOWED is obviously ALLOWED. Remember, boys and girls...anyone who says you can use an Allow command in robots.txt is flat out wrong, end of conversation. And that includes numerous horribly mis-informed sites. (last poll had nearly 65% of the 3,000,000 robots.txt files searched badly formatted and non-functional because of mistakes like these that sites are spreading around).

You can Disallow the specific file you want to never be listed, or the entire directory that file resides in, though that will cause all other files in that directory to not be listed as well. Note, however, that we're talking true spiders, ones that follow the rules. I can write my own spider that can suck email and images and teach it to disregard robots.txt, so you may want to set up a trap. (in fact, I've seen robots that will open robots.txt and look specifically for disallowed directories and crawl those first..the logic should be obvious as to why they would do that). Anyways, disallow a page and don't link to it from anywhere, setup logging on this page that captures IP and user agent string...any hit on this page is a bad spider/agent and you should probably consider blocking them from your site...

Also, using the META tags for robot control is not recognized by a good portion of robots, and robots.txt is your best bet for total control (in fact, the only reason to use the meta tag is for such things as google's cache control and no archiving, otherwise everyone has to listen to robots.txt.

And WA, I can send you a robots.txt that will block every single non-browser spider (at least so far) and only allow true surfers in. (also would have been nice when the old forums were getting harvested so much by spammers).

joh6nn
07-02-2002, 12:56 PM
Feyd, i'd love to have a copy of that file as well, if you could. joh6nn@hotmail.com is my email. thanks.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum