Go Back   CodingForums.com > :: Client side development > HTML & CSS

Before you post, read our: Rules & Posting Guidelines

Reply
 
Thread Tools Rating: Thread Rating: 7 votes, 3.14 average.
Enjoy an ad free experience by logging in. Not a member yet? Register.
Old 07-01-2002, 11:32 AM   PM User | #1
WA
Administrator


 
Join Date: Mar 2002
Posts: 2,596
Thanks: 2
Thanked 19 Times in 18 Posts
WA will become famous soon enough
Tags to stop search engines from spidering a page?

I'm sure I knew this before, but what are the tag(s) for preventing search engines from spidering a page? If you know any other techniques to further ensure this, please do share as well.

The old WA Forum has long been a spider magnet with its static pages, and I believe it's time to put a stop to it.

Thanks,
__________________
- George
- JavaScript Kit- JavaScript tutorials and 400+ scripts!
- JavaScript Reference- JavaScript reference you can relate to.
WA is offline   Reply With Quote
Old 07-01-2002, 11:38 AM   PM User | #2
joh6nn
wei wu wei


 
joh6nn's Avatar
 
Join Date: Jun 2002
Location: 72° W. 48' 57" , 41° N. 32' 04"
Posts: 1,887
Thanks: 0
Thanked 1 Time in 1 Post
joh6nn is an unknown quantity at this point
i've never actually used any of this, George, but this is what Google's suggested reading on the subject:
http://www.robotstxt.org/wc/norobots.html
__________________
bluemood | devedge | devmo | MS Dev Library | WebMonkey | the Guide

i am a loser geek, crazy with an evil streak,
yes i do believe there is a violent thing inside of me.
joh6nn is offline   Reply With Quote
Old 07-01-2002, 11:51 AM   PM User | #3
boxer_1
Regular Coder

 
Join Date: May 2002
Location: Maine, USA
Posts: 574
Thanks: 0
Thanked 0 Times in 0 Posts
boxer_1 is an unknown quantity at this point
In addition to the robots.txt file mentioned in the page Joh6nn linked to, I've also been using the following meta tag (all the values I know of listed for example purposes:

<meta name="robots" content="All | NoIndex | NoFollow | None">

Edit: Here's another useful link on the subject from WDVL:

http://www.wdvl.com/Location/Search/Robots.html
__________________
boxer_1
CodingForums Moderator
"How did a fool and his money get together in the first place?"

Last edited by boxer_1; 07-01-2002 at 11:56 AM..
boxer_1 is offline   Reply With Quote
Old 07-01-2002, 01:53 PM   PM User | #4
justame
Regular Coder

 
Join Date: Jun 2002
Posts: 676
Thanks: 1
Thanked 0 Times in 0 Posts
justame is on a distinguished road
wa...
/me has this from feyd from just a longgg® time back...

"quote...
Now, you need to create a new file, named ROBOTS.TXT (case does not matter) within that new file, it should contain this:

#/robots.txt file for http://www.domain.com/

User-agent: *
Disallow: /consultants
Disallow: /images
Disallow: /bin
Disallow: /cgi-bin/
Disallow: /test

The first line should be on ONE line only, and designates to the spider what it is doing.

User agent * means all agents must conform to this robots.txt file

Disallow tells it not to go into the following directories. So, in short, if you don't want him in your image directory, disallow it...

The robots.txt simply acts as a rulebook for the current site, the spider must do as it says.

You also might want to duplicate your keyword list and place it in one of the ALT tags of one of your first or second images which loads on your index page...some spiders, specifically Alta Vista, like this...

All search engine robots, or spiders, recognize and conform to the robots.txt standard. Some search engines also recognize the META ROBOTS tag, but for your purposes, the robots.txt exclusions would work fine.

Just do:
Disallow: /anyother.html
Disallow: /contact.html
Disallow: /images
Disallow: /about
Disallow everything but the files you do want listed, and remember that you can just
Disallow an entire folder, rather than disallow each individual page (unless there are pages within that folder you want to list and pages you don't want to list, then you would have to disallow all the pages you do not want listed in that folder rather than the entire folder)...endquote"
__________________
The New JustaBuster Version 2.0 OR JustaBusta Lite V2.0
...just a special® thanx kinda hugs to jkd n' nex ...:O)))

CommemorateWTC.com --Please lend your support
justame is offline   Reply With Quote
Old 07-01-2002, 07:20 PM   PM User | #5
Jeepers
Regular Coder

 
Join Date: Jun 2002
Location: Conwy. UK
Posts: 122
Thanks: 0
Thanked 0 Times in 0 Posts
Jeepers is an unknown quantity at this point
Quote:
Originally posted by boxer_1
Edit: Here's another useful link on the subject from WDVL:

http://www.wdvl.com/Location/Search/Robots.html
I had always believed that there was only a disallow instruction, now it seems as if there is also an allow instruction.

I'm wondering, if you disallow everything on the first line and then on following lines allow certain files or directories, would it work??
__________________
An answer needs a question just as much as a question needs an answer. Deep eh!
Jeepers is offline   Reply With Quote
Old 07-01-2002, 08:04 PM   PM User | #6
boxer_1
Regular Coder

 
Join Date: May 2002
Location: Maine, USA
Posts: 574
Thanks: 0
Thanked 0 Times in 0 Posts
boxer_1 is an unknown quantity at this point
Quote:
Originally posted by Jeepers


I had always believed that there was only a disallow instruction, now it seems as if there is also an allow instruction.

I'm wondering, if you disallow everything on the first line and then on following lines allow certain files or directories, would it work??
Hmm...I believe that if you want to 'Allow' access to certain files and Disallow others you should specify what you wish to Allow first, though I'm not sure it makes a difference.

Here's a quote from the page I provided the link to:
Quote:
The record starts with one or more User-agent lines, specifying which robots the record applies to, followed by "Disallow" and "Allow" instructions to that robot. To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

For example:
User-agent: webcrawler
User-agent: infoseek
Allow: /tmp/ok.html
Disallow: /tmp

WebCrawler and InfoSeek are not allowed access to the /tmp/ directory, except to access ok.html. All other robots are allowed unrestricted access.
There are also quite a few other links on the WDVL page I linked to with examples and more in-depth information .
__________________
boxer_1
CodingForums Moderator
"How did a fool and his money get together in the first place?"
boxer_1 is offline   Reply With Quote
Old 07-01-2002, 08:27 PM   PM User | #7
Feyd
Regular Coder


 
Feyd's Avatar
 
Join Date: May 2002
Location: Los Angeles, CA Maxim: Subvert Society
Posts: 404
Thanks: 0
Thanked 0 Times in 0 Posts
Feyd is an unknown quantity at this point
Sorry, there is no such thing as an allow command. Period. Forget it.

You can only Disallow, anything NOT DISALLOWED is obviously ALLOWED. Remember, boys and girls...anyone who says you can use an Allow command in robots.txt is flat out wrong, end of conversation. And that includes numerous horribly mis-informed sites. (last poll had nearly 65% of the 3,000,000 robots.txt files searched badly formatted and non-functional because of mistakes like these that sites are spreading around).

You can Disallow the specific file you want to never be listed, or the entire directory that file resides in, though that will cause all other files in that directory to not be listed as well. Note, however, that we're talking true spiders, ones that follow the rules. I can write my own spider that can suck email and images and teach it to disregard robots.txt, so you may want to set up a trap. (in fact, I've seen robots that will open robots.txt and look specifically for disallowed directories and crawl those first..the logic should be obvious as to why they would do that). Anyways, disallow a page and don't link to it from anywhere, setup logging on this page that captures IP and user agent string...any hit on this page is a bad spider/agent and you should probably consider blocking them from your site...

Also, using the META tags for robot control is not recognized by a good portion of robots, and robots.txt is your best bet for total control (in fact, the only reason to use the meta tag is for such things as google's cache control and no archiving, otherwise everyone has to listen to robots.txt.

And WA, I can send you a robots.txt that will block every single non-browser spider (at least so far) and only allow true surfers in. (also would have been nice when the old forums were getting harvested so much by spammers).
__________________
Moderator, Perl/CGI Forum
shadowstorm.net - subvert society
Feyd is offline   Reply With Quote
Old 07-02-2002, 11:56 AM   PM User | #8
joh6nn
wei wu wei


 
joh6nn's Avatar
 
Join Date: Jun 2002
Location: 72° W. 48' 57" , 41° N. 32' 04"
Posts: 1,887
Thanks: 0
Thanked 1 Time in 1 Post
joh6nn is an unknown quantity at this point
Feyd, i'd love to have a copy of that file as well, if you could. joh6nn@hotmail.com is my email. thanks.
__________________
bluemood | devedge | devmo | MS Dev Library | WebMonkey | the Guide

i am a loser geek, crazy with an evil streak,
yes i do believe there is a violent thing inside of me.
joh6nn is offline   Reply With Quote
Reply

Bookmarks

Jump To Top of Thread


Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:02 PM.


Advertisement
Log in to turn off these ads.