Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 8 of 8
  1. #1
    WA
    WA is offline
    Administrator
    Join Date
    Mar 2002
    Posts
    2,596
    Thanks
    2
    Thanked 19 Times in 18 Posts

    Tags to stop search engines from spidering a page?

    I'm sure I knew this before, but what are the tag(s) for preventing search engines from spidering a page? If you know any other techniques to further ensure this, please do share as well.

    The old WA Forum has long been a spider magnet with its static pages, and I believe it's time to put a stop to it.

    Thanks,
    - George
    - JavaScript Kit- JavaScript tutorials and 400+ scripts!
    - JavaScript Reference- JavaScript reference you can relate to.

  • #2
    Senior Coder joh6nn's Avatar
    Join Date
    Jun 2002
    Location
    72° W. 48' 57" , 41° N. 32' 04"
    Posts
    1,887
    Thanks
    0
    Thanked 1 Time in 1 Post
    i've never actually used any of this, George, but this is what Google's suggested reading on the subject:
    http://www.robotstxt.org/wc/norobots.html
    bluemood | devedge | devmo | MS Dev Library | WebMonkey | the Guide

    i am a loser geek, crazy with an evil streak,
    yes i do believe there is a violent thing inside of me.

  • #3
    Regular Coder
    Join Date
    May 2002
    Location
    Maine, USA
    Posts
    574
    Thanks
    0
    Thanked 0 Times in 0 Posts
    In addition to the robots.txt file mentioned in the page Joh6nn linked to, I've also been using the following meta tag (all the values I know of listed for example purposes:

    <meta name="robots" content="All | NoIndex | NoFollow | None">

    Edit: Here's another useful link on the subject from WDVL:

    http://www.wdvl.com/Location/Search/Robots.html
    Last edited by boxer_1; 07-01-2002 at 11:56 AM.
    boxer_1
    CodingForums Moderator
    "How did a fool and his money get together in the first place?"

  • #4
    Regular Coder
    Join Date
    Jun 2002
    Posts
    676
    Thanks
    1
    Thanked 0 Times in 0 Posts
    wa...
    /me has this from feyd from just a longgg® time back...

    "quote...
    Now, you need to create a new file, named ROBOTS.TXT (case does not matter) within that new file, it should contain this:

    #/robots.txt file for http://www.domain.com/

    User-agent: *
    Disallow: /consultants
    Disallow: /images
    Disallow: /bin
    Disallow: /cgi-bin/
    Disallow: /test

    The first line should be on ONE line only, and designates to the spider what it is doing.

    User agent * means all agents must conform to this robots.txt file

    Disallow tells it not to go into the following directories. So, in short, if you don't want him in your image directory, disallow it...

    The robots.txt simply acts as a rulebook for the current site, the spider must do as it says.

    You also might want to duplicate your keyword list and place it in one of the ALT tags of one of your first or second images which loads on your index page...some spiders, specifically Alta Vista, like this...

    All search engine robots, or spiders, recognize and conform to the robots.txt standard. Some search engines also recognize the META ROBOTS tag, but for your purposes, the robots.txt exclusions would work fine.

    Just do:
    Disallow: /anyother.html
    Disallow: /contact.html
    Disallow: /images
    Disallow: /about
    Disallow everything but the files you do want listed, and remember that you can just
    Disallow an entire folder, rather than disallow each individual page (unless there are pages within that folder you want to list and pages you don't want to list, then you would have to disallow all the pages you do not want listed in that folder rather than the entire folder)...endquote"
    The New JustaBuster Version 2.0 OR JustaBusta Lite V2.0
    ...just a special® thanx kinda hugs to jkd n' nex ...:O)))

    CommemorateWTC.com --Please lend your support

  • #5
    Regular Coder
    Join Date
    Jun 2002
    Location
    Conwy. UK
    Posts
    122
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Originally posted by boxer_1
    Edit: Here's another useful link on the subject from WDVL:

    http://www.wdvl.com/Location/Search/Robots.html
    I had always believed that there was only a disallow instruction, now it seems as if there is also an allow instruction.

    I'm wondering, if you disallow everything on the first line and then on following lines allow certain files or directories, would it work??
    An answer needs a question just as much as a question needs an answer. Deep eh!

  • #6
    Regular Coder
    Join Date
    May 2002
    Location
    Maine, USA
    Posts
    574
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Originally posted by Jeepers


    I had always believed that there was only a disallow instruction, now it seems as if there is also an allow instruction.

    I'm wondering, if you disallow everything on the first line and then on following lines allow certain files or directories, would it work??
    Hmm...I believe that if you want to 'Allow' access to certain files and Disallow others you should specify what you wish to Allow first, though I'm not sure it makes a difference.

    Here's a quote from the page I provided the link to:
    The record starts with one or more User-agent lines, specifying which robots the record applies to, followed by "Disallow" and "Allow" instructions to that robot. To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

    For example:
    User-agent: webcrawler
    User-agent: infoseek
    Allow: /tmp/ok.html
    Disallow: /tmp

    WebCrawler and InfoSeek are not allowed access to the /tmp/ directory, except to access ok.html. All other robots are allowed unrestricted access.
    There are also quite a few other links on the WDVL page I linked to with examples and more in-depth information .
    boxer_1
    CodingForums Moderator
    "How did a fool and his money get together in the first place?"

  • #7
    Regular Coder Feyd's Avatar
    Join Date
    May 2002
    Location
    Los Angeles, CA Maxim: Subvert Society
    Posts
    404
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Sorry, there is no such thing as an allow command. Period. Forget it.

    You can only Disallow, anything NOT DISALLOWED is obviously ALLOWED. Remember, boys and girls...anyone who says you can use an Allow command in robots.txt is flat out wrong, end of conversation. And that includes numerous horribly mis-informed sites. (last poll had nearly 65% of the 3,000,000 robots.txt files searched badly formatted and non-functional because of mistakes like these that sites are spreading around).

    You can Disallow the specific file you want to never be listed, or the entire directory that file resides in, though that will cause all other files in that directory to not be listed as well. Note, however, that we're talking true spiders, ones that follow the rules. I can write my own spider that can suck email and images and teach it to disregard robots.txt, so you may want to set up a trap. (in fact, I've seen robots that will open robots.txt and look specifically for disallowed directories and crawl those first..the logic should be obvious as to why they would do that). Anyways, disallow a page and don't link to it from anywhere, setup logging on this page that captures IP and user agent string...any hit on this page is a bad spider/agent and you should probably consider blocking them from your site...

    Also, using the META tags for robot control is not recognized by a good portion of robots, and robots.txt is your best bet for total control (in fact, the only reason to use the meta tag is for such things as google's cache control and no archiving, otherwise everyone has to listen to robots.txt.

    And WA, I can send you a robots.txt that will block every single non-browser spider (at least so far) and only allow true surfers in. (also would have been nice when the old forums were getting harvested so much by spammers).
    Moderator, Perl/CGI Forum
    shadowstorm.net - subvert society

  • #8
    Senior Coder joh6nn's Avatar
    Join Date
    Jun 2002
    Location
    72° W. 48' 57" , 41° N. 32' 04"
    Posts
    1,887
    Thanks
    0
    Thanked 1 Time in 1 Post
    Feyd, i'd love to have a copy of that file as well, if you could. joh6nn@hotmail.com is my email. thanks.
    bluemood | devedge | devmo | MS Dev Library | WebMonkey | the Guide

    i am a loser geek, crazy with an evil streak,
    yes i do believe there is a violent thing inside of me.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •