The first line should be on ONE line only, and designates to the spider what it is doing.
User agent * means all agents must conform to this robots.txt file
Disallow tells it not to go into the following directories. So, in short, if you don't want him in your image directory, disallow it...
The robots.txt simply acts as a rulebook for the current site, the spider must do as it says.
You also might want to duplicate your keyword list and place it in one of the ALT tags of one of your first or second images which loads on your index page...some spiders, specifically Alta Vista, like this...
All search engine robots, or spiders, recognize and conform to the robots.txt standard. Some search engines also recognize the META ROBOTS tag, but for your purposes, the robots.txt exclusions would work fine.
Disallow everything but the files you do want listed, and remember that you can just
Disallow an entire folder, rather than disallow each individual page (unless there are pages within that folder you want to list and pages you don't want to list, then you would have to disallow all the pages you do not want listed in that folder rather than the entire folder)...endquote"
I had always believed that there was only a disallow instruction, now it seems as if there is also an allow instruction.
I'm wondering, if you disallow everything on the first line and then on following lines allow certain files or directories, would it work??
Hmm...I believe that if you want to 'Allow' access to certain files and Disallow others you should specify what you wish to Allow first, though I'm not sure it makes a difference.
Here's a quote from the page I provided the link to:
The record starts with one or more User-agent lines, specifying which robots the record applies to, followed by "Disallow" and "Allow" instructions to that robot. To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.
WebCrawler and InfoSeek are not allowed access to the /tmp/ directory, except to access ok.html. All other robots are allowed unrestricted access.
There are also quite a few other links on the WDVL page I linked to with examples and more in-depth information .
"How did a fool and his money get together in the first place?"
Sorry, there is no such thing as an allow command. Period. Forget it.
You can only Disallow, anything NOT DISALLOWED is obviously ALLOWED. Remember, boys and girls...anyone who says you can use an Allow command in robots.txt is flat out wrong, end of conversation. And that includes numerous horribly mis-informed sites. (last poll had nearly 65% of the 3,000,000 robots.txt files searched badly formatted and non-functional because of mistakes like these that sites are spreading around).
You can Disallow the specific file you want to never be listed, or the entire directory that file resides in, though that will cause all other files in that directory to not be listed as well. Note, however, that we're talking true spiders, ones that follow the rules. I can write my own spider that can suck email and images and teach it to disregard robots.txt, so you may want to set up a trap. (in fact, I've seen robots that will open robots.txt and look specifically for disallowed directories and crawl those first..the logic should be obvious as to why they would do that). Anyways, disallow a page and don't link to it from anywhere, setup logging on this page that captures IP and user agent string...any hit on this page is a bad spider/agent and you should probably consider blocking them from your site...
Also, using the META tags for robot control is not recognized by a good portion of robots, and robots.txt is your best bet for total control (in fact, the only reason to use the meta tag is for such things as google's cache control and no archiving, otherwise everyone has to listen to robots.txt.
And WA, I can send you a robots.txt that will block every single non-browser spider (at least so far) and only allow true surfers in. (also would have been nice when the old forums were getting harvested so much by spammers).