29 APR 2011 3
Most sites have several pages you want to keep out of the reach of search engines. For example, there is no need to clutter Google’s results pages with your login pages or other private pages. You can easily “tell” spiders the pages they are to stay away from with a robots.txt file.
When a crawler visits your site, it first looks for a robots.txt file placed in the root of your domain that instructs it on which pages it should ignore. Such a file is made of one or more records, and each must contain a line to address a certain user agent followed by one or more Disallow lines. The syntax is therefore trivial – you don’t need to learn more than these two directives. For example, a robots.txt file made of
User-agent: googlebot
Disallow: /login.php
Disallow: /admin
would tell Google Bot not to crawl neither http://yourdomain.com/login.php nor http://yourdomain.com/admin.
User agents may be matched by a wildcard. Instead of having a User-agent: spidername section for each crawler, you can instruct them all to follow the subsequent Disallow lines by using User-agent: *. The star symbol matches any number of characters – so “spider*” can stand for “spider A”, “spiderFromSomeSite” or “spiderFromThatOtherSite”. A question mark would match one character, so “spider?” will work for “spiderA”, “spiderB”, but not “spider X”.
Robots.txt can be picky with syntaxes, so make sure you follow the structure:
As a final note, remember that you shouldn't use robots.txt to keep spiders from accessing overly sensitive information. Just because a standards-compliant web spider won’t access it, it doesn’t mean a malicious one (or even a human user) can’t or won’t. Such a setting is a no-no:
User-agent: *
Disallow: /admin/passwords.txt
Sensitive information such as users and passwords should be placed outside your htdocs path. Just because Google won’t crawl your passwords.txt file it doesn’t mean a malicious user won’t open up /admin/passwords.txt in a browser and read your “hidden” content.
Lifeline Design Inc.
@ CSI 192 Spadina Ave.
Toronto, Ontario
M5T 2C7
Phone: 877 543 3110
Email: sales@lifelinedesign.ca
Copyright 2024. All Rights Reserved. Lifeline Design