Sunday, December 04, 2005

Search Engine Robots and Robot.txt

Many search engines use programs called robots to gather web pages for indexing. These programs are not limited to a pre-defined list of web pages, they can follow links on pages they find, which makes them a form of intelligent agent. The process of following links is called spidering, wandering, or gathering.

Controlling Robot Indexing
Robot spiders cannot index unlinked files, so they will ignore all the miscellaneous files you may have in your web server directory. Webmasters can control which directories the robots should index by editing the robots.txt file, and web page creators can control robot indexing behavior using the Robots META tag.

Following Links
Local search robot spider indexers locate files to index by following links, just like webwide search engine spiders. You specify the starting page, and these indexers will request it from the server and received it just like a browser. The indexer will store every word on the page and then follow each link on that page, indexing the linked pages and following each link from those pages.

Link Problems
They will miss pages which have been accidentally unlinked from any of your starting points. And spiders will have problems with JavaScript links, just like webwide search engine robots.

Dynamic Elements
Robot spider indexers will receive each page exactly as a browser will receive it, with all dynamic data from CGIs, SSI (server-side includes), ASP (active server pages) and so on. This is vital to some sites, but other sites may find that the presence of these dynamic elements triggers the re-indexing process, although none of the actual text of the page has been changed.

Most site search engines can handle dynamic URLs (including question marks ? and other punctuation). However, most webwide search engines will not index these pages: for help building plain URLs, see our page on Generating Simple URLs .

Server Load
Because they use HTTP, robot spider indexers can be slower than local file indexers, and can put more pressure on your web server, as they ask for each page.

Updating Indexes
To update the index, some robot spider will query the web server about the status of each linked page by asking for the HTTP header using a "HEAD" request (the usual request for an HTML page is a "GET"). For HEAD requests, the server may be able to send the page header information from an internal cache, without opening and reading the entire file, and so the interaction may be much more efficient. Then the indexer compares the modified date from the header with its own date for the last time the index was updated. If the page has not been changed, it doesn't have to update the index. If it has been changed, or if it is new and has not yet been indexed, the robot spider will then send a GET request for the entire page, and store every word. An alternate solution is for robot spiders to send an "If-Modified-Since" request: this HTTP/1.1 header option allows the web server to send back a code if the page has not changed, and the entire page if it has changed.

Duplicate Files
Robots must contain special code to check for duplicate pages, due to server mirroring, alternate default page names, mistakes in relative file naming (./ instead of ../, for example), and so on. Some search indexers have powerful algorithms to identify these duplicates and only store and search one copy.

Search engine robots will check a special file in the root of each server called robots.txt, which is, as you may guess, a plain text file (not HTML). Robots.txt implements the Robots Exclusion Protocol, which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can disallow access to cgi, private and temporary directories, for example, because they do not want pages in those areas indexed.

The syntax of this file is obscure to most of us: it tells robots not to look at pages which have certain paths in their URLs. Each section includes the name of the user agent (robot) and the paths it may not follow. There is no way to allow a specific directory, or to specify a kind of file. You should remember that robots may access any directory path in a URL which is not explicitly disallowed in this file: everything not forbidden is OK.

This is all documented in the Standard for Robot Exclusion, and all robots should recognize and honor the rules in the robots.txt file.

Entry Meaning
User-agent: *
Disallow:

The asterisk (*) in the User-agent field is shorthand for "all robots". Because nothing is disallowed, everything is allowed.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
In this example, all robots can visit every directory except the three mentioned.
User-agent: BadBot
Disallow: /

In this case, the BadBot robot is not allowed to see anything. The slash is shorthand for "all directories"

The User Agent can be any unique substring, and robots are not supposed to care about capitalization.

User-agent: BadBot
Disallow: /

User-agent: *
Disallow: /private/
The blank line indicates a new "record" - a new user agent command.

BadBot should uts go away. All other robots can see everything except the "private" folder.

User-agent: WeirdBot
Disallow: /tmp/
Disallow: /private/
Disallow: /links/listing.html

User-agent: *
Disallow: /tmp/
Disallow: /private/

This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.

All other robots can see everything except the tmp and private directories.

If you think this is inefficient, you're right!

Bad Examples - Common Wrong Entries
use one of the robots.txt checkers to see if your file is malformed
User-agent: *
Disallow /
NO! This entry is missing the colon after the disallow.

User-agent: *
Disallow: *

NO! If you want to disallow everything, use a slash (indicating the root directory).

User-agent: sidewiner
Disallow: /tmp/

NO! Robots will ignore misspelled User Agent names. Check your server logs and the listings of User Agent names.

User-agent: *
Disallow: /tmp/

User-agent: Weirdbot
Disallow: /links/listing.html

Disallow: /tmp/

NO! Robots read from top to bottom and stop when they reach something that applies to them. So Weirdbot would stop at the first record, *, instead of seeing its special entry.

Thanks to Enrico Altavilla for pointing out this problem in my own robots.txt file!




No comments: