robots.txt

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with Web Crawlers and other web robots - wikipedia

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

For a database of Web Crawlers see: robotstxt.org

# How it works

It works likes this: a robot wants to visits a Web site URL, say http://www.example.com/welcome.html. Before it does so, it first checks for:

http://www.example.com/robots.txt

and finds:

User-agent: * Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt: * robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. * the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

A robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file. In addition, each protocol and port needs its own robots.txt file; http://example.com/robots.txt does not apply to pages under https://example.com:8080/ or https://example.com/.

The standard is different from, but can be used in conjunction with Sitemaps, a robot inclusion standard for websites.

# Examples

This example tells all robots that they can visit all files because the wildcard * specifies all robots:

User-agent: * Allow:

The same result can be accomplished with an empty or missing robots.txt file. This example tells all robots to stay out of a website:

User-agent: * Disallow: /

This example tells all robots not to enter three directories:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

This example tells all robots to stay away from one specific file:

User-agent: * Disallow: /directory/file.html

Note that all other files in the specified directory will be processed. This example tells a specific robot to stay out of a website:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

This example tells two specific robots not to enter one specific directory:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

Example demonstrating how comments can be used:

# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few sites, such as Google, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. Example demonstrating multiple user-agents:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this directory