Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
# How it works
It works likes this: a robot wants to visits a Web site URL, say http://www.example.com/welcome.html. Before it does so, it first checks for:
User-agent: * Disallow: /
The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
There are two important considerations when using /robots.txt: * robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. * the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
So don't try to use /robots.txt to hide information.
A robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file. In addition, each protocol and port needs its own robots.txt file; http://example.com/robots.txt does not apply to pages under https://example.com:8080/ or https://example.com/.
The standard is different from, but can be used in conjunction with Sitemaps, a robot inclusion standard for websites.
This example tells all robots that they can visit all files because the wildcard * specifies all robots:
User-agent: * Allow:
The same result can be accomplished with an empty or missing robots.txt file. This example tells all robots to stay out of a website:
User-agent: * Disallow: /
This example tells all robots not to enter three directories:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
This example tells all robots to stay away from one specific file:
User-agent: * Disallow: /directory/file.html
Note that all other files in the specified directory will be processed. This example tells a specific robot to stay out of a website:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /
This example tells two specific robots not to enter one specific directory:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/
Example demonstrating how comments can be used:
# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few sites, such as Google, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. Example demonstrating multiple user-agents:
User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this directory