WHAT IS ROBOTS.TXT ?
According to Cyber Security Experts the robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.
The robots.txt file is a simple text file that is placed in your website’s root directory in order to tell the search engines which pages to index and which to skip.
Many webmasters utilize this file to help the search engines index the content of their websites.
This document details how Google handles the robots.txt file that allows you to control how Google’s website crawlers crawl and index publicly accessible websites.
Crawler: A crawler is a service or agent that crawls websites. Generally speaking, a crawler automatically and recursively accesses known URLs of a host that exposes content which can be accessed with standard web-browsers. As new URLs are found (through various means, such as from links on existing, crawled pages or from Sitemap files), these are also crawled in the same way.
User-agent: a means of identifying a specific crawler or set of crawlers.
Directives: the list of applicable guidelines for a crawler or group of crawlers set forth in the robots.txt file.
URL: Uniform Resource Locators as defined in RFC 1738.
Google-specific: These elements are specific to Google’s implementation of robots.txt and may not be relevant for other parties.
If webmasters can tell the search engine spiders to skip pages that they do not consider important enough to be crawled (eg. printable versions of pages, .pdf files etc.), then they have a better opportunity to have their most valuable pages featured in the search engine results pages.
There is a simple instruction that restricts all search engine spiders from crawling the entire site:
Without the “forward slash” in the instructions, search engines are granted access to the entire site. So, the inclusion of this one character in the robots.txt can prevent a website from showing in the search engines.
This also increases spiderability for the search engines. Of course, even though this is a small aspect of the search engine optimization process, if utilized correctly, a robots.txt can be a significant benefit.
The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis, etc), these guidelines do not need to apply.
File location & range of validity
The robots.txt file must be in the top-level directory of the host, accessible though the appropriate protocol and port number. Generally accepted protocols for robots.txt (and crawling of websites) are “http” and “https”. On http and https, the robots.txt file is fetched using a HTTP non-conditional GET request.
Google-specific: Google also accepts and follows robots.txt files for FTP sites. FTP-based robots.txt files are accessed via the FTP protocol, using an anonymous login.
The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted.
There are generally three different outcomes when robots.txt files are fetched:
- full allow: All content may be crawled.
- full disallow: No content may be crawled.
- conditional allow: The directives in the robots.txt determine the ability to crawl certain content.
File format :-
The <field> element is case-insensitive. The <value> element may be case-sensitive, depending on the <field> element.
Handling of <field> elements with simple errors / types (eg “useragent” instead of “user-agent”) is undefined and may be interpreted as correct directives by some user-agents.
A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500 kilobytes (KB).