Google has now introduced the robots.txt generator tool in its Google Webmaster Tools page. This tool automatically generates a robots.txt file, which helps search engine robots identify what part of a web site must not be crawled. And by default, crawlers or otherwise called spiders try to crawl areas of the website that are not explicitly mentioned in the robots.txt file.
A basic format the content of this file looks like this:
where User-Agent is the name of the robot and Disallow is the name of the file or folder to skip crawling. By specifying,
we tell search all engines (“*” is the wildcard default for all robots) to skip all content (“/” with no folder name suffix means all folders and files) to skip everything. On the other hand, the following content
specifically tells Googlebot, Google’s spider to skip the whole “print” folder when crawling for page contents.
While these are general statements that is widely understood, there are two things to bear in mind (and reminded by Google Webmasters Blog):
* Not every search engine will support every extension to robots.txt files
The Robots.txt Generator creates files that Googlebot will understand, and most other major robots will understand them too. But it’s possible that some robots won’t understand all of the robots.txt features that the generator uses.
* Robots.txt is simply a request
Although it’s highly unlikely from a major search engine, there are some unscrupulous robots that may ignore the contents of robots.txt and crawl blocked areas anyway. If you have sensitive content that you need to protect completely, you should put it behind password protection rather than relying on robots.txt.