Robots Exclusion Standard: robots.txt - Close off irrelevant pages

What is robots.txt?

Robots.txt file on the webserver (Robots Exclusion Standard) contains information to prevent web crawlers like GoogleBot to access all parts of the website. Search engine crawlers before accessing a website verifies the presence of the robots.txt.
You need a robots.txt file only if your want to disallow some content from the search engines. If you want to be everything indexed in your site, you don't need a robots.txt file (not even an empty one)

According to robotstxt.org there are more than 300 search engine robots on the net. See the whole list: http://www.robotstxt.org/db.html

Where should I put robots.txt?

Robots file are usually placed in root web directory. They can be accessed as: http://example.com/robots.txt

Similar function

If you don't have access to the root of your domain, you can restrict indexing of a webpage using the robots meta tag. Also you can use robots meta and Googlebot meta in your page headers:
  • meta name="robots" content="..., ..."
  • meta name="googlebot" content="..., ..."
These meta tags provides information to search engine indexers to index, follow, nosnipplet, noimageindex etc. the page. (Googlebot meta tag is used just by GoogleBot.) See Meta tags used for SEO .

Exception

Google still can index the URLs blocked by robots.txt if the URLs are bounding from other pages on the web. see: https://support.google.com/webmasters/answer/156449?rd=1 The only solution to restrict indexing from any external source is to use robots noindex meta tag.

Check your robots.txt with analyzemysites.com

Allow indexing of any page in robots.txt

User-agent: *
Disallow:

Allow indexing of any page in robots.txt

User-agent: *
Allow: /

robots.txt - Special rules for different crawlers

User-agent: Googlebot
Disallow: /specificfolder/*
Crawl-delay: 1

User-agent: Slurp
Disallow: /otherfolder/*
Crawl-delay: 1
Sitemap: http://sourceforge.net/sitemap.xml

Analyze meta information on your site