Robots.txt
For web pages to be indexed, we need programs, robots, scour the net in search of unknown or changed pages to be added to the engine. When it crawls your site, the first thing he wants is a text file robots.txt.
This file can give polite requests to search engine bots. You can tell them they have the right to access your site and if they can index all or a specific page. For this, there are two commands: User-agent and Disallow.
---
User-agent can specify the robots which pages are allowed. It can take several forms:
User-agent: * – All robots can index.
User-agent: robot – Only the specified robot can index.Disallow used to declare the pages that you do not want the engine indexes. It can be used like this:
Disallow: / dir / – A directory will not be indexed.
Disallow: / page.html – Only page.html will not be indexed.
You can use several commands to Disallow later. They must each be placed on one line. The robots.txt file must then be inserted at the root of your site.
Google provides a tool for automatically generating a robots.txt file on its site in the Webmaster Tools.
Define a Sitemap
You can also use the Sitemap to indicate the absolute address of a file XML sitemap on your site. This can give:
Sitemap: http://www . yoursite . com/sitemap.xml
The sitemap is a simple text file that corresponds to the XML standard, which contains all links to your site to allow Google to access it more easily.
In Webmaster Tools, you can also use the feature that checks your robots.txt is valid and that the sitemap is detected. You get the number of URLs added to the engine.
It is better to add an human readable Sitemap of the site too, preferencially on the footer. The reason is simple, you as a human can not read this xml sitemap of this website easily; but you can easily read and search anything from this version of sitemap of our website.
To summarize
- The robots.txt file is the first thing read by a robot when trying to browse your Web pages.
- User-agent can specify the robot and allowed to choose Disallow directories that should not be indexed.
- Sitemap command allows to specify where that file must contain all the links in your site.
