/    /  Robot.txt file

Robot.txt file

Robot.txt file (i2tutorials)

The robots.txt file or index file is a plain text document encoded in UTF-8, valid for the http, https, and FTP protocols. The file gives recommendations to search engines: which pages / files are worth scanning. If the file contains characters not in UTF-8, but in a different encoding, search robots may incorrectly process them. The rules listed in the robots.txt file are valid only for the host, protocol and port number where the file is located.

The file should be located in the root directory as a plain text document and be available at: https://site.com/robots.txt .The robots.txt file provides important information for search robots that scan the Internet. Always search robots check this file, Before you go through the pages of your site.

This will allow them to scan the site more efficiently, as you help the robots immediately start indexing the really important information on your site (this is provided that you have correctly configured robots.txt).

But, both the directives in robots.txt and the noindex instruction in the robots meta tag are only recommendations for robots, so they do not guarantee that closed pages will not be indexed and added to the index.

If you really need to close part of the site from being indexed, then, for example, you can additionally use closing directories with a password.

 

Basic syntax

User-Agent : the robot for which the following rules will apply (for example, “Googlebot”)

Disallow : pages to which you want to close access (you can specify a large list of such directives from each new line)

Every User-Agent/Disallow group ought to be isolated by a blank line. But, no empty lines should exist within the group (between the User-Agent and the last Disallow directive).

The hash symbol (#) can be used for comments in the robots.txt file: for the current line, everything after # will be ignored. This comment can be used both for the entire line and at the end of the line after the directives.

Catalogs and file names are case-sensitive : “catalog”, “Catalog” and “CATALOG” are all different directories for search engines.

Host: used to indicate to google the main site mirror. Therefore, if you want to glue 2 sites together and make a page by page 301 redirect, then for the robots.txt file (on the duplicate site) you DO NOT need to redirect, so that google can see this directive on the site that needs to be glued.

Crawl-delay: you can limit the crawl rate of your site, since if your site has very high traffic, the server load from various search robots can lead to additional problems.

Regular expressions : for more flexible customization of your directives, you can use 2 characters

1. * (asterisk) – means any sequence of characters

2.  $ (dollar sign) – define the end of the line.

 

Basic examples of using robots.txt

Ban on indexing the entire site

User-agent: *

Disallow: /

This instruction is important to use when you are developing a new site and sharing access to it, for example, through a subdomain.

Very often, developers forget thus to close the site from indexing and immediately get a full copy of the site in the index of search engines. If this does happen, then you need to make a page by page 301 redirect to your main domain.

And such a design ALLOWS to index the whole site:

User-agent: *

Disallow:

Prevent indexing of a specific folder

User-agent: Googlebot

Disallow: / no-index /

Stop on visiting the page for a particular robot

User-agent: Googlebot

Disallow: /no-index/this-page.html

Prevent indexing of certain file types

User-agent: *

Disallow: /*.pdf$

Allow a specific search bot to visit a specific page.

User-agent: *

Disallow: /no-bots/block-all-bots-except-rogerbot-page.html

User-agent: Google

Allow: /no-bots/block-all-bots-except-google-page.html

Sitemap link

User-agent: *

Disallow:

Sitemap: http://www.example.com/none-standard-location/sitemap.xml

The nuances of using this directive: if your site constantly adds unique content, then

1. it’s better NOT to add a link to your sitemap to robots.txt,

2. make the site map itself with the CUSTOM title sitemap.xml (for example, my-new-sitemap.xml and then add this link through the webmasters of the search engines),

since, there are a lot of unscrupulous webmasters who parse from other sites content and use for their projects.

Which is better to use robots.txt or noindex?

If you want the page not to get into the index, it is better to use noindex in the robots meta tag. To do this, add the following meta tag to the page in the <head> section:

<meta name = ”robots” content = ”noindex, follow”>.

This will allow you

1.Remove the page from the index when you next visit the search robot (and you do not need to manually delete this page via webmasters)

2. Allow you to transfer the reference page weight

Robots.txt is best closed from indexing:

1. admin site

2. site search results

3. login / login / password recovery page

How and what to check the robots.txt file?

After you have at last made a robots.txt file, you have to check it for blunders. To do this, you can utilize the tools  from the web indexes:

Google Webmasters : log in to your account with a confirmed current site, go to Scan -> Tool to check the robots.txt file.

In this tool you can:

1. immediately see all your mistakes and possible problems

2. right in this tool to make all edits and immediately check for errors, then to transfer the finished file to your site,

3. check whether you have closed all the pages that are not needed for indexation and whether all the necessary pages are open.

Finally

1. Creating and configuring robots.txt is in the list of first points on the internal optimization of the site and the beginning of search promotion .

2. It is important to set it up correctly so that the necessary pages and sections are accessible to indexing of search engines And not needed were closed.