Delving deeper into the usefulness of robots.txt

Intro

The concept of the robots.txt file has been around ever since search engines started crawling the web. The basic idea was to give directives to the crawlers as to what they can and cannot do. 

Continue below to see the common implementations and some more advanced techniques. 

Some points to note

  • robots.txt files are not required.
  • robots.txt contains two instructions, either allow or disallow.
  • Google stopped using noindex directive as it was not documented (2019).
  • Rules are case sensitive.
  • A robots.txt rule does not mean results will not appear in search results.
  • Check if a page has backlinks, before adding a disallow rule.
  • If a rule does not start with a leading slash, the rule will be ignored.

robots.txt can assist with the following

  • Allow specific bots to access your content or site
  • Block specific directories
  • Block specific wild card directories or content structures
  • Block content which cannot be moderated (user generated)
  • Block private areas of a website (login area or checkout process)
  • Include sitemap.xml

Implementation

Typically the robots.txt file lives in the root of the project and must be accessible to crawlers/bots from the domain.

domain.com/robots.txt
subdomain.domain.com/robots.txt

The file must be a text file.

An example of a robots.txt

Below is a very basic setup of how you can allow user agents to crawl your website, however the directories listed are all set to disallow, meaning that will not be crawled. At the end we place a link to the sitemap to inform the crawlers of its location.

User-agent:  *

Disallow:  /cgi-bin/
Disallow:  /adminarea/
Disallow:  /feed/
Disallow:  /index.php/
Disallow:  /test/

Sitemap:  https://domain.com/sitemap.xml

robots.txt directives

The following are the directives which you can use in the robots.txt file. There is generally no order necessary when adding rules, however Google has a list of rule examples which will work or do not work, so take a look there is anything does not match. Valid rulesets.

  1. User-agent
  2. Disallow
  3. Allow
  4. Crawl delay
  5. Sitemap
  6. Comments

1. User-agent

All web crawlers are defined as User-agents. These user-agents all have unique names to identify them. When first setting up a robots.txt file you will most likely set this value to a wildcard (*).

In the example below, we allow all user-agents to access the site, but disallow two folders from being crawled.

User-agent:  *
Disallow:  /cgi-bin/
Disallow:  /adminarea/

In the next example, we want to specifically inform a bot from Google to ignore certain folder and disallow SemrushBot from crawling the site. The directives would look like this.

User-agent:  AdsBot-Google 
Disallow:  /checkout

User-agent:  SemrushBot
Disallow:  *

2. Disallow

This rule tells crawlers what they can crawl. This example blocks any crawling to two specific directories.

User-agent:  *
# Disallow access to
Disallow:  /cgi-bin/
Disallow:  /adminarea/

The next example, blocks all crawling to the website.

# Used for development purposes so the website is not crawled during dev
User-agent:  *
Disallow:  /

3. Allow

The allow rule specifies to bots which URL's they are allowed to access. These can be folders and or files.

# Allow a specific file to be crawled
User-agent:  *
Disallow: /adminarea/
Allow: /adminarea/myfile.php

4. Crawl delay

Crawl delay informs the bots to slow down the rate at which it crawls your website. There are a few bots which ignore this rule.

# Crawl Delay in seconds
User-agent: BingBot
Disallow: /adminarea/
Crawl-delay: 5

5. Sitemap

The sitemap rule informs search engines where to find your sitemap.xml. Refer to the example above. Add this directive to the end of the robots.txt file.

6. Comments

Adding comments to a robots.txt file is for information purposes, so you can remember why you added a certain rule. Simple add this using a hash at the start of the line and write a comment.

Example: Comment

# Used for development purposes so the website is not crawled during dev
User-agent:  *
Disallow:  /

Complex rules for robots.txt

In the robots.txt file you can also add regular expressions to allow multiple urls or files to be disallowed or allowed.

  • Dollar sign ($) which matches the end of a URL
  • Asterisk (*) which is a wildcard rule that represents any sequence of characters.

Below are a few examples.

# Block any URL which contains the word content
Disallow: */content/*

# Block URL which include tasks followed by content
Disallow: /tasks/content/

# Block URL with two facets between paths
Disallow: /tasks/*/*/content/

# Block URL with 3 paths between tasks & content, including ending in people
# eg: domain.com/tasks/d/bb/849/ideaabc/content
Disallow: */tasks/*/*/*/content$

# Prevent bots from crawling Order info or Search info
Disallow: *?ordernumber=*
Disallow: *?searchquery=*

There is a lot more one can do with robots.txt. If you find you require assistance, do get in touch.