What are bots?
Bots are applications that perform automated tasks. You may sometimes hear them referred to as robots, spiders, and crawlers. When we speak about bots in the context of your site’s traffic, we’re referring to automated behavior such as indexing your web pages. While many times bot traffic may be normal and healthy, some bots may also be programmed to perform more malicious actions like scanning your site for vulnerabilities or trying to brute-force login pages. Bots exist in many aspects of modern life, both good and bad. When it comes to your website, the robots.txt file is a way to help set boundaries for these bots.
In this article, we will explain what the robots.txt file is, what it does and, more importantly, what it doesn’t do.
What is the robots.txt File?
The robots.txt file is a text file that tells web crawling software what pages on your site you want to be indexed, and which you don’t. It contains a list of “Allow” and “Deny” commands along with the urls that you want found and those that you want private.
The robots.txt file is often used as a layer of security, but in reality bots do not have any obligation to respect these rules. Reputable crawlers (such as those for search engines) will respect the rules, but some (including spammers) will not.
Please note: By default, WP Engine restricts the traffic of search engines to any site using the install.wpengine.com domain. This means search engines will not be able to visit sites which are not currently in production using a custom domain.
Google and the robots.txt File
Google respects the “Allow” and “Disallow” directives from the robots.txt file, but certain features such as crawl rate (how fast Google indexes the site during the crawl process) are not respected. They also may pick up disallowed urls from pages where the url or other indexing is available. Google has an excellent article on how to use robots.txt with their crawlers in this Support page.
How to create a robots.txt file
Using the text editor of your choice create a file named ‘robots.txt’. Make sure the name is lowercase and make sure that the extension is ‘.txt’ and not ‘.html’.p Then, just place it in the root directory of your install using SFTP. That’s it.
Once you have created the file, list the user agents that you want to block and the pages that you want blocked.
This requires two directives:
1. User-agent: indicates what search engine robots this command applies to – most are listed on the
Robots Database – the most comprehensive as of this writing.
2. Disallow: indicates what page, file, or directory should not be indexed.
If you want to restrict all robot access to your site:
If you want to restrict robot access to certain directories and files, list them like this:
If you want to restrict robot access to all files of a specific type (we’re using .pdf in this example):
If you want to restrict a specific search engine (we’re using Googlebot-Image as an example):
Add a crawl-rate to define how long the bot must wait before visiting again (in this case bingbot with a 10-second delay):
Adding the right combinations of directives can be complicated. Luckily, there are plugins that will also create (and test) the robots.txt file for you. Plugin examples include:
All in One SEO plugin
If you need more help configuring rules in your robots.txt file, we recommend visiting The Web Robots Pages for further guidance.
The first best practice to keep in mind is: Non-production sites should disallow all user-agents. WP Engine automatically does this for any sites using the installname.wpengine.com domain. Only when you are ready to “go live” with your site should you add a robots.txt file.
Secondly, if you want to block a specific User-Agent, remember that robots do not have to follow the rules set in your robots.txt file. Best practice would be to use a firewall like Sucuri WAF or Cloudflare which allows you to block the bad actors before they hit your site. Or, you can contact support for more help.
Last, if you have a very large library of posts and pages on your site, Google and other search engines indexing your site can cause performance issues. Increasing your cache expiration time or limiting the crawl rate will help offset this impact.