The internet is full of robots, and their ranks are growing exponentially. In 2015 roughly half of all internet traffic was bots, so it’s important to pay attention to how robots interact with your site. These robots are software that automatically browse the internet. The most common kind of robot crawls the internet and indexes content from search engines, but there are also bad bots, which harvest emails or search for sites that are weak against mass hacking attempts. Luckily, you can use a robots.txt file to communicate with all of these non-human netizens, and in some cases even give them instructions on how to use your site.

What is Robots.txt?

A Robots.txt file is a simple .txt file that tells robots how to interact with your website. For the most part webmasters use this file to tell search engines where to find sitemaps and to block bad bots from easily accessing sensitive directories. If you want to learn more, I highly recommend checking out Robotstxt.org. They even have a Bots Database if you want to learn more about specific web robots. Unfortunately, the site hasn’t been updated much since 2007 so you may not find the bot you’re looking for.

Where to Find Robots.txt File

Finding your Robots.txt file is usually a simple task. The most common location for this file is in your top-level directory, which is also where your main index.html typically resides. When bots come to your site, they’re programmed to look for robots.txt by stripping all parameters from the base URL and appending it with robots.txt.

How to Use Robots.txt

Using the Robots.txt file is fairly straightforward. Most files consist of a single record which is composed using two components: “User-agent” and “Disallow.” The User-agent declares which bots the record applies to. The Disallow declares which directories to block. In addition, you can always switch Disallow to Allow. Use the Allow function when you Disallow a parent directory, but still want robots to crawl a sub-directory.

There are a lot of ways to use this file, but only a few recommended applications. Most websites just use the file to help robots locate their sitemaps. I do not recommend using robots.txt to block URLs from being indexed by search engines. If you diasllow Googlebot from directories you don’t want to index then you’re going to be disappointed. In many cases Googlebot will still see the URL and have a tendency to record it without any metadata. Therefore, the URL will still show up in the index, but will lack a title and description.

The Right Way to Block URLs From Being Indexed

Instead, you should use the robots noindex meta tag on a page by page basis. This will tell robots not to index the page, and you won’t end up with a bunch of empty records indexed on your site. If you want to block URLs from being indexed, try placing one of the following meta tags in your HTML head:

Meta Robots Noindex

meta name="robots" content="noindex"

Place this tag in the HTML head of any page you do not want indexed. Bots will still crawl this URL and follow links, but they will not index this page.

Meta Robots Noindex, Nofollow

meta name="robots" content="noindex,nofollow"

Place this tag in the HTML head of any page you do not want indexed nor any of the links on page followed. Bots will still crawl this URL, but they will not follow any links nor index this page.

How to Create a Robots.txt File

Creating this file is fairly easy. All you need to do is create a plain text file, and save it as “robots.txt.” If you’re on a Windows machine, I recommend using notepad.exe, wordpad.exe or my favorite Notepad++. When you save your file, just make sure the filename is all lower case. If you name your file “Robots.txt” then bots will fail to read your file entirely. Once you’ve saved your file just place it in the proper directory (usually your top-level directory). That’s it.

Declare Your Sitemap in Robots.txt

Pointing robots to your XML sitemap is the most common use for robots.txt. Declaring your sitemap should always be the first record in your file. To tell robots where to find your sitemap use the following code (just replace the example URL with the full URL for your sitemap):

Sitemap: http://www.example.com/sitemap.xml

Allow All Robots to Crawl Every Directory on Your Site

If you wanted to create a robots.txt file that blocks all bots from all directories, it would look like this:

User-agent: *
Disallow:

Notice there’s an asterisk after User-agent. This may look like RegEx, but it’s not. Robots.txt files use a special syntax which means RegEx and Globbing are not supported. The asterisk in this case means “all robots.”

Disallow All Robots from Crawling Every Directory on Your Site

Since disallow is blank, then all directories are fair game for all robots to crawl. If you wanted to block every robot from accessing your site entirely, your robots.txt file would look like this:

User-agent: *
Disallow:/

In this example, you just have to add a trailing / to disallow all directories on your site. This is because Disallow works by blocking directories from a top-level directory all the way down. This will come in handy later when you’re picking which directories to block to keep sensitive areas on your site secure.

Block Sensitive WordPress Directories

Robots.txt is not a security tool, but you can use it to block sensitive directories on your site. While this will make make it harder for hackers to target your site for an attack, this does not mean disallowed directories are protected from bots. Additionally, robots can (and will) ignore your robots.txt records even if they’re disallowed by name. As a result, all you can do is make it harder to probe your site for vulnerabilities.

So which directories should you block? That depends on your site, but there are a few directories on WordPress sites that hackers commonly target. First, plug-ins are one of the most common components on a site that hackers try to exploit. The reason they focus on plug-ins so much is because they give hackers the ability to target a large volume of sites quickly and easily. Therefore, you’ll want to disallow bots from easily detecting which plug-ins are installed on your site. While you can disallow the plug-ins individually by name, it defeats the purpose because you broadcast which plug-ins you’re using. Rather, you should just disallow the plug-ins directory. As a result, your record should look like the following snippet:

User-agent: *
Disallow: /wp-content/plugins/

Next, there are other sensitive directories on Wordpress sites like admin, trackback, comments and cgi-bin. If you want to disallow all of these directories try this snippet:

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/plugins/
Disallow: /trackback/
Disallow: /comments/
Disallow: */comments/

Disallow Bad Robots

Now it’s time to discuss the bad bots. I’m not a fan of their uses, but I think these rogue robots are interesting. Many of these bots will ignore robots.txt entirely, but I feel it’s still worth trying to keep them offsite as much as possible. At the least, it helps save some bandwidth.

Backdoor.Bot (aka Trojan.Win32.Midgare) is exactly what it sounds like–a robot that creates a backdoor and gives a hacker unauthorized access to a computer or server. In addition, a backdoor bot can be used to inject malicious code on a site and use it to target other people. You will, of course, want to disallow this bot (and any others like it) from your site. To do this using robots.txt you would use the following code:

User-agent: BackDoorBot/1.0
Disallow:/

There are a lot of bad bots to block. If you want to learn more, I recommend checking out Botreports.com’s bad bots list.

Wildcard Matching

Next, there’s wildcard matching. Currently, Google and Bing are the only search engines that support wildcard matching, but other search engines may catch up eventually.

You can use an asterisk to use wildcard matching on any URL parameter. It works a lot like the asterisk does in the User-agent field. If you want to use wildcard matching to disallow every URL that includes the word “tacocat” then your robots.txt record would look like this:

User-agent: *
Disallow: /*tacocat

Use the asterisk and dollar sign to use wildcard matching to disallow URLs with a specific file extension. For instance, if you wanted to disallow every URL that ends in .xlsx then your record would look like this:

User-agent: *
Disallow: /*.xlsx$

Sample Robots.txt File

That covers the basics for using robots.txt. So, if you put together all of the snippets discussed above, then your robots.txt file should look something like this:

#
# robots.txt
#
Sitemap: http://www.example.com/sitemap.xml
#
# BEGIN block WP directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/plugins/
Disallow: /trackback/
Disallow: /comments/
Disallow: */comments/
Disallow: /*tacocat
Disallow: /*.xlsx$
# END block WP directories
#
# BEGIN block bad bots
User-agent: BackDoorBot/1.0
Disallow:/
# END block bad bots

Finally, if you want to use this entire snippet just replace the example sitemap URL with the full URL of your sitemap. Then, remove any sections or lines you don’t need. If you save this file using the instructions in this article you’ll have a working robots.txt file. In the future you can add to this file as your site grows.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply