In this technical article we’ll show how to block SEO bots on your network sites using .htaccess file. We’ll use two ways to do this, both of them have their own pros and cons, so it’s better to use both ways for maximum reliability.

The following search engines / SEO crawlers are in our example blacklist. You may wish to add more to your custom list, but this is an excellent start:

https://moz.com/ – well-known SEO research tool

https://ahrefs.com/ – SEO research tool

https://majestic.com/ – SEO research tool

http://gigablast.com/ – open source search engine that provides backlink information.

All these services were either designed for SEO / backlink checking purposes, or just provide backlink information to their visitors. We want to block them because they don’t provide any traffic to our websites and at the same time they can help our competitors discover the links to our client sites from our PBN sites.

Requirements

This article assumes that you use Apache web server and can use .htaccess file. mod_rewrite and mod_authz_host Apache modules should be installed. This is pretty standard on most web hosting packages and most hosts will be happy to enable them if they aren’t currently active.

Identify and block bots by user-agent string.

The first way to identify SEO bots is to utilize their user agent string. Each web client (web browser or bot) sends special string on every request – this string should be unique to this client, so we can identify which bot is crawling our website and block those which are not welcome. In order to block four bots that are listed above you need to put the following on top of your .htaccess file:

RewriteEngine On
#moz.com
RewriteCond %{HTTP_USER_AGENT} rogerbot [OR]
#majestic.com
RewriteCond %{HTTP_USER_AGENT} MJ12bot [OR]
#moz.com
RewriteCond %{HTTP_USER_AGENT} dotbot [OR]
#gigablast.com
RewriteCond %{HTTP_USER_AGENT} gigabot [OR]
#ahrefs.com
RewriteCond %{HTTP_USER_AGENT} AhrefsBot
RewriteRule .* – [F]

Basically it tells apache to send “403 Forbidden” response to all clients which contain the specified strings in their user-agent.

This method has disadvantages, though. User-agent strings are very easy to spoof, so anyone who has access to curl for example, or “User-Agent Switcher” Chrome extension can make requests to the websites using Ahref’s user-agent string for example. In addition, SEO bots might use some “unofficial” user agent strings to bypass this blocking method.

Identify and block bots by hostname

Another method is to block bots by their hostname. This is very similar to blocking by IP addresses, but more reliable and easy to implement because you don’t have to maintain long (and constantly changing) list of IP addresses used by crawlers that you want to block. This method has several disadvantages too, though. First, we cannot block all bots using this method – for example Majestic SEO claims to be a distributed search engine which means they crawl from wide range of user’s IP addresses. Some bots don’t have known hostname patterns or just changing datacenters that they use quite frequently. However, we can block some of them using this technique. Below is example of .htaccess file (again, this should be on top of .htaccess, before WordPress rewrite rules for example).

Deny from .ahrefs.com
Deny from .dotnetdotcom.org

As you can see, here we block Ahrefs crawlers and dotnetdotcom.org crawlers (this company was purchased by moz.com and now Moz uses their data too).

Moz.com bot, rogerbot, currently use the Amazon Web Services (https://aws.amazon.com/) infrastructure for their crawler, so it becomes challenging to block this bot by hostname. The issue is that if you block AWS servers, you can accidentally block some legitimate clients like RSS feed readers, some minor search engines etc. However, it won’t affect your site visitors and major search engines like Google and Bing, so it’s probably safe to block requests from AWS too. If it’s ok for you, you can add “.amazonaws.com“ to the list of blocked hosts, so the final version looks like:

Deny from .ahrefs.com
Deny from .dotnetdotcom.org
Deny from .amazonaws.com

Conclusion

For maximum reliability, we would recommend to use both blocking techniques. The final version of .htaccess file would look like:

RewriteEngine On
#moz.com
RewriteCond %{HTTP_USER_AGENT} rogerbot [OR]
#majestic.com
RewriteCond %{HTTP_USER_AGENT} MJ12bot [OR]
#moz.com
RewriteCond %{HTTP_USER_AGENT} dotbot [OR]
#gigablast.com
RewriteCond %{HTTP_USER_AGENT} gigabot [OR]
#ahrefs.com
RewriteCond %{HTTP_USER_AGENT} AhrefsBot
RewriteRule .* – [F]
Deny from .ahrefs.com
Deny from .dotnetdotcom.org
#please remove the line below if you don’t want to block requests from AWS
Deny from .amazonaws.com

Let us know if you have any questions in the comments!

2 Comments

  1. Hafis

    Thanks guys,

    It was really helpful. 🙂

  2. Kempes

    I was looking for this content. I wanted to block these boots in my PBNs.

    Thank you

Leave a Reply

Your email address will not be published.Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>