Your website’s robots.txt file probably contains some rules that tell compliant search engines and other bots which pages they can visit, and which are not allowed, etc. In most of the robots.txt files that I’ve looked at, all of the
Disallow rules are applied to all user agents. This is done with the wildcard operator, which is written as an asterisk
*, like this:
This site’s robots.txt file provides a typical example. All of the allow/disallow rules apply to anything/bot that visits, regardless of their reported user-agent.
Your robots.txt file provides a simple way to block TONS of bad bots, spiders and scrapers.
Targeting specific user agents
So the universal wildcard
* is sort of the default for most/many websites. Simply apply all robots rules to all bots. That’s fine, very simple and easy. But if your site is getting hammered by tons of spam and other unwanted bot behavior and creepy crawling, you can immediately reduce the spam traffic by targeting specific user agents via robots.txt. Let’s look at an example:
User-agent: * Allow: / User-agent: 360Spider Disallow: / User-agent: A6-Indexer Disallow: / User-agent: Abonti Disallow: / User-agent: AdIdxBot Disallow: / User-agent: adscanner Disallow: /
First and most important, we allow all user agents access to everything. That happens in the first two lines:
User-agent: * Allow: /
So that means the default rule for any visiting bots is, in human-speak, “go ahead and crawl anywhere you want, no restrictions.” Then from there, the remaining robots rules target specific user-agents,
Abonti, and so forth. For each of those bots, we disallow access to everything with a
So if any of these other bots come along, and they happen to be compliant with robots.txt (many are, surprisingly), they will obey the disallow rule and not crawl anything at your site. In robots language, the slash
/ matches any/all URLs.
Together, the above set of robots rules effectively says to bots, “360Spider, A6-Indexer, Abonti, AdIdxBot, and adscanner are not allowed here, but all other bots can crawl whatever they’d like.”
Depending on the site, this user-agent technique can greatly reduce the amount of bandwidth-wasting, resource-draining spammy bot traffic. The more spammy bots you block via robots.txt, the less spammy traffic is gonna hit your site. So where to get an awesome list of spammy user-agents? Glad you asked..
Block over 650+ spammy user-agents
To get you started crafting your own anti-spam robots rules, here is a collection of over 650 spammy user-agents. This is a ready-to-go, copy-&-paste set of user-agent rules for your site’s robots.txt file. By downloading, you agree to the Terms.
Terms / Disclaimer
This collection of user-agent rules for robots.txt is provided “as-is”, with the intention of helping people protect their sites against bad bots and other malicious activity. By downloading this code, you agree to accept full responsibility for its use. So use wisely, test thoroughly, and enjoy.
Also: Before applying this set of robots rules, make sure to read through the rules and remove any user-agent(s) that you don’t want to block. And to be extra safe, make sure to validate your robots.txt file using an online validator.
Other ways to block bad bots
If you’re asking whether or not robots.txt is the best way to block spammy user-agents, the answer probably is “no”. Much of my work with Apache/.htaccess (like nG Firewall) focuses on strong ways to block bad bots. If you want more powerful antispam/bad-bot protection, check out these tools from yours truly:
- Blackhole for Bad Bots — Trap bad bots in a virtual black hole (free WordPress plugin)
- Blackhole for Bad Bots — Trap bad bots in a virtual black hole (PHP/standalone script)
- 7G Firewall — Super strong firewall (and bad-bot protection) for sites running on Apache or Nginx
So check ’em out if you want stronger protection against online threats. For just casual admins and site owners, the robots.txt user-agent rules provide a simple, effective way to reduce spam.