Save 25% on Wizard’s SQL for WP w/ code: WIZARDSQL
Web Dev + WordPress + Security

Target User Agents and Reduce Spam via robots.txt

Bad robots

Your website’s robots.txt file probably contains some rules that tell compliant search engines and other bots which pages they can visit, and which are not allowed, etc. In most of the robots.txt files that I’ve looked at, all of the Allow and Disallow rules are applied to all user agents. This is done with the wildcard operator, which is written as an asterisk *, like this:

User-agent: *

This site’s robots.txt file provides a typical example. All of the allow/disallow rules apply to anything/bot that visits, regardless of their reported user-agent.

Your robots.txt file provides a simple way to block TONS of bad bots, spiders and scrapers.

Targeting specific user agents

So the universal wildcard * is sort of the default for most/many websites. Simply apply all robots rules to all bots. That’s fine, very simple and easy. But if your site is getting hammered by tons of spam and other unwanted bot behavior and creepy crawling, you can immediately reduce the spam traffic by targeting specific user agents via robots.txt. Let’s look at an example:

User-agent: *
Allow: /

User-agent: 360Spider
Disallow: /

User-agent: A6-Indexer
Disallow: /

User-agent: Abonti
Disallow: /

User-agent: AdIdxBot
Disallow: /

User-agent: adscanner
Disallow: /

First and most important, we allow all user agents access to everything. That happens in the first two lines:

User-agent: *
Allow: /

So that means the default rule for any visiting bots is, in human-speak, “go ahead and crawl anywhere you want, no restrictions.” Then from there, the remaining robots rules target specific user-agents, 360Spider, A6-Indexer, Abonti, and so forth. For each of those bots, we disallow access to everything with a Disallow rule:

Disallow: /

So if any of these other bots come along, and they happen to be compliant with robots.txt (many are, surprisingly), they will obey the disallow rule and not crawl anything at your site. In robots language, the slash / matches any/all URLs.

Together, the above set of robots rules effectively says to bots, “360Spider, A6-Indexer, Abonti, AdIdxBot, and adscanner are not allowed here, but all other bots can crawl whatever they’d like.”

Depending on the site, this user-agent technique can greatly reduce the amount of bandwidth-wasting, resource-draining spammy bot traffic. The more spammy bots you block via robots.txt, the less spammy traffic is gonna hit your site. So where to get an awesome list of spammy user-agents? Glad you asked..

Tip: An informative look at how Google interprets the robots.txt specification. Lots of juicy details in that article.

Block over 650+ spammy user-agents

To get you started crafting your own anti-spam robots rules, here is a collection of over 650 spammy user-agents. This is a ready-to-go, copy-&-paste set of user-agent rules for your site’s robots.txt file. By downloading, you agree to the Terms.

650+ user-agents for robots.txtVersion 1.0 ( 21 bytes TXT )
Note: I can’t take credit for this set of rules. It was sent in by a reader some time ago. If you happen to know the source of these robots.txt user-agent rules, let me know so I can add a credit link to the article.

Terms / Disclaimer

This collection of user-agent rules for robots.txt is provided “as-is”, with the intention of helping people protect their sites against bad bots and other malicious activity. By downloading this code, you agree to accept full responsibility for its use. So use wisely, test thoroughly, and enjoy.

Also: Before applying this set of robots rules, make sure to read through the rules and remove any user-agent(s) that you don’t want to block. And to be extra safe, make sure to validate your robots.txt file using an online validator.

Other ways to block bad bots

If you’re asking whether or not robots.txt is the best way to block spammy user-agents, the answer probably is “no”. Much of my work with Apache/.htaccess (like nG Firewall) focuses on strong ways to block bad bots. If you want more powerful antispam/bad-bot protection, check out these tools from yours truly:

  • Blackhole for Bad Bots — Trap bad bots in a virtual black hole (free WordPress plugin)
  • Blackhole for Bad Bots — Trap bad bots in a virtual black hole (PHP/standalone script)
  • 7G Firewall — Super strong firewall (and bad-bot protection) for sites running on Apache or Nginx

So check ’em out if you want stronger protection against online threats. For just casual admins and site owners, the robots.txt user-agent rules provide a simple, effective way to reduce spam.

Jeff Starr
About the Author
Jeff Starr = Creative thinker. Passionate about free and open Web.
The Tao of WordPress: Master the art of WordPress.

Leave a reply

Name and email required. Email kept private. Basic markup allowed. Please wrap any small/single-line code snippets with <code> tags. Wrap any long/multi-line snippets with <pre><code> tags. For more info, check out the Comment Policy and Privacy Policy.

Subscribe to comments on this post

Welcome
Perishable Press is operated by Jeff Starr, a professional web developer and book author with two decades of experience. Here you will find posts about web development, WordPress, security, and more »
SAC Pro: Unlimited chats.
Thoughts
DIY: Monitor File Changes via Cron working perfectly for over a decade.
Mastodon social is a trip. Glad I found it.
As a strict rule, I never use cache plugins on any of my sites. They cause more problems than they solve, imho. Just not worth it.
Currently on a posting spree :)
6 must come before 7.
My top three favorite-to-write coding languages: CSS, PHP, JavaScript.
If you’re not 100% sure that you can trust something, you can’t.
Newsletter
Get news, updates, deals & tips via email.
Email kept private. Easy unsubscribe anytime.