Protect Your Site with a Blackhole for Bad Bots

One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately [...] • Read more »

2010 IP Blacklist

Over the course of each year, I blacklist a considerable number of individual IP addresses. Every day, Perishable Press is hit with countless numbers of spammers, scrapers, crackers and all sorts of other hapless turds. Weekly examinations of my site’s error logs enable me to filter through the chaff and cherry-pick only the most heinous, nefarious attackers for blacklisting. Minor offenses are generally dismissed, but the evil bastards that insist on wasting resources running redundant automated scripts [...] • Read more »

Stop 404 Requests for Mobile Versions of Your Site

If you’ve been keeping an eye on your 404 errors recently, you will have noticed an increase in requests for nonexistent mobile files and directories, especially over the past year or so. The scripts and bots requesting these files from your server seem to be looking for a mobile version of your site. Unfortunately, they are wasting bandwidth and resources in the process. It has become common to see the following 404 errors constantly repeated in your [...] • Read more »

Pimp Your 404: Presentation and Functionality

I have been wanting to write about 404 error pages for quite awhile now. They have always been very important to me, with customized error pages playing a integral part of every well-rounded web-design strategy. Rather than try to re-invent the wheel with this, I think I will just go through and discuss some thoughts about 404 error pages, share some useful code snippets, and highlight some suggested resources along the way. In a sense, this post [...] • Read more »

4G Series: The Ultimate User-Agent Blacklist, Featuring Over 1200 Bad Bots

As discussed in my recent article, Eight Ways to Blacklist with Apache’s mod_rewrite, one method of stopping spammers, scrapers, email harvesters, and malicious bots is to blacklist their associated user agents. Apache enables us to target bad user agents by testing the user-agent string against a predefined blacklist of unwanted visitors. Any bot identifying itself as one of the blacklisted agents is immediately and quietly denied access. While this certainly isn’t the most effective method of securing [...] • Read more »

Yahoo! Slurp too Stupid to be a Robot

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our websites every minute of the day. For the most part, there are effective [...] • Read more »

Yahoo! Lies about Obeying Robots.txt Directives

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap: • Read more »

Evil Incarnate, but Easily Blocked

As my readers know, I spend a lot of time digging through error logs, preventing attacks, and reporting results. Occasionally, some moron will pull a stunt that deserves exposure, public humiliation, and banishment. There is certainly no lack of this type of nonsense, as many of you are well-aware. 3G Blacklist Even so, I have to admit that I am very happy with my latest strategy against crackers, spammers, and other scumbags, namely, the 3G Blacklist. Since [...] • Read more »

Yahoo! Once Again Caught Disobeying Robots.txt Rules

Hmmm.. Let’s see here. Google can do it. MSN/Live can do it. Even Ask can do it. So why oh why can’t Yahoo’s grubby Slurp crawler manage to adhere to robots.txt crawl directives? Just when I thought Yahoo! finally figured it out, I discover more Slurp tracks in my Blackhole trap for bad spiders: • Read more »

Yahoo Incongruities.

When frustration builds, and finally reaches its the boiling point, it’s nice to be able to express yourself to someone. Although I really don’t enjoy ranting about things, but when it comes to certain aspects of Yahoo!, I just can’t he’p myse’f. So, thanks to recent attempt at using My Yahoo!, it’s time to get some of this off my chest, clear the decks, and give Yahoo! (yet another) chance to clean up its act. Here are [...] • Read more »

Unexplained Crawl Behavior Involving Tagged Query Strings

I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following: http://perishablepress.com/press/page/2/?tag=spam http://perishablepress.com/press/page/3/?tag=code http://perishablepress.com/press/page/2/?tag=email http://perishablepress.com/press/page/2/?tag=xhtml http://perishablepress.com/press/page/4/?tag=notes http://perishablepress.com/press/page/2/?tag=flash http://perishablepress.com/press/page/2/?tag=links http://perishablepress.com/press/page/3/?tag=theme http://perishablepress.com/press/page/2/?tag=press ..plus hundreds and hundreds more 1. The URL pattern is always the same: a different page number followed [...] • Read more »

Over 150 of the Worst Spammers, Scrapers and Crackers from 2007

Update 2010/07/07: Please visit the 2010 IP Blacklist for more current information. Over the course of each year, I blacklist a considerable number of individual IP addresses. Every day, Perishable Press is hit with countless numbers of spammers, scrapers, crackers and all sorts of other hapless turds. Weekly examinations of my site’s error logs enable me to filter through the chaff and cherry-pick only the most heinous, nefarious attackers for blacklisting. Minor offenses are generally dismissed, but [...] • Read more »

Blacklist Candidate Number 2008-02-10

Welcome to the Perishable Press “Blacklist Candidate” series. In this post, we continue our new tradition of exposing, humiliating and banishing spammers, crackers and other worthless scumbags.. Scumbag number 2008-02-10, “COME ON DOWN!!” — you’re the next baboon to get banished from the site! Like many bloggers, I like to spend a little quality time each week examining my site’s error logs. The data contained in Apache, 404, and even PHP error logs is always enlightening. In [...] • Read more »

Blacklist Candidate Number 2008-01-02

Come one, come all — today we officially begin a new series of posts here at Perishable Press: the public exposure, humiliation, and banishment of spammers, crackers, and other site attackers. Kicking things off for 2008: blacklist candidate number 2008-01-02! Every Wednesday, I take a little time to investigate my 404 error logs. In addition to spam, crack attacks, and other deliberate mischief, the 404 logs for Perishable Press contain errors due to missing resources, mistyped URLs, [...] • Read more »

Yahoo! Slurp in My Blackhole (Yet Again)

Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded in my site’s blackhole log (I know, that sounds terrible): • Read more »

Yahoo! in my Blackhole

Okay, I realize that the title sounds a bit odd, but nowhere near as odd as my recent discovery of Slurp ignoring explicit robots.txt rules and digging around in my highly specialized bot trap, which I have lovingly dubbed “the blackhole”. What is up with that, Yahoo!? — does your Slurp spider obey robots.txt directives or not? I have never seen Google crawling around that side of town, neither has MSN nor even Ask ventured into the [...] • Read more »