Yahoo! Slurp in My Blackhole (Yet Again)
Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded in my site’s blackhole log (I know, that sounds terrible):
74.6.22.164
[2007-11-29 (Thu) 07:15:36] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)
74.6.26.167
[2007-11-29 (Thu) 22:45:16] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
74.6.26.167
[2007-12-01 (Sat) 22:56:45] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
74.6.26.167
[2007-12-02 (Sun) 01:29:51] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
202.160.181.126
[2007-12-02 (Sun) 02:52:08] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
You know, I think I speak for a majority of webmasters and bloggers when I say that I really don’t want to blacklist Yahoo!, however, if they continue disobeying rules that they claim to follow, I may have no choice. Just because yahoo! sends me a few hits every now and then is no reason to treat them any differently than other bad bots. Granted, this would be much simpler if we were dealing with a single Slurp bot instead of three (or possibly more) of them. If it were only a single experimental Slurp bot sniffing around, this might almost be forgivable, but alas, we are talking about numerous IPs here, clearly demonstrating a non-compliant policy regarding robots.txt specifications.
So, until I decide to resort to the undesirable action of blacklisting Yahoo! Slurp, either partially or entirely, I have taken extra measures to ensure that there is absolutely no mistaking the fact that I explicitly forbid any bots — including yours, Yahoo — from accessing the site’s blackhole directory; my robots.txt file now includes the following, redundant directives just in case ‘ol Slurp is having a hard time understanding wild card operators:
Disallow: /blackhole
Disallow: /blackhole/
Disallow: */blackhole
Disallow: */blackhole/*
I mean, let’s be absolutely clear about this. In the real world, this is like the equivalent of posting a 50-foot, flashing neon billboard that says KEEP THE F* OUT!!!. — here’s hoping they finally take the hint..
Related articles
- Yahoo! in my Blackhole
- Suspicious Behavior from Yahoo! Slurp Crawler
- How to Verify the Four Major Search Engines
- Robots Notes Plus
- Disobedient Robots and Company
- Invite Only: Visitor Exclusivity via the Opt-In Method
- Website Attack Recovery
About this article
This is article #461, posted by Perishable on Sunday, December 16, 2007 @ 06:23pm. Categorized as Websites, and tagged with ip, robots, search, security, server, spider, yahoo. Updated on December 17, 2007. Visited 5802 times. 11 Responses »
Bookmark • Trackback • Comment • Subscribe • Explore
« Focus on the Details: Optimizing Images for Humans and Machines • Up • How to Enable PHP Error Logging via htaccess »
1 • December 25, 2007 at 4:28 am — Ibnu Asad says:
I’ve been watching Yahoo’s behaviour on my site and each time they access my site, the cause a high load in the server. I don’t get it….why can Google index efficiently without causing any serious load on the server and Yahoo can’t?
I wanted to block Yahoo from my site for a few times but I backed off because some of my sites traffic are coming from Yahoo.
What do you think? Should I just concentrate on Google instead?