Jump Menu : Content | Explore | Comments | Search | Home | Sitemap | Contact | Login | Access.

Yahoo! Slurp in My Blackhole (Yet Again)

Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded in my site’s blackhole log (I know, that sounds terrible):

74.6.22.164
    [2007-11-29 (Thu) 07:15:36] "GET /blackhole/ HTTP/1.0"
    Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)
74.6.26.167
    [2007-11-29 (Thu) 22:45:16] "GET /blackhole/ HTTP/1.0"
    Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
74.6.26.167
    [2007-12-01 (Sat) 22:56:45] "GET /blackhole/ HTTP/1.0"
    Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
74.6.26.167
    [2007-12-02 (Sun) 01:29:51] "GET /blackhole/ HTTP/1.0"
    Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
202.160.181.126
    [2007-12-02 (Sun) 02:52:08] "GET /blackhole/ HTTP/1.0"
    Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

You know, I think I speak for a majority of webmasters and bloggers when I say that I really don’t want to blacklist Yahoo!, however, if they continue disobeying rules that they claim to follow, I may have no choice. Just because yahoo! sends me a few hits every now and then is no reason to treat them any differently than other bad bots. Granted, this would be much simpler if we were dealing with a single Slurp bot instead of three (or possibly more) of them. If it were only a single experimental Slurp bot sniffing around, this might almost be forgivable, but alas, we are talking about numerous IPs here, clearly demonstrating a non-compliant policy regarding robots.txt specifications.

So, until I decide to resort to the undesirable action of blacklisting Yahoo! Slurp, either partially or entirely, I have taken extra measures to ensure that there is absolutely no mistaking the fact that I explicitly forbid any bots — including yours, Yahoo — from accessing the site’s blackhole directory; my robots.txt file now includes the following, redundant directives just in case ‘ol Slurp is having a hard time understanding wild card operators:

Disallow: /blackhole
Disallow: /blackhole/
Disallow: */blackhole
Disallow: */blackhole/*

I mean, let’s be absolutely clear about this. In the real world, this is like the equivalent of posting a 50-foot, flashing neon billboard that says KEEP THE F* OUT!!!. — here’s hoping they finally take the hint..

Related articles

About this article

This is article #461, posted by Perishable on Sunday, December 16, 2007 @ 06:23pm. Categorized as Websites, and tagged with ip, robots, search, security, server, spider, yahoo. Updated on December 17, 2007. Visited 5802 times. 11 Responses »

BookmarkTrackbackCommentSubscribeExplore

« Focus on the Details: Optimizing Images for Humans and Machines • Up • How to Enable PHP Error Logging via htaccess »


11 Responses

1 • December 25, 2007 at 4:28 am — Ibnu Asad says:

I’ve been watching Yahoo’s behaviour on my site and each time they access my site, the cause a high load in the server. I don’t get it….why can Google index efficiently without causing any serious load on the server and Yahoo can’t?

I wanted to block Yahoo from my site for a few times but I backed off because some of my sites traffic are coming from Yahoo.

What do you think? Should I just concentrate on Google instead?

2 • December 26, 2007 at 8:54 am — DeepFreeze says:

Ban them all. Search Engine giants never exactly tell you how they work. If you can’t handle the abuse those bots are giving them kick them out.

3 • December 26, 2007 at 10:47 am — Perishable says:

@Ibnu Asad:

I do not get much traffic from Yahoo (although they crawl around incessantly), so personally I would not lose much if I banned them entirely. However, I would advise you to weigh the decision very carefully, consider your traffic, your target audience, and long-term goals. If you are well-established, and Yahoo sends little or no traffic, then by all means, block them and save the bandwidth and server load for other search engines and actual visitors. On the other hand, if you are receiving a reasonable amount of traffic, and feel the server load is worth it, then I would suggest allowing them to crawl, but keep a close eye on them.

4 • December 26, 2007 at 10:52 am — Perishable says:

@DeepFreeze:

Although I frequently find myself thinking the exact same thing, I try to keep my head in check by looking at the big picture. After all, there is a fine line between paranoia and legitimate abuse, especially when it comes to corporate machines such as Google, Yahoo, and MSN. Everyone assumes that they are legit just because they are the dominant players in the search engine game, however, that fact alone fails to justify continued submission to questionable tactics.

5 • December 26, 2007 at 10:35 pm — DeepFreeze says:

I came across something funny. When I was checking how many backlinks are pointing to my blog i get the following stats:

1. Google: 0
2. Yahoo: 430
3. AltaVista: 480
4. AllTheWeb: 477

Whats with Google? Why no backlinks? Google Bots must be taking the Holidays seriously. lol.

PS: I used smartpagerank.com to get the stats.

6 • December 29, 2007 at 8:38 am — Perishable says:

DeepFreeze,

That is rather odd.. I wonder if Google has penalized your site for something. It looks like you have a few sponsored links that Google may have discovered.. They are getting fairly aggressive about nofollowing paid links, but I have (so far) only heard of larger sites getting dropped. But alas, I digress — this is a post about Yahoo, not Google!

7 • December 29, 2007 at 9:18 am — DeepFreeze says:

^^ you r right should stick to the subject. but still google bots r able to crawl sites much more effectively than yahoo.

8 • January 4, 2008 at 4:39 pm — lizard says:

so … did it work? did it stop? i found your site when researching weird behavior from the yahoo bot, and i’m debating banning it entirely — if that is even possible?

9 • January 5, 2008 at 3:21 pm — Perishable says:

Apparently so, since implementing the new rules, I have only seen legit behavior from Slurp. Of course, who knows if Yahoo always identifies its crawlers as “Slurp”.. but that’s another post :)

Additionally, pulling the plug on “Big Y” is as easy as one of these:

# BYE BYE YAHOO SLURP
SetEnvIfNoCase User-Agent "Slurp" keep_out
<Limit GET POST>
 order allow,deny
 allow from all
 deny from env=keep_out
</Limit>

Add that to your site’s root htaccess file and kiss ‘ol Slurp goodbye!

10 • January 5, 2008 at 6:26 pm — lizard says:

thanks, i may actually do that. it’s really messing with my mind, the way that bot acts. it behaves exactly like a bad guy trying to guess at where i hide stuff. on the Yahoo blog it tries to say they’re “looking to see what kind of 404 page” or something, but instead of using something random, it guesses at stuff using things it finds in the pages. it even found an unlinked page with script errors that i forgot to delete, and then tried to use the text of error messages it found on that page, to guess at other pages. that is so sneaky and black-hat-like.

heh. at least it’s helped me clean up the site. but still. eww.

11 • January 6, 2008 at 9:33 am — Perishable says:

I know what you mean.. they even like to grab keywords from sites that you link to and use them as query strings on your own site. For awhile, I had a prominent link to a site on the same server and received frequent visits from Slurp appending various terms from the linked site to the URL. The behavior you describe is typical, I think, of Slurp. It just makes me cringe when I read their “explanation” as to why they are acting like such scumbags. You would think that an organization as large as Yahoo (or whoever owns them) would be able to run a clean, consistent crawl without the shady tactics..

Drop a comment


Set CSS to lite theme
Set CSS to dark theme