Latest TweetsDifference between mod_alias and mod_rewrite perishablepress.com/difference…
Perishable Press

Yahoo! Slurp too Stupid to be a Robot

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our sites every minute of the day.

For the most part, there are effective methods available enabling us to protect our sites against the endless hordes of irrelevant and mischievous bots. Such evil is easily blocked with virtually zero side-effects because their presence is simply irrelevant.

But what about bad bots that aren’t exactly irrelevant, such as Yahoo’s mindless Slurp crawler? By disobeying the robots.txt protocol as promised, Yahoo’s Slurp clearly falls into the “bad-bot” category. Unlike typical “nonsense” bots, Slurp is not exactly irrelevant (yet), so simply blocking them is not a reasonable solution.

And Yahoo must know this. Why else would they allow their Slurp software to flagrantly disobey robots.txt directives? Yahoo certainly benefits from proclaiming standards compliance, wherein they front credibility by claiming adherence to the same guidelines as industry leaders such as Google and Microsoft. I have never seen (nor heard of) a single instance of either googlebot (Google’s web crawler) or msnbot (Microsoft’s web crawler) appearing in locations forbidden by robots.txt directives.

So what’s up, Yahoo? There are only two possibilities here: Slurp is disobeying either erroneously or intentionally. Either case does not look good for Slurp’s master, Yahoo. There is either an error in the Slurp software that is causing Slurp to roam around like a drunken sailor, or else the software is correct and Slurp is behaving exactly as it has been directed. If the problem is an error, you would think that Yahoo would have been able to get a handle on it after a few days, months, or even years. If the problem is that complex or unsolvable, then Slurp should be retired immediately. Nobody benefits from a stupid web crawler — not you, not me, and certainly not Yahoo.

On the other hand, if there is no problem with Slurp’s ability to obey its own programming, then the programming must be instructing Slurp to disobey robots.txt directives. This of course is an even worse case scenario than if Slurp were simply malfunctioning. Yahoo would then be guilty of lying to users, webmasters, and shareholders by claiming to obey the rules while secretly programming Slurp to disobey them. Hopefully, this is not the case and this whole mess is easily explained by the simple fact that Yahoo’s Slurp is too stupid to be a robot.

What do you think? Why does Slurp continue to disobey the clearly stated and agreed-upon robots.txt directives? Is it because Slurp is broken or because it has been told to do so?

Log entries showing Yahoo’s Slurp crawler accessing forbidden directories

At the bottom of the source code of my current theme (opens new window or tab), I include the following markup:

<!-- Warning: please to NOT follow the next link ("Welcome to the Blackhole") or you may be banned from this site -->
<div style="display:none;">
	<a href="https://perishablepress.com/blackhole/" title="Welcome to the Blackhole" rel="nofollow">Attention: Do NOT follow this link!</a>
</div>

Then, in my site’s robots.txt file, I include the following directives:

User-agent: *
Disallow: */blackhole/*

Taken together, the message is clear: stay OUT of my blackhole, especially if you are a robot. As simple and clear as it gets, right? Google certainly gets it, and so does good ‘ol MSN. Yet somehow, Yahoo’s stupid Slurp crawler can’t seem to figure it out. Consider the following entries taken from the access log of my blackhole directory:

Yahoo Slurp disobeys robots.txt on November 19th, 2008

72.30.81.166
[2008-11-19 (Wed) 12:06:09] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-11-18 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on December 12th, 2008

74.6.17.165
[2008-12-12 (Fri) 01:08:23] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  abechtel@inktomi.com

# ARIN WHOIS database, last updated 2008-12-11 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on January 5th, 2009

72.30.142.217
[2009-01-05 (Mon) 08:20:15] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-01-04 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on February 25th, 2009

72.30.79.95
[2009-02-25 (Wed) 01:38:02] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-02-24 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on March 3rd, 2009

72.30.65.54
[2009-03-03 (Tue) 18:26:43] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-02 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on March 3rd, 2009 (again)

72.30.65.54
[2009-03-03 (Tue) 18:26:43] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-02 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on March 10th, 2009

72.30.65.54
[2009-03-10 (Tue) 05:34:38] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-09 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on March 15th, 2009

72.30.142.167
[2009-03-15 (Sun) 06:41:23] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  abechtel@inktomi.com

# ARIN WHOIS database, last updated 2009-03-14 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

As you can see, despite the clear robots.txt directives, and despite the fact that every respectable web crawler manages to obey them, my infamous blackhole directory is a regular destination for the disobedient Yahoo Slurp crawler. This behavior is not only disrespectful to the entire online community, but it makes Yahoo look either incompetent, dishonest, or both.

Jeff Starr
About the Author Jeff Starr = Web Developer. Book Author. Secretly Important.
Archives
25 responses
  1. Jeff Starr

    Yes, that’s the whole point right there: this is not the kind of behavior that people would expect from one of the “big three” search engines. But the sad truth of the matter is that Yahoo Slurp is definitely and verifiably demonstrating malicious behavior. The verification comes from doing a forward/reverse IP/Host lookup. I wrote an article describing this technique not too long ago.

    Also, bummer about your site getting hacked. Did you know that there is a “4G” version of the blacklist available? In any case, keep fighting the good fight against those nefarious cracker scumbags.

  2. holy mother! i was in the blackhole, please don’t ban me from this site.

    btw: hopefully after switching yahoo! to the bing-alghorithm, this incompetence/impudence will have an end.

  3. Jeff Starr

    @Emanuel: Lol! I think you’re safe — let me know if you get locked out :)

  4. Hi all, glad ( not realy) to see many have my same problem.
    the fact is that i don’t understand much i just know tha my bandwidth in going all for surp yet i told him to stay out!!

    he comes in every day and there goes 4.38GB

    well just wanted to say, thank form your info.

  5. Jeff Starr

    @Sara: That seems excessive even for Slurp. If Yahoo is really chewing up over four gigs of bandwidth every visit, you may want to weigh the pros and cons of blacklisting them. How much traffic are they sending to your site? Hardly anything here.

  6. Oh you are right . my hosting service explan to me that 4.35gb is per month,
    but it is true that i told him to stay out and he come in anyway.
    i told them thay are not respecting the robot.txt
    i’m still waiting for an answer.
    well for the trafiic thay are sending i’m not sure but i do have many pages on yahoo search now.

    thank you for your reply.

  7. Jeff Starr

    Oh per month, you mean for the entire bandwidth usage, not just for Slurp, correct? If so, that seems more reasonable, but I would still keep an eye on how much bandwidth and resources Slurp is using — I have heard scary reports that would surprise most people.

    It’s good that you reported the issue to them, although I can tell you that they know about it. I have posted many times about Yahoo’s flagrant lying about obeying the robots.txt directives. Also, remember that indexed pages are worthless in an engine that nobody is using.

  8. No, per month only for slurp.

    (it is a lot more then google. and get i visits from google)

    This is what thay say:

    Hello Sara,

    Thank you for writing to Yahoo! Search.

    I understand that you’d like to prevent Yahoo! Slurp from crawling your
    website. I’d be happy to look into this for you.

    The robots.txt file you submitted needs a space after the colon as
    follows:

    User-agent: Slurp
    Disallow: /

    although this will not block Slurp immediately, it will take a few days.
    Please correct the robots.txt file and then within a few days the
    crawling should stop. We will continue checking the robots.txt file for
    changes so we may still access your server, but we shouldn’t crawl any
    file other than the robots.txt after that point.

    We hope this information has helped. Please let us know if our answer
    did not resolve your issue and if you need further assistance.

    Thank you again for contacting Yahoo! Search.

    the strang thing is that i always copy and past. so the space should of been there.
    So lets wait and see.

  9. Jeff Starr

    ..but we shouldn’t crawl any file other than the robots.txt after that point.

    Notice the word “shouldn’t” there? That tells me that these guys know what Slurp is doing but refuse to do anything about it. In other words, saying that Slurp “will not” would’ve been more reassuring (even if it is a complete lie), but they can’t say that, so they don’t.

    Then accusing you of an error in your syntax.. that’s exactly how they responded to one of my posts here in the past. Honestly, I have tried every possible way of writing the robots.txt and scrutinized everything, and have waited for months for Slurp to “get it” and begin to do what it’s told. But no dice — it doesn’t matter how perfect your robots.txt directives are, because Slurp does whatever Yahoo wants anyway.

    Thanks for sharing the feedback from Yahoo. Let us know if the new rules are effective at keeping out that nasty Slurp bot.

  10. @ Fuzion

    any chance you can send me a link for this file

    /banme/index.php being a standard “add ip to ../.htaccess deny list” script

    cheers

    jon

  11. After seeing Slurp download the same image 39 times in one day, I decided to see if others were having the same problem. I found articles from 2007 complaining about this issue, and here it is 4 years later and it’s still a problem. I hate Yahoo with a vengeance, they block my email saying I’m a spammer, while much of the spam I get is from Yahoo’s servers. I’m pretty close to just blocking them completely like you did.

  12. Jeff Starr

    Hey Matt, Great article you posted at http://bit.ly/qIHIYh – amazing that after how many years Yahoo still can’t control their stupid bots. I did my best to post numerous articles about it to get their attention, but apparently all in vain. It’s good to see someone else speaking up about Slurps intentional or ignorant behavior, which in either case is a total failure.

[ Comments are closed for this post ]