Yahoo! Slurp too Stupid to be a Robot

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our websites every minute of the day.

For the most part, there are effective methods available enabling us to protect our sites against the endless hordes of irrelevant and mischievous bots. Such evil is easily blocked with virtually zero side-effects because their presence is simply irrelevant.

But what about bad bots that aren’t exactly irrelevant, such as Yahoo’s mindless Slurp crawler? By disobeying the robots.txt protocol as promised, Yahoo’s Slurp clearly falls into the “bad-bot” category. Unlike typical “nonsense” bots, Slurp is not exactly irrelevant (yet), so simply blocking them is not a reasonable solution.

And Yahoo must know this. Why else would they allow their Slurp software to flagrantly disobey robots.txt directives? Yahoo certainly benefits from proclaiming standards compliance, wherein they front credibility by claiming adherence to the same guidelines as industry leaders such as Google and Microsoft. I have never seen (nor heard of) a single instance of either googlebot (Google’s web crawler) or msnbot (Microsoft’s web crawler) appearing in locations forbidden by robots.txt directives.

So what’s up, Yahoo? There are only two possibilities here: Slurp is disobeying either erroneously or intentionally. Either case does not look good for Slurp’s master, Yahoo. There is either an error in the Slurp software that is causing Slurp to roam around like a drunken sailor, or else the software is correct and Slurp is behaving exactly as it has been directed. If the problem is an error, you would think that Yahoo would have been able to get a handle on it after a few days, months, or even years. If the problem is that complex or unsolvable, then Slurp should be retired immediately. Nobody benefits from a stupid web crawler — not you, not me, and certainly not Yahoo.

On the other hand, if there is no problem with Slurp’s ability to obey its own programming, then the programming must be instructing Slurp to disobey robots.txt directives. This of course is an even worse case scenario than if Slurp were simply malfunctioning. Yahoo would then be guilty of lying to users, webmasters, and shareholders by claiming to obey the rules while secretly programming Slurp to disobey them. Hopefully, this is not the case and this whole mess is easily explained by the simple fact that Yahoo’s Slurp is too stupid to be a robot.

What do you think? Why does Slurp continue to disobey the clearly stated and agreed-upon robots.txt directives? Is it because Slurp is broken or because it has been told to do so?

Log entries showing Yahoo’s Slurp crawler accessing forbidden directories

At the bottom of the source code of my current theme (opens new window or tab), I include the following markup:

<!-- Warning: please to NOT follow the next link ("Welcome to the Blackhole") or you may be banned from this site -->
<div style="display:none;">
	<a href="http://perishablepress.com/blackhole/" title="Welcome to the Blackhole" rel="nofollow">Attention: Do NOT follow this link!</a>
</div>

Then, in my site’s robots.txt file, I include the following directives:

User-agent: *
Disallow: */blackhole/*

Taken together, the message is clear: stay OUT of my blackhole, especially if you are a robot. As simple and clear as it gets, right? Google certainly gets it, and so does good ‘ol MSN. Yet somehow, Yahoo’s stupid Slurp crawler can’t seem to figure it out. Consider the following entries taken from the access log of my blackhole directory:

Yahoo Slurp disobeys robots.txt directives on November 19th, 2008

72.30.81.166
[2008-11-19 (Wed) 12:06:09] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-11-18 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt directives on December 12th, 2008

74.6.17.165
[2008-12-12 (Fri) 01:08:23] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  abechtel@inktomi.com

# ARIN WHOIS database, last updated 2008-12-11 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt directives on January 5th, 2009

72.30.142.217
[2009-01-05 (Mon) 08:20:15] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-01-04 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt directives on February 25th, 2009

72.30.79.95
[2009-02-25 (Wed) 01:38:02] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-02-24 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt directives on March 3rd, 2009

72.30.65.54
[2009-03-03 (Tue) 18:26:43] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-02 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt directives on March 3rd, 2009 (again)

72.30.65.54
[2009-03-03 (Tue) 18:26:43] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-02 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt directives on March 10th, 2009

72.30.65.54
[2009-03-10 (Tue) 05:34:38] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-09 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt directives on March 15th, 2009

72.30.142.167
[2009-03-15 (Sun) 06:41:23] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  abechtel@inktomi.com

# ARIN WHOIS database, last updated 2009-03-14 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

As you can see, despite the clear robots.txt directives, and despite the fact that every respectable web crawler manages to obey them, my infamous blackhole directory is a regular destination for the disobedient Yahoo Slurp crawler. This behavior is not only disrespectful to the entire online community, but it makes Yahoo look either incompetent, dishonest, or both.