Latest TweetsVerify any search engine or visitor via CLI Forward-Reverse Lookup perishablepress.com/cli-forwar…
Perishable Press

Yahoo! Slurp too Stupid to be a Robot

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our sites every minute of the day.

For the most part, there are effective methods available enabling us to protect our sites against the endless hordes of irrelevant and mischievous bots. Such evil is easily blocked with virtually zero side-effects because their presence is simply irrelevant.

But what about bad bots that aren’t exactly irrelevant, such as Yahoo’s mindless Slurp crawler? By disobeying the robots.txt protocol as promised, Yahoo’s Slurp clearly falls into the “bad-bot” category. Unlike typical “nonsense” bots, Slurp is not exactly irrelevant (yet), so simply blocking them is not a reasonable solution.

And Yahoo must know this. Why else would they allow their Slurp software to flagrantly disobey robots.txt directives? Yahoo certainly benefits from proclaiming standards compliance, wherein they front credibility by claiming adherence to the same guidelines as industry leaders such as Google and Microsoft. I have never seen (nor heard of) a single instance of either googlebot (Google’s web crawler) or msnbot (Microsoft’s web crawler) appearing in locations forbidden by robots.txt directives.

So what’s up, Yahoo? There are only two possibilities here: Slurp is disobeying either erroneously or intentionally. Either case does not look good for Slurp’s master, Yahoo. There is either an error in the Slurp software that is causing Slurp to roam around like a drunken sailor, or else the software is correct and Slurp is behaving exactly as it has been directed. If the problem is an error, you would think that Yahoo would have been able to get a handle on it after a few days, months, or even years. If the problem is that complex or unsolvable, then Slurp should be retired immediately. Nobody benefits from a stupid web crawler — not you, not me, and certainly not Yahoo.

On the other hand, if there is no problem with Slurp’s ability to obey its own programming, then the programming must be instructing Slurp to disobey robots.txt directives. This of course is an even worse case scenario than if Slurp were simply malfunctioning. Yahoo would then be guilty of lying to users, webmasters, and shareholders by claiming to obey the rules while secretly programming Slurp to disobey them. Hopefully, this is not the case and this whole mess is easily explained by the simple fact that Yahoo’s Slurp is too stupid to be a robot.

What do you think? Why does Slurp continue to disobey the clearly stated and agreed-upon robots.txt directives? Is it because Slurp is broken or because it has been told to do so?

Log entries showing Yahoo’s Slurp crawler accessing forbidden directories

At the bottom of the source code of my current theme (opens new window or tab), I include the following markup:

<!-- Warning: please to NOT follow the next link ("Welcome to the Blackhole") or you may be banned from this site -->
<div style="display:none;">
	<a href="https://perishablepress.com/blackhole/" title="Welcome to the Blackhole" rel="nofollow">Attention: Do NOT follow this link!</a>
</div>

Then, in my site’s robots.txt file, I include the following directives:

User-agent: *
Disallow: */blackhole/*

Taken together, the message is clear: stay OUT of my blackhole, especially if you are a robot. As simple and clear as it gets, right? Google certainly gets it, and so does good ‘ol MSN. Yet somehow, Yahoo’s stupid Slurp crawler can’t seem to figure it out. Consider the following entries taken from the access log of my blackhole directory:

Yahoo Slurp disobeys robots.txt on November 19th, 2008

72.30.81.166
[2008-11-19 (Wed) 12:06:09] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-11-18 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on December 12th, 2008

74.6.17.165
[2008-12-12 (Fri) 01:08:23] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  abechtel@inktomi.com

# ARIN WHOIS database, last updated 2008-12-11 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on January 5th, 2009

72.30.142.217
[2009-01-05 (Mon) 08:20:15] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-01-04 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on February 25th, 2009

72.30.79.95
[2009-02-25 (Wed) 01:38:02] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-02-24 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on March 3rd, 2009

72.30.65.54
[2009-03-03 (Tue) 18:26:43] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-02 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on March 3rd, 2009 (again)

72.30.65.54
[2009-03-03 (Tue) 18:26:43] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-02 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on March 10th, 2009

72.30.65.54
[2009-03-10 (Tue) 05:34:38] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2009-03-09 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Yahoo Slurp disobeys robots.txt on March 15th, 2009

72.30.142.167
[2009-03-15 (Sun) 06:41:23] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   72.30.0.0 - 72.30.255.255
CIDR:       72.30.0.0/16
NetName:    INKTOMI-BLK-5
NetHandle:  NET-72-30-0-0-1
Parent:     NET-72-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2005-01-28
Updated:    2005-10-19

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  abechtel@inktomi.com

# ARIN WHOIS database, last updated 2009-03-14 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

As you can see, despite the clear robots.txt directives, and despite the fact that every respectable web crawler manages to obey them, my infamous blackhole directory is a regular destination for the disobedient Yahoo Slurp crawler. This behavior is not only disrespectful to the entire online community, but it makes Yahoo look either incompetent, dishonest, or both.

Jeff Starr
About the Author Jeff Starr = Web Developer. Book Author. Secretly Important.
Archives
25 responses
  1. So is there no way to stop them from doing this? Is the robots.txt file the only defense one has, albeit an on-your-honor defense?

    BTW, excellent website and excellent articles.

  2. */blackhole/* doesn’t look standard.

  3. Jeff Starr

    @Brad: That’s the whole dilemma, right there. Generally, dealing with disobedient bots and user agents is as simple as adding them to the blacklist. As one of the major search engines, however, it may not be in your site’s best interests to deny Yahoo access.

    @fuzion: Well, let’s break it down and see. Alphanumeric characters are obviously supported, as are forward slashes. The asterisks are actually wildcard characters that represent any valid character. Wildcards may not be standard, but Google and MSN support them, as does Yahoo (or so they claim).

  4. I’ve used the following for as long as my domain’s been up and haven’t had any issues:

    User-agent: *
    Disallow: /banme/

    /banme/index.php being a standard “add ip to ../.htaccess deny list” script

    /banme/index.php is added to the header in a similar way

    I use it on 5 domains (3 are writing to a shared root .htaccess) and get new bans almost daily.

    I saw on wiki that Disallow: * just isn’t recommended (for reasons not well explained). Perhaps it has something to do with drafts that allow for regex where the original standard does not.
    http://www.conman.org/people/spc/robots2.html#format.directives.disallow

  5. Jeff Starr

    Yeh, I have actually tried using a variety of different directives (at the same time) for my blackhole directory, including:

    Disallow: /blackhole
    Disallow: /blackhole/
    Disallow: */blackhole
    Disallow: */blackhole/*

    Unfortunately, Slurp still couldn’t manage to understand and/or obey them, and continually found itself visiting explicitly forbidden territory, as discussed here. Nevertheless, it should be emphasized that Yahoo! claims to support wildcards. But maybe that’s the issue here — perhaps Slurp doesn’t support wildcards after all..

  6. I noticed that your /blackhole/ directory displays:

    Please note that the following Whois data will be reviewed carefully. If it is determined that you suck, you will be banned from this site forever.

    followed by the WHOIS information. I was wondering how you obtained this information.

  7. Jeff Starr

    @Patrick: The WHOIS data is obtained by a script that queries the WHOIS database and echoes the formatted results to browser. I am thinking about sharing the technique with my readers, and probably will as soon as my book is finished.

  8. Maybe “Slurp is not exactly irrelevant (yet)” is wrong.

    My heaviest traffic site is a regional news blog doing 5 daily media releases and occasional in-depth, so Gbot is all over it like a rash (and, yep, slurp like a drunken sailer :0).

    Yahoo visitors are 1/10 of Live who are 1/100 of Google.

  9. Jeff Starr

    Good point, Phil. My stats are similar, with Yahoo traffic representing less than a fraction of a percentage of total visitors. Unless they pull a magic rabbit out of their hat, Yahoo may be joining the ranks of Ask et al.

  10. YES! Slurp has been crawling my site, but actually accessing files and folders not allowed by my robots.txt. I just assume block it – it does no good for the site by completely disregarding the robots.txt file.

    I was waiting for an article on this to be posted!

    -Monkey

    P.S. First post to your site, but I must say I absolutely love it. The best articles I’ve read in a long while.

  11. Jeff Starr

    @Monkey: I am one step away from blacklisting them myself. Will I miss the three daily visitors they send? Maybe. But protection against the malicious Slurp crawler would be well worth the sacrifice. I’m glad you found the article useful!

    Also, thanks for the compliment on the site — it’s a labor of love, so I am glad you enjoy the content. :)

  12. Haha just maybe ;)

    And something tells me Slurp might actually be a malicious script that uses the name from a not originally created by Yahoo!. This blatently disobeying of the robots.txt is just too weird, and Yahoo, at least I would hope, would not be as intrusive as this on purpose. Perhaps a programmer just spaced on the universal rules of a robots.txt file? O.o

    -Monkey

    P.S. Yes, It’s my new favorite site! Especially since I’ve been hacked about 3 times in the past 2 days, customized 3G blacklist and all -.-” I’ll keep pounding away in a desperate attempt to stop them.

[ Comments are closed for this post ]