Yahoo! Lies about Obeying Robots.txt Directives

Post #629 categorized as Websites, last updated on Nov 17, 2008
Tagged with blacklist, ip, robots, search, security, server, spider, yahoo

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:

74.6.22.102
[2008-10-06 (Mon) 15:37:31] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:    
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-10-05 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

..and another appearing later that month:

74.6.22.112
[2008-10-28 (Tue) 01:10:39] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  jluster@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-10-27 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Come on, Yahoo! program your spiders to behave according to your officially stated policy declaring adherence to the Robots Exclusion Standard:

Yahoo! Slurp obeys the Robot Exclusion Standard. Specifically, Yahoo! Slurp adheres to the 1996 Robots Exclusion Standard (RES). — source: http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html

Obviously this is not the case. Yahoo! needs to either fix their disobedient, broken Slurp crawlers, or else amend their policy to more accurately reflect reality:

Yahoo! Slurp may obey the Robot Exclusion Standard, unless we tell it otherwise or it gets confused about the meaning of various robots.txt directives. Either way, Yahoo! likes people to believe that Slurp adheres to the 1996 Robots Exclusion Standard (RES), even though it doesn’t always do so.

At least then I wouldn’t feel compelled to call them out every couple of months with new violations. </rant>

Subscribe to Perishable Press


10 Responses

TopLeave a comment

[ Gravatar Icon ]

#1Sue

Great. And Slurp is also the biggest draw on resources. I’ve added directives to my robots.txt with crawl delays to try and combat this, to stay out of certain parts of my site, and more. So it’s all been for naught if Slurp is so easily confused. *Sigh*

[ Gravatar Icon ]

#2Shadow Caster

Hi. To give Yahoo! the benefit of the doubt you could say that the robots rule is to not have that directory put into the Yahoo! index. It might crawl it for statistical analysis purposes or spying (paranoid yet?) but it won’t add it to its index.

[ Gravatar Icon ]

#3Jeff Starr

@Shadow Caster: the Robots Exclusion Protocol is “is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable..” according to the relevant entry at Wikipedia. So, to “give Yahoo! the benefit of the doubt” you would have to change the meaning, purpose, and functionality of the entire Robots Exclusion Protocol. Perhaps you were thinking of the meta noindex attribute?

[ Gravatar Icon ]

#4Shadow Casters

OK Jeff, I agree with you now. By any chance do you have any files in that folder that Yahoo! accessed?

[ Gravatar Icon ]

#5Donace

well we always knew yahoo’s bot had issues :p…wouldnt adding a .htaccess in the folders blocking all access barring exceptions be a more ’secure’ alternative?

also I noticed feedburner throws an email after every 3 posts instead of one :p fix it! I want to read them asap!!!

Sideline issue the trackback on this post have a look at the site… it seems to be genrated using ‘ping crawl’ from bluehatseo.com … just amke sure the site is ‘kosher’ if you want to allow it! :p

[ Gravatar Icon ]

#6Jeff Starr

@Shadow Casters: yes, but only an index.php file that tells the visitor that they have “fallen into a trap” and displays their IP and other server information. Yet this file is completely isolated and accessible from only the pages on this site..

@Donace: indeed, htaccess could easily prevent Slurp from treading into the forbidden realms, but there is nothing there to protect, really. The whole setup is just a big “spider trap” to catch bad bots — strictly for my own amusement! ;)

About your second point, I am not sure what you mean about Feedburner throwing emails.. Perhaps you refer to the “Subscribe to Comments” feature available for post comments? If so, I hope everything is working correctly, the plugin is supposed to send an email for every comment left on a post.

Also, thanks for the catch on the ping/crawl trackback — flushed! ;)

[ Gravatar Icon ]

#7Donace

@ Jeff no imean the subscribe function, normally it sends me an email every time you publish a post etc…ths time it sent one email with three posts i.e.:

Fruit Loop: Separate any Number of Odd and Even Posts from any Category in WordPress

Yahoo! Lies about Obeying Robots.txt Directives

Three Years and Counting

Maybe they were published at the same date…to lazy to check :p

[ Gravatar Icon ]

#8Jeff Starr

@Donace: Ah, I see what you mean.. yes, all three of those posts were published on the same day (yesterday). Are you getting updates via Feed or email?

[ Gravatar Icon ]

#9Donace

@Jeff Email….easier to read on the move….if they were all published yesterday that explains it then! I’m used you your one every few days posting schedule :p

[ Gravatar Icon ]

#10Jeff Starr

@Donace: ah, cool — I won’t worry about it then.. I guess I shouldn’t get so ambitious and stick to only a couple posts per week! ;)

Share your thoughts..

TopRead official comment policy

The rules are simple. Comment intelligently. Stay on-topic. Don’t spam! Suspected spam will be deleted. Use your real name or nickname, not a site name or business name. Using a site name or business name is a good way to get your link or comment removed. Certain comments are moderated; if your comment does not appear after several days, or if you wish to comment privately, contact me. Also, by posting a comment, you grant this site a perpetual license to reproduce your comment, name, and website URL. Lastly, you may use basic HTML markup, but please do not use <pre> tags. Instead, wrap your code with <code> tags. Use a new set of <code> tags for each code term or phrase, as well as for each individual line of code (i.e., multiple lines of code require multiple code tags). Please see the complete comment policy for more information.