Yahoo! Lies about Obeying Robots.txt Directives
Post #629 categorized as Websites, last updated on Nov 17, 2008
Tagged with blacklist, ip, robots, search, security, server, spider, yahoo
There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:
74.6.22.102
[2008-10-06 (Mon) 15:37:31] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName: Inktomi Corporation
OrgID: INKT
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
NetRange: 74.6.0.0 - 74.6.255.255
CIDR: 74.6.0.0/16
NetName: INKTOMI-BLK-6
NetHandle: NET-74-6-0-0-1
Parent: NET-74-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate: 2006-02-13
Updated: 2007-03-09
RAbuseHandle: NETWO857-ARIN
RAbuseName: Network Abuse
RAbusePhone: +1-408-349-3300
RAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName: Network Abuse
OrgAbusePhone: +1-408-349-3300
OrgAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgTechHandle: NA258-ARIN
OrgTechName: Netblock Admin
OrgTechPhone: +1-408-349-3300
OrgTechEmail: rauschen@yahoo-inc.com
# ARIN WHOIS database, last updated 2008-10-05 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
..and another appearing later that month:
74.6.22.112
[2008-10-28 (Tue) 01:10:39] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName: Inktomi Corporation
OrgID: INKT
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
NetRange: 74.6.0.0 - 74.6.255.255
CIDR: 74.6.0.0/16
NetName: INKTOMI-BLK-6
NetHandle: NET-74-6-0-0-1
Parent: NET-74-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate: 2006-02-13
Updated: 2007-03-09
RAbuseHandle: NETWO857-ARIN
RAbuseName: Network Abuse
RAbusePhone: +1-408-349-3300
RAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName: Network Abuse
OrgAbusePhone: +1-408-349-3300
OrgAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgTechHandle: NA258-ARIN
OrgTechName: Netblock Admin
OrgTechPhone: +1-408-349-3300
OrgTechEmail: jluster@yahoo-inc.com
# ARIN WHOIS database, last updated 2008-10-27 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
Come on, Yahoo! program your spiders to behave according to your officially stated policy declaring adherence to the Robots Exclusion Standard:
Yahoo! Slurp obeys the Robot Exclusion Standard. Specifically, Yahoo! Slurp adheres to the 1996 Robots Exclusion Standard (RES). — source: http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html
Obviously this is not the case. Yahoo! needs to either fix their disobedient, broken Slurp crawlers, or else amend their policy to more accurately reflect reality:
Yahoo! Slurp may obey the Robot Exclusion Standard, unless we tell it otherwise or it gets confused about the meaning of various robots.txt directives. Either way, Yahoo! likes people to believe that Slurp adheres to the 1996 Robots Exclusion Standard (RES), even though it doesn’t always do so.
At least then I wouldn’t feel compelled to call them out every couple of months with new violations. </rant>
#1 — Sue
Great. And Slurp is also the biggest draw on resources. I’ve added directives to my robots.txt with crawl delays to try and combat this, to stay out of certain parts of my site, and more. So it’s all been for naught if Slurp is so easily confused. *Sigh*