Yahoo! Lies about Obeying Robots.txt Directives
Published Sunday, November 16, 2008 @ 9:36 am • 10 Responses
There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:
74.6.22.102
[2008-10-06 (Mon) 15:37:31] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName: Inktomi Corporation
OrgID: INKT
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
NetRange: 74.6.0.0 - 74.6.255.255
CIDR: 74.6.0.0/16
NetName: INKTOMI-BLK-6
NetHandle: NET-74-6-0-0-1
Parent: NET-74-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate: 2006-02-13
Updated: 2007-03-09
RAbuseHandle: NETWO857-ARIN
RAbuseName: Network Abuse
RAbusePhone: +1-408-349-3300
RAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName: Network Abuse
OrgAbusePhone: +1-408-349-3300
OrgAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgTechHandle: NA258-ARIN
OrgTechName: Netblock Admin
OrgTechPhone: +1-408-349-3300
OrgTechEmail: rauschen@yahoo-inc.com
# ARIN WHOIS database, last updated 2008-10-05 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
..and another appearing later that month:
74.6.22.112
[2008-10-28 (Tue) 01:10:39] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName: Inktomi Corporation
OrgID: INKT
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
NetRange: 74.6.0.0 - 74.6.255.255
CIDR: 74.6.0.0/16
NetName: INKTOMI-BLK-6
NetHandle: NET-74-6-0-0-1
Parent: NET-74-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate: 2006-02-13
Updated: 2007-03-09
RAbuseHandle: NETWO857-ARIN
RAbuseName: Network Abuse
RAbusePhone: +1-408-349-3300
RAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName: Network Abuse
OrgAbusePhone: +1-408-349-3300
OrgAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgTechHandle: NA258-ARIN
OrgTechName: Netblock Admin
OrgTechPhone: +1-408-349-3300
OrgTechEmail: jluster@yahoo-inc.com
# ARIN WHOIS database, last updated 2008-10-27 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
Come on, Yahoo! program your spiders to behave according to your officially stated policy declaring adherence to the Robots Exclusion Standard:
Yahoo! Slurp obeys the Robot Exclusion Standard. Specifically, Yahoo! Slurp adheres to the 1996 Robots Exclusion Standard (RES). — source: http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html
Obviously this is not the case. Yahoo! needs to either fix their disobedient, broken Slurp crawlers, or else amend their policy to more accurately reflect reality:
Yahoo! Slurp may obey the Robot Exclusion Standard, unless we tell it otherwise or it gets confused about the meaning of various robots.txt directives. Either way, Yahoo! likes people to believe that Slurp adheres to the 1996 Robots Exclusion Standard (RES), even though it doesn’t always do so.
At least then I wouldn’t feel compelled to call them out every couple of months with new violations. </rant>
About this article
Related articles
Dialogue
10 Responses Jump to comment form
November 16, 2008 at 12:08 pm
Hi. To give Yahoo! the benefit of the doubt you could say that the robots rule is to not have that directory put into the Yahoo! index. It might crawl it for statistical analysis purposes or spying (paranoid yet?) but it won’t add it to its index.
November 17, 2008 at 8:10 am
OK Jeff, I agree with you now. By any chance do you have any files in that folder that Yahoo! accessed?
November 17, 2008 at 8:32 am
well we always knew yahoo’s bot had issues :p…wouldnt adding a .htaccess in the folders blocking all access barring exceptions be a more ’secure’ alternative?
also I noticed feedburner throws an email after every 3 posts instead of one :p fix it! I want to read them asap!!!
Sideline issue the trackback on this post have a look at the site… it seems to be genrated using ‘ping crawl’ from bluehatseo.com … just amke sure the site is ‘kosher’ if you want to allow it! :p
November 17, 2008 at 9:29 am
@ Jeff no imean the subscribe function, normally it sends me an email every time you publish a post etc…ths time it sent one email with three posts i.e.:
Fruit Loop: Separate any Number of Odd and Even Posts from any Category in WordPress
Yahoo! Lies about Obeying Robots.txt Directives
Three Years and Counting
Maybe they were published at the same date…to lazy to check :p
November 17, 2008 at 9:46 am
@Jeff Email….easier to read on the move….if they were all published yesterday that explains it then! I’m used you your one every few days posting schedule :p
Share your thoughts..
← Previous post • Next post →
« Three Years and Counting • Fruit Loop: Separate any Number of Odd and Even Posts from any Category in WordPress »
1 • Sue
November 16, 2008 at 10:21 am
Great. And Slurp is also the biggest draw on resources. I’ve added directives to my robots.txt with crawl delays to try and combat this, to stay out of certain parts of my site, and more. So it’s all been for naught if Slurp is so easily confused. *Sigh*