Yahoo! Lies about Obeying Robots.txt Directives
by Jeff Starr on Sunday, November 16, 2008 – 13 Responses
There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:
74.6.22.102
[2008-10-06 (Mon) 15:37:31] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName: Inktomi Corporation
OrgID: INKT
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
NetRange: 74.6.0.0 - 74.6.255.255
CIDR: 74.6.0.0/16
NetName: INKTOMI-BLK-6
NetHandle: NET-74-6-0-0-1
Parent: NET-74-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate: 2006-02-13
Updated: 2007-03-09
RAbuseHandle: NETWO857-ARIN
RAbuseName: Network Abuse
RAbusePhone: +1-408-349-3300
RAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName: Network Abuse
OrgAbusePhone: +1-408-349-3300
OrgAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgTechHandle: NA258-ARIN
OrgTechName: Netblock Admin
OrgTechPhone: +1-408-349-3300
OrgTechEmail: rauschen@yahoo-inc.com
# ARIN WHOIS database, last updated 2008-10-05 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
..and another appearing later that month:
74.6.22.112
[2008-10-28 (Tue) 01:10:39] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName: Inktomi Corporation
OrgID: INKT
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
NetRange: 74.6.0.0 - 74.6.255.255
CIDR: 74.6.0.0/16
NetName: INKTOMI-BLK-6
NetHandle: NET-74-6-0-0-1
Parent: NET-74-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate: 2006-02-13
Updated: 2007-03-09
RAbuseHandle: NETWO857-ARIN
RAbuseName: Network Abuse
RAbusePhone: +1-408-349-3300
RAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName: Network Abuse
OrgAbusePhone: +1-408-349-3300
OrgAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgTechHandle: NA258-ARIN
OrgTechName: Netblock Admin
OrgTechPhone: +1-408-349-3300
OrgTechEmail: jluster@yahoo-inc.com
# ARIN WHOIS database, last updated 2008-10-27 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
Come on, Yahoo! program your spiders to behave according to your officially stated policy declaring adherence to the Robots Exclusion Standard:
Yahoo! Slurp obeys the Robot Exclusion Standard. Specifically, Yahoo! Slurp adheres to the 1996 Robots Exclusion Standard (RES). — source: http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html
Obviously this is not the case. Yahoo! needs to either fix their disobedient, broken Slurp crawlers, or else amend their policy to more accurately reflect reality:
Yahoo! Slurp may obey the Robot Exclusion Standard, unless we tell it otherwise or it gets confused about the meaning of various robots.txt directives. Either way, Yahoo! likes people to believe that Slurp adheres to the 1996 Robots Exclusion Standard (RES), even though it doesn’t always do so.
At least then I wouldn’t feel compelled to call them out every couple of months with new violations. </rant>






13 Responses
Add a comment
Sue – #1
Great. And Slurp is also the biggest draw on resources. I’ve added directives to my robots.txt with crawl delays to try and combat this, to stay out of certain parts of my site, and more. So it’s all been for naught if Slurp is so easily confused. *Sigh*
Shadow Caster – #2
Hi. To give Yahoo! the benefit of the doubt you could say that the robots rule is to not have that directory put into the Yahoo! index. It might crawl it for statistical analysis purposes or spying (paranoid yet?) but it won’t add it to its index.
Jeff Starr – #3
@Shadow Caster: the Robots Exclusion Protocol is “is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable..” according to the relevant entry at Wikipedia. So, to “give Yahoo! the benefit of the doubt” you would have to change the meaning, purpose, and functionality of the entire Robots Exclusion Protocol. Perhaps you were thinking of the
metanoindexattribute?Shadow Casters – #4
OK Jeff, I agree with you now. By any chance do you have any files in that folder that Yahoo! accessed?
Donace – #5
well we always knew yahoo’s bot had issues :p…wouldnt adding a .htaccess in the folders blocking all access barring exceptions be a more ’secure’ alternative?
also I noticed feedburner throws an email after every 3 posts instead of one :p fix it! I want to read them asap!!!
Sideline issue the trackback on this post have a look at the site… it seems to be genrated using ‘ping crawl’ from bluehatseo.com … just amke sure the site is ‘kosher’ if you want to allow it! :p
Jeff Starr – #6
@Shadow Casters: yes, but only an
index.phpfile that tells the visitor that they have “fallen into a trap” and displays their IP and other server information. Yet this file is completely isolated and accessible from only the pages on this site..@Donace: indeed, htaccess could easily prevent Slurp from treading into the forbidden realms, but there is nothing there to protect, really. The whole setup is just a big “spider trap” to catch bad bots — strictly for my own amusement! ;)
About your second point, I am not sure what you mean about Feedburner throwing emails.. Perhaps you refer to the “Subscribe to Comments” feature available for post comments? If so, I hope everything is working correctly, the plugin is supposed to send an email for every comment left on a post.
Also, thanks for the catch on the ping/crawl trackback — flushed! ;)
Donace – #7
@ Jeff no imean the subscribe function, normally it sends me an email every time you publish a post etc…ths time it sent one email with three posts i.e.:
Fruit Loop: Separate any Number of Odd and Even Posts from any Category in WordPress
Yahoo! Lies about Obeying Robots.txt Directives
Three Years and Counting
Maybe they were published at the same date…to lazy to check :p
Jeff Starr – #8
@Donace: Ah, I see what you mean.. yes, all three of those posts were published on the same day (yesterday). Are you getting updates via Feed or email?
Donace – #9
@Jeff Email….easier to read on the move….if they were all published yesterday that explains it then! I’m used you your one every few days posting schedule :p
Jeff Starr – #10
@Donace: ah, cool — I won’t worry about it then.. I guess I shouldn’t get so ambitious and stick to only a couple posts per week! ;)
allie – #11
I came across this page because I was wondering why this isp address is checking up on me 67.195.111.221 All I know is that it comes from sunnyvale CA. I have a zaba search account, and they inform me whenever someone searches for me online. I have a very rare name, so I know it’s me their looking at. I have no clue why they keep doing this. I know nothing about this stuff. Any help would be greatly appreciated. Also, how can I get a hold of them to find out what they want? I kinda wanna give them a piece of my mind. It’s making me a little nervous that their looking me up all the time. Thanks
Christina – #12
Me too, they have been constantly checking on me since last August. I think it is really creepy and I do not know anything about any of this stuff either, any help is appreciated.
allie – #13
Christina,
I contacted Yahoo about this yesterday and they acted like they didn’t know what I was talking about. SO I explained to them in detail what was going on and I haven’t heard anything back. Also, I not sure if you have noticed this, but you may if you have yahoo messenger. When I logged onto messenger yesterday theres that screen that pops up and tells you the time and the weather for the city you live in. WELL, it had my city listed as Sunnyvale CA the place that this thing is coming from. I have a feeling their monitoring everything I do on Yahoo. Now for some reason my webcam won’t work with messenger and other weird things are going on too. I threatened to delete my account (which I’ve had for 9 years) if they didn’t tell me what’s going on. I don’t like this one bit. I uninstalled messenger just to see what would happen, and I was still receiving notifications from ZabaSearch saying they were still looking me up. DOES ANYONE KNOW WHY THEY ARE DOING THIS? I view it as harassment.