Yahoo! Lies about Obeying Robots.txt Directives
There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:
74.6.22.102
[2008-10-06 (Mon) 15:37:31] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName: Inktomi Corporation
OrgID: INKT
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
NetRange: 74.6.0.0 - 74.6.255.255
CIDR: 74.6.0.0/16
NetName: INKTOMI-BLK-6
NetHandle: NET-74-6-0-0-1
Parent: NET-74-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate: 2006-02-13
Updated: 2007-03-09
RAbuseHandle: NETWO857-ARIN
RAbuseName: Network Abuse
RAbusePhone: +1-408-349-3300
RAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName: Network Abuse
OrgAbusePhone: +1-408-349-3300
OrgAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgTechHandle: NA258-ARIN
OrgTechName: Netblock Admin
OrgTechPhone: +1-408-349-3300
OrgTechEmail: rauschen@yahoo-inc.com
# ARIN WHOIS database, last updated 2008-10-05 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
..and another appearing later that month:
74.6.22.112
[2008-10-28 (Tue) 01:10:39] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
OrgName: Inktomi Corporation
OrgID: INKT
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
NetRange: 74.6.0.0 - 74.6.255.255
CIDR: 74.6.0.0/16
NetName: INKTOMI-BLK-6
NetHandle: NET-74-6-0-0-1
Parent: NET-74-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate: 2006-02-13
Updated: 2007-03-09
RAbuseHandle: NETWO857-ARIN
RAbuseName: Network Abuse
RAbusePhone: +1-408-349-3300
RAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName: Network Abuse
OrgAbusePhone: +1-408-349-3300
OrgAbuseEmail: network-abuse@cc.yahoo-inc.com
OrgTechHandle: NA258-ARIN
OrgTechName: Netblock Admin
OrgTechPhone: +1-408-349-3300
OrgTechEmail: jluster@yahoo-inc.com
# ARIN WHOIS database, last updated 2008-10-27 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.
Come on, Yahoo! program your spiders to behave according to your officially stated policy declaring adherence to the Robots Exclusion Standard:
Yahoo! Slurp obeys the Robot Exclusion Standard. Specifically, Yahoo! Slurp adheres to the 1996 Robots Exclusion Standard (RES).
Obviously this is not the case. Yahoo! needs to either fix their disobedient, broken Slurp crawlers, or else amend their policy to more accurately reflect reality:
Yahoo! Slurp may obey the Robot Exclusion Standard, unless we tell it otherwise or it gets confused about the meaning of various robots.txt directives. Either way, Yahoo! likes people to believe that Slurp adheres to the 1996 Robots Exclusion Standard (RES), even though it doesn’t always do so.
At least then I wouldn’t feel compelled to call them out every couple of months with new violations. </rant>
15 responses to “Yahoo! Lies about Obeying Robots.txt Directives”
Christina,
I contacted Yahoo about this yesterday and they acted like they didn’t know what I was talking about. SO I explained to them in detail what was going on and I haven’t heard anything back. Also, I not sure if you have noticed this, but you may if you have yahoo messenger. When I logged onto messenger yesterday theres that screen that pops up and tells you the time and the weather for the city you live in. WELL, it had my city listed as Sunnyvale CA the place that this thing is coming from. I have a feeling their monitoring everything I do on Yahoo. Now for some reason my webcam won’t work with messenger and other weird things are going on too. I threatened to delete my account (which I’ve had for 9 years) if they didn’t tell me what’s going on. I don’t like this one bit. I uninstalled messenger just to see what would happen, and I was still receiving notifications from ZabaSearch saying they were still looking me up. DOES ANYONE KNOW WHY THEY ARE DOING THIS? I view it as harassment.
All of the search engines do this. If you want them to obey your rules, you will have to enforce the rules yourself. I am not saying this to be inflammatory. It is a reality I live with after getting into numerous verbal battles with Google on this matter.
You can use mod_rewrite to keep them out of directories they are not supposed to index. If you are not concerned with search engine rankings, you can even firewall their networks, though they will flag your site as malware if you do this.
Yes, I was a bit confused to see a squillion Slurp hits when I knew I was explicitly blocking “.crawl.” and anything yahoo or msn. I’m going to have to get more aggressive with them, obviously. Love your site – lots of useful info and I even understand some of it ;)