Latest TweetsGreat post about the latest power grab: www.eff.org/deeplinks/2018/09/…
Perishable Press

Yahoo! Lies about Obeying Robots.txt Directives

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:

74.6.22.102
[2008-10-06 (Mon) 15:37:31] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:    
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-10-05 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

..and another appearing later that month:

74.6.22.112
[2008-10-28 (Tue) 01:10:39] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  jluster@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-10-27 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Come on, Yahoo! program your spiders to behave according to your officially stated policy declaring adherence to the Robots Exclusion Standard:

Yahoo! Slurp obeys the Robot Exclusion Standard. Specifically, Yahoo! Slurp adheres to the 1996 Robots Exclusion Standard (RES).

Obviously this is not the case. Yahoo! needs to either fix their disobedient, broken Slurp crawlers, or else amend their policy to more accurately reflect reality:

Yahoo! Slurp may obey the Robot Exclusion Standard, unless we tell it otherwise or it gets confused about the meaning of various robots.txt directives. Either way, Yahoo! likes people to believe that Slurp adheres to the 1996 Robots Exclusion Standard (RES), even though it doesn’t always do so.

At least then I wouldn’t feel compelled to call them out every couple of months with new violations. </rant>

Jeff Starr
About the Author Jeff Starr = Web Developer. Book Author. Secretly Important.
Archives
15 responses
  1. Great. And Slurp is also the biggest draw on resources. I’ve added directives to my robots.txt with crawl delays to try and combat this, to stay out of certain parts of my site, and more. So it’s all been for naught if Slurp is so easily confused. *Sigh*

  2. Shadow Caster November 16, 2008 @ 12:08 pm

    Hi. To give Yahoo! the benefit of the doubt you could say that the robots rule is to not have that directory put into the Yahoo! index. It might crawl it for statistical analysis purposes or spying (paranoid yet?) but it won’t add it to its index.

  3. Jeff Starr

    @Shadow Caster: the Robots Exclusion Protocol is “is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable..” according to the relevant entry at Wikipedia. So, to “give Yahoo! the benefit of the doubt” you would have to change the meaning, purpose, and functionality of the entire Robots Exclusion Protocol. Perhaps you were thinking of the meta noindex attribute?

  4. well we always knew yahoo’s bot had issues :p…wouldnt adding a .htaccess in the folders blocking all access barring exceptions be a more ‘secure’ alternative?

    also I noticed feedburner throws an email after every 3 posts instead of one :p fix it! I want to read them asap!!!

    Sideline issue the trackback on this post have a look at the site… it seems to be genrated using ‘ping crawl’ from bluehatseo.com … just amke sure the site is ‘kosher’ if you want to allow it! :p

  5. Shadow Casters November 17, 2008 @ 8:10 am

    OK Jeff, I agree with you now. By any chance do you have any files in that folder that Yahoo! accessed?

  6. @ Jeff no imean the subscribe function, normally it sends me an email every time you publish a post etc…ths time it sent one email with three posts i.e.:

    Fruit Loop: Separate any Number of Odd and Even Posts from any Category in WordPress

    Yahoo! Lies about Obeying Robots.txt Directives

    Three Years and Counting

    Maybe they were published at the same date…to lazy to check :p

  7. @Jeff Email….easier to read on the move….if they were all published yesterday that explains it then! I’m used you your one every few days posting schedule :p

  8. Jeff Starr

    @Shadow Casters: yes, but only an index.php file that tells the visitor that they have “fallen into a trap” and displays their IP and other server information. Yet this file is completely isolated and accessible from only the pages on this site..

    @Donace: indeed, htaccess could easily prevent Slurp from treading into the forbidden realms, but there is nothing there to protect, really. The whole setup is just a big “spider trap” to catch bad bots — strictly for my own amusement! ;)

    About your second point, I am not sure what you mean about Feedburner throwing emails.. Perhaps you refer to the “Subscribe to Comments” feature available for post comments? If so, I hope everything is working correctly, the plugin is supposed to send an email for every comment left on a post.

    Also, thanks for the catch on the ping/crawl trackback — flushed! ;)

  9. Jeff Starr

    @Donace: Ah, I see what you mean.. yes, all three of those posts were published on the same day (yesterday). Are you getting updates via Feed or email?

  10. Jeff Starr

    @Donace: ah, cool — I won’t worry about it then.. I guess I shouldn’t get so ambitious and stick to only a couple posts per week! ;)

  11. I came across this page because I was wondering why this isp address is checking up on me 67.195.111.221 All I know is that it comes from sunnyvale CA. I have a zaba search account, and they inform me whenever someone searches for me online. I have a very rare name, so I know it’s me their looking at. I have no clue why they keep doing this. I know nothing about this stuff. Any help would be greatly appreciated. Also, how can I get a hold of them to find out what they want? I kinda wanna give them a piece of my mind. It’s making me a little nervous that their looking me up all the time. Thanks

  12. Me too, they have been constantly checking on me since last August. I think it is really creepy and I do not know anything about any of this stuff either, any help is appreciated.

[ Comments are closed for this post ]