Yahoo! Lies about Obeying Robots.txt Directives

Published Sunday, November 16, 2008 @ 9:36 am • 10 Responses

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:

74.6.22.102
[2008-10-06 (Mon) 15:37:31] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:    
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  rauschen@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-10-05 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

..and another appearing later that month:

74.6.22.112
[2008-10-28 (Tue) 01:10:39] "GET /blackhole/ HTTP/1.0"
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

OrgName:    Inktomi Corporation
OrgID:      INKT
Address:    701 First Ave
City:       Sunnyvale
StateProv:  CA
PostalCode: 94089
Country:    US

NetRange:   74.6.0.0 - 74.6.255.255
CIDR:       74.6.0.0/16
NetName:    INKTOMI-BLK-6
NetHandle:  NET-74-6-0-0-1
Parent:     NET-74-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.YAHOO.COM
NameServer: NS2.YAHOO.COM
NameServer: NS3.YAHOO.COM
NameServer: NS4.YAHOO.COM
NameServer: NS5.YAHOO.COM
Comment:
RegDate:    2006-02-13
Updated:    2007-03-09

RAbuseHandle: NETWO857-ARIN
RAbuseName:   Network Abuse
RAbusePhone:  +1-408-349-3300
RAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgAbuseHandle: NETWO857-ARIN
OrgAbuseName:   Network Abuse
OrgAbusePhone:  +1-408-349-3300
OrgAbuseEmail:  network-abuse@cc.yahoo-inc.com

OrgTechHandle: NA258-ARIN
OrgTechName:   Netblock Admin
OrgTechPhone:  +1-408-349-3300
OrgTechEmail:  jluster@yahoo-inc.com

# ARIN WHOIS database, last updated 2008-10-27 19:10
# Enter ? for additional hints on searching ARIN's WHOIS database.

Come on, Yahoo! program your spiders to behave according to your officially stated policy declaring adherence to the Robots Exclusion Standard:

Yahoo! Slurp obeys the Robot Exclusion Standard. Specifically, Yahoo! Slurp adheres to the 1996 Robots Exclusion Standard (RES). — source: http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html

Obviously this is not the case. Yahoo! needs to either fix their disobedient, broken Slurp crawlers, or else amend their policy to more accurately reflect reality:

Yahoo! Slurp may obey the Robot Exclusion Standard, unless we tell it otherwise or it gets confused about the meaning of various robots.txt directives. Either way, Yahoo! likes people to believe that Slurp adheres to the 1996 Robots Exclusion Standard (RES), even though it doesn’t always do so.

At least then I wouldn’t feel compelled to call them out every couple of months with new violations. </rant>


Dialogue

10 Responses Jump to comment form

1Sue

November 16, 2008 at 10:21 am

Great. And Slurp is also the biggest draw on resources. I’ve added directives to my robots.txt with crawl delays to try and combat this, to stay out of certain parts of my site, and more. So it’s all been for naught if Slurp is so easily confused. *Sigh*

2Shadow Caster

November 16, 2008 at 12:08 pm

Hi. To give Yahoo! the benefit of the doubt you could say that the robots rule is to not have that directory put into the Yahoo! index. It might crawl it for statistical analysis purposes or spying (paranoid yet?) but it won’t add it to its index.

3Jeff Starr

November 16, 2008 at 1:31 pm

@Shadow Caster: the Robots Exclusion Protocol is “is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable..” according to the relevant entry at Wikipedia. So, to “give Yahoo! the benefit of the doubt” you would have to change the meaning, purpose, and functionality of the entire Robots Exclusion Protocol. Perhaps you were thinking of the meta noindex attribute?

4Shadow Casters

November 17, 2008 at 8:10 am

OK Jeff, I agree with you now. By any chance do you have any files in that folder that Yahoo! accessed?

5Donace

November 17, 2008 at 8:32 am

well we always knew yahoo’s bot had issues :p…wouldnt adding a .htaccess in the folders blocking all access barring exceptions be a more ’secure’ alternative?

also I noticed feedburner throws an email after every 3 posts instead of one :p fix it! I want to read them asap!!!

Sideline issue the trackback on this post have a look at the site… it seems to be genrated using ‘ping crawl’ from bluehatseo.com … just amke sure the site is ‘kosher’ if you want to allow it! :p

6Jeff Starr

November 17, 2008 at 9:22 am

@Shadow Casters: yes, but only an index.php file that tells the visitor that they have “fallen into a trap” and displays their IP and other server information. Yet this file is completely isolated and accessible from only the pages on this site..

@Donace: indeed, htaccess could easily prevent Slurp from treading into the forbidden realms, but there is nothing there to protect, really. The whole setup is just a big “spider trap” to catch bad bots — strictly for my own amusement! ;)

About your second point, I am not sure what you mean about Feedburner throwing emails.. Perhaps you refer to the “Subscribe to Comments” feature available for post comments? If so, I hope everything is working correctly, the plugin is supposed to send an email for every comment left on a post.

Also, thanks for the catch on the ping/crawl trackback — flushed! ;)

7Donace

November 17, 2008 at 9:29 am

@ Jeff no imean the subscribe function, normally it sends me an email every time you publish a post etc…ths time it sent one email with three posts i.e.:

Fruit Loop: Separate any Number of Odd and Even Posts from any Category in WordPress

Yahoo! Lies about Obeying Robots.txt Directives

Three Years and Counting

Maybe they were published at the same date…to lazy to check :p

8Jeff Starr

November 17, 2008 at 9:37 am

@Donace: Ah, I see what you mean.. yes, all three of those posts were published on the same day (yesterday). Are you getting updates via Feed or email?

9Donace

November 17, 2008 at 9:46 am

@Jeff Email….easier to read on the move….if they were all published yesterday that explains it then! I’m used you your one every few days posting schedule :p

10Jeff Starr

November 17, 2008 at 10:08 am

@Donace: ah, cool — I won’t worry about it then.. I guess I shouldn’t get so ambitious and stick to only a couple posts per week! ;)

Subscribe to comments on this post


Share your thoughts..

TopRead official comment policy

Contact Perishable Press

  • Contact Jeff via form

Search Perishable Press

About Perishable Press

Perishable Press is the virtual playground of Jeff Starr — visionary, founder and lead developer of Monzilla Media, a small web and graphic design company in the lush desert oasis of Moses Lake, Washington. Perishable Press features articles and tutorials on many aspects of digital design..

Read more..

Perishable on Twitter

better to have too many ideas and not enough time than the converse

Perishable on Tumblr

WordPress Tip for Multiple Themes

Sunday, 4 January 2009, 5:16 pm

If your site makes available multiple themes for users to choose from, remember to include the JavaScript (or any other required code) for any statistical applications that you might be using, such as Mint, Google Analytics, and so forth. I am not sure about the various WordPress statistics plugins, but they may need to be included as well. A good way to check if your stats plugin is tracking data across all themes is to either visit a few pages that you know others aren’t hitting, or else activate each of the alternate themes and check the source code of each one for the required code.

Earlier today, I realized that only several of my most recent themes included the required JavaScript for Mint and Google Analytics. I am now in the process of editing each of the 18 themes available for users at Perishable Press. Haven’t decided on whether or not both statistics apps are needed for all themes, but I will certainly be using at least one of them to keep an eye on everything.

Insane Christmas

Monday, 22 December 2008, 9:47 pm

For as long as I can remember, Christmas has always been a relatively peaceful affair. Sure there’s the usual holiday stress — traffic, shopping, presents, relatives, and all that goes with the preparation of a traditional celebration, but when it’s all said and done, you get to relax and enjoy the peace and harmony of gathering together and basking in the reason for the season: the birth of Christ.

This year, however, the stress factor has been kicked up a few notches, making for a rather insane Christmas if I do say so myself. In addition to the usual holiday chaos, we are currently purchasing a brand new home, and quickly realizing the incredible amount of work involved in the process. If you’ve ever bought a newly built home, you know exactly what I am talking about here.

Plus, as if all the paperwork, inspections, insurance, costs, and anxious anticipation weren’t enough to confound the usual holiday stress, we are also packing up everything, dealing with kids, working full-time jobs, and — beginning on Christmas Eve — moving into our new house.

It certainly is all a great joy and blessing to have such amazing things going on, but combined with the work that I do on the Web — blogging, designing, projects, helping people, and so on — it really becomes all too much rather quickly. We are doing are best to get through everything with our sanity intact, but I have to admit that this is the most insane Christmas I have ever experienced.

New (4G) Blacklist Now in Beta

Monday, 22 December 2008, 9:27 pm

Just a quick note to anyone interested in securing their websites against malicious activity, spam, and other nonsense. Several months after releasing my 3G Blacklist, I have finally begun work on the next incarnation of the blacklist: the 4G Firewall!

The first part of the blacklist is now ready for testing, and I plan on setting it up on Perishable Press within the next few days. While testing on my own site, I thought it would beneficial to also invite a few “beta” testers to run the code on their own site(s) as well.

So, if you have a site that receives its share of malicious attacks, and cracker exploits, drop me a line via the contact form at Perishable Press and I will send you the initial block of HTAccess directives. This version of the Blacklist is looking better than ever, and I look forward to releasing the complete version to the public early in 2009.

Thanks for the Free Traffic and Link Juice

Sunday, 7 December 2008, 1:26 pm

Just wanted to thank the fine folks at fafich.ufmg.br for all the free traffic and link juice. Thanks to their misapplication of my comprehensive canonicalization code, every non-canonical version of their 21,700 indexed pages points directly to my site, Perishable Press. This means that every one of their permalink URLs that is mistyped, lacks the “www” prefix, or contains the superfluous “index.php” file name is directed via permanent redirect directly to the home page of my site.

I have tried contacting the site owner(s) about this situation, but it has been over a week and I have yet to hear anything back. Hopefully, they will take notice soon and correct the issue by properly configuring their htaccess file, but in the meantime, I certainly don’t mind the extra link juice and free traffic! :)

No Plugin Needed for Feed Delay

Monday, 24 November 2008, 10:01 am

I recently saw a WordPress plugin that was designed to delay the publication of your WordPress feed by any specified time interval. While it is a good idea to carefully proofread your content before posting it, a plugin certainly is not required to do so.

As savvy WordPress users already know, WordPress has a built-in post-preview feature that enables authors to view their unpublished content as a published post. This enables authors to do any amount of proofreading and browser checking until they are satisfied with the results.

To do this, simply write your post as usual, and then click on the “Preview this post” button on the right-hand side of the screen. In older versions of WordPress (less than 2.5, I think), you actually need to save (without publishing!) the post first and then re-open it as if to continue editing. You will then see a “Preview »” link sort of hidden (due to poor CSS design) in the upper-right corner near the edit post field. Right-click on that link to open in new tab and you are good to go.

No extra plugin needed! :)

Read more on Tumblr..

Subscribe to Comments Recent Dialogue

  • Jeff Starr: Hi heywho, glad to hear you are doing well! ;) I wish I could join in the festivities.. it has been so long that I almost have forgot...
  • Rob Barrett: Thanks for posting about the Stealth Publish plugin -- just what I needed for my site. Works perfectly!...
  • Jeff Starr: Hi Chiwan, I got your email and have sent some information that may help you with this. Cheers, Jeff...
  • Chiwan: Hi. This is cool. So I can I replace the clock that comes with your Apathy theme with this clock? If that's not possible, how do ...
  • Brass Engraved: Thankyou very much for this, worked like a dream!...
  • Patrix: I'm using FeedBurner and the Feedsmith plugin for my filter blog, DesiPundit. I found your post via the WordPress page for RSS feeds ...
  • teddY: @Jessi Hance: Sorry to hear about your experience with Twitter spammers/flamers. I was once a victim of flamers and it sucks that peo...
  • heywho: Hey.... Very Nice...... I'm TOTALLY not a programmer..... but I have this thing I want to do...... so I just decided to start doi...
  • Rodrigo Nunes: NIce SEE MY BLOG http://designrn.wordpress.com/...
  • zubfatal: The Quintessential theme looks great to me, however when scrolling up or down on the page, it makes my laptop work harder than it sho...

Read more recent comments..

Attention: Do NOT follow this link!