Suspicious Behavior from Yahoo! Slurp Crawler

[ Image: Black and white illustration of the upper half of a man's suspicious, paranoid face ] Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces.

Of course, this policy presents a bit of a dilemma when the culprit is one of the four major search engines. Slamming the door on Yahoo! would be unwise, but if their Slurp crawler continues behaving suspiciously, I may have no choice. Check out the following records, pulled directly from one of my error logs, where Yahoo! exhibits some extremely questionable behavior.

July 29th, 2007

The first observed record of suspicious activity shows Yahoo! attempting to access a nonexistent URL. Note that the first portion of the URL does exist; it is the Perishable Press Redirection Lounge. Only explicitly predefined 301 redirects (via htaccess) should arrive at the Redirection Lounge.

July 29th 2007, 04:07pm   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/gallery/index.php?path=.%2Fpublic_%2F
REFERRER: 
QUERY STRING: path=.%2Fpublic_%2F
REMOTE ADDRESS: 74.6.20.166
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

July 30th, 2007

Yahoo! dropped in eight times on the 30th. Hmmm. We have several additional requests for nonexistent subdirectories of the Redirection Lounge, as well as several new and insightful clues into Yahoo!’s suspicious behavior. Allow me to break a few of these log entries on down..

The first record of the day presents an interesting clue as to what on earth ol’ Slurp might be doing. Check out the value of the query string. “88t” is an abbreviation for the username of fellow DLa [Dead Letter Art] member 88teeth. The query string in question is prepended to several of 88teeth’s DLa images and appears nowhere on the Perishable Press domain. Further, there is no specific “gallery” directory either.

July 30th 2007, 12:50am   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/gallery/index.php?path=.%2F88t_%2F
REFERRER: 
QUERY STRING: path=.%2F88t_%2F
REMOTE ADDRESS: 74.6.69.38
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

This next one is interesting as well. The URL once again begins with the Redirection Lounge location, but then subdirects to “/paintboard/editor.php”. I can assure you, there is no such directory or resource here at Perishable Press, but there is reference to such over at the DLa website. What are you up to, Yahoo!?

July 30th 2007, 02:46am   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/paintboard/editor.php
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.25.173
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Here is a 404 error that is not unique to the Yahoo! Slurp crawler. I have been unable to ascertain the originating source of myriad URL requests looking for nonexistent locations named after various JavaScript functions. With almost every visit, Google, MSN, Ask, and (as shown here) Yahoo! generate 404 errors such as the following:

July 30th 2007, 06:16am   >>   http://perishablepress.com/press/category/nothing/feed/function.opendir
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.28.79
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Later that day, Yahoo! strolls into town again. This time looking to “view” “/easyboard.php”. Huh? There are no links or mentions of any such resource on this domain! Any ideas as to why Yahoo! is picking up keywords, image-name prefixes, and JavaScript functions from a different domain and looking for them here at Perishable Press? I would love to know..

July 30th 2007, 01:53pm   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/comments/easyboard.php?action=view
REFERRER: 
QUERY STRING: action=view
REMOTE ADDRESS: 74.6.73.32
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Another several hours later, Yahoo! is looking for a nonexistent resource (another username of a DLa member)..

July 30th 2007, 08:58pm   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/thanec.html
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.71.153
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

This one confuses me. While there is an article on this domain with a URL matching the first part of that shown below, there is absolutely nothing associated with that post that has anything to do with “function.array-rand”. This is confusing because, in previous log entries, Yahoo! is apparently following links created on a different domain, or at least somewhere that redirects them to the Lounge, however in this case, the request is not redirected, but rather followed directly to the 404 status code.

July 30th 2007, 09:19pm   >>   http://perishablepress.com/press/2005/11/07/dead-letter-art/feed/function.array-rand
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.26.221
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Similar to the previous log entry, this one shows Yahoo! looking again for a JavaScript function, only this time the referral is from a category view instead of a single-post view.

July 30th 2007, 10:23pm   >>   http://perishablepress.com/press/category/nothing/feed/function.array-rand
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.26.221
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Another request for the nonexistent “/paintboard/editor.php”..

July 30th 2007, 11:59pm   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/paintboard/editor.php
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.72.222
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

July 31st, 2007

More suspicious behavior from Yahoo!. Here is another request for a JavaScript function:

July 31st 2007, 04:05am   >>   http://perishablepress.com/press/category/nothing/feed/function.opendir
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.70.229
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Totally shady, Yahoo! Slurp is not only looking for a JavaScript function, it is looking for it in an incomplete, nonexistent directory. In previous cases, at least the first part of the URL was a legitimate resource. Here, that is simply not the case. If you ask me, this is very unusual behavior for the Slurp crawler.

July 31st 2007, 07:59am   >>   http://perishablepress.com/press/2006/10/23/ref.outcontrol
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.20.200
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Similar to the first error shown for July 30th (above), this case shows Yahoo! passing the value of a filename prefix (another DLa member username) located on a different domain to the URL of a referral that was ultimately redirected to the Redirection Lounge. I should also mention that this type of request has not happened before (to my knowledge), and that both of the domains that seem to be involved (perishablepress.com and deadletterart.com) have been crawled previously many times by Yahoo! without such errors. Further, I have not altered the DLa site in any way within the previous several months (sadly). Perishable Press has been changed here and there, but nothing (to my mind) that would result in sudden crawl errors for Slurp.

July 31st 2007, 07:59am   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/gallery/index.php?path=.%2Fweech_%2F
REFERRER: 
QUERY STRING: path=.%2Fweech_%2F
REMOTE ADDRESS: 74.6.27.105
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

August 1st 2007

At the time of this writing, August 1st is the most recent date for which I have been able to gather such data. Again, we see more of the previously demonstrated behavior. First, the ol’ JavaScript function trick:

August 1st 2007, 02:36am   >>   http://perishablepress.com/press/category/websites/accessibility/feed/function.array-rand
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.20.202
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Then, another request for a gallery resource passing a variable with a value matching the psuedonym of yet another DLa member (yours truly)..

August 1st 2007, 02:46am   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/gallery/index.php?path=.%2Fmon_%2F
REFERRER: 
QUERY STRING: path=.%2Fmon_%2F
REMOTE ADDRESS: 74.6.72.188
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

And another DLa member..

August 1st 2007, 03:09am   >>   http://perishablepress.com/press/2007/03/19/perishable-press-redirection-lounge/ronaldo.html
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.73.171
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

And finally, another JavaScript function..

August 1st 2007, 10:07am   >>   http://perishablepress.com/press/category/websites/accessibility/feed/function.opendir
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.22.47
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Conclusion

Summarizing our clues, let’s first examine reasons why such behavior is genuinely suspicious for the Yahoo!/Slurp crawler:

  • Two different sites seem to be involved, Perishable Press and Dead Letter Art.
  • No changes whatsoever have been made to the DLa site in several months.
  • Changes have been made to this site, but nothing related to DLa, paintboards, or JavaScript.
  • The pattern of 404 errors presented in this article is unique and has not been seen before.
  • Reverse IP lookups verify these crawl errors belong to Yahoo/Slurp.
  • None of the other major search engines have demonstrated any similar patterns of behavior.
  • None of the resources requested in these examples exist on the Perishable Press domain.
  • None of the resources requested in these examples are referred to anywhere on this domain.

On the other hand, here are a few reasons why such behavior may not be so suspicious after all:

  • The IP addresses are from different domains.
  • Relatively speaking, there are relatively few unresolved requests.
  • The uresolved requests occur over the course of several days.
  • A significant amount of time transpired between each request.
  • There is no recognizable pattern to the unresolved requests.

Beyond these clues, I may only guess at the reason behind such unusual crawl behavior. This weekend, I plan on digging into the DLa site to determine if there is an underlying cause for these errors. If/when anything turns up, I will post an update here on this article. Until then, if you have any ideas or clues as to why Yahoo! is doing this, please let me know. Also, if you have experienced similar behavior from Yahoo!/Slurp, I would love to hear about it. Otherwise, thanks for your kind attention.