Latest TweetsNew version of Disable Gutenberg includes options to disable for specific theme templates and/or post/page IDs. wordpress.org/plugins/disable-…
Perishable Press

Suspicious Behavior from Yahoo! Slurp Crawler

[ Image: Black and white illustration of the upper half of a man's suspicious, paranoid face ] Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces.

Of course, this policy presents a bit of a dilemma when the culprit is one of the four major search engines. Slamming the door on Yahoo! would be unwise, but if their Slurp crawler continues behaving suspiciously, I may have no choice. Check out the following records, pulled directly from one of my error logs, where Yahoo! exhibits some extremely questionable behavior.

July 29th, 2007

The first observed record of suspicious activity shows Yahoo! attempting to access a nonexistent URL. Note that the first portion of the URL does exist; it is the Perishable Press Redirection Lounge. Only explicitly predefined 301 redirects (via htaccess) should arrive at the Redirection Lounge.

Note: in the following log entries, each instance of perishablepress.com was replaced with example.com. This was required to prevent endless 404 errors from googlebot constantly crawling plain-text URLs.
July 29th 2007, 04:07pm   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/gallery/index.php?path=.%2Fpublic_%2F
REFERRER: 
QUERY STRING: path=.%2Fpublic_%2F
REMOTE ADDRESS: 74.6.20.166
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

July 30th, 2007

Yahoo! dropped in eight times on the 30th. Hmmm. We have several additional requests for nonexistent subdirectories of the Redirection Lounge, as well as several new and insightful clues into Yahoo!’s suspicious behavior. Allow me to break a few of these log entries on down..

The first record of the day presents an interesting clue as to what on earth ol’ Slurp might be doing. Check out the value of the query string. “88t” is an abbreviation for the username of fellow DLa [Dead Letter Art] member 88teeth. The query string in question is prepended to several of 88teeth’s DLa images and appears nowhere on the Perishable Press domain. Further, there is no specific “gallery” directory either.

July 30th 2007, 12:50am   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/gallery/index.php?path=.%2F88t_%2F
REFERRER: 
QUERY STRING: path=.%2F88t_%2F
REMOTE ADDRESS: 74.6.69.38
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

This next one is interesting as well. The URL once again begins with the Redirection Lounge location, but then subdirects to “/paintboard/editor.php”. I can assure you, there is no such directory or resource here at Perishable Press, but there is reference to such over at the DLa website. What are you up to, Yahoo!?

July 30th 2007, 02:46am   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/paintboard/editor.php
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.25.173
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Here is a 404 error that is not unique to the Yahoo! Slurp crawler. I have been unable to ascertain the originating source of myriad URL requests looking for nonexistent locations named after various JavaScript functions. With almost every visit, Google, MSN, Ask, and (as shown here) Yahoo! generate 404 errors such as the following:

July 30th 2007, 06:16am   >>   https://example.com/press/category/nothing/feed/function.opendir
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.28.79
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Later that day, Yahoo! strolls into town again. This time looking to “view” “/easyboard.php”. Huh? There are no links or mentions of any such resource on this domain! Any ideas as to why Yahoo! is picking up keywords, image-name prefixes, and JavaScript functions from a different domain and looking for them here at Perishable Press? I would love to know..

July 30th 2007, 01:53pm   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/comments/easyboard.php?action=view
REFERRER: 
QUERY STRING: action=view
REMOTE ADDRESS: 74.6.73.32
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Another several hours later, Yahoo! is looking for a nonexistent resource (another username of a DLa member)..

July 30th 2007, 08:58pm   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/thanec.html
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.71.153
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

This one confuses me. While there is an article on this domain with a URL matching the first part of that shown below, there is absolutely nothing associated with that post that has anything to do with “function.array-rand”. This is confusing because, in previous log entries, Yahoo! is apparently following links created on a different domain, or at least somewhere that redirects them to the Lounge, however in this case, the request is not redirected, but rather followed directly to the 404 status code.

July 30th 2007, 09:19pm   >>   https://example.com/press/2005/11/07/dead-letter-art/feed/function.array-rand
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.26.221
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Similar to the previous log entry, this one shows Yahoo! looking again for a JavaScript function, only this time the referral is from a category view instead of a single-post view.

July 30th 2007, 10:23pm   >>   https://example.com/press/category/nothing/feed/function.array-rand
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.26.221
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Another request for the nonexistent “/paintboard/editor.php”..

July 30th 2007, 11:59pm   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/paintboard/editor.php
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.72.222
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

July 31st, 2007

More suspicious behavior from Yahoo!. Here is another request for a JavaScript function:

July 31st 2007, 04:05am   >>   https://example.com/press/category/nothing/feed/function.opendir
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.70.229
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Totally shady, Yahoo! Slurp is not only looking for a JavaScript function, it is looking for it in an incomplete, nonexistent directory. In previous cases, at least the first part of the URL was a legitimate resource. Here, that is simply not the case. If you ask me, this is very unusual behavior for the Slurp crawler.

July 31st 2007, 07:59am   >>   https://example.com/press/2006/10/23/ref.outcontrol
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.20.200
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Similar to the first error shown for July 30th (above), this case shows Yahoo! passing the value of a filename prefix (another DLa member username) located on a different domain to the URL of a referral that was ultimately redirected to the Redirection Lounge. I should also mention that this type of request has not happened before (to my knowledge), and that both of the domains that seem to be involved (perishablepress.com and deadletterart.com) have been crawled previously many times by Yahoo! without such errors. Further, I have not altered the DLa site in any way within the previous several months (sadly). Perishable Press has been changed here and there, but nothing (to my mind) that would result in sudden crawl errors for Slurp.

July 31st 2007, 07:59am   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/gallery/index.php?path=.%2Fweech_%2F
REFERRER: 
QUERY STRING: path=.%2Fweech_%2F
REMOTE ADDRESS: 74.6.27.105
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

August 1st 2007

At the time of this writing, August 1st is the most recent date for which I have been able to gather such data. Again, we see more of the previously demonstrated behavior. First, the ol’ JavaScript function trick:

August 1st 2007, 02:36am   >>   https://example.com/press/category/websites/accessibility/feed/function.array-rand
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.20.202
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Then, another request for a gallery resource passing a variable with a value matching the psuedonym of yet another DLa member (yours truly)..

August 1st 2007, 02:46am   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/gallery/index.php?path=.%2Fmon_%2F
REFERRER: 
QUERY STRING: path=.%2Fmon_%2F
REMOTE ADDRESS: 74.6.72.188
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

And another DLa member..

August 1st 2007, 03:09am   >>   https://example.com/press/2007/03/19/perishable-press-redirection-lounge/ronaldo.html
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.73.171
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

And finally, another JavaScript function..

August 1st 2007, 10:07am   >>   https://example.com/press/category/websites/accessibility/feed/function.opendir
REFERRER: 
QUERY STRING: 
REMOTE ADDRESS: 74.6.22.47
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Conclusion

Summarizing our clues, let’s first examine reasons why such behavior is genuinely suspicious for the Yahoo!/Slurp crawler:

  • Two sites seem to be involved, Perishable Press and Dead Letter Art.
  • No changes whatsoever have been made to the DLa site in several months.
  • Changes have been made to this site, but nothing related to DLa, paintboards, or JavaScript.
  • The pattern of 404 errors is unique and has not been seen before.
  • Reverse IP lookups verify these crawl errors belong to Yahoo/Slurp.
  • None of the other major search engines have demonstrated any similar patterns of behavior.
  • None of the resources requested exist on the Perishable Press domain.
  • None of the resources requested are referred to anywhere on this domain.

On the other hand, here are a few reasons why such behavior may not be so suspicious after all:

  • The IP addresses are from different domains.
  • Relatively speaking, there are relatively few unresolved requests.
  • The uresolved requests occur over the course of several days.
  • A significant amount of time transpired between each request.
  • There is no recognizable pattern to the unresolved requests.

Beyond these clues, I may only guess at the reason behind such unusual crawl behavior. This weekend, I plan on digging into the DLa site to determine if there is an underlying cause for these errors. If/when anything turns up, I will post an update here on this article. Until then, if you have any ideas or clues as to why Yahoo! is doing this, please let me know. Also, if you have experienced similar behavior from Yahoo!/Slurp, I would love to hear about it. Otherwise, thanks for your kind attention.

Jeff Starr
About the Author Jeff Starr = Designer. Developer. Producer. Writer. Editor. Etc.
Archives
16 responses
  1. Strange requests indeed, I’ve seen them in my own logfiles to.

    The two functions (array_rand and opendir) you define as javascript functions are PHP functions. Some servers generate clickable links to the php manual (which uses function.NAMEOFFUNCTION in their URL’s) in php scripting error messages. Maybe that’s also the cause of these problems.

  2. Sick of Debt August 16, 2007 @ 10:03 am

    One reason I can think of for some of the items may be them testing for a mirrored site (re: items not on this site but on the other one). Some unscrupulous people may register several domains and have the exact same site on them, but the links just a little different pointing to each other. The more locations having links pointing to a site can raise up it’s ranking on some search engines.

  3. Jeff Starr

    @Bas — Thank you for pointing that out! While banging my head against the wall trying to troubleshoot the bizarre request errors, my thinking became hazy (at best) and I failed to recognize the functions as PHP (JavaScript – yikes! ..time to hit the books). If I understand you correctly, it seems as if PHP error messages (as recorded in a log file) may be creating links that the search engines are following? I want to look into this further. Would you happen to have a link for me to check out or perhaps some well-focused keywords to google..? I would be me most appreciative! Thanks again for your help.

    @Sick of Debt — That is very possible, indeed. I wonder what would cause ol’ Slurp to suddenly start testing my domain/pages.. As mentioned, all sites involved have are well-established and have not seen any significant changes in quite some time. Are there other reasons why a search engine would suddenly grow suspicious? I have nothing to hide, and would like to get to the bottom of this. I recently checked a few new entries in the error log and things haven’t changed — still seeing all kinds of mysterious URL requests. We’ll see.. Either way, thanks for commenting. I need all the help I can get!!

  4. Try a search for “function.fopen” (without the quotes) on Google and go to page 10 or further, there the longtime error messages pages are shown, mostly with clickable functionnames. There you see the cause of the function.FUNCTIONNAME requests.

    If you look to the url’s in the search results of Google, you see many sites having the same problem as you did have.

  5. Jeff Starr

    Ahh yes, I see.. many sites indeed.
    Very nice work, Bas. This is exactly the information I need. There must be a way to eliminate the link or otherwise prevent it from being created in the first place. Or perhaps an htaccess redirect solution of some sort.
    And here I thought I was going to get a post written today.. ;)
    Thanks for your help!

  6. Thanks for the posting. While I agree with you that the Yahoo Slurp seems to behave strangely, I beg o differ with your conclusions. It’s definitely a hack.

    I had 2 websites database erased this month, September 07, and after digging through the raw log files, I discovered that all the 550 entries in one database where deleted one by one by an agent masquarading as Googlebot, with genuine Google IP addresses, I did a reverse DNS lookup, while the other database was deleted in the same manner but through an agent masquarading as Yahoo Slurp with, here again, genuine Yahoo ip address.

    It’s simply impossible that both Search Engines became crazy at the same time. It’s just that the dickheads behind this are using Yahoo and Google ip addresses so that you won’t block them as you suggested earlier, for fear of not being spydered by their bots when they revisit for future referencing.

    What amazes me here is that they are using some kind of ip hopping software and using the same technique, could do some serious damage to programs like AdSense if they were to click ads by rotating ip addresses and browsers headers…

    Thankfully, I had backups so I restored both database which fully protected. You always learn something when such a problem occurs. Anyway thanks again for the great post.

  7. Jeff Starr

    Very interesting information, Steven. However, I have verified that the crawlers are indeed Yahoo! Slurp machines via reverse/forward DNS lookups. Thus, as much as I would like to think that the suspicious behavior was coming from hacks instead of Yahoo, I don’t see how it would be possible.

    What I want to know is, how on earth did rogue agents manage to delete database content? That just blows my mind! If there was some security hole, please let me know so that I may examine my setup and fix any potential problems immediately.

    Nevertheless, I am happy to hear that you were wise enough to make backups of your database content. I do the same, but am seriously concerned after reading some of the things mentioned in your comment.

  8. i think it’s kind of silly how you assumed something sinister was going on without any evidence. based on what was actually happening, it seems far more likely that it was just a bug. occasional requests for nonexistent urls don’t seem like a huge problem. anyway, this is probably your answer (404 link removed 2017/01/20).

  9. Jeff Starr

    Hi d,

    Thank you for the feedback, your kind criticism helps to improve the overall content-value and trustworthiness of my site ;)

    In the article, I present numerous examples of unusual/unexpected crawl behavior from slurp as recorded in my error logs. I have also verified each IP as belonging to Yahoo via forward-reverse DNS lookup (Yahoo’s recommended method). After verifying that the odd requests were in fact coming from slurp, I desired to understand and share the information with my readers. I assure you, I have nothing against Yahoo and have no reason to fabricate such data; it is published with the hope that bright individuals such as yourself will help contribute to a possible explanation, which you have indeed managed to do. While the evidence is no “slam dunk”, it certainly warrants the use of the word “suspicious” to describe the behavior — it’s not like I am accusing Yahoo of murder or something. As you say, it is definitely not a “huge problem”. Chances are that your suspicions are correct: the evidence presented in this article is probably the result of a technical failure at Yahoo, as opposed to some deliberate attempt to act suspiciously.

    Regards,
    Jeff

  10. Home Theater Tech December 26, 2007 @ 11:33 am

    Well Jeff – at least Yahoo is paying attention to your site. :) I used to see Slurp in my logs frequently – then back in the Spring Yahoo suddenly and inexplicably de-indexed my site. For months I had no pages in the Yahooligan database … then in August they re-indexed my home page and then gradually (over a period of about 3 months) re-added all of my pages. Word around the webster forums was that I was not alone in the deal – a lot of folks were finding their sites totally dropped – but it was disconcerting nonetheless. Nowadays I rarely ever see Slurp visit – but it doesn’t concern me all that much as “Big G” has always delivered at least 100 times the traffic “Little Y” ever did.

    I’m wondering if Yahoo’s suspected technical glitch in their system that was related to your issue also contributed to what happened to myself and numerous other sites…??

  11. Jeff Starr

    Sure, why not! Technology is so vast and convoluted these days that even relatively small glitches on large systems have the potential of causing devastating results. I find it much like the weather, where the flap of moth wing can flood a tribal village on the opposite side of the globe. With a system as large as Google’s or even Yahoo’s, it would not surprise me in the least if a small database hiccup resulted in a cascading avalanche of subtle changes. I mean, with a system that deep and wide, is it even possible to be aware of every glitch and bug and their respective consequences?

  12. I think slurp tried to log into my websites database. wtf slurp

    Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) Error Executing Database Query. [Macromedia][SQLServer JDBC Driver][SQLServer] Cannot open database "-databasename-" requested by the login. The login failed. The error occurred on line 13.

[ Comments are closed for this post ]