Unexplained Crawl Behavior Involving Tagged Query Strings

Updated July 3, 2018 • 6 comments

I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following:

https://example.com/press/page/2/?tag=spam
https://example.com/press/page/3/?tag=code
https://example.com/press/page/2/?tag=email
https://example.com/press/page/2/?tag=xhtml
https://example.com/press/page/4/?tag=notes
https://example.com/press/page/2/?tag=flash
https://example.com/press/page/2/?tag=links
https://example.com/press/page/3/?tag=theme
https://example.com/press/page/2/?tag=press

Note: For these example URLs, I replaced my domain, perishablepress.com with the generic example.com. Turns out that listing the plain-text URLs was a mistake because Google, Bing, Yahoo, et al keep following them (even though they are not actual links), which in turn results in even more 404 errors.

..plus hundreds and hundreds more¹. The URL pattern is always the same: a different page number followed by a query string containing one of the tags used here at Perishable Press, for example: “/?tag=something”. The problem is that there are no such links anywhere on the site. The site employs permalink format for all WordPress-generated links (e.g., post/page links, search queries, tag queries, etc.). In an effort to locate the source of these URLs, I have performed multiple, thorough searches through every aspect of my site (files, database, code, examples, etc.), and the results are always the same: nothing. UGH! What is the source of these misguided URLs? Hopefully, you will be able to help shed some light on this unexplained crawl behavior (please, I beg you!!).

More Information..

Many of these errors show up in the 404 log every day. They are almost always from either msnbot or Slurp² — I have never seen one from googlebot — but occasionally some other, unknown agent is involved. The requests happen in chronologically sporadic fashion; that is, relatively significant periods of time pass between each of the requests — they seldom occur in immediately successive fashion. Further, all of the queried tags are indeed used within the site, but they are all (as far as I know) referred to using permalink-formatted URLs. None of the tag archives are referred to via query string. In the past, however, all tags were referred to via query strings (and not permalinks). I forget the cause of the change, but this is something I plan to investigate further..

Anything and Everything..

Several weeks ago, I finally threw my hands up and decided to just block the spiders from requesting the query-based tag URLs to begin with. Supposedly, both Yahoo and MSN obey robots.txt directives, so I added the following, admittedly redundant rules:

Disallow: */?tag=*
Disallow: */*?*

As I said, that was several weeks ago, so the lack of positive results may have yet to manifest. In the meantime, while waiting for the spiders to check and follow the new robots.txt directives, here are a few of the other things that I have tried or considered:

examination of other sites under the root domain on the same server
web search for similar stories, articles, discussions, comments, clues, etc.
thorough Google/Yahoo/MSN search for anything remotely resembling the non-existent tagged query-string URLs
using mod_rewrite to pattern match and redirect “tag” query strings to their permalink counterparts

Unfortunately, my efforts to solve this mystery have yielded zero fruits. What I need at this point are more leads to investigate. Hopefully I have explained the issue well enough to get your wheels turning, but if something isn’t clear, let me know. Otherwise, if you have any ideas or clues at all, please leave a comment or contact me directly. Thanks!

Footnotes

¹ More examples of this behavior may be seen in this log excerpt.
² As verified via forward-reverse IP lookup.

errors ip msn robots server spider url websites yahoo

About the Author

Jeff Starr = Web Developer. Book Author. Secretly Important.

6 responses to “Unexplained Crawl Behavior Involving Tagged Query Strings”

SneakyWho_am_i 2008/09/26 4:10 pm

I’m guessing they are taking information they’ve learned from some other site and trying to use it to navigate here, even though it fails every time and is totally wrong. The spiders don’t leave referrer information, do they?

I see a lot of strange things on my site… MSNBot is ever-present but never doing something it’s actually allowed to do.

Google Adsense bot actually follows users to every page they visit, especially when the links are visible only to registered users (which mediapartners google is NOT right now)

Strangely, even though the bot has no way to see onto the hidden pages, the ads are still relevant…

Jeff Starr 2008/09/27 9:07 pm • Post Author

Yes, that seems to be the best explanation.. Awhile ago, I discovered some suspicious behavior where the Slurp crawler seemed to be using keywords from a few of my other sites (on the same server). The keywords were appended to existing URLs for this site, and resulted in a slew of 404 errors. This behavior is either an error or deliberate — each disconcerting for their own reasons. And, don’t even get me started about Yahoo disobeying robots.txt rules..

yli 2010/05/23 1:30 am

i have a problem with the yahoo bot..
it bomb me with query like
mysite/x22NJIptSlouvkiJG/default.html
or
mysite/listings.html?word=x22NJIptSlouvkiJG

this queryes are random

so do you think this lines will work on my case?

Disallow: */*?word=*
Disallow: */*/default.html

thanks. i wait for answer

Jeff Starr 2010/05/23 11:35 am • Post Author

Hi yli, if Yahoo obeyed robots.txt directives, then yes, that would work. Unfortunately, Yahoo has shown a consistent pattern of disobeying robots.txt directives.

I have actually considered using something like htaccess to block Yahoo Slurp, but Yahoo continues to send modest amounts of traffic. If that ever changes, they’re blacklisted.

But there is no need to block Yahoo. Instead, we can block access to those random, nonexistent directories. Here is how to use htaccess to block access to those directories for any request (i.e., from any user agent):

RewriteCond %{QUERY_STRING} word= [NC]
RewriteRule .* - [F,L]
RedirectMatch 403 default.html

For more information on blacklisting different requests, check out some of my other articles:

The Perishable Press 4G Blacklist
Eight Ways to Blacklist with Apache’s mod_rewrite
…plus many more in the blacklist tag archive

yli 2010/05/23 12:16 pm

thanks man. i will to this.
in my case this inexistent keywords are inserter on db and the showed on a cloud tag and then google crawl me with this wich create non relevant pages.
thanks. again.

Chris Vendilli 2013/04/16 9:17 am

I know I’m a little late to the party but Google WMT might shed some light on this if you can see whether any other sites are linking to a tag URL that doesn’t exist and getting the bots confused. Considering they are finding all sorts of wrongly formatted URL’s perhaps it was a sitemap issue?

I’m curious to hear if you ever figured this one out Jeff…

Comments are closed for this post. Something to add? Let me know.

SneakyWho_am_i 2008/09/26 4:10 pm

I’m guessing they are taking information they’ve learned from some other site and trying to use it to navigate here, even though it fails every time and is totally wrong. The spiders don’t leave referrer information, do they?

I see a lot of strange things on my site… MSNBot is ever-present but never doing something it’s actually allowed to do.

Google Adsense bot actually follows users to every page they visit, especially when the links are visible only to registered users (which mediapartners google is NOT right now)

Strangely, even though the bot has no way to see onto the hidden pages, the ads are still relevant…
Jeff Starr 2008/09/27 9:07 pm • Post Author

Yes, that seems to be the best explanation.. Awhile ago, I discovered some suspicious behavior where the Slurp crawler seemed to be using keywords from a few of my other sites (on the same server). The keywords were appended to existing URLs for this site, and resulted in a slew of 404 errors. This behavior is either an error or deliberate — each disconcerting for their own reasons. And, don’t even get me started about Yahoo disobeying robots.txt rules..
yli 2010/05/23 1:30 am

i have a problem with the yahoo bot..
it bomb me with query like
mysite/x22NJIptSlouvkiJG/default.html
or
mysite/listings.html?word=x22NJIptSlouvkiJG

this queryes are random

so do you think this lines will work on my case?

Disallow: */*?word=*
Disallow: */*/default.html

thanks. i wait for answer
Jeff Starr 2010/05/23 11:35 am • Post Author
Hi yli, if Yahoo obeyed robots.txt directives, then yes, that would work. Unfortunately, Yahoo has shown a consistent pattern of disobeying robots.txt directives.

I have actually considered using something like htaccess to block Yahoo Slurp, but Yahoo continues to send modest amounts of traffic. If that ever changes, they’re blacklisted.

But there is no need to block Yahoo. Instead, we can block access to those random, nonexistent directories. Here is how to use htaccess to block access to those directories for any request (i.e., from any user agent):

RewriteCond %{QUERY_STRING} word= [NC]
RewriteRule .* - [F,L]
RedirectMatch 403 default.html

For more information on blacklisting different requests, check out some of my other articles:
- The Perishable Press 4G Blacklist
- Eight Ways to Blacklist with Apache’s mod_rewrite
- …plus many more in the blacklist tag archive
yli 2010/05/23 12:16 pm

thanks man. i will to this.
in my case this inexistent keywords are inserter on db and the showed on a cloud tag and then google crawl me with this wich create non relevant pages.
thanks. again.
Chris Vendilli 2013/04/16 9:17 am

I know I’m a little late to the party but Google WMT might shed some light on this if you can see whether any other sites are linking to a tag URL that doesn’t exist and getting the bots confused. Considering they are finding all sorts of wrongly formatted URL’s perhaps it was a sitemap issue?

I’m curious to hear if you ever figured this one out Jeff…