I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many
Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following:
https://example.com/press/page/2/?tag=spam https://example.com/press/page/3/?tag=code https://example.com/press/page/2/?tag=email https://example.com/press/page/2/?tag=xhtml https://example.com/press/page/4/?tag=notes https://example.com/press/page/2/?tag=flash https://example.com/press/page/2/?tag=links https://example.com/press/page/3/?tag=theme https://example.com/press/page/2/?tag=press
perishablepress.comwith the generic
example.com. Turns out that listing the plain-text URLs was a mistake because Google, Bing, Yahoo, et al keep following them (even though they are not actual links), which in turn results in even more 404 errors.
..plus hundreds and hundreds more1. The URL pattern is always the same: a different page number followed by a query string containing one of the tags used here at Perishable Press, for example: “
/?tag=something”. The problem is that there are no such links anywhere on the site. The site employs permalink format for all WordPress-generated links (e.g., post/page links, search queries, tag queries, etc.). In an effort to locate the source of these URLs, I have performed multiple, thorough searches through every aspect of my site (files, database, code, examples, etc.), and the results are always the same: nothing. UGH! What is the source of these misguided URLs? Hopefully, you will be able to help shed some light on this unexplained crawl behavior (please, I beg you!!).
Many of these errors show up in the
404 log every day. They are almost always from either msnbot or Slurp2 — I have never seen one from googlebot — but occasionally some other, unknown agent is involved. The requests happen in chronologically sporadic fashion; that is, relatively significant periods of time pass between each of the requests — they seldom occur in immediately successive fashion. Further, all of the queried tags are indeed used within the site, but they are all (as far as I know) referred to using permalink-formatted URLs. None of the tag archives are referred to via query string. In the past, however, all tags were referred to via query strings (and not permalinks). I forget the cause of the change, but this is something I plan to investigate further..
Anything and Everything..
Several weeks ago, I finally threw my hands up and decided to just block the spiders from requesting the query-based tag URLs to begin with. Supposedly, both Yahoo and MSN obey
robots.txt directives, so I added the following, admittedly redundant rules:
Disallow: */?tag=* Disallow: */*?*
As I said, that was several weeks ago, so the lack of positive results may have yet to manifest. In the meantime, while waiting for the spiders to check and follow the new robots.txt directives, here are a few of the other things that I have tried or considered:
- examination of other sites under the root domain on the same server
- web search for similar stories, articles, discussions, comments, clues, etc.
- thorough Google/Yahoo/MSN search for anything remotely resembling the non-existent tagged query-string URLs
mod_rewriteto pattern match and redirect “
tag” query strings to their permalink counterparts
Unfortunately, my efforts to solve this mystery have yielded zero fruits. What I need at this point are more leads to investigate. Hopefully I have explained the issue well enough to get your wheels turning, but if something isn’t clear, let me know. Otherwise, if you have any ideas or clues at all, please leave a comment or contact me directly. Thanks!
- 1 More examples of this behavior may be seen in this log excerpt.
- 2 As verified via forward-reverse IP lookup.