Clean Up Malicious Links with HTAccess

.htaccess made easy

I recently spent some time analyzing Perishable Press pages as they appear in the search results for Google, Bing, et al. Google Webmaster Tools provides a wealth of information about crawl errors, as well as the URLs of any pages that link to missing content. Combined with your site’s access/error logs, you have everything needed to track down 404 errors and clean up your listings in the search engine results.

So far so good, but unfortunately not everyone understands and/or practices proper link etiquette, so even if you manage to clean up all of your 404 and other crawl errors, you could see something like this in the search results:

[ Screenshot: Google Search Results ]

Notice the “scamdex” query string? Apparently Google considers these URLs valid even though there is no matching resource or functionality for specific query strings. That is, the pages exist, so Google includes them in the search index. Depending on who/what is linking to you, many of your site’s lesser-ranked pages could be indexed with some random query string appended to the URLs. Yuck.

Sabotage or Ignorance?

How does this happen? Scouring my database and files, there is no trace of any “scamdex” query strings, URLs, or anything else, so what’s up. Turns out that somebody for whatever reason posted a link to my site that looks like this:

http://perishablepress.com/?scamdex

So then Googlebot follows the link, sees a valid page, and continues to crawl my site. The problem is that the “scamdex” query string is passed from one link to the next, and eventually your normal URLs are replaced with weird query-string URLs in the search results. Hence the screenshot above.

Consequences of teh ill behavior

So the big question is “why append a query string when linking to a URL?” Who knows. But fortunately most people working on the Web understand link etiquette and don’t append weird query strings when linking to your pages.

In this particular case, no real harm was done – and certainly nothing that can’t be fixed – but the problem is that apparently anybody can just append whatever query string they want to your URLs, causing Google to replace your original URLs with irrelevant query-string versions.

What’s the worst that can happen? I suppose the worst that could happen is that someone could link to your site with a threatening or obscene query string. Some random examples:

  • http://starbucks.com/?overpriced-coffee
  • http://www.wireless.att.com/?terrible-service
  • http://www.house.gov/?corrupt-politics

Then as Google crawls and indexes these valid pages, the malicious query string would begin replacing the original URLs in the search results. Granted this is all hypothetical, but as they say, “if it happened to me, it can happen to anyone”. In my case, one scamdex link was all it took for Google to index all sorts of pages with the appended query-string. So yeah, definitely something to keep an eye on, and just in case it ever happens to your site, here is how to fix it..

Clean up malicious links with htaccess

There are numerous ways to clean up sloppy incoming links. Here is how I did it with a simple slice of .htaccess:

# CLEAN MALICIOUS LINKS
<ifModule mod_rewrite.c>
 RewriteCond %{QUERY_STRING} querystring [NC]
 RewriteRule .* http://example.com/$1? [R=301,L]
</ifModule>

Just place into your web-accessible-root .htaccess file and replace the “querystring” with whatever is plaguing you, and also replace “example.com” with your site URL. Adding more query strings is easy, just replace the RewriteCond with something like this:

RewriteCond %{QUERY_STRING} (apples|oranges|bananas) [NC]

..and replace the fruits with any query strings whatever. After implementing this technique, Google et al got the message and cleared out all but one of the scamdex URLs, and I’m guessing it’s just a matter of time before the results are completely clean.