Spring Sale! Save 30% on all books w/ code: PLANET24
Web Dev + WordPress + Security

Clean Up Malicious Links with HTAccess

I recently spent some time analyzing Perishable Press pages as they appear in the search results for Google, Bing, et al. Google Webmaster Tools provides a wealth of information about crawl errors, as well as the URLs of any pages that link to missing content. Combined with your site’s access/error logs, you have everything needed to track down 404 errors and clean up your listings in the search engine results.

So far so good, but unfortunately not everyone understands and/or practices proper link etiquette, so even if you manage to clean up all of your 404 and other crawl errors, you could see something like this in the search results:

[ Screenshot: Google Search Results ]

Notice the “scamdex” query string? Apparently Google considers these URLs valid even though there is no matching resource or functionality for specific query strings. That is, the pages exist, so Google includes them in the search index. Depending on who/what is linking to you, many of your site’s lesser-ranked pages could be indexed with some random query string appended to the URLs. Yuck.

Sabotage or Ignorance?

How does this happen? Scouring my database and files, there is no trace of any “scamdex” query strings, URLs, or anything else, so what’s up. Turns out that somebody for whatever reason posted a link to my site that looks like this:

https://perishablepress.com/?scamdex

So then Googlebot follows the link, sees a valid page, and continues to crawl my site. The problem is that the “scamdex” query string is passed from one link to the next, and eventually your normal URLs are replaced with weird query-string URLs in the search results. Hence the screenshot above.

Consequences of teh ill behavior

So the big question is “why append a query string when linking to a URL?” Who knows. But fortunately most people working on the Web understand link etiquette and don’t append weird query strings when linking to your pages.

In this particular case, no real harm was done – and certainly nothing that can’t be fixed – but the problem is that apparently anybody can just append whatever query string they want to your URLs, causing Google to replace your original URLs with irrelevant query-string versions.

What’s the worst that can happen? I suppose the worst that could happen is that someone could link to your site with a threatening or obscene query string. Some random examples:

  • http://starbucks.com/?overpriced-coffee
  • http://www.wireless.att.com/?terrible-service
  • http://www.house.gov/?corrupt-politics

Then as Google crawls and indexes these valid pages, the malicious query string would begin replacing the original URLs in the search results. Granted this is all hypothetical, but as they say, “if it happened to me, it can happen to anyone”. In my case, one scamdex link was all it took for Google to index all sorts of pages with the appended query-string. So yeah, definitely something to keep an eye on, and just in case it ever happens to your site, here is how to fix it..

Clean up malicious links with htaccess

There are numerous ways to clean up sloppy incoming links. Here is how I did it with a simple slice of .htaccess:

# CLEAN MALICIOUS LINKS
<ifModule mod_rewrite.c>
 RewriteCond %{QUERY_STRING} querystring [NC]
 RewriteRule .* http://example.com/$1? [R=301,L]
</ifModule>

Just place into your web-accessible-root .htaccess file and replace the “querystring” with whatever is plaguing you, and also replace “example.com” with your site URL. Adding more query strings is easy, just replace the RewriteCond with something like this:

RewriteCond %{QUERY_STRING} (apples|oranges|bananas) [NC]

..and replace the fruits with any query strings whatever. After implementing this technique, Google et al got the message and cleared out all but one of the scamdex URLs, and I’m guessing it’s just a matter of time before the results are completely clean.

About the Author
Jeff Starr = Web Developer. Security Specialist. WordPress Buff.
Digging Into WordPress: Take your WordPress skills to the next level.

40 responses to “Clean Up Malicious Links with HTAccess”

  1. Michael Miller 2011/04/08 2:00 pm

    My site got swamped with .htaccess files in almost every directory. How did this happen? How can I prevent it in the future? Thanks, MM

  2. Hi Jeff.

    I’m finding invented url on my website, the URL pattern is interesting and I suspect dishonest, I do not find written references anywhere.

    The URL pattern is always the same, but change always in the end item.

    It is:

    http://www.example.com/com.google.crawl.wmconsole.fe.util.gxp.UrlItem$2@172e05f

    Anyway, I’m trying to solve this with this trick and I can not find the solution.

    Perhaps I can fix it somehow?

    thanks

    • Hi dqeva,

      To block these types of requests, we look for the least common character string, which if I’m guessing correctly is something along the lines of util.gxp.UrlItem, so a rule like this should work:

      RedirectMatch 403 util.gxp.UrlItem

      that should do it, but you may need to trim down the matching pattern, or perhaps get more specific with it. Some experimenting should get you there. Good luck.

  3. Thank you very much Jeff.

    Requests for this type of pattern are resolved.

    I appreciate your help!

    Have a nice day!

  4. Hi Jeff,

    I just left a comment on a different section of your blog too. An additional o e for attention if I may… Bing webmaster tells me I have malware on Home page. Getting rid is o e thing but how do find it first?

    I thought about changing themes but I think that might not work… Any suggestions???

    Regards
    Terry.a

  5. I have an interesting url that i am trying to remove. It kind of goes like this:

    domain.com/news/?#/...[malicious code]

    How can i use htaccess to fix this issue?

Comments are closed for this post. Something to add? Let me know.
Welcome
Perishable Press is operated by Jeff Starr, a professional web developer and book author with two decades of experience. Here you will find posts about web development, WordPress, security, and more »
Wizard’s SQL for WordPress: Over 300+ recipes! Check the Demo »
Thoughts
I live right next door to the absolute loudest car in town. And the owner loves to drive it.
8G Firewall now out of beta testing, ready for use on production sites.
It's all about that ad revenue baby.
Note to self: encrypting 500 GB of data on my iMac takes around 8 hours.
Getting back into things after a bit of a break. Currently 7° F outside. Chillz.
2024 is going to make 2020 look like a vacation. Prepare accordingly.
First snow of the year :)
Newsletter
Get news, updates, deals & tips via email.
Email kept private. Easy unsubscribe anytime.