Clean Up Malicious Links with HTAccess
I recently spent some time analyzing Perishable Press pages as they appear in the search results for Google, Bing, et al. Google Webmaster Tools provides a wealth of information about crawl errors, as well as the URLs of any pages that link to missing content. Combined with your site’s access/error logs, you have everything needed to track down 404 errors and clean up your listings in the search engine results.
So far so good, but unfortunately not everyone understands and/or practices proper link etiquette, so even if you manage to clean up all of your 404 and other crawl errors, you could see something like this in the search results:
Notice the “scamdex
” query string? Apparently Google considers these URLs valid even though there is no matching resource or functionality for specific query strings. That is, the pages exist, so Google includes them in the search index. Depending on who/what is linking to you, many of your site’s lesser-ranked pages could be indexed with some random query string appended to the URLs. Yuck.
Sabotage or Ignorance?
How does this happen? Scouring my database and files, there is no trace of any “scamdex” query strings, URLs, or anything else, so what’s up. Turns out that somebody for whatever reason posted a link to my site that looks like this:
https://perishablepress.com/?scamdex
So then Googlebot follows the link, sees a valid page, and continues to crawl my site. The problem is that the “scamdex
” query string is passed from one link to the next, and eventually your normal URLs are replaced with weird query-string URLs in the search results. Hence the screenshot above.
Consequences of teh ill behavior
So the big question is “why append a query string when linking to a URL?” Who knows. But fortunately most people working on the Web understand link etiquette and don’t append weird query strings when linking to your pages.
In this particular case, no real harm was done – and certainly nothing that can’t be fixed – but the problem is that apparently anybody can just append whatever query string they want to your URLs, causing Google to replace your original URLs with irrelevant query-string versions.
What’s the worst that can happen? I suppose the worst that could happen is that someone could link to your site with a threatening or obscene query string. Some random examples:
http://starbucks.com/?overpriced-coffee
http://www.wireless.att.com/?terrible-service
http://www.house.gov/?corrupt-politics
Then as Google crawls and indexes these valid pages, the malicious query string would begin replacing the original URLs in the search results. Granted this is all hypothetical, but as they say, “if it happened to me, it can happen to anyone”. In my case, one scamdex
link was all it took for Google to index all sorts of pages with the appended query-string. So yeah, definitely something to keep an eye on, and just in case it ever happens to your site, here is how to fix it..
Clean up malicious links with htaccess
There are numerous ways to clean up sloppy incoming links. Here is how I did it with a simple slice of .htaccess
:
# CLEAN MALICIOUS LINKS
<ifModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} querystring [NC]
RewriteRule .* http://example.com/$1? [R=301,L]
</ifModule>
Just place into your web-accessible-root .htaccess
file and replace the “querystring
” with whatever is plaguing you, and also replace “example.com
” with your site URL. Adding more query strings is easy, just replace the RewriteCond
with something like this:
RewriteCond %{QUERY_STRING} (apples|oranges|bananas) [NC]
..and replace the fruits with any query strings whatever. After implementing this technique, Google et al got the message and cleared out all but one of the scamdex
URLs, and I’m guessing it’s just a matter of time before the results are completely clean.
40 responses to “Clean Up Malicious Links with HTAccess”
My site got swamped with .htaccess files in almost every directory. How did this happen? How can I prevent it in the future? Thanks, MM
Hi Jeff.
I’m finding invented url on my website, the URL pattern is interesting and I suspect dishonest, I do not find written references anywhere.
The URL pattern is always the same, but change always in the end item.
It is:
http://www.example.com/com.google.crawl.wmconsole.fe.util.gxp.UrlItem$2@172e05f
Anyway, I’m trying to solve this with this trick and I can not find the solution.
Perhaps I can fix it somehow?
thanks
Hi dqeva,
To block these types of requests, we look for the least common character string, which if I’m guessing correctly is something along the lines of
util.gxp.UrlItem
, so a rule like this should work:RedirectMatch 403 util.gxp.UrlItem
that should do it, but you may need to trim down the matching pattern, or perhaps get more specific with it. Some experimenting should get you there. Good luck.
Thank you very much Jeff.
Requests for this type of pattern are resolved.
I appreciate your help!
Have a nice day!
Hi Jeff,
I just left a comment on a different section of your blog too. An additional o e for attention if I may… Bing webmaster tells me I have malware on Home page. Getting rid is o e thing but how do find it first?
I thought about changing themes but I think that might not work… Any suggestions???
Regards
Terry.a
I have an interesting url that i am trying to remove. It kind of goes like this:
domain.com/news/?#/...[malicious code]
How can i use htaccess to fix this issue?