Clean Up Malicious Links with HTAccess
I recently spent some time analyzing Perishable Press pages as they appear in the search results for Google, Bing, et al. Google Webmaster Tools provides a wealth of information about crawl errors, as well as the URLs of any pages that link to missing content. Combined with your site’s access/error logs, you have everything needed to track down 404 errors and clean up your listings in the search engine results.
So far so good, but unfortunately not everyone understands and/or practices proper link etiquette, so even if you manage to clean up all of your 404 and other crawl errors, you could see something like this in the search results:
Notice the “scamdex
” query string? Apparently Google considers these URLs valid even though there is no matching resource or functionality for specific query strings. That is, the pages exist, so Google includes them in the search index. Depending on who/what is linking to you, many of your site’s lesser-ranked pages could be indexed with some random query string appended to the URLs. Yuck.
Sabotage or Ignorance?
How does this happen? Scouring my database and files, there is no trace of any “scamdex” query strings, URLs, or anything else, so what’s up. Turns out that somebody for whatever reason posted a link to my site that looks like this:
https://perishablepress.com/?scamdex
So then Googlebot follows the link, sees a valid page, and continues to crawl my site. The problem is that the “scamdex
” query string is passed from one link to the next, and eventually your normal URLs are replaced with weird query-string URLs in the search results. Hence the screenshot above.
Consequences of teh ill behavior
So the big question is “why append a query string when linking to a URL?” Who knows. But fortunately most people working on the Web understand link etiquette and don’t append weird query strings when linking to your pages.
In this particular case, no real harm was done – and certainly nothing that can’t be fixed – but the problem is that apparently anybody can just append whatever query string they want to your URLs, causing Google to replace your original URLs with irrelevant query-string versions.
What’s the worst that can happen? I suppose the worst that could happen is that someone could link to your site with a threatening or obscene query string. Some random examples:
http://starbucks.com/?overpriced-coffee
http://www.wireless.att.com/?terrible-service
http://www.house.gov/?corrupt-politics
Then as Google crawls and indexes these valid pages, the malicious query string would begin replacing the original URLs in the search results. Granted this is all hypothetical, but as they say, “if it happened to me, it can happen to anyone”. In my case, one scamdex
link was all it took for Google to index all sorts of pages with the appended query-string. So yeah, definitely something to keep an eye on, and just in case it ever happens to your site, here is how to fix it..
Clean up malicious links with htaccess
There are numerous ways to clean up sloppy incoming links. Here is how I did it with a simple slice of .htaccess
:
# CLEAN MALICIOUS LINKS
<ifModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} querystring [NC]
RewriteRule .* http://example.com/$1? [R=301,L]
</ifModule>
Just place into your web-accessible-root .htaccess
file and replace the “querystring
” with whatever is plaguing you, and also replace “example.com
” with your site URL. Adding more query strings is easy, just replace the RewriteCond
with something like this:
RewriteCond %{QUERY_STRING} (apples|oranges|bananas) [NC]
..and replace the fruits with any query strings whatever. After implementing this technique, Google et al got the message and cleared out all but one of the scamdex
URLs, and I’m guessing it’s just a matter of time before the results are completely clean.
40 responses to “Clean Up Malicious Links with HTAccess”
Great little article and a good tale of what can result from some careful analysis and clever remedial action.
Here’s to ‘clean’ links.
Thanks for the info. I was wondering if you could do some posts on creating a good htaccess file for wordpress sites.
Thanks again for the info
Yes indeed, I’ve got something along those lines in store for a soon-future article. Excellent idea ;)
‘Yes indeed, I’ve got something along those lines in store for a soon-future article. Excellent idea ;)’
Yes!
A lot of websites out there with bits and pieces. A few with information overload. Some I don’t think they know what they are talking about…copy and paste perhaps. Definitely like this idea.
Question, why is this new layout and server you have appear waterfally (or whatever you may call it) Is it because of the VPS or CS3 and me browser? Or perhaps some other code?
Awesome, great to hear the interest! :)
I’ve actually got many htaccess posts to share, but one of the next ones will be something like a solid WP htaccess template with some optional features. Gotta do it right though.
About the new design, others have mentioned jagged/rough scrolling in some browsers on various systems, but I’m still working on pinning it down. What makes it difficult is that I can’t replicate the issue, so it’s kinda just guessing at stuff ;)
Jeff I’ve had 100 of different links of this variety:
/wp-content/view.php?q=cauliflower+ear
attached to my site. Should I use the
/wp-content/view.php
as the query string? Or what? thanksThis will stop all requests for anything remotely resembling a cauliflower ear:
# CLEAN MALICIOUS LINKS
<IfModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} cauliflower\+ear [NC]
RewriteRule .* http://example.com/$1? [R=301,L]
</IfModule>
The term “
view.php
” is in the request, but not in the query string. But it doesn’t matter because you’ll stop them all with this code.But I have many queries with different endings, all beginning with
mysite.com/wp-content/view.php?q=
So do I need to it for each one?
Ah gotcha. Yeah then it all pretty much depends on whether or not your site is using a file named “
view.php
” located in thewp-content
directory. If it is not, then stopping the nonsense is even easier, just use something like this:RedirectMatch 401 /wp-content/view\.php
Otherwise, a more careful analyses of the malicious request patterns is required to formulate an effective strategy.
You should have asked me!
I’m a fan of PP and have used your information to hand-craft my .htaccess file to week out spammers, hackers and exploiters. Any unfound links on my website are emailed to me so that I can take remedial action if I see any content theft or major scripting going on.
When I put a link on my website pointing to a friendly site, I often append the querystring ‘?scamdex’ just so that the website owner, looking at his logs will notice that a lot of incoming links are in fact coming from me (and he will hopefully think what a nice guy I am for sending customers his way).
You do use advertising on your site and I am a Google Page Rank 5 site so incoming links from me should ‘lift both our boats’ as the saying goes.
I had no intent to abuse you and if you want, I will remove my link to you but perhaps we should talk?
Hey Mark,
Yes I really should have thought of contacting you about the link, but it’s really not a big deal in this case. I didn’t name the source of the link in the post, but people who know your site will certainly get the reference. Your site is a great resource, but for some reason seeing that phrase “scam”-whatever appended to my search engine URLs is just kinda spooky. Then again I’m pretty paranoid about stuff, but seriously no harm or offense intended.
As for why you (or anyone) does that, I get it, but I’m getting the same information from the referrer fields in my access logs. Analytical/statistical software shows you the same thing – you can look at who is sending you what kind of traffic, so when cool people link and send traffic, it’s pretty easy to tell. IMHO appending names (or anything) to other people’s URLs is just unnecessary. But that’s just my view: keep it clean, keep it simple, use existing tools to do the job, etc.
This would be an interesting conversation indeed. Shoot me an email to set it up.
this is off topic from your post but it concerns your site.
So i am using win xp sp3 on a older laptop with firefox 3.6.13. My problem is trying to scroll down on your site is just about impossible.
I have had this problem on other sites that use a huge image as a background and as a temporary fix I use adblock+ to block that image. Once that is done i can scroll with no stutter or sluggishness at all.
Now with your new design i removed the background image (bg1.png) and no change. the scrolling problem is still there.
Oh now i see the image that needs to be blocked but you turned it into a DataUrl.
Any idea as to why those bg images cause my scrolling problem??
any help at all would be GREATLY APPRECIATED!
thank you.
Hey thanks for the heads up about this. I have heard other reports, but cannot replicate the issue so there is little I can do to fix it. But I am looking for ideas, and for now have removed a large background image,
bg2.png
, from the design. It wasn’t a data URL, but it was a relatively huge file that could have been slowing things down on older machines. Let me know if it helps, Thank you.I have blocked both bg1 & bg2 png’s and scrolling problem is still there…
After testing more I now know it’s a “background” image causing the problem (not bg1 or bg2 pngs). I used the WebDev add-on for FF and under the Images menu I set it to “hide background images” and unblocked bg1 & bg2 png’s…. BAM
Scrolling works perfectly again.
Which leads me to believe that it is one of the DataUrl images causing the problem.
Sidenote: what happened to the option to get notification emails when new comments have been posted??
Thanks for the help! :)
I’m going to go ahead and replace the data-URLs with images and we’ll see how it goes. Note that bg2.png was removed awhile ago, but there is a second background image, bg0.png (bg-zero), still in use.
If you get a chance to test further (after about 10 minutes from now), let me know if it’s working. I would love to get everything scrolling smoothly for everyone.
Sidenote: That functionality is still available for all of my old posts, but I haven’t installed the plugin yet for this new WP install. Will get to it eventually. Side-sidenote: I’m surprised Subscribe to Comments hasn’t yet been integrated into the core. The WP devs have certainly adopted lesser functionality, and it would be awesome to not have to install it on every site.
</opinion>
ok to get smooth scrolling fixed I used 2 adblock+ filters to block 2 images:
bg1.png
&bg0.png
|https://perishablepress.com/perish/wp-content/themes/tether/images/bg1.png$domain=perishablepress.com
|https://perishablepress.com/perish/wp-content/themes/tether/images/bg0.png$domain=perishablepress.com
Once those 2 images are blocked BAM! PerishablePress scrolls like buttaa :)
So your background many not be as stylish but the content more than makes up for it 100 times over.
Excellent information – Thanks again for the help :)
I’m thinking it might be an issue with the
fixed
property applied to both of the background images.Is there a way to set up a brief time frame to test this? I’ll just remove the
fixed
properties for a few minutes while you check the scrolling?I can’t use this laptop at work otherwise I would say any time during the day.
We can do it right now if your awake…
So how about 2/23/2011 at 7pm cst?
Hit me on twitter @digideth
Sounds like a plan: I’ll remove the
fixed
properties just before 7:00 CST, and then leave it that way until your report or for an hour or so, whichever happens first.Thanks again for your help with this :)
YES! removing the fixed properties did the trick. scrolls like butta with nothing blocked!
Can you do a post on this issue please?
You are the first web dev that actually tried to fix the problem out of about 10 sites that I have reported this problem too. Heres the kicker… majority of those sites discussed web dev & design. I couldn’t even get a reply from them so when you acted on it I was jumping for joy and was happy to help you fix the problem.
Thanks for caring!
Awesome, I am stoked to get this figured out. Smooth scrolling is an essential part of good usability, so Thanks again for the help.
As tweeted, I’m going to replace the 2 background images with the original data URLs and let’s see what happens.
Update: I forgot that only one of the two bg images was a data URL. So now the current setup is one actual image, one data URL, and no fixed properties.
my thoughts exactly on the usability!
If you put the dataurl images back then no change = scroll problem solved!
So how many dataurls images are you using now. 15?
So it looks like the scrolling issue is entirely related to the
fixed
positioning. The question now is was it the combination of two fixed backgrounds, or will scrolling lag even with only onefixed
property.I went ahead and added
fixed
to the bg0.png data URL (for the HTML element). Let me know if scrolling breaks..Thanks again :)
scrolls just fine still. nothing broke on that last change.
A curious phenomena. Why didn’t the canonical tag prevent this?
Should it have? I’m not using them on a majority of pages, but if canonical tags work to prevent it, maybe that’s a good alternative to using the htaccess directives.
Why does the URL in the RewriteRule end with a question mark? Wouldn’t this result in URLs redirected to “
example.com/?
”?That’s what’s doing the work. It tells Apache to remove the query string from the destination URL.
Oh, and what about a method to get rid of ALL query strings? This would — of course — only work for sites which never ever use any querystring.
That would look like this:
RewriteRule .* http://example.com/$1? [R=301]
This will eliminate query-strings in your URLs for all requests.
how can ?scamdex be indexed in Google?
is it due to your sitemap xml?
I think that the google webmaster tools let you see witch parameters are used in your indexed urls and with ones to keep, ignore or let google decide for himself.
This is another way. But can also be used in combination with your technique, you just use this tool in order to detect the malicious parameters in this case.
Anyway, great informations here as usual, you are an eye opener ! thanks.
Jeff, thanks for another great post. I’ve learned much from you. Question on this one though… since this seems like a search problem, couldn’t you simply block Google from indexing the bogus URLs using robot.txt? Example: Disallow/*.
Yes but Google already does a decent job of doing that, plus you can just use canonical tags to effectively do the same thing. Also, htaccess gives you the benefit of keeping the bad bots away from your site. Not so much with robots.txt.