How to Deal with Content Scrapers
Chris Coyier of CSS-Tricks recently declared that people should do “nothing” in response to other sites scraping their content. I totally get what Chris is saying here. He is basically saying that the original source of content is better than scrapers because:
- it’s on a domain with more trust.
- you published that article first.
- it’s coded better for SEO than theirs.
- it’s better designed than theirs.
- it isn’t at risk for serious penalization from search engines.
If these things are all true, then I agree, you have nothing to worry about. Unfortunately, that’s a tall order for many sites on the Web today. Although most scraping sites are pure and utter crap, the software available for automating the production of decent-quality websites is getting better everyday. More and more I’ve been seeing scraper sites that look and feel authentic because they are using some sweet WordPress theme and a few magical plugins. In the past, it was easy to spot a scraper site, but these days it’s getting harder to distinguish between scraped and original content. Not just for visitors, but for search engines too.
Here are some counter-points to consider:
- Many new and smaller/less-trafficked sites have no more trust than most scraper sites
- It’s trivial to fake an earlier publication date, and search engines can’t tell the difference
- A well-coded, SEO-optimized WP theme is easy to find (often for free)
- There are thousands of quality-looking themes that scrapers can install (again for free, see previous point)
- If the search engines don’t distinguish between original and scraper sites, your site could indeed be at risk for SE penalization
To quote Chris, instead of worrying about scraper sites..
..you could spend that time doing something enjoyable, productive, and ultimately more valuable for the long-term success of your site.
I agree that pursuing and stopping content scrapers is no fun at all not as fun as some things. But unless your site is well-built, well-known, and well-trusted, doing nothing about content scrapers may not be the best solution. Certainly an awesome site such as CSS-Tricks doesn’t need to do anything about scraped content because it is a well-coded, well-optimized, well-known and trusted domain. But most sites are nowhere near that level of success, and thus should have some sort of strategy in place for dealing with scraped content. So..
How to Deal with Content Scrapers
So what is the best strategy for dealing with content-scraping scumbags? My personal three-tiered strategy includes the following levels of action:
- Do nothing.
- Always include lots of internal links
- Stop the bastards with a well-placed slice of htaccess
These are the tools I use when dealing with content scrapers. For bigger sites like DigWP.com, I agree with Chris that no action is really required. As long as you are actively including plenty of internal links in your posts, scraped content equals links back to your pages. For example, getting a link in a Smashing Magazine article instantly provides hundreds of linkbacks thanks to all of thieves and leeches stealing Smashing Mag’s content. Sprinkling a few internal links throughout your posts benefits you in some fantastic ways:
- Provides links back to your site from stolen/scraped content
- Helps your readers find new and related pages/content on your site
- Makes it easy for search engines to crawl deeply into your site
So do nothing if you can afford not to worry about it; otherwise, get in the habit of adding lots of internal links to take advantage of the free link juice. This strategy works great unless you start getting scraped by some of the more sinister sites. In which case..
Kill ’em all
So you’re trying to be cool and let scrapers be scrapers. You’re getting some free linkbacks so it’s all good, right? Not if the scrapers are removing the links from your content. Some of the more depraved scrapers will actually run your content through a script to strip out all of the hyperlinks. Then all of your hard work is benefiting some grease-bag out there and you’re getting NOTHING in return. Fortunately there is a quick, easy way to stop content scrapers from using your feeds to populate their crap sites.
HTAccess to the rescue!
If you want to stop some moron from stealing your feed, check your access logs for their IP address and then block it with something like this in your root htaccess file (or Apache configuration file):
Deny from 123.456.789
That will stop them from stealing your feed from that particular IP address. You could also do something like this:
RewriteCond %{REMOTE_ADDR} 123\.456\.789\.
RewriteRule .* http://dummyfeed.com/feed [R,L]
Instead of blocking requests from the scraper, this code delivers some “dummy” content of your choosing. You can do anything here, so get creative. For example, you might send them oversize feed files filled with Lorem Ipsum text. Or maybe you prefer to send them some disgusting images of bad things. You could also send them right back to their own server – always fun to watch infinite loops crash a website. Of course, there are other ways to block and redirect the bad guys if the IP isn’t available.
Bottom Line
If you don’t have to worry about scrapers then don’t. In my experience that’s a luxury that we can’t always afford, especially for newer, fledgling sites that are still building up content and establishing reputation. Fortunately, you can protect your content quickly and easily with a simple line of code. And that’s pretty “fix-it-and-forget-it”-type stuff that anyone can do in a few minutes. And finally, regardless of your site’s condition, you should make a practice of including LOTS of internal links in your content to ride the wave of free linkbacks.
Here are some more ideas and techniques for stopping scrapers. Check the comments too – there’s some real gems in there.
42 responses to “How to Deal with Content Scrapers”
I’ve had scrapers steal my content on several occasions. My final solution was to simply report them to Google …
It worked — They wiped the jerks out of the index.
:-)
BTW Jeff
You and your site rock
EB
Hey everyone,
I’m definitely intrigued about this subject as I’ve had several client sites scraped. I’m an SEO guy so, I catch scraped sites all the time when I do research about the content that my clients index.
One scrape/hack even copied all of the code down to the META and ONLY changed the brand name and links. It actually was ranking directly behind my client’s website (same Title, Same Description, same everything except for some Flash elements). The worst is that the client site CMS/DB was breached and left with an empty database and virtually no content.
It seems to me that most people here are addressing some kind of post-attach prevention (which I think is good). However, @Edward, I totally agree with the spam reports. Generally my strategy is as follows:
1) Contact the website and inform them of their copyright infringement and the consequences. I don’t give any ultimatum, wiggle room, or opportunity to engage in a dialogue. I just tell the ‘scraper’ what is going to happen and what they did was illegal. I try to keep things short and civil because the person who scraped the site sucks enough anyway not to deserve anyone’s emotional energy.
2) Contact Yahoo!, Bing, Google to report the site. Google never really responds with a live person for anything, but Yahoo! and Bing are surprisingly responsive regarding spam/copyright infringements. Overall, search engines take copyrights VERY seriously, so taking the ‘Remove the content from your website that you stole or else you’ll get dropped from search engine indexes’ route works pretty well. After all what good is scraped stolen content that no one can find?
3) Do a reverse IP/WhoIs lookup to find the hosting provider, contact the hosting provider and tell them that their client is hosting illegal content. This works especially fast.
Hope this helps!
Echoing EdWard, Perishable Press rocks indeed.
You may add in the headers of your articles pages a meta tag like for example:
Hi Jeff
Not read all of the comments yet but has anbody mentioned bandwidth.
My bandwidth is finite and those scrapers keep coming back and using more and more.
My service provider has blocked a couple of IP’s but seems like every month another one takes its place.
Thanks for giving me a few ideas to stop those bots – much appreciated.
What I do is add a link back to my original post and the homepage to the end of every post in the feed. It’s a simple matter of adding a custom function to your functions.php file where you check if the content is_feed and then use concatenation to add a link back to the end of $content. I’d post it here but WordPress would probably filter it out…
Anyway, the theory is that if your content does get scrapped, the spammer site will provide a link back to the original, so it shouldn’t rank any higher than your website.
The most difficult is to deal with scrapers that copy-paste the article manually. They do not include backlinks to the original source at all.
Do you have experience to report to a domain registrar? And whether they will respond to our DMCA letter?
@Bebe: they will, it just takes a long time. Make sure your DMCA letter meets every requirement, and keep copies for your records. The DMCA notices I’ve filed were all eventually resolved, but like I said it took forever to happen.
Hey,
Very good post! I remember the first time I had my content scraped in this fashion. I was furious, because I had spent serious time and energy researching and writing a good article, just to have it stolen. After about a year and a half, I have come to appreciate the theft. My links leading back to my site were left in tact and gave me 4 backlinks from a PR4 site. This brought me up to a PR3 from a PR1 and my site is optimized for long tail keywords. All in all, even though it’s crap to steal someone else’s content, it actually helped me in this case.
My 2 cents,
Chaz
I have disabled my rss feed, because I find out that even when you block people, they can still read your website outside your website to use rss readers and that is not the meaning. So I have disabled my rss feeds, and use the .htacces files to block them, what else can you do?
You’ve offered some useful tips here. My site’s hosted by TypePad so I’m unable to get to my site’s htaccess, but I do have Pubsubhubbub enabled through feedburner. What’s so disheartening about content theft/plagiarism is that some of these criminals actually promote their tech on the search engines. I was doing an online search for “content scrapers” and several content scraper tech sites appeared in Bing’s top results.
Something needs to be done about this problem. We need a clean and honest internet, not one rife with content stolen from other websites.
thanks a lot man you saved my day…although just one doubt can you tell me where to insert the rewritecond rules in my htaccess file..u see i m kinda a paranoid when it comes to dealing with scripts and things that will mess up my site…thanks in advance! :)
Okay so here’s one for you:
User agent Mozilla 4.0, all by itself, is for some kind of image service that I want to block. Adding it to the list in my htaccess file blocks it.
However, “Mozilla 4.0” is the start of the string identifying MSIE 8. So everyone using IE8 is now also blocked.
Is there someway to block the short string — Mozilla 4.0 — but then allow the long string Mozilla 4.0 (compatible; MSIE 8.0; … ?
Happy to look at this.. will you copy/paste the entire UA string for each one (ie, Mozilla 4.0 and IE8)? The key to finding a match is utilizing the entire string(s).