How to Deal with Content Scrapers

♦ Posted by Jeff Starr in .htaccess, Security

Updated June 29, 2018 • 42 comments

Chris Coyier of CSS-Tricks recently declared that people should do “nothing” in response to other sites scraping their content. I totally get what Chris is saying here. He is basically saying that the original source of content is better than scrapers because:

it’s on a domain with more trust.
you published that article first.
it’s coded better for SEO than theirs.
it’s better designed than theirs.
it isn’t at risk for serious penalization from search engines.

If these things are all true, then I agree, you have nothing to worry about. Unfortunately, that’s a tall order for many sites on the Web today. Although most scraping sites are pure and utter crap, the software available for automating the production of decent-quality websites is getting better everyday. More and more I’ve been seeing scraper sites that look and feel authentic because they are using some sweet WordPress theme and a few magical plugins. In the past, it was easy to spot a scraper site, but these days it’s getting harder to distinguish between scraped and original content. Not just for visitors, but for search engines too.

Here are some counter-points to consider:

Many new and smaller/less-trafficked sites have no more trust than most scraper sites
It’s trivial to fake an earlier publication date, and search engines can’t tell the difference
A well-coded, SEO-optimized WP theme is easy to find (often for free)
There are thousands of quality-looking themes that scrapers can install (again for free, see previous point)
If the search engines don’t distinguish between original and scraper sites, your site could indeed be at risk for SE penalization

To quote Chris, instead of worrying about scraper sites..

..you could spend that time doing something enjoyable, productive, and ultimately more valuable for the long-term success of your site.

I agree that pursuing and stopping content scrapers is ~~no fun at all~~ not as fun as some things. But unless your site is well-built, well-known, and well-trusted, doing nothing about content scrapers may not be the best solution. Certainly an awesome site such as CSS-Tricks doesn’t need to do anything about scraped content because it is a well-coded, well-optimized, well-known and trusted domain. But most sites are nowhere near that level of success, and thus should have some sort of strategy in place for dealing with scraped content. So..

How to Deal with Content Scrapers

So what is the best strategy for dealing with content-scraping scumbags? My personal three-tiered strategy includes the following levels of action:

Do nothing.
Always include lots of internal links
Stop the bastards with a well-placed slice of htaccess

These are the tools I use when dealing with content scrapers. For bigger sites like DigWP.com, I agree with Chris that no action is really required. As long as you are actively including plenty of internal links in your posts, scraped content equals links back to your pages. For example, getting a link in a Smashing Magazine article instantly provides hundreds of linkbacks thanks to all of thieves and leeches stealing Smashing Mag’s content. Sprinkling a few internal links throughout your posts benefits you in some fantastic ways:

Provides links back to your site from stolen/scraped content
Helps your readers find new and related pages/content on your site
Makes it easy for search engines to crawl deeply into your site

So do nothing if you can afford not to worry about it; otherwise, get in the habit of adding lots of internal links to take advantage of the free link juice. This strategy works great unless you start getting scraped by some of the more sinister sites. In which case..

Kill ’em all

So you’re trying to be cool and let scrapers be scrapers. You’re getting some free linkbacks so it’s all good, right? Not if the scrapers are removing the links from your content. Some of the more depraved scrapers will actually run your content through a script to strip out all of the hyperlinks. Then all of your hard work is benefiting some grease-bag out there and you’re getting NOTHING in return. Fortunately there is a quick, easy way to stop content scrapers from using your feeds to populate their crap sites.

HTAccess to the rescue!

If you want to stop some moron from stealing your feed, check your access logs for their IP address and then block it with something like this in your root htaccess file (or Apache configuration file):

Deny from 123.456.789

That will stop them from stealing your feed from that particular IP address. You could also do something like this:

RewriteCond %{REMOTE_ADDR} 123\.456\.789\.
RewriteRule .* http://dummyfeed.com/feed [R,L]

Instead of blocking requests from the scraper, this code delivers some “dummy” content of your choosing. You can do anything here, so get creative. For example, you might send them oversize feed files filled with Lorem Ipsum text. Or maybe you prefer to send them some disgusting images of bad things. You could also send them right back to their own server – always fun to watch infinite loops crash a website. Of course, there are other ways to block and redirect the bad guys if the IP isn’t available.

Bottom Line

If you don’t have to worry about scrapers then don’t. In my experience that’s a luxury that we can’t always afford, especially for newer, fledgling sites that are still building up content and establishing reputation. Fortunately, you can protect your content quickly and easily with a simple line of code. And that’s pretty “fix-it-and-forget-it”-type stuff that anyone can do in a few minutes. And finally, regardless of your site’s condition, you should make a practice of including LOTS of internal links in your content to ride the wave of free linkbacks.

Here are some more ideas and techniques for stopping scrapers. Check the comments too – there’s some real gems in there.

About the Author

Jeff Starr = Fullstack Developer. Book Author. Teacher. Human Being.

42 responses to “How to Deal with Content Scrapers”

Robert G Mears 2012/10/25 10:49 pm

Offender:
Agent: Mozilla/4.0

Internet Explorer 8:
Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; AskTbORJ/5.14.1.20007; .NET4.0C)

And I see I blocked you:
Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)

I removed the block for Mozilla/4.0 so you can visit the site now.
- Jeff Starr 2012/10/25 11:57 pm • Post Author
  
  If the offending agent is literally “Mozilla/4.0” and nothing else, we can do this:
  
  RewriteCond %{HTTP_USER_AGENT} ^(Mozilla/4\.0)$ [NC]
  RewriteRule .* - [F,L]
  
  This will not match anything other than “Mozilla/4.0”, so old IE and whatever else will be allowed through. Test thoroughly ;)
Robert G Mears 2012/10/26 12:32 am

Added. I’ll see how it works. Thanks.
- Robert G Mears 2012/10/26 11:37 am
  
  Thanks Jeff. That did the trick.
  
  While I’m here i should also thank you for your thorough tutorials over the years. I have made considerable use of the information you’ve presented to create an effective htaccess file.
  
  Thanks again!
Kyl 2013/01/18 8:22 am

There are 1001 ways for the scrapers to steal our contents.. thks for the tricks, worth to try

« Previous Comments • 1234

Comments are closed for this post. Something to add? Let me know.