How to Protect Your Site Against Content Thieves

Stolen content is the bane of every blogger who provides a publicly available RSS feed. By delivering your content via feed, you make it easy for scrapers to assimilate and re-purpose your material on their crap Adsense sites. It’s bad enough that someone would re-post your entire feed without credit, but to use it for cheap money-making schemes is about as pathetic as it gets. If you’re lucky, the bastards may leave all the links intact, so at least you will get a few back-links (if you have been linking internally) and get notified of the stolen content as well (via pingback or Google Alert). Lately, however, many of the scraper sites that I have seen are completely removing all links within the stolen content. Incidentally, there are some tell-tale signs that the site you are visiting is a scraper site:

  • No RSS feed available
  • Many quality posts that contain no links
  • Many quality posts but very low subscriber count
  • Great content but with zero comments on any posts
  • Lots of good content but with lots of Adsense or other ads
  • No “About” page or business information
  • And the number one brain-dead giveaway: no contact form or email address

If you pay attention as you surf around, you may want to keep an eye out for some of these dead giveaways. If it looks like the site is profiting from stolen content, it is advisable to leave immediately and locate an original source of information (you could even be cool and report the scraper site to the original author). I.e., help strengthen the legit blogging community and don’t support scrapers in any way. But avoiding scraper sites is merely an afterthought. The real challenge is to have a solid strategy in place that will help you identify, eliminate and prevent stolen content. Unfortunately, there is no “magic cure” that will stop the scrapers from stealing your hard work — apart from running a private site or not providing a feed — but there are many great tools that have proven quite effective in fighting the war against stolen content. While not completely exhaustive, here are some powerful tips and tricks that have served me well over the years:

Use partial feeds
This is arguably the most effect way to immunize against scrapers, who prefer to steal entire articles as opposed to excerpts.
Use a monitoring service
Services such as Fairshare or Copyscape will help you find out who is stealing your content
Use a feed footer plugin
WordPress users have many to choose from including the excellent Copyfeed and Ozh’ Better Feed.
Setup Google Alerts
Keep an eye on what Google discovers around the Web. If you have a specific phrase or URL (perhaps your own) that appears in all (or most) of your posts, setup a comprehensive daily Google Alert to be notified any time that Google finds a match. This is particularly useful if you have pingbacks disabled at your site.
Analyze your access logs
Keep an eye out for image requests coming from external IP addresses. These are usually associated with stolen content.
Tell them to stop
Stay vigilant and confront scrapers with formal “Cease and Desist“ emails. Gather information about the scraper and then contact them as directly and clearly as possible. Tell them to cease and desist, include all relevant URLs, and explain the consequences of non-compliance. Then follow through.
File a DMCA notice
Take action by filing a formal DMCA notice with each of the major search engines. If you are serious about stopping a scraper, filing a DMCA report is an essential step in documenting and potentially resolving the situation. The process is slow and tedious, but worth the effort.
Register your work
To verify your work in the case of a dispute, register your content with services like Numly (404 link removed 2014/10/22) or Registered Commons. For situations when registering is not possible, Archive.org may provide some indirect evidence of original authorship (just make sure you aren’t blocking them with your robots.txt file.
Deliver your own feeds
Rather than blacklist potential scrapers, target them directly by banning their IP or user agent from accessing your feed. You can’t do this with Feedburner, but it is a great way to prevent stolen content.
Blacklist by IP address
Once you have determined the IP of someone who is scraping your content, block them from accessing your feed by blacklisting their address. Something as simple as the following placed in your HTAccess should do the trick:
RewriteCond %{REMOTE_ADDR} ^123.123.123
RewriteRule ^(.*)$ http://domain.tld/feed
Replace the IP address with that of the scraper and replace the feed URL with anything you wish. I prefer to use the feed of the scraper’s site — works awesome ;)
Blacklist the User Agent
If the scraper is using a unique or obscure User Agent when accessing your content, you can blacklist that particular User Agent.

And there are probably many other effective techniques to be found around the Web. The take-home point is that yes, scrapers suck, but there are many ways to prevent your content from being misused. Developing your own anti-scraping strategy while staying vigilant, informed, and proactive will definitely help minimize the degree to which scrapers misuse your content.