How to Protect Your Site Against Content Thieves (and Other Scumbags)
Stolen content is the bane of every blogger who provides a publicly available RSS feed. By delivering your content via feed, you make it easy for scrapers to assimilate and re-purpose your material on their crap Adsense sites. It’s bad enough that someone would re-post your entire feed without credit, but to use it for cheap money-making schemes is about as pathetic as it gets. If you’re lucky, the bastards may leave all the links intact, so at least you will get a few back-links (if you have been linking internally) and get notified of the stolen content as well (via pingback or Google Alert). Lately, however, many of the scraper sites that I have seen are completely removing all links within the stolen content. Incidentally, there are some tell-tale signs that the site you are visiting is a scraper site:
- No RSS feed available
- Many quality posts that contain no links
- Many quality posts but very low subscriber count
- Great content but with zero comments on any posts
- Lots of good content but with lots of Adsense or other ads
- No “About” page or business information
- And the number one brain-dead giveaway: no contact form or email address
If you pay attention as you surf around, you may want to keep an eye out for some of these dead giveaways. If it looks like the site is profiting from stolen content, it is advisable to leave immediately and locate an original source of information (you could even be cool and report the scraper site to the original author). I.e., help strengthen the legit blogging community and don’t support scrapers in any way.
But avoiding scraper sites is merely an afterthought. The real challenge is to have a solid strategy in place that will help you identify, eliminate and prevent stolen content. Unfortunately, there is no “magic cure” that will stop the scrapers from stealing your hard work — apart from running a private site or not providing a feed — but there are many great tools that have proven quite effective in fighting the war against stolen content. While not completely exhaustive, here are some powerful tips and tricks that have served me well over the years:
- Use partial feeds
- This is arguably the most effect way to immunize against scrapers, who prefer to steal entire articles as opposed to excerpts.
- Use a monitoring service
- Services such as Copyscape will help you find out who is stealing your content.
- Use a feed footer plugin
- WordPress users have many to choose from including the excellent Copyfeed.
- Setup Google Alerts
- Keep an eye on what Google discovers around the Web. If you have a specific phrase or URL (perhaps your own) that appears in all (or most) of your posts, setup a comprehensive daily Google Alert to be notified any time that Google finds a match. This is particularly useful if you have pingbacks disabled at your site.
- Analyze your access logs
- Keep an eye out for image requests coming from external IP addresses. These are usually associated with stolen content.
- Tell them to stop
- Stay vigilant and confront scrapers with formal “Cease and Desist“ emails. Gather information about the scraper and then contact them as directly and clearly as possible. Tell them to cease and desist, include all relevant URLs, and explain the consequences of non-compliance. Then follow through.
- File a DMCA notice
- Take action by filing a formal DMCA notice with each of the major search engines. If you are serious about stopping a scraper, filing a DMCA report is an essential step in documenting and potentially resolving the situation. The process is slow and tedious, but worth the effort.
- Register your work
- To verify your work in the case of a dispute, register your content with services like Registered Commons. For situations when registering is not possible, Archive.org may provide some indirect evidence of original authorship (just make sure you aren’t blocking them via robots.txt.
- Deliver your own feeds
- Rather than blacklist potential scrapers, target them directly by banning their IP or user agent from accessing your feed. You can’t do this with Feedburner, but it is a great way to prevent stolen content.
- Blacklist the User Agent
- If the scraper is using a unique or obscure User Agent when accessing your content, you can blacklist that particular User Agent.
- Blacklist the IP address
- Once you have determined the IP of someone who is scraping your content, block them from accessing your feed by blacklisting their address. Something as simple as the following placed in your HTAccess should do the trick.
RewriteCond %{REMOTE_ADDR} ^123.123.123 RewriteRule ^(.*)$ http://domain.tld/feed
Replace the IP address with that of the scraper and replace the feed URL with anything you wish. I prefer to use the feed of the scraper’s site — works awesome ;)
And there are probably many other effective techniques to be found around the Web. The take-home point is that yes, scrapers suck, but there are many ways to prevent your content from being misused. Developing your own anti-scraping strategy while staying vigilant, informed, and proactive will definitely help minimize the degree to which scrapers misuse your content.
13 responses to “How to Protect Your Site Against Content Thieves (and Other Scumbags)”
Thanks for this post and also thanks to link my plugin ©Feed. I will write a new version with more possibilities and new code.
Another option is to use a website like http://www.copyscape.com, where you can check a page/website to see if the content has been replicated elsewhere.
@Frank: My pleasure. Keep up the excellent work with the plugin, it is one of the best! :)
@Caroline: Yes, copyscape was included in the article. It is a great service.
There are two types of feed scrapers: those who filter out markup, and those who don’t.
For those who filter out markup, that’s easy enough to counter; just sprinkle these in the feed:
<SPAN STYLE="display: none">
This is stolen content. Check outperishablepress.com
for the original content, and please useperishablepress.com/contact/
to report the infringer.</SPAN>
Meanwhile, for those who don’t, we get even sneakier:
<IMG SRC="http://www.perishablepress.com/scraper.php" ALT="Innocuous Image">
Now, for those who access the site with a pre-blessed “Referer” (sic) header, that loads a 1×1 transparent GIF/PNG. For those coming from an outside host after a few “grace period” hits to allow for people reading using strange download methods, it pops up a nice, big image telling them they’re reading stolen content.
Of course, if the person stealing your content is completely brain-dead, there always this:
<SCRIPT SRC="http://www.perishablepress.com/js-scraper.php" TYPE="text/javascript">
What you do with this tag I leave up to your imagination.
~~BMDan
Great comment, BMDan — thanks for sharing these techniques. A couple of thoughts in response..
I think it might be possible to remove only anchor tags from RSS content. I have seen certain scraped posts where my markup seemed to be perfectly intact, except for the missing
<a>
tags.It is also possible to selectively filter out
<script>
tags (and anything else, for that matter). When markup is removed entirely or selectively, including an external script would not work, but the inlinedisplay:none;
trick would work a treat.The only case where the inline span message would fail is when CSS is not available on the user’s browsing device. Mobile phones are good examples of this.
What I am really digging is the image tactic. I can see something like that working very well, depending on the contents of the script. A great idea for a future article ;)
Great stuff, Dan — thanks again. You have given me some new ideas to think about.
I’ve been wondering about this content scraping problem for some time. Your tips are very interesting and useful.
Actually, I think your site is a mine of useful information, so useful in fact, I’m going to link to it and help, in a small way, spread the word.
Love the design too!
Very best regards,
Alex
Thanks Alex, much appreciated :)
I realized that some of your signs reflect my blog currently since it is a virgin blog. I love your content and would never steal someones content but I don’t have to worry about this…. yet.
Thanks for the word.
Hi Tyler, so true – I never had to worry about any of this before things started picking up around here. Once traffic and content began to increase, fighting scrapers, spammers, and other scumbags became an essential part of my daily routine. Any preventative measures that you can take now will pay of 100-fold as your site begins to grow.
BMDan I am a newbie at websites but have a pretty good following on the blog i am running and i recently found out that the whole thing was cloned in China… how do i use those scripts you mentioned?
Jay,
It’s not as simple as a drop-in. My post was designed to trigger new thoughts on ways to prevent scraping, not as a step-by-step on how to do so. For that, you’d need to pay me. ;)
~~BMDan
Great post Jeff.
@BMDan: love the tips. very useful stuff.