Latest TweetsSeven important #security headers & how to add via #Apache/.htaccess htaccessbook.com/important-sec…
Perishable Press

How to Deal with Content Scrapers

Chris Coyier of CSS-Tricks recently declared that people should do “nothing” in response to other sites scraping their content. I totally get what Chris is saying here. He is basically saying that the original source of content is better than scrapers because:

  • it’s on a domain with more trust.
  • you published that article first.
  • it’s coded better for SEO than theirs.
  • it’s better designed than theirs.
  • it isn’t at risk for serious penalization from search engines.

If these things are all true, then I agree, you have nothing to worry about. Unfortunately, that’s a tall order for many sites on the Web today. Although most scraping sites are pure and utter crap, the software available for automating the production of decent-quality websites is getting better everyday. More and more I’ve been seeing scraper sites that look and feel authentic because they are using some sweet WordPress theme and a few magical plugins. In the past, it was easy to spot a scraper site, but these days it’s getting harder to distinguish between scraped and original content. Not just for visitors, but for search engines too.

Here are some counter-points to consider:

  • Many new and smaller/less-trafficked sites have no more trust than most scraper sites
  • It’s trivial to fake an earlier publication date, and search engines can’t tell the difference
  • A well-coded, SEO-optimized WP theme is easy to find (often for free)
  • There are thousands of quality-looking themes that scrapers can install (again for free, see previous point)
  • If the search engines don’t distinguish between original and scraper sites, your site could indeed be at risk for SE penalization

To quote Chris, instead of worrying about scraper sites..

..you could spend that time doing something enjoyable, productive, and ultimately more valuable for the long-term success of your site.

I agree that pursuing and stopping content scrapers is no fun at all not as fun as some things. But unless your site is well-built, well-known, and well-trusted, doing nothing about content scrapers may not be the best solution. Certainly an awesome site such as CSS-Tricks doesn’t need to do anything about scraped content because it is a well-coded, well-optimized, well-known and trusted domain. But most sites are nowhere near that level of success, and thus should have some sort of strategy in place for dealing with scraped content. So..

How to Deal with Content Scrapers

So what is the best strategy for dealing with content-scraping scumbags? My personal three-tiered strategy includes the following levels of action:

  • Do nothing.
  • Always include lots of internal links
  • Stop the bastards with a well-placed slice of htaccess

These are the tools I use when dealing with content scrapers. For bigger sites like DigWP.com, I agree with Chris that no action is really required. As long as you are actively including plenty of internal links in your posts, scraped content equals links back to your pages. For example, getting a link in a Smashing Magazine article instantly provides hundreds of linkbacks thanks to all of thieves and leeches stealing Smashing Mag’s content. Sprinkling a few internal links throughout your posts benefits you in some fantastic ways:

  • Provides links back to your site from stolen/scraped content
  • Helps your readers find new and related pages/content on your site
  • Makes it easy for search engines to crawl deeply into your site

So do nothing if you can afford not to worry about it; otherwise, get in the habit of adding lots of internal links to take advantage of the free link juice. This strategy works great unless you start getting scraped by some of the more sinister sites. In which case..

Kill ’em all

So you’re trying to be cool and let scrapers be scrapers. You’re getting some free linkbacks so it’s all good, right? Not if the scrapers are removing the links from your content. Some of the more depraved scrapers will actually run your content through a script to strip out all of the hyperlinks. Then all of your hard work is benefiting some grease-bag out there and you’re getting NOTHING in return. Fortunately there is a quick, easy way to stop content scrapers from using your feeds to populate their crap sites.

HTAccess to the rescue!

If you want to stop some moron from stealing your feed, check your access logs for their IP address and then block it with something like this in your root htaccess file (or Apache configuration file):

Deny from 123.456.789

That will stop them from stealing your feed from that particular IP address. You could also do something like this:

RewriteCond %{REMOTE_ADDR} 123\.456\.789\.
RewriteRule .* http://dummyfeed.com/feed [R,L]

Instead of blocking requests from the scraper, this code delivers some “dummy” content of your choosing. You can do anything here, so get creative. For example, you might send them oversize feed files filled with Lorem Ipsum text. Or maybe you prefer to send them some disgusting images of bad things. You could also send them right back to their own server – always fun to watch infinite loops crash a website. Of course, there are other ways to block and redirect the bad guys if the IP isn’t available.

Bottom Line

If you don’t have to worry about scrapers then don’t. In my experience that’s a luxury that we can’t always afford, especially for newer, fledgling sites that are still building up content and establishing reputation. Fortunately, you can protect your content quickly and easily with a simple line of code. And that’s pretty “fix-it-and-forget-it”-type stuff that anyone can do in a few minutes. And finally, regardless of your site’s condition, you should make a practice of including LOTS of internal links in your content to ride the wave of free linkbacks.

Here are some more ideas and techniques for stopping scrapers. Check the comments too – there’s some real gems in there.

Jeff Starr
About the Author Jeff Starr = Web Developer. Security Specialist. WordPress Buff.
Archives
42 responses
  1. @Jeff Starr: sorry, but i still dont know how can you tell if a IP belongs to a scraper or to a human before they steal something on your site..

    Well, if you find a scraper site then you can get it’s IP and add into your black list, that’s ok, but this is a retroactive behavior..

    Another thing is, as you suggested with many internal links in the content, could be talking about yourself, naming your blog domain often in the post..
    Or maybe, hide some text via CSS so users on your site wont be able to read it, but the scraper will get anyway.

    I admin that i’d never dealed with scrapers, this are just my 2 cents ;)

  2. @Dave: A copyright hint works fine if the scraper really is a robot without emotions. In my experience, the hardest thieves are people, for instance young girls decorating their profile on Myscape with photos from other people’s websites via hotlinking. Don’t make creepy people angry. So, I use a 1x1px BMP replacement in htaccess.

    I’d enhance Jeff’s internal link trick with semantic tags. We are using a lot of q-tags, mark-tags, title-tags etc. inline. This seems to prevent copying very good as far as I can see.

    It really helps blocking by user-agent. In my experience visitors with agents like “Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)” or “Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)” can never be good visitors.

    At last: There is a button in the Google Webmaster tools that says: Help me Google, someone is stealing my content.

  3. ok, so i use magpie for pulling in feeds and displaying content. i’m pretty sure i’m doing it the *right way, as in not saying the content is mine, and also providing the real author’s linkbacks, etc., but i’m just wondering if there is protocol for this? how do i know what i’m doing is ok? i would never steal content or want anyone to think someone else’s work is mine.
    ps – per usual, great topic.

  4. I like the .htaccess solution, but it only works if you host your own feed, doesn’t it? If you use FeedBurner/FeedBlitz/etc, scrapers can swipe your content all the livelong day! How would you suggested dealing with that (without having to quit using your feed aggregator)?

  5. I think there is going to be a rise in services like https://www.digiprove.com/secure/subscribe.aspx. I wonder if google might offer some kind of service like this in the future? Just like there is SSL keys for verifying secure connections, there could just as easily be keys to verify content ownership. Not that this would stop content scraping, but it could be a way for search engines to start to verify ownership of content.

  6. Lee Kowalkowski September 28, 2010 @ 4:34 am

    How about, in addition to links (which would have to be absolute – grr!), just casually mentioning the name of your site once in a while? e.g. “Here at Perishable Press, we…”

  7. I like the sound of adding rules to your HTAccess, but what if you don’t have access to the server. What can you do then to deny scrapers?

    I like the idea of using lots of internal links. I definitely can do a better job of that.

    Thanks for any advice.

    Matt

  8. @daniele Scrapers are easily discouraged, it is why the become content thieves, they have no creativity or work ethic. So banning their IP’s quickly sends them a message that you are on to them and they should find easier targets.

    @Jeff – I love the infinite loop concept and the dummy content redirect. I have a client that is feeling the pressure from scrapers, so i am thinking of setting up a new domain full of automated trash spam terms then unleashing 3 – 5 articles a day. In my mind this will them penalise them as far as Google is concerned, even if it doesn’t it allows me the satisfaction of pro-actively deterring them from coming back.

    Great site, I use a lot of your techniques.

  9. Good article and I agree totally. There are definitely far better things to do with time than worry about others ‘stealing’ content!
    Google in particular has become far better recently at spotting original content and the ‘scraping’ sites are definitely being identified better by Google. There is little one can do about article scrapers so leave it up to Google to identify their sites!

  10. Rockford Home Remodeling October 21, 2010 @ 3:12 am

    Never thought about the free link backs from content scrapers. However I have always included internal links on my blog posts.

    Nevertheless always good info and things to think, try and do. On the other hand, in my line of work and what I usually post about, there are only so many ways to say the same thing. Not much of a writer but more of a craftsman. Still learning though because I am somewhat of a techno geek too.

  11. Putting internal links in your article is a good way to get back some (foul) juice from scraper sites, but this strategy may actually backfire if thousands of spam sites suddenly point to various sections of your website. Some news aggregators are fine (authority and trust attached) but millions of automated WordPress blogs aren’t. In the end, the sites that can withstand scraping can also handle a few thousand spammy links, the others can’t either way. A few internal links back may not even rank your feeble site above a legit and trusted news site relaying it. Google may be able to determine the original version of a document, but in some cases site authority prevails over authorship or quality.

  12. Hi Jeff,

    I must say that using .htaccess is not efficient. Many scrapers (at least here) usually use Open Proxies, hijacked sites or even TOR to remain undetected. They often switch IP addresses making .htaccess useless.

    Personally I prefer checking visitor’s IP against DNSBLs – this allows effectively block Open Proxies and TOR (e.g., xbl.spamhaus.org, dnsbl.ahbl.org, ip-port.exitlist.torproject.org), virus/worm sources – they often automatically scan websites fro vulnerabilities (virus.rbl.jp, wormrbl.imp.ch, dnsbl.ahbl.org), hijacked sites (zombie.dnsbl.sorbs.net, httpbl.abuse.ch, drone.abuse.ch) or even comment spammers (dnsbl.ahbl.org, list.blogspambl.com, sbl.spamhaus.org).

[ Comments are closed for this post ]