Latest TweetsAll o' me plugins freshly updated and ready for WP 5.0 :) profiles.wordpress.org/special…
Perishable Press

How to Deal with Content Scrapers

Chris Coyier of CSS-Tricks recently declared that people should do “nothing” in response to other sites scraping their content. I totally get what Chris is saying here. He is basically saying that the original source of content is better than scrapers because:

  • it’s on a domain with more trust.
  • you published that article first.
  • it’s coded better for SEO than theirs.
  • it’s better designed than theirs.
  • it isn’t at risk for serious penalization from search engines.

If these things are all true, then I agree, you have nothing to worry about. Unfortunately, that’s a tall order for many sites on the Web today. Although most scraping sites are pure and utter crap, the software available for automating the production of decent-quality websites is getting better everyday. More and more I’ve been seeing scraper sites that look and feel authentic because they are using some sweet WordPress theme and a few magical plugins. In the past, it was easy to spot a scraper site, but these days it’s getting harder to distinguish between scraped and original content. Not just for visitors, but for search engines too.

Here are some counter-points to consider:

  • Many new and smaller/less-trafficked sites have no more trust than most scraper sites
  • It’s trivial to fake an earlier publication date, and search engines can’t tell the difference
  • A well-coded, SEO-optimized WP theme is easy to find (often for free)
  • There are thousands of quality-looking themes that scrapers can install (again for free, see previous point)
  • If the search engines don’t distinguish between original and scraper sites, your site could indeed be at risk for SE penalization

To quote Chris, instead of worrying about scraper sites..

..you could spend that time doing something enjoyable, productive, and ultimately more valuable for the long-term success of your site.

I agree that pursuing and stopping content scrapers is no fun at all not as fun as some things. But unless your site is well-built, well-known, and well-trusted, doing nothing about content scrapers may not be the best solution. Certainly an awesome site such as CSS-Tricks doesn’t need to do anything about scraped content because it is a well-coded, well-optimized, well-known and trusted domain. But most sites are nowhere near that level of success, and thus should have some sort of strategy in place for dealing with scraped content. So..

How to Deal with Content Scrapers

So what is the best strategy for dealing with content-scraping scumbags? My personal three-tiered strategy includes the following levels of action:

  • Do nothing.
  • Always include lots of internal links
  • Stop the bastards with a well-placed slice of htaccess

These are the tools I use when dealing with content scrapers. For bigger sites like DigWP.com, I agree with Chris that no action is really required. As long as you are actively including plenty of internal links in your posts, scraped content equals links back to your pages. For example, getting a link in a Smashing Magazine article instantly provides hundreds of linkbacks thanks to all of thieves and leeches stealing Smashing Mag’s content. Sprinkling a few internal links throughout your posts benefits you in some fantastic ways:

  • Provides links back to your site from stolen/scraped content
  • Helps your readers find new and related pages/content on your site
  • Makes it easy for search engines to crawl deeply into your site

So do nothing if you can afford not to worry about it; otherwise, get in the habit of adding lots of internal links to take advantage of the free link juice. This strategy works great unless you start getting scraped by some of the more sinister sites. In which case..

Kill ’em all

So you’re trying to be cool and let scrapers be scrapers. You’re getting some free linkbacks so it’s all good, right? Not if the scrapers are removing the links from your content. Some of the more depraved scrapers will actually run your content through a script to strip out all of the hyperlinks. Then all of your hard work is benefiting some grease-bag out there and you’re getting NOTHING in return. Fortunately there is a quick, easy way to stop content scrapers from using your feeds to populate their crap sites.

HTAccess to the rescue!

If you want to stop some moron from stealing your feed, check your access logs for their IP address and then block it with something like this in your root htaccess file (or Apache configuration file):

Deny from 123.456.789

That will stop them from stealing your feed from that particular IP address. You could also do something like this:

RewriteCond %{REMOTE_ADDR} 123\.456\.789\.
RewriteRule .* http://dummyfeed.com/feed [R,L]

Instead of blocking requests from the scraper, this code delivers some “dummy” content of your choosing. You can do anything here, so get creative. For example, you might send them oversize feed files filled with Lorem Ipsum text. Or maybe you prefer to send them some disgusting images of bad things. You could also send them right back to their own server – always fun to watch infinite loops crash a website. Of course, there are other ways to block and redirect the bad guys if the IP isn’t available.

Bottom Line

If you don’t have to worry about scrapers then don’t. In my experience that’s a luxury that we can’t always afford, especially for newer, fledgling sites that are still building up content and establishing reputation. Fortunately, you can protect your content quickly and easily with a simple line of code. And that’s pretty “fix-it-and-forget-it”-type stuff that anyone can do in a few minutes. And finally, regardless of your site’s condition, you should make a practice of including LOTS of internal links in your content to ride the wave of free linkbacks.

Here are some more ideas and techniques for stopping scrapers. Check the comments too – there’s some real gems in there.

Jeff Starr
About the Author Jeff Starr = Web Developer. Security Specialist. WordPress Buff.
Archives
42 responses
  1. Timothy Warren September 24, 2010 @ 10:00 am

    For static, blog, and content-heavy sites, I see what you mean, but on very dynamic websites, scrapers are going to have a much more difficult time getting content.

    By the way, have you ever looked at using something like the browser capabilites database(http://browsers.garykeith.com/downloads.asp) as a basis for determining which user-agents to block?

  2. Instead of sending dummy text, send them policy on infringement. Would get the message across to users that even though the site looks good, it’s still a scraping site stealing content… anyone have good generic language that could be used for this?

  3. Jeff you and Chris both make some good points. In the end I feel it comes down to a judgment call on the website at hand.

    The real irony is that this article has been scraped itself. My guess is that this is just one of many to come.

    http://www.webplus.me/2010/09/24/how-to-deal-with-content-scrapers/

    Editor’s note: (Update 2012 April 25th) just a note that the “do-nothing” approach is almost always effective in the long run, as scraper sites just don’t have the staying power that real, determined, hard-working people have going for them. Prime example: the link provided in this comment is now MIA (as is the entire crappy site).

  4. Jeff Starr

    @Timothy Warren: I think most content scrapers are just grabbing feeds and displaying them on their own sites. Most of the dynamic, content-driven sites I’ve seen these days provide free feeds. True that smaller sites and e-commerce sites may have less to worry about. Thank you for the tip on the browsers capabilities database. Looking forward to using it :)

    @Dave: Excellent idea, even using everyday language saying something like “this site is stealing content from mysite.com” would be effective at communicating the message to visitors.

    @Shay Howe: Yes it absolutely depends on the site and content. This post is just basically saying that doing more than nothing is easy and can protect your content from evil people. As you say, this post is already out there and has been scraped numerous times, but those internal links I sprinkled throughout the post help both people and machines identify the original source.

  5. I dont think the htaccess way (blocking ip address) can be a reliable way.

    IP changes frequently, really frequently, and in the long terms you will ban all the users.

    Chris wrote a post about this few days ago, “For how long should we ban ip?” (or something else, i dont remember the title exactly).. it was about spam, but for scrapers is different: spammers switch ip likely every hours, but in few minutes can fill up your site with tons of spam, so its a good idea to ban a ip for few hours.

    But scrapers doesnt: they come, get the content, and leave you, they’ll come back maybe next week; But tomorrow the ip you banned belongs to a ‘normal’ user.

    And, another point is: how can you tell scrapers from humans?

    I dont think a scraper is so stupid to crawl the contents from his website’s server, using the server’s IP…would be really really stupid nowadays.

    @Timothy Warren: i saw a service just few days ago that aims to create a real time mobile-version of you website.
    I dont want to spam so i wont name the service, but is amazing how easily and accurate it retrieve the contents from a website, telling main content from (for example) side blocks or navigations menu.
    Seo try to make your site’s content easily to be read from bots; scrapers are bots..

    The only way (i see) to protect from scrapers? Bring your site 10 years ago! Dismiss CSS, well-formatted HTML markup, forget about accessibility: put your text into images, fillup the page with tables and so on.

    p.s: sorry for my bad english ;)

  6. Jeff Starr

    @daniele: I think a few lines of htaccess is an ideal way to stop content thieves. I use the method all the time and you know what? It usually works immediately. Then I wait a few days or weeks and remove the block. Keeping an eye on those sites, I’ve never seen one try it again. So it works, it’s fast, and it’s easy.

    IP changes frequently, really frequently, and in the long terms you will ban all the users.

    Most scraping is done with a simple script or plugin that grabs a feed, parses it, and spits it out on the scraper site. The IPs generally don’t change, so a “scraper blacklist” is going to remain relatively short (depending on popularity and other factors) and extremely accurate.

    But scrapers doesnt: they come, get the content, and leave you, they’ll come back maybe next week; But tomorrow the ip you banned belongs to a ‘normal’ user.

    No, usually they use a plugin to pull your feed every x hours to check for new content. When something new appears in the feed, it is automatically published as an article on the scraper site. The IP address isn’t going to suddenly change, and if it does, it’s most likely going to point to the same IP block, so the chances of blocking legitimate users is nil.

    And, another point is: how can you tell scrapers from humans? I dont think a scraper is so stupid to crawl the contents from his website’s server, using the server’s IP…would be really really stupid nowadays.

    With all due respect, scrapers are lazy bastards. That’s why they’re stealing content. The easiest way to scrape content is to install WordPress, grab a few plugins and away you go – up and running in ten minutes. Fortunately this also makes it easy to stop them with a single line of code.

    The only way (i see) to protect from scrapers? Bring your site 10 years ago! Dismiss CSS, well-formatted HTML markup, forget about accessibility: put your text into images, fillup the page with tables and so on.

    Completely disagree. To say that you have to go back to the Stone Age to protect your website is ludicrous. With a little knowledge and a few lines of code, you can protect your site against just about anything. Even those fancy modern sites that use CSS and well-formatted HTML. ;)

    Sorry if that sounds harsh, but it really sounds like you are saying that people are powerless against those uber-smart content scrapers. That is so opposite of how I see things. These guys aren’t as smart as you think they are and it is usually quick and easy to stop them from stealing your content. If you want to.

  7. The sad part is Shay that the scraped article is in the Trackbacks …. I figure if you disable trackbacks … it often discourages some spammers… Another good solution is putting your own ads in the feed.

  8. I am usually somewhat surprised to see that the larger sites that are getting scraped don’t watermark their images. Something like “original content at perishablepress.com” small in the corner.

    Maybe it is time for Askimet to come to the rescue and deny access to your RSS feed from known scraper sites? Is that possible?

  9. Funny. I’m at home right now and my blackhole trap bookmark is on my work comp so I googled it and came up with a scraped article of it. Internal links were all in tact so you do get the link back bonuses with google.

  10. Wow found your article through chris’s site and now I understand it. . . especially since I am on the process of creating my own blog about web design and ofcourse its reputation and popularity would be a million miles away from css-tricks, it would make it much safer for me to do all what you have mentioned above. But if and when I get to the stage where I already have an established readership then I would have to do what Chris is saying, which is to spend time creating the good stuff rather that going after the scrapers.

    Thanks Jeff.

  11. Web Technology News September 26, 2010 @ 12:10 am

    What ever methods you use to avoid it they’ll still get around it.

    If you had a htaccess then it’s just a challenge (and an easy one) to the scraper. If you redirect them to a different page then again it’s just another simple challenge to get around it.

    Scraping scripts are generally done in languages where a proxy server switcher can be easily plugged in. If you add in lots of link backs then these can be removed with a simple regular expression.

    The thing you’ve got to be aware of is when a scraper grabs your text then has a algorithm to swap the words and sentences around so Google recognises it as unique content. I’m sure we’ll be seeing this more soon.

    I don’t personally scrape the web for plagiarism, I do it for other reasons.

  12. Great and helpful points, i learned something today.

[ Comments are closed for this post ]