Latest TweetsDifference between mod_alias and mod_rewrite perishablepress.com/difference…
Perishable Press

Fixing WordPress Infinite Duplicate Content Issue

Jeff Morris recently demonstrated a potential issue with the way WordPress handles multipaged posts and comments. The issue involves WordPress’ inability to discern between multipaged posts and comments that actually exist and those that do not. By redirecting requests for nonexistent numbered pages to the original post, WordPress creates an infinite amount of duplicate content for your site. In this article, we explain the issue, discuss the implications, and provide an easy, working solution.

Understanding the “infinite duplicate content” issue

Using the <!--nextpage--> tag, WordPress makes it easy to split your post content into multiple pages, and also makes it easy to paginate the display of your comment threads. For both paged posts and paged comments, WordPress appends the page number to the permalink. So for example, if we have a post split into 3 pages, WordPress will generate the following set of completely valid permalinks (based on name-only permalink structure):

https://perishablepress.com/wordpress-infinite-duplicate-content/
https://perishablepress.com/wordpress-infinite-duplicate-content/1/
https://perishablepress.com/wordpress-infinite-duplicate-content/2/
https://perishablepress.com/wordpress-infinite-duplicate-content/3/

Likewise for paged comments, WordPress generates a predictable sequence of URLs:

https://perishablepress.com/wordpress-infinite-duplicate-content/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-1/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-2/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-3/

In each of these examples, we see duplicate content for the canonical URL and the first page of each series. But it gets worse. We can append any integer to the canonical URL and get the exact same page. In fact, with the way WordPress 3.0 (and older) handles paged content, there are an infinite number of fully valid (i.e., header status “200 OK”) URLs that all point to the original post. Consider the following URLs:

https://perishablepress.com/wordpress-infinite-duplicate-content/10
https://perishablepress.com/wordpress-infinite-duplicate-content/100/
https://perishablepress.com/wordpress-infinite-duplicate-content/2000/
https://perishablepress.com/wordpress-infinite-duplicate-content/300000/

As it turns out, you can append any page number to the URL and, if the page doesn’t actually exist, WordPress will return a valid “200 OK” response and deliver the highest numbered existing page to the user. And not just for paged content — WordPress exhibits this behavior for any permalink on your site. This of course is not good from an SEO perspective, where an infinite number of URLs all pointing to the same page containing the exact same title, description, and content could potentially devastate your rankings.

Try it yourself! Add a page number to the end of any WordPress-powered URL and witness the insanity..

The worst part is that WordPress does this for not only multipaged posts and comments, but for any post or page. Try it yourself on any WordPress site on the Web — just append some number to the end of any URL and watch WordPress work its magic.

Why is this a potential problem?

To be fair, having an infinite number of unique URLs all pointing to the same page is only a problem if someone makes it a problem. In other words, as long as no one is linking to your site using these “phantom” URLs (as Jeff puts it), then everything is probably going to be fine. But as soon as you get some nefarious scumbag deliberately linking to your post using a few hundred differently numbered permalinks, it’s just a matter of time before Google pays a visit, tags your site as spam, and ultimately shuts it down. Admittedly, the chances of something like this happening are probably slim to none, but the potential exists for some serious duplicate-content mayhem.

One possible (less-than-ideal) solution

As Jeff points out, “some bright spark might argue: ‘Hey, nowadays WP puts a canonical link in your head element, this is no problem.’” Certainly this is a possible way of dealing with the “infinite duplicate content issue,” but you have to keep in mind that the canonical link element is purely informational and only acknowledged by three search engines: Google, Bing, and Yahoo. Canonical URLs are nothing more than a hint designed to help Google better apply its various indexing algorithms. It is not supported by the RFC, and there is neither obligation to adhere to it nor consequence for ignoring it. It’s a step in the right direction, but should not be considered a “foolproof” solution for handling duplicate content.

[The canonical tag] is designed to help [Google] avoid shooting themselves in the foot. It’s no more than a hint. It’s not backed up by an RFC and there’s no obligation to honour it. Its use has a few caveats, and some cop-outs. – Jeff Morris

The canonical link is useful in helping the “big three” search engines determine your preferred URL when presented with duplicate content. There are many situations where canonical links may help, including complex query strings and www/non-www hostname variations. There is certainly no reason not to have canonical links for your site, so long as you understand that they aren’t a “bulletproof” solution for duplicate content. Again, from Jeff:

…the issues about WP that I’m grappling with concern the bit that lies in-between the host name and query string: the path component. The path is an expression of my blog’s internal structure, and I’ve delegated management of that to WP. And right now, WP is doing it wrong, it’s responding with phantom structures, and it’s bustin’ my chops.

When dealing with WordPress’ multipaged duplicate content issue, canonical links are like cold medicine – they mask the symptoms, but the underlying problem remains: WordPress is redirecting invalid URLs to valid pages, essentially creating infinite amounts of duplicate content.

For the record, as of version 2.9, WordPress automatically generates canonical tags for your single posts and pages. But it doesn’t provide canonical tags for paged comments. Fortunately, Jean-Baptiste Jung explains how with a simple function:

// canonical links for comments
function canonical_for_comments() {
	global $cpage, $post;
	if ($cpage > 1) :
		echo "\n";
		echo "<link rel='canonical' href='";
		echo get_permalink( $post->ID );
		echo "' />\n";
	endif;
}
add_action('wp_head', 'canonical_for_comments');

Add that to your custom functions.php template and you’re good to go. With this in place, your WordPress site is fully equipped with canonical tags, which may or may not be the best solution. Keep in mind that the canonical links are only a suggestion designed to help Google et al wade through some of the murkier URLs on the Web. So, if you’re looking for a better solution for the infinite duplicate content issue, read on..

A better solution

Instead of merely suggesting to the search engines that they should ignore nonexistent multipaged posts and comments, a better solution is to prevent WordPress from handling them in the first place. For this, we employ the following bit of scripted logic:

  1. Requests for legitimate/existing URLs are handled normally: no redirection takes place and the pages appear as normal at their corresponding URLs.
  2. Requests for URLs ending with “0” or “1” are effectively redundant, and are permanently redirected (via 301) to the canonical URL.
  3. Requests for URLs ending with any number greater than “0” or “1” are “soft” redirected via “302 Temporary” header to the original post, thereby consolidating all “phantom” requests into a single, canonical URL.

Putting this logic into practice, we get the following function:

<?php // malicious post page ordinal fix by Jeff Morris
// [post URL for reference]

    global $posts, $numpages;

    $request_uri = $_SERVER['REQUEST_URI'];

    $result = preg_match('%\/(\d)+(\/)?$%', $request_uri, $matches);

    $ordinal = $result ? intval($matches[1]) : FALSE;

    if(is_numeric($ordinal)) {

        // a numbered page was requested: validate it
        // look-ahead: initialises the global $numpages

        setup_postdata($posts[0]); // yes, hack

        $redirect_to = ($ordinal < 2) ? '/': (($ordinal > $numpages) ? "/$numpages/" : FALSE);

        if(is_string($redirect_to)) {

            // we got us a phantom
            $redirect_url = get_option('home') . preg_replace('%'.$matches[0].'%', $redirect_to, $request_uri);

            // if page = 0 or 1, redirect permanently
            if($ordinal < 2) {
                header($_SERVER['SERVER_PROTOCOL'] . ' 301 Moved Permanently');
            } else {
                header($_SERVER['SERVER_PROTOCOL'] . ' 302 Found');
            }

            header("Location: $redirect_url");
            exit();

        }
    }
?>

To implement this technique, just place it into your active theme’s functions.php file. Consider this a “beta” fix and test thoroughly. I have tested this successfully on WordPress 2.3 and 2.9, so it will probably work fine on just about any version of WordPress in use today.

This solution “fixes” the infinite duplicate content issue on the server-side of the equation. It eliminates the extraneous URLs by redirecting them to their canonical counterparts. Search engines will always find the canonical versions of your pages thus eliminating any duplicate content.

Conversely, using the canonical tag is more of a “cosmetic suggestion” that fails to resolve the issue in a reliable way. As Matt Cutts states several times in his presentation on canonical links, it’s always better for site admins to resolve any potential duplication issues on their site at the upstream end. Similar ideas are expressed in this video on the canonical tag.

Thoughts?

This all seems like a significant issue to the few people that have now heard about this, but we may be getting caught up in the sheer discovery of it all. What do you think? Is this something that needs addressed in an upcoming version of WordPress? Is the canonical tag a good enough fix? Share your thoughts and help us sort it all out..

Jeff Starr
About the Author Jeff Starr = Fullstack Developer. Book Author. Teacher. Human Being.
Archives
46 responses
  1. This is strange. I’ve tested with all the stock custom permalinks (the one I’m actually using on the site in question is “day & name”). The site I’m building is from the ground up (not a mod of someone else’s template) on WP3 and is clean and lean. I’ve also tested with a fresh install of WP3 with the template in question, twentyten, classic, and default, all with the same result – everything’s great except for the pagination of the blog homepage when output from a page template. No plugins are running, nothing else of sway in functions.php, no htaccess except the regular wp block

  2. Is there any reason why this type of solution would have any better/worse results?

    It looks to be a simpler way to de essential the same thing, or have I missed the point?

    <?php if(is_single() || is_page() || is_home()) { ?>
    <meta name="googlebot" content="index,follow" />
    <meta name="robots" content="index,follow" />
    <meta name="msnbot" content="index,follow" />
    <?php } else { ?>
    <meta name="googlebot" content="noindex,follow" />
    <meta name="robots" content="noindex,follow" />
    <meta name="msnbot" content="noindex,follow" />
    <?php }?>

    From: http://technicallyeasy.net/2009/02/preventing-duplicate-content-in-search-engines-with-wordpress/

  3. Hi,
    any news about this SEO issue being natively fixed in the latest WP version 3.1.1? I was trying to lookup this argument on worpress.org without success.

  4. J.Grisham June 18, 2011 @ 11:32 pm

    im new use wordpress, and i found duplicate content on my webmaster tools, can anybody help me, what easy plugin for removing duplicate content?.. i don’t understand code :(

  5. You seem to be not using this solution on your blog now.Any reasons? I was looking for a good solution to this as duplicate content issue is getting nastier with google panda but i just want to check with you whether this is still needed and relevant.

  6. Jeff Starr

    Hey Rajesh,

    It depends on how far you want to take the SEO of your site. Some people go the extra mile, others do not.. it’s really up to you to decide whether or not it’s worth your time. For me, I’ve got too much to do as-is, so I haven’t implemented this technique at Perishable Press. If/when I ever find the time, I would probably use it.

  7. Thanks Jeff.

  8. Hi Jeff,

    I just switched a site over to Thesis, & for some reason that seemed to cause hundreds of the /comment-page-1 duplicate titles & meta descriptions. Not sure what Thesis would have to do w/ it but either way, the problem now exists…

    Rather than fiddling w/ WP, why not simply add the following to the robots.txt file:

    Disallow: /comment-page-*/

    It would seem that it would do the job until the time if/when WP decides to remedy the problem.

    I am no expert so I look forward to your reply. Thanks!!

  9. Jeff Starr

    Hi Alan, That’s a great idea that should work for compliant search engine bots, but unfortunately there’s not many of them these days. Even some of the larger bots like Yahoo Slurp are known to ignore robots.txt directives and crawl whatever they want. It gives them an “edge” in finding content for their index. So relying solely on robots.txt is good in theory, but doing so may be insufficient to prevent duplicate content.

  10. Hi,

    True, but since most traffic comes from G, & not as much from the others, was thinking at least I would make G happy!

    If you don’t mind…here is some code from girlie over at Thesis. What do you think of it & how is it different than yours?

    function robots_comment_pages() {
         if (get_query_var('cpage') >= 1 || get_query_var('cpage') < get_comment_pages_count())
              echo '';
         }
    add_action('wp_head','robots_comment_pages');

    http://diythemes.com/forums/showthread.php?36441-rel-Canonical-does-not-always-work-on-paginated-comments.-Webmaster-Tools-shows-as-dup-content

    Thanks!!

  11. Dudes

    Pardon me, but this dupe-content-comment-paging thing has been a non-issue for …like …ages.

    ZB Phantom is fixing it, yawn, etc…

    Please don’t blow me out on my birthday of all days, Perishable.

  12. Huh?

    It’s an issue on my site.

    Prefer not to use a plugin to fix it…

[ Comments are closed for this post ]