Latest Tweets404 Fix: Block Nuisance Requests for Non-Existent Files: perishablepress.com/block-nuis…
Perishable Press

Fixing WordPress Infinite Duplicate Content Issue

Jeff Morris recently demonstrated a potential issue with the way WordPress handles multipaged posts and comments. The issue involves WordPress’ inability to discern between multipaged posts and comments that actually exist and those that do not. By redirecting requests for nonexistent numbered pages to the original post, WordPress creates an infinite amount of duplicate content for your site. In this article, we explain the issue, discuss the implications, and provide an easy, working solution.

Understanding the “infinite duplicate content” issue

Using the <!--nextpage--> tag, WordPress makes it easy to split your post content into multiple pages, and also makes it easy to paginate the display of your comment threads. For both paged posts and paged comments, WordPress appends the page number to the permalink. So for example, if we have a post split into 3 pages, WordPress will generate the following set of completely valid permalinks (based on name-only permalink structure):

https://perishablepress.com/wordpress-infinite-duplicate-content/
https://perishablepress.com/wordpress-infinite-duplicate-content/1/
https://perishablepress.com/wordpress-infinite-duplicate-content/2/
https://perishablepress.com/wordpress-infinite-duplicate-content/3/

Likewise for paged comments, WordPress generates a predictable sequence of URLs:

https://perishablepress.com/wordpress-infinite-duplicate-content/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-1/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-2/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-3/

In each of these examples, we see duplicate content for the canonical URL and the first page of each series. But it gets worse. We can append any integer to the canonical URL and get the exact same page. In fact, with the way WordPress 3.0 (and older) handles paged content, there are an infinite number of fully valid (i.e., header status “200 OK”) URLs that all point to the original post. Consider the following URLs:

https://perishablepress.com/wordpress-infinite-duplicate-content/10
https://perishablepress.com/wordpress-infinite-duplicate-content/100/
https://perishablepress.com/wordpress-infinite-duplicate-content/2000/
https://perishablepress.com/wordpress-infinite-duplicate-content/300000/

As it turns out, you can append any page number to the URL and, if the page doesn’t actually exist, WordPress will return a valid “200 OK” response and deliver the highest numbered existing page to the user. And not just for paged content — WordPress exhibits this behavior for any permalink on your site. This of course is not good from an SEO perspective, where an infinite number of URLs all pointing to the same page containing the exact same title, description, and content could potentially devastate your rankings.

Try it yourself! Add a page number to the end of any WordPress-powered URL and witness the insanity..

The worst part is that WordPress does this for not only multipaged posts and comments, but for any post or page. Try it yourself on any WordPress site on the Web — just append some number to the end of any URL and watch WordPress work its magic.

Why is this a potential problem?

To be fair, having an infinite number of unique URLs all pointing to the same page is only a problem if someone makes it a problem. In other words, as long as no one is linking to your site using these “phantom” URLs (as Jeff puts it), then everything is probably going to be fine. But as soon as you get some nefarious scumbag deliberately linking to your post using a few hundred differently numbered permalinks, it’s just a matter of time before Google pays a visit, tags your site as spam, and ultimately shuts it down. Admittedly, the chances of something like this happening are probably slim to none, but the potential exists for some serious duplicate-content mayhem.

One possible (less-than-ideal) solution

As Jeff points out, “some bright spark might argue: ‘Hey, nowadays WP puts a canonical link in your head element, this is no problem.’” Certainly this is a possible way of dealing with the “infinite duplicate content issue,” but you have to keep in mind that the canonical link element is purely informational and only acknowledged by three search engines: Google, Bing, and Yahoo. Canonical URLs are nothing more than a hint designed to help Google better apply its various indexing algorithms. It is not supported by the RFC, and there is neither obligation to adhere to it nor consequence for ignoring it. It’s a step in the right direction, but should not be considered a “foolproof” solution for handling duplicate content.

[The canonical tag] is designed to help [Google] avoid shooting themselves in the foot. It’s no more than a hint. It’s not backed up by an RFC and there’s no obligation to honour it. Its use has a few caveats, and some cop-outs. – Jeff Morris

The canonical link is useful in helping the “big three” search engines determine your preferred URL when presented with duplicate content. There are many situations where canonical links may help, including complex query strings and www/non-www hostname variations. There is certainly no reason not to have canonical links for your site, so long as you understand that they aren’t a “bulletproof” solution for duplicate content. Again, from Jeff:

…the issues about WP that I’m grappling with concern the bit that lies in-between the host name and query string: the path component. The path is an expression of my blog’s internal structure, and I’ve delegated management of that to WP. And right now, WP is doing it wrong, it’s responding with phantom structures, and it’s bustin’ my chops.

When dealing with WordPress’ multipaged duplicate content issue, canonical links are like cold medicine – they mask the symptoms, but the underlying problem remains: WordPress is redirecting invalid URLs to valid pages, essentially creating infinite amounts of duplicate content.

For the record, as of version 2.9, WordPress automatically generates canonical tags for your single posts and pages. But it doesn’t provide canonical tags for paged comments. Fortunately, Jean-Baptiste Jung explains how with a simple function:

// canonical links for comments
function canonical_for_comments() {
	global $cpage, $post;
	if ($cpage > 1) :
		echo "\n";
		echo "<link rel='canonical' href='";
		echo get_permalink( $post->ID );
		echo "' />\n";
	endif;
}
add_action('wp_head', 'canonical_for_comments');

Add that to your custom functions.php template and you’re good to go. With this in place, your WordPress site is fully equipped with canonical tags, which may or may not be the best solution. Keep in mind that the canonical links are only a suggestion designed to help Google et al wade through some of the murkier URLs on the Web. So, if you’re looking for a better solution for the infinite duplicate content issue, read on..

A better solution

Instead of merely suggesting to the search engines that they should ignore nonexistent multipaged posts and comments, a better solution is to prevent WordPress from handling them in the first place. For this, we employ the following bit of scripted logic:

  1. Requests for legitimate/existing URLs are handled normally: no redirection takes place and the pages appear as normal at their corresponding URLs.
  2. Requests for URLs ending with “0” or “1” are effectively redundant, and are permanently redirected (via 301) to the canonical URL.
  3. Requests for URLs ending with any number greater than “0” or “1” are “soft” redirected via “302 Temporary” header to the original post, thereby consolidating all “phantom” requests into a single, canonical URL.

Putting this logic into practice, we get the following function:

<?php // malicious post page ordinal fix by Jeff Morris
// [post URL for reference]

    global $posts, $numpages;

    $request_uri = $_SERVER['REQUEST_URI'];

    $result = preg_match('%\/(\d)+(\/)?$%', $request_uri, $matches);

    $ordinal = $result ? intval($matches[1]) : FALSE;

    if(is_numeric($ordinal)) {

        // a numbered page was requested: validate it
        // look-ahead: initialises the global $numpages

        setup_postdata($posts[0]); // yes, hack

        $redirect_to = ($ordinal < 2) ? '/': (($ordinal > $numpages) ? "/$numpages/" : FALSE);

        if(is_string($redirect_to)) {

            // we got us a phantom
            $redirect_url = get_option('home') . preg_replace('%'.$matches[0].'%', $redirect_to, $request_uri);

            // if page = 0 or 1, redirect permanently
            if($ordinal < 2) {
                header($_SERVER['SERVER_PROTOCOL'] . ' 301 Moved Permanently');
            } else {
                header($_SERVER['SERVER_PROTOCOL'] . ' 302 Found');
            }

            header("Location: $redirect_url");
            exit();

        }
    }
?>

To implement this technique, just place it into your active theme’s functions.php file. Consider this a “beta” fix and test thoroughly. I have tested this successfully on WordPress 2.3 and 2.9, so it will probably work fine on just about any version of WordPress in use today.

This solution “fixes” the infinite duplicate content issue on the server-side of the equation. It eliminates the extraneous URLs by redirecting them to their canonical counterparts. Search engines will always find the canonical versions of your pages thus eliminating any duplicate content.

Conversely, using the canonical tag is more of a “cosmetic suggestion” that fails to resolve the issue in a reliable way. As Matt Cutts states several times in his presentation on canonical links, it’s always better for site admins to resolve any potential duplication issues on their site at the upstream end. Similar ideas are expressed in this video on the canonical tag.

Thoughts?

This all seems like a significant issue to the few people that have now heard about this, but we may be getting caught up in the sheer discovery of it all. What do you think? Is this something that needs addressed in an upcoming version of WordPress? Is the canonical tag a good enough fix? Share your thoughts and help us sort it all out..

Jeff Starr
About the Author Jeff Starr = Web Developer. Security Specialist. WordPress Buff.
Archives
46 responses
  1. BTW,

    I also noticed something else in regards to the paginated comments…

    All of the dups reported in GWT that I checked have less than 50 comments yet are reporting as such:

    mypost.com
    mypost.com/comment-page-1

    So, not sure why the latter is being created when there are not even 50 comments. In other words, pagination isn’t necessary. The posts on the site that actually have more than 50 comments, haven’t yet reported! This all began immediately after switching to Thesis. What could be causing this?

    Hope this isn’t too much Jeff. Thanks

  2. @ Jeff M,

    Didn’t realize who you were until now! Considering that you are the man behind the code in this post..I am all ears! ;-)

    I would greatly appreciate your help as well, if you’re so inclined. Please read my last post for a better explanation.

    Sorry for all the comments Jeff S!

  3. Hi Jeff S,

    I have been getting some great assistance & explanations from Jeff M. in regards to this WP pagination issue. Curious of your opinion about that code I mentioned above from the Thesis forum. Any thoughts on handling that way? Don’t want to bother him w/ more :-)

    Although I see the actual meta tag is somehow missing from that code so here’s that part again:

    This is based on a condition but the content=”noindex” scares me some! Should it? :-)

    Thanks!

  4. Shabnam Sultan March 9, 2012 @ 9:21 pm

    I am also getting hundreds of duplicate titles & meta descriptions after shifting to thesis and i am not able to find a proper solution to it.

    @Alan are you being to resolve your issue?

  5. Shabnam,

    What sort of duplication issues are you seeing? Are they of the duplicate comment page type like so?:

    mypost.com
    mypost.com/comment-page-1

    If so, simply disable comment pagination from within your WordPress admin dashboard by going to Settings-Discussion & be sure to uncheck the box for “Break comments into pages with top level comments per page and the page displayed by default. Comments should be displayed with the comments at the top of each page”

    WordPress will automatically 301 redirect all preexisting paginated comment pages to the original url, i.e.:

    mypost.com/comment-page-1 will redirect to mypost.com
    mypost.com/comment-page-2 will redirect to mypost.com

    etc, etc

    Hope that helps!

  6. Shabnam Sultan March 11, 2012 @ 12:16 am

    Thanks for the reply Alan :) i was facing duplicate comment page issue though “Break comments into pages” was unchecked in WP dashboard.

    However, issue is resolved now by blocking Disallow: /comment-page-*/.

  7. Dear all,
    Thanks for the info. I have a question on duplicate content (I think). I have posted on my blog page information on directors duties, had comments on it, then updated the with a fuller post to the same URL.
    Now when I try and see the full post I can only see it through the Recent Comments widget rather than via the blog or the tags. It may help to see the problem @ “4alaw dot com directors duties. Its the directors duties post not appearing fully when my view is it should.
    Please help.
    Thanks terry

  8. Hello,

    I have a probelm with infinite duplicate content.
    I am getting /page/2/ etc., all the way up to millions and beyond. My site’s just been taken of google (again!), and I need a fix that will work. Does this code actually work, or will it get me in even more bother?

    I’m having this problem with a few sites, so I need something that definitley works.

    Best regards,
    Shane

  9. I’ve tried this code, when I added this code then it shows 404 error for blog pages, it is redirected to example.com/page/ instead of example.com/page/3/

    How to solve this, please help me.

  10. Instead of filtering numbers ending with 0 and 1, it is filtering all numbers. And I get 404 for most pages

[ Comments are closed for this post ]