Latest TweetsDifference between mod_alias and mod_rewrite perishablepress.com/difference…
Perishable Press

Fixing WordPress Infinite Duplicate Content Issue

Jeff Morris recently demonstrated a potential issue with the way WordPress handles multipaged posts and comments. The issue involves WordPress’ inability to discern between multipaged posts and comments that actually exist and those that do not. By redirecting requests for nonexistent numbered pages to the original post, WordPress creates an infinite amount of duplicate content for your site. In this article, we explain the issue, discuss the implications, and provide an easy, working solution.

Understanding the “infinite duplicate content” issue

Using the <!--nextpage--> tag, WordPress makes it easy to split your post content into multiple pages, and also makes it easy to paginate the display of your comment threads. For both paged posts and paged comments, WordPress appends the page number to the permalink. So for example, if we have a post split into 3 pages, WordPress will generate the following set of completely valid permalinks (based on name-only permalink structure):

https://perishablepress.com/wordpress-infinite-duplicate-content/
https://perishablepress.com/wordpress-infinite-duplicate-content/1/
https://perishablepress.com/wordpress-infinite-duplicate-content/2/
https://perishablepress.com/wordpress-infinite-duplicate-content/3/

Likewise for paged comments, WordPress generates a predictable sequence of URLs:

https://perishablepress.com/wordpress-infinite-duplicate-content/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-1/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-2/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-3/

In each of these examples, we see duplicate content for the canonical URL and the first page of each series. But it gets worse. We can append any integer to the canonical URL and get the exact same page. In fact, with the way WordPress 3.0 (and older) handles paged content, there are an infinite number of fully valid (i.e., header status “200 OK”) URLs that all point to the original post. Consider the following URLs:

https://perishablepress.com/wordpress-infinite-duplicate-content/10
https://perishablepress.com/wordpress-infinite-duplicate-content/100/
https://perishablepress.com/wordpress-infinite-duplicate-content/2000/
https://perishablepress.com/wordpress-infinite-duplicate-content/300000/

As it turns out, you can append any page number to the URL and, if the page doesn’t actually exist, WordPress will return a valid “200 OK” response and deliver the highest numbered existing page to the user. And not just for paged content — WordPress exhibits this behavior for any permalink on your site. This of course is not good from an SEO perspective, where an infinite number of URLs all pointing to the same page containing the exact same title, description, and content could potentially devastate your rankings.

Try it yourself! Add a page number to the end of any WordPress-powered URL and witness the insanity..

The worst part is that WordPress does this for not only multipaged posts and comments, but for any post or page. Try it yourself on any WordPress site on the Web — just append some number to the end of any URL and watch WordPress work its magic.

Why is this a potential problem?

To be fair, having an infinite number of unique URLs all pointing to the same page is only a problem if someone makes it a problem. In other words, as long as no one is linking to your site using these “phantom” URLs (as Jeff puts it), then everything is probably going to be fine. But as soon as you get some nefarious scumbag deliberately linking to your post using a few hundred differently numbered permalinks, it’s just a matter of time before Google pays a visit, tags your site as spam, and ultimately shuts it down. Admittedly, the chances of something like this happening are probably slim to none, but the potential exists for some serious duplicate-content mayhem.

One possible (less-than-ideal) solution

As Jeff points out, “some bright spark might argue: ‘Hey, nowadays WP puts a canonical link in your head element, this is no problem.’” Certainly this is a possible way of dealing with the “infinite duplicate content issue,” but you have to keep in mind that the canonical link element is purely informational and only acknowledged by three search engines: Google, Bing, and Yahoo. Canonical URLs are nothing more than a hint designed to help Google better apply its various indexing algorithms. It is not supported by the RFC, and there is neither obligation to adhere to it nor consequence for ignoring it. It’s a step in the right direction, but should not be considered a “foolproof” solution for handling duplicate content.

[The canonical tag] is designed to help [Google] avoid shooting themselves in the foot. It’s no more than a hint. It’s not backed up by an RFC and there’s no obligation to honour it. Its use has a few caveats, and some cop-outs. – Jeff Morris

The canonical link is useful in helping the “big three” search engines determine your preferred URL when presented with duplicate content. There are many situations where canonical links may help, including complex query strings and www/non-www hostname variations. There is certainly no reason not to have canonical links for your site, so long as you understand that they aren’t a “bulletproof” solution for duplicate content. Again, from Jeff:

…the issues about WP that I’m grappling with concern the bit that lies in-between the host name and query string: the path component. The path is an expression of my blog’s internal structure, and I’ve delegated management of that to WP. And right now, WP is doing it wrong, it’s responding with phantom structures, and it’s bustin’ my chops.

When dealing with WordPress’ multipaged duplicate content issue, canonical links are like cold medicine – they mask the symptoms, but the underlying problem remains: WordPress is redirecting invalid URLs to valid pages, essentially creating infinite amounts of duplicate content.

For the record, as of version 2.9, WordPress automatically generates canonical tags for your single posts and pages. But it doesn’t provide canonical tags for paged comments. Fortunately, Jean-Baptiste Jung explains how with a simple function:

// canonical links for comments
function canonical_for_comments() {
	global $cpage, $post;
	if ($cpage > 1) :
		echo "\n";
		echo "<link rel='canonical' href='";
		echo get_permalink( $post->ID );
		echo "' />\n";
	endif;
}
add_action('wp_head', 'canonical_for_comments');

Add that to your custom functions.php template and you’re good to go. With this in place, your WordPress site is fully equipped with canonical tags, which may or may not be the best solution. Keep in mind that the canonical links are only a suggestion designed to help Google et al wade through some of the murkier URLs on the Web. So, if you’re looking for a better solution for the infinite duplicate content issue, read on..

A better solution

Instead of merely suggesting to the search engines that they should ignore nonexistent multipaged posts and comments, a better solution is to prevent WordPress from handling them in the first place. For this, we employ the following bit of scripted logic:

  1. Requests for legitimate/existing URLs are handled normally: no redirection takes place and the pages appear as normal at their corresponding URLs.
  2. Requests for URLs ending with “0” or “1” are effectively redundant, and are permanently redirected (via 301) to the canonical URL.
  3. Requests for URLs ending with any number greater than “0” or “1” are “soft” redirected via “302 Temporary” header to the original post, thereby consolidating all “phantom” requests into a single, canonical URL.

Putting this logic into practice, we get the following function:

<?php // malicious post page ordinal fix by Jeff Morris
// [post URL for reference]

    global $posts, $numpages;

    $request_uri = $_SERVER['REQUEST_URI'];

    $result = preg_match('%\/(\d)+(\/)?$%', $request_uri, $matches);

    $ordinal = $result ? intval($matches[1]) : FALSE;

    if(is_numeric($ordinal)) {

        // a numbered page was requested: validate it
        // look-ahead: initialises the global $numpages

        setup_postdata($posts[0]); // yes, hack

        $redirect_to = ($ordinal < 2) ? '/': (($ordinal > $numpages) ? "/$numpages/" : FALSE);

        if(is_string($redirect_to)) {

            // we got us a phantom
            $redirect_url = get_option('home') . preg_replace('%'.$matches[0].'%', $redirect_to, $request_uri);

            // if page = 0 or 1, redirect permanently
            if($ordinal < 2) {
                header($_SERVER['SERVER_PROTOCOL'] . ' 301 Moved Permanently');
            } else {
                header($_SERVER['SERVER_PROTOCOL'] . ' 302 Found');
            }

            header("Location: $redirect_url");
            exit();

        }
    }
?>

To implement this technique, just place it into your active theme’s functions.php file. Consider this a “beta” fix and test thoroughly. I have tested this successfully on WordPress 2.3 and 2.9, so it will probably work fine on just about any version of WordPress in use today.

This solution “fixes” the infinite duplicate content issue on the server-side of the equation. It eliminates the extraneous URLs by redirecting them to their canonical counterparts. Search engines will always find the canonical versions of your pages thus eliminating any duplicate content.

Conversely, using the canonical tag is more of a “cosmetic suggestion” that fails to resolve the issue in a reliable way. As Matt Cutts states several times in his presentation on canonical links, it’s always better for site admins to resolve any potential duplication issues on their site at the upstream end. Similar ideas are expressed in this video on the canonical tag.

Thoughts?

This all seems like a significant issue to the few people that have now heard about this, but we may be getting caught up in the sheer discovery of it all. What do you think? Is this something that needs addressed in an upcoming version of WordPress? Is the canonical tag a good enough fix? Share your thoughts and help us sort it all out..

Jeff Starr
About the Author Jeff Starr = Web Developer. Security Specialist. WordPress Buff.
Archives
46 responses
  1. @thomas

    Hmm, so it’s doing exactly what it’s supposed to do. The script in the article just illustrates one way to handle this for pretty permalinks.
    See below for more…

    @ryanve

    Some fine-tuning is inevitable, you should take this script as a starting point for your own solution. Typically, you’ll want to wrap it some conditional tag that targets – or excludes – certain posts or pages.

    In your case, it might be if( is_single() ){ /* script here */ }.

    See the WordPress Conditional Tags page at the codex for a list of conditional tags.

    HTH

  2. @ryanve

    It’s evident that requests for date archives represent a special case and need to be filtered out of the process.

    It might seem that a simple conditional test (for NOT a date archive) would suffice:

    if( ! is_date() ) {/* process it */}

    …and to an extent it does. But when we get a request for a future date – say, /2010/5/ while it’s still only /2010/4/, WP flags the situation as a 404 error, not a date archive, so if( ! is_date() ) is satisfied, returning TRUE — the opposite of what we expect — and blows our logic into the weeds. Here, the redirection to the canonical /2010/ is probably inappropriate, although some folks might be happy with it.

    Note that we get the same outcome with requests for real past dates where, say, we had an apathetic month and couldn’t be arsed to post anything (maybe we felt run down, having been hit by a truck). ;)

    It’s likely, however, that there will one day exist archives for /2010/5/ (the following month, I guess). That’s the reasoning behind the decision to redirect temporarily with a 302 (and this is actually a clearer answer to Rilwis in post #5). Who knows what the future holds?

    On the other hand, if you’re marginally OCD like myself, you probably want to do it slightly more ‘proper’ and let WP consign bad requests to your gorgeous custom 404 page (that you were inspired by Perishable to create all those moons ago). Fair enough.

    if( ( ! is_date() ) & ( ! is_404() ) ){ /* sweet, off we go */}

    FYI, this is how I implement it.

    And while I’m on a roll, here’s a line for your .htaccess that nobbles requests for anything ‘0’. Avoids waking the slumbering giant.

    ### we don't got '0' of anything
    RewriteCond %{REQUEST_URI} (.*)/0/?$
    RewriteRule .* %1/ [R=301,L]

    Works for me.

    HTH

    Now, where did I put that Smirnoff?

  3. That looks like it’d do the trick. Thanks for the extra info!

  4. Wow! Using this code changed a Google Site Analysis report from
    You have minor duplicate tag issues.
    220 pages are using the same description tag.
    174 pages are using the same title tag.
    to
    You have minor duplicate tag issues.
    58 pages are using the same description tag.
    14 pages are using the same title tag.
    Within 10 minutes of updating the functions.php file.
    One thing though, as a newbie I don’t always understand where to place code so it’s a bit scary. Learning as I go all about “The Loop”. Thanks, again.

  5. This problem has been known and ignored by the WordPress team for quite a long time. It was probably first explored around 3 years ago on a blog I stopped updating soon after:

    Yet Another Duplicate Content Vulnerability Hits WordPress, Movable Type Blogs (Part 2) (404 link removed 2017/01/27)

    Note that the mod_rewrite-based fix explored there is no longer really suitable.

    The real fix belongs with the WordPress developers, but unfortunately the problem was ignored as ‘invalid’ when someone reported it to the developers a year later (because gosh, who would ever think of such a URL?):

    Serious SEO Bug with duplicated content

    Some of the related problems caused by design decisions with the commenting code introduced in 2.7 have been addressed:

    Comment Paging leads to duplicated URLs for the same content

    While others have been half-fixed or left unfixed:

    Function get_comment_link returns duplicate URLs for comments available at permalink
    Function get_page_of_comment() returns incorrect page numbers

    All the best,
    Greg

  6. I’ve hit a snag implementing this –

    I’ve got a design with a static homepage and the blog main page output via a page template.The script works perfectly (thanks so much), except for the pagination of the blog main page (older posts / newer posts) which 404.

    I’ve tried absolutely everything my mediocre skills will allow, every kind of conditional wrap etc. that I can cogitate, testing for subpages etc…

  7. @aljuk

    I guess you’re calling next_posts_link() and previous_posts_link() to get your navigation links.
    These functions take their cue from your posts-per-page option in Settings->Reading: Blog pages show at most ?? posts and they append ‘/page/[n]/’ to the URL. Stripping out the last segment here will generate a 404.

    The conditional is_paged() indicates whether this path extension is in play, for page 2 and higher on index, home, front_page and archive pages. And if is_paged() returns true, the global $paged variable holds the ordinal of the current page of posts.

    The following conditional should fix your snag:

    if(!is_date() && !is_paged() && !is_404()){

         /* code here */

    }

    HTH

  8. Hi Jeff,

    thanks for the reply. I don’t think I was too clear, sorry (a very late night …)

    I already had the conditional in place, but it doesn’t work for the is_home page if the is_front_page is a “page” page, as set in Settings > Reading.

    It appears to be a permalink issue specific to that one situation – it works perfectly with the default permalink structure, but not with any custom structure.

  9. Sorry, when I say “but it doesn’t work for the is_home page” what I actually mean is that the next_posts_link() and previous_posts_link() on the is_home page still 404, but only when a custom permalink structure is in place.

  10. aljuk

    Hey, your latest came in while I was typing this, quote…

    That’s odd. I just checked it out using a templated page as the static front page, and it worked fine. It only borks if I drop the is_paged() clause. Btw, I’m using the ‘Month and name’ permalinks (/%year%/%monthnum%/%postname%/).

    …unquote

    I hear what you’re saying, aljuk. It must be down to your custom permalinks – whatever they are. You don’t elaborate – is it ‘TM’? ;)

    This all goes to show how pervasive the problem is, I suppose. WP 3.0 RC3 takes us no further either. Maybe better to implement our fixes on new WP builds, since retro-fitting can have so many issues.

    Incidentally, consider grabbing that global $paged value if it’s set, and tag it onto your document title and/or meta description. Avoids duplicating the sucker.

    Thanks for your input, stuff I hadn’t anticipated.

    ;)

  11. This is strange. I’ve tested with all the stock custom permalinks (the one I’m actually using on the site in question is “day & name”). The site I’m building is from the ground up (not a mod of someone else’s template) on WP3 and is clean and lean. I’ve also tested with a fresh install of WP3 with the template in question, twentyten, classic, and default, all with the same result – everything’s great except for the pagination of the blog homepage when output from a page template. No plugins are running, nothing else of sway in functions.php, no htaccess except the regular wp block.

  12. On further testing, it seems that my conditional is simply being ignored. This is what I’m using (below), in functions.php – am I applying this conditional in the right way?

    In functions.php:

    <?php
    if (!is_paged() && !is_date() && !is_404()) {
         global $posts, $numpages;
         $request_uri = $_SERVER['REQUEST_URI'];
         $result = preg_match('%\/(\d)+(\/)?$%', $request_uri, $matches);
         $ordinal = $result ? intval($matches[1]) : FALSE;
         if(is_numeric($ordinal)) {
              // a numbered page was requested: validate it
              // look-ahead: initialises the global $numpages
              setup_postdata($posts[0]); // yes, hack
              $redirect_to = ($ordinal < 2) ? '/': (($ordinal > $numpages) ? "/$numpages/" : FALSE);
              if(is_string($redirect_to)) {
                   // we got us a phantom
                   $redirect_url = get_option('home') . preg_replace('%'.$matches[0].'%', $redirect_to, $request_uri);
                   // if page = 0 or 1, redirect permanently
                   if($ordinal < 2) {
                        header($_SERVER['SERVER_PROTOCOL'] . ' 301 Moved Permanently');
                   } else {
                        header($_SERVER['SERVER_PROTOCOL'] . ' 302 Found');
                   }
                   header("Location: $redirect_url");
                   exit();
              }
         }
    }
    ?>

    In index.php:

    <?php if ( $wp_query->max_num_pages > 1 ) : ?>
    <div class="navigation" id="nav-above">
    <div class="nav-previous"><?php next_posts_link('Previous posts'); ?></div>
    <div class="nav-next"><?php previous_posts_link('Recent posts;'); ?></div>
    </div><!-- #nav-above -->
    <?php endif; ?>

[ Comments are closed for this post ]