Fixing WordPress Infinite Duplicate Content Issue

Jeff Morris recently demonstrated a potential issue with the way WordPress handles multipaged posts and comments. The issue involves WordPress’ inability to discern between multipaged posts and comments that actually exist and those that do not. By redirecting requests for nonexistent numbered pages to the original post, WordPress creates an infinite amount of duplicate content for your site. In this article, we explain the issue, discuss the implications, and provide an easy, working solution.

Understanding the “infinite duplicate content” issue

Using the <!--nextpage--> tag, WordPress makes it easy to split your post content into multiple pages, and also makes it easy to paginate the display of your comment threads. For both paged posts and paged comments, WordPress appends the page number to the permalink. So for example, if we have a post split into 3 pages, WordPress will generate the following set of completely valid permalinks (based on name-only permalink structure):

http://perishablepress.com/wordpress-infinite-duplicate-content/
http://perishablepress.com/wordpress-infinite-duplicate-content/1/
http://perishablepress.com/wordpress-infinite-duplicate-content/2/
http://perishablepress.com/wordpress-infinite-duplicate-content/3/

Likewise for paged comments, WordPress generates a predictable sequence of URLs:

http://perishablepress.com/wordpress-infinite-duplicate-content/
http://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-1/
http://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-2/
http://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-3/

In each of these examples, we see duplicate content for the canonical URL and the first page of each series. But it gets worse. We can append any integer to the canonical URL and get the exact same page. In fact, with the way WordPress 3.0 (and older) handles paged content, there are an infinite number of fully valid (i.e., header status “200 OK”) URLs that all point to the original post. Consider the following URLs:

http://perishablepress.com/wordpress-infinite-duplicate-content/10
http://perishablepress.com/wordpress-infinite-duplicate-content/100/
http://perishablepress.com/wordpress-infinite-duplicate-content/2000/
http://perishablepress.com/wordpress-infinite-duplicate-content/300000/

As it turns out, you can append any page number to the URL and, if the page doesn’t actually exist, WordPress will return a valid “200 OK” response and deliver the highest numbered existing page to the user. And not just for paged content — WordPress exhibits this behavior for any permalink on your site. This of course is not good from an SEO perspective, where an infinite number of URLs all pointing to the same page containing the exact same title, description, and content could potentially devastate your rankings.

Try it yourself! Add a page number to the end of any WordPress-powered URL and witness the insanity..

The worst part is that WordPress does this for not only multipaged posts and comments, but for any post or page. Try it yourself on any WordPress site on the Web — just append some number to the end of any URL and watch WordPress work its magic.

Why is this a potential problem?

To be fair, having an infinite number of unique URLs all pointing to the same page is only a problem if someone makes it a problem. In other words, as long as no one is linking to your site using these “phantom” URLs (as Jeff puts it), then everything is probably going to be fine. But as soon as you get some nefarious scumbag deliberately linking to your post using a few hundred differently numbered permalinks, it’s just a matter of time before Google pays a visit, tags your site as spam, and ultimately shuts it down. Admittedly, the chances of something like this happening are probably slim to none, but the potential exists for some serious duplicate-content mayhem.

One possible (less-than-ideal) solution

As Jeff points out, “some bright spark might argue: ‘Hey, nowadays WP puts a canonical link in your head element, this is no problem.’” Certainly this is a possible way of dealing with the “infinite duplicate content issue,” but you have to keep in mind that the canonical link element is purely informational and only acknowledged by three search engines: Google, Bing, and Yahoo. Canonical URLs are nothing more than a hint designed to help Google better apply its various indexing algorithms. It is not supported by the RFC, and there is neither obligation to adhere to it nor consequence for ignoring it. It’s a step in the right direction, but should not be considered a “foolproof” solution for handling duplicate content.

[The canonical tag] is designed to help [Google] avoid shooting themselves in the foot. It’s no more than a hint. It’s not backed up by an RFC and there’s no obligation to honour it. Its use has a few caveats, and some cop-outs. – Jeff Morris

The canonical link is useful in helping the “big three” search engines determine your preferred URL when presented with duplicate content. There are many situations where canonical links may help, including complex query strings and www/non-www hostname variations. There is certainly no reason not to have canonical links for your site, so long as you understand that they aren’t a “bulletproof” solution for duplicate content. Again, from Jeff:

…the issues about WP that I’m grappling with concern the bit that lies in-between the host name and query string: the path component. The path is an expression of my blog’s internal structure, and I’ve delegated management of that to WP. And right now, WP is doing it wrong, it’s responding with phantom structures, and it’s bustin’ my chops.

When dealing with WordPress’ multipaged duplicate content issue, canonical links are like cold medicine – they mask the symptoms, but the underlying problem remains: WordPress is redirecting invalid URLs to valid pages, essentially creating infinite amounts of duplicate content.

For the record, as of version 2.9, WordPress automatically generates canonical tags for your single posts and pages. But it doesn’t provide canonical tags for paged comments. Fortunately, Jean-Baptiste Jung explains how with a simple function:

// canonical links for comments
function canonical_for_comments() {
	global $cpage, $post;
	if ($cpage > 1) :
		echo "\n";
		echo "<link rel='canonical' href='";
		echo get_permalink( $post->ID );
		echo "' />\n";
	endif;
}
add_action('wp_head', 'canonical_for_comments');

Add that to your custom functions.php template and you’re good to go. With this in place, your WordPress site is fully equipped with canonical tags, which may or may not be the best solution. Keep in mind that the canonical links are only a suggestion designed to help Google et al wade through some of the murkier URLs on the Web. So, if you’re looking for a better solution for the infinite duplicate content issue, read on..

A better solution

Instead of merely suggesting to the search engines that they should ignore nonexistent multipaged posts and comments, a better solution is to prevent WordPress from handling them in the first place. For this, we employ the following bit of scripted logic:

  1. Requests for legitimate/existing URLs are handled normally: no redirection takes place and the pages appear as normal at their corresponding URLs.
  2. Requests for URLs ending with “0” or “1” are effectively redundant, and are permanently redirected (via 301) to the canonical URL.
  3. Requests for URLs ending with any number greater than “0” or “1” are “soft” redirected via “302 Temporary” header to the original post, thereby consolidating all “phantom” requests into a single, canonical URL.

Putting this logic into practice, we get the following function:

<?php // malicious post page ordinal fix by Jeff Morris
// [post URL for reference]

    global $posts, $numpages;

    $request_uri = $_SERVER['REQUEST_URI'];

    $result = preg_match('%\/(\d)+(\/)?$%', $request_uri, $matches);

    $ordinal = $result ? intval($matches[1]) : FALSE;

    if(is_numeric($ordinal)) {

        // a numbered page was requested: validate it
        // look-ahead: initialises the global $numpages

        setup_postdata($posts[0]); // yes, hack

        $redirect_to = ($ordinal < 2) ? '/': (($ordinal > $numpages) ? "/$numpages/" : FALSE);

        if(is_string($redirect_to)) {

            // we got us a phantom
            $redirect_url = get_option('home') . preg_replace('%'.$matches[0].'%', $redirect_to, $request_uri);

            // if page = 0 or 1, redirect permanently
            if($ordinal < 2) {
                header($_SERVER['SERVER_PROTOCOL'] . ' 301 Moved Permanently');
            } else {
                header($_SERVER['SERVER_PROTOCOL'] . ' 302 Found');
            }

            header("Location: $redirect_url");
            exit();

        }
    }
?>

To implement this technique, just place it into your active theme’s functions.php file. Consider this a “beta” fix and test thoroughly. I have tested this successfully on WordPress 2.3 and 2.9, so it will probably work fine on just about any version of WordPress in use today.

This solution “fixes” the infinite duplicate content issue on the server-side of the equation. It eliminates the extraneous URLs by redirecting them to their canonical counterparts. Search engines will always find the canonical versions of your pages thus eliminating any duplicate content.

Conversely, using the canonical tag is more of a “cosmetic suggestion” that fails to resolve the issue in a reliable way. As Matt Cutts states several times in his presentation on canonical links, it’s always better for site admins to resolve any potential duplication issues on their site at the upstream end. Similar ideas are expressed in this video on the canonical tag.

Thoughts?

This all seems like a significant issue to the few people that have now heard about this, but we may be getting caught up in the sheer discovery of it all. What do you think? Is this something that needs addressed in an upcoming version of WordPress? Is the canonical tag a good enough fix? Share your thoughts and help us sort it all out..