Latest TweetsNew version of Disable Gutenberg includes options to disable for specific theme templates and/or post/page IDs. wordpress.org/plugins/disable-…
Perishable Press

Fixing WordPress Infinite Duplicate Content Issue

Jeff Morris recently demonstrated a potential issue with the way WordPress handles multipaged posts and comments. The issue involves WordPress’ inability to discern between multipaged posts and comments that actually exist and those that do not. By redirecting requests for nonexistent numbered pages to the original post, WordPress creates an infinite amount of duplicate content for your site. In this article, we explain the issue, discuss the implications, and provide an easy, working solution.

Understanding the “infinite duplicate content” issue

Using the <!--nextpage--> tag, WordPress makes it easy to split your post content into multiple pages, and also makes it easy to paginate the display of your comment threads. For both paged posts and paged comments, WordPress appends the page number to the permalink. So for example, if we have a post split into 3 pages, WordPress will generate the following set of completely valid permalinks (based on name-only permalink structure):

https://perishablepress.com/wordpress-infinite-duplicate-content/
https://perishablepress.com/wordpress-infinite-duplicate-content/1/
https://perishablepress.com/wordpress-infinite-duplicate-content/2/
https://perishablepress.com/wordpress-infinite-duplicate-content/3/

Likewise for paged comments, WordPress generates a predictable sequence of URLs:

https://perishablepress.com/wordpress-infinite-duplicate-content/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-1/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-2/
https://perishablepress.com/wordpress-infinite-duplicate-content/comment-page-3/

In each of these examples, we see duplicate content for the canonical URL and the first page of each series. But it gets worse. We can append any integer to the canonical URL and get the exact same page. In fact, with the way WordPress 3.0 (and older) handles paged content, there are an infinite number of fully valid (i.e., header status “200 OK”) URLs that all point to the original post. Consider the following URLs:

https://perishablepress.com/wordpress-infinite-duplicate-content/10
https://perishablepress.com/wordpress-infinite-duplicate-content/100/
https://perishablepress.com/wordpress-infinite-duplicate-content/2000/
https://perishablepress.com/wordpress-infinite-duplicate-content/300000/

As it turns out, you can append any page number to the URL and, if the page doesn’t actually exist, WordPress will return a valid “200 OK” response and deliver the highest numbered existing page to the user. And not just for paged content — WordPress exhibits this behavior for any permalink on your site. This of course is not good from an SEO perspective, where an infinite number of URLs all pointing to the same page containing the exact same title, description, and content could potentially devastate your rankings.

Try it yourself! Add a page number to the end of any WordPress-powered URL and witness the insanity..

The worst part is that WordPress does this for not only multipaged posts and comments, but for any post or page. Try it yourself on any WordPress site on the Web — just append some number to the end of any URL and watch WordPress work its magic.

Why is this a potential problem?

To be fair, having an infinite number of unique URLs all pointing to the same page is only a problem if someone makes it a problem. In other words, as long as no one is linking to your site using these “phantom” URLs (as Jeff puts it), then everything is probably going to be fine. But as soon as you get some nefarious scumbag deliberately linking to your post using a few hundred differently numbered permalinks, it’s just a matter of time before Google pays a visit, tags your site as spam, and ultimately shuts it down. Admittedly, the chances of something like this happening are probably slim to none, but the potential exists for some serious duplicate-content mayhem.

One possible (less-than-ideal) solution

As Jeff points out, “some bright spark might argue: ‘Hey, nowadays WP puts a canonical link in your head element, this is no problem.’” Certainly this is a possible way of dealing with the “infinite duplicate content issue,” but you have to keep in mind that the canonical link element is purely informational and only acknowledged by three search engines: Google, Bing, and Yahoo. Canonical URLs are nothing more than a hint designed to help Google better apply its various indexing algorithms. It is not supported by the RFC, and there is neither obligation to adhere to it nor consequence for ignoring it. It’s a step in the right direction, but should not be considered a “foolproof” solution for handling duplicate content.

[The canonical tag] is designed to help [Google] avoid shooting themselves in the foot. It’s no more than a hint. It’s not backed up by an RFC and there’s no obligation to honour it. Its use has a few caveats, and some cop-outs. – Jeff Morris

The canonical link is useful in helping the “big three” search engines determine your preferred URL when presented with duplicate content. There are many situations where canonical links may help, including complex query strings and www/non-www hostname variations. There is certainly no reason not to have canonical links for your site, so long as you understand that they aren’t a “bulletproof” solution for duplicate content. Again, from Jeff:

…the issues about WP that I’m grappling with concern the bit that lies in-between the host name and query string: the path component. The path is an expression of my blog’s internal structure, and I’ve delegated management of that to WP. And right now, WP is doing it wrong, it’s responding with phantom structures, and it’s bustin’ my chops.

When dealing with WordPress’ multipaged duplicate content issue, canonical links are like cold medicine – they mask the symptoms, but the underlying problem remains: WordPress is redirecting invalid URLs to valid pages, essentially creating infinite amounts of duplicate content.

For the record, as of version 2.9, WordPress automatically generates canonical tags for your single posts and pages. But it doesn’t provide canonical tags for paged comments. Fortunately, Jean-Baptiste Jung explains how with a simple function:

// canonical links for comments
function canonical_for_comments() {
	global $cpage, $post;
	if ($cpage > 1) :
		echo "\n";
		echo "<link rel='canonical' href='";
		echo get_permalink( $post->ID );
		echo "' />\n";
	endif;
}
add_action('wp_head', 'canonical_for_comments');

Add that to your custom functions.php template and you’re good to go. With this in place, your WordPress site is fully equipped with canonical tags, which may or may not be the best solution. Keep in mind that the canonical links are only a suggestion designed to help Google et al wade through some of the murkier URLs on the Web. So, if you’re looking for a better solution for the infinite duplicate content issue, read on..

A better solution

Instead of merely suggesting to the search engines that they should ignore nonexistent multipaged posts and comments, a better solution is to prevent WordPress from handling them in the first place. For this, we employ the following bit of scripted logic:

  1. Requests for legitimate/existing URLs are handled normally: no redirection takes place and the pages appear as normal at their corresponding URLs.
  2. Requests for URLs ending with “0” or “1” are effectively redundant, and are permanently redirected (via 301) to the canonical URL.
  3. Requests for URLs ending with any number greater than “0” or “1” are “soft” redirected via “302 Temporary” header to the original post, thereby consolidating all “phantom” requests into a single, canonical URL.

Putting this logic into practice, we get the following function:

<?php // malicious post page ordinal fix by Jeff Morris
// [post URL for reference]

    global $posts, $numpages;

    $request_uri = $_SERVER['REQUEST_URI'];

    $result = preg_match('%\/(\d)+(\/)?$%', $request_uri, $matches);

    $ordinal = $result ? intval($matches[1]) : FALSE;

    if(is_numeric($ordinal)) {

        // a numbered page was requested: validate it
        // look-ahead: initialises the global $numpages

        setup_postdata($posts[0]); // yes, hack

        $redirect_to = ($ordinal < 2) ? '/': (($ordinal > $numpages) ? "/$numpages/" : FALSE);

        if(is_string($redirect_to)) {

            // we got us a phantom
            $redirect_url = get_option('home') . preg_replace('%'.$matches[0].'%', $redirect_to, $request_uri);

            // if page = 0 or 1, redirect permanently
            if($ordinal < 2) {
                header($_SERVER['SERVER_PROTOCOL'] . ' 301 Moved Permanently');
            } else {
                header($_SERVER['SERVER_PROTOCOL'] . ' 302 Found');
            }

            header("Location: $redirect_url");
            exit();

        }
    }
?>

To implement this technique, just place it into your active theme’s functions.php file. Consider this a “beta” fix and test thoroughly. I have tested this successfully on WordPress 2.3 and 2.9, so it will probably work fine on just about any version of WordPress in use today.

This solution “fixes” the infinite duplicate content issue on the server-side of the equation. It eliminates the extraneous URLs by redirecting them to their canonical counterparts. Search engines will always find the canonical versions of your pages thus eliminating any duplicate content.

Conversely, using the canonical tag is more of a “cosmetic suggestion” that fails to resolve the issue in a reliable way. As Matt Cutts states several times in his presentation on canonical links, it’s always better for site admins to resolve any potential duplication issues on their site at the upstream end. Similar ideas are expressed in this video on the canonical tag.

Thoughts?

This all seems like a significant issue to the few people that have now heard about this, but we may be getting caught up in the sheer discovery of it all. What do you think? Is this something that needs addressed in an upcoming version of WordPress? Is the canonical tag a good enough fix? Share your thoughts and help us sort it all out..

Jeff Starr
About the Author Jeff Starr = Creative thinker. Passionate about free and open Web.
Archives
46 responses
  1. If this works like it sounds, it seems like a great idea.

    But isn’t this something that should be wrapped in a function and added via add_action to something like the ‘init’ or ‘send_headers’ action, instead of just free floating in the functions.php file?

    I would think there is a specific spot in the order of WordPress loading that this would be a good fit, but loading when the theme’s functions file does seems to have the potential for problems.

    -Marty

  2. Thomas Scholz April 6, 2010 @ 6:42 pm

    Funny, I wrote about the same issue in the last days. For the numbering problem I created a plugin, which works without regexes: http://toscho.de/2010/wordpress-plugin-canonical-permalink/

    There are some other issues, which are handles best in the .htaccess: http://toscho.de/2010/wordpress-htaccess-request-saeubern/

  3. I’ve just started digging around for answers on this but one other issue is the home page + index.php … wordpress serves up the page (domain.com/index.php). For individual posts + index.php, it at least has the link rel='canonical' . How to fix this for the home page?

  4. It’s really a good solution. But I agree with Thornley, this may have a problem of performance.

  5. I have a question, why don’t we use the 301 redirect in case of requested page has paged number greater than maximum it has to be? And why don’t we redirect to the original post (paged = 0) instead of to the maximum paged post?

  6. @Marty T

    Short of hacking the core, your first opportunity to do anything is going to be in functions.php.
    The objective should be to commit as few resources to a phantom as possible. Without hacking the core.

    @Thomas S

    Liked your article. I hope this has given you more to ponder.
    The regex is ubiquitous; it’s trivial to port it to .htaccess for pages 0 and 1.

    @shimu

    The performance hit is negligible. If Perishable’s ‘nefarious scumbag’ should manage to dump your site into the abyss, I doubt many folks will even be able to comment on its performance.

    @Rilwis

    ‘Permanent’ should mean ‘permanent’, right? We don’t want to be painting ourselves into a corner, do we?

    This is an obscure issue, admittedly, but its true import we have yet to discover.

  7. So, I could be wrong, but this seems to break viewing older posts on the “home page”.

    So, if I go to /page/2 on my homepage, it redirects me back to just /. Even though /page/2 is valid.

    Any thoughts?

  8. @jeff

    re: functions file…

    Of course no hacking the core :)

    I think you misunderstood. Of course this would be placed in the functions.php file, but the function itself should be added using ‘add_action’ to the ‘init’ function to ensure this is one of the first things loaded into the page.

    I could see potential issues with allowing a function that plays with the headers to load at a random point.

    Any function that is meant to be loaded at a specific time, should take advantage of the WordPress hooks to lock it in place.

    Just my two cents.

  9. @Marty T

    OK, we’re on the same page, pun intended.

    But I don’t consider functions.php to be ‘some random point’. In a regular WP theme without output buffering it’s about the only place you can send headers and bail out. It’s what it’s for.

    What’s too-easily missed (or misunderstood) in Perishable’s article is that this is an issue ‘for both paged posts and paged comments’ (qv).

    Fact is we need to invoke setup_postdata() to initialise $numpages. And we need to invoke comments_template() to get a $cpage value.

    There’s a plugin here, for sure. I’d write it myself if it wasn’t for this dashed idleness. ;)

    JeffM

  10. hi Jeff….

    using your function inside my childthemes function.php doesn’t make fun at pages or articles using .

    Paged links like: blog.example.com/hallo-welt/2

    give me the result: blog.example.com///hallo-welt

    …and just let me see the first page….sic!

    using wp 2.9.2

  11. Pretty cool, but the thing is it kills the archives, which you want to be paged, right? e.g. it causes /2010/ to forward back to the home page. What would be the best way to only apply it to single posts?

  12. Cool thanks yea I tried adding if(is_single()) around the whole thing and around the other if statements but I haven’t got it to work, yet. ;)

[ Comments are closed for this post ]