Comprehensive URL Canonicalization via htaccess for WordPress-Powered Sites

.htaccess made easy

Permalink URL canonicalization is automated via PHP in WordPress 2.3+, however, for those of us running sites on pre-2.3 versions or preferring to deal with rewrites directly via Apache, comprehensive WordPress URL canonicalization via htaccess may seem impossible. While there are several common methods that are partially effective, there has not yet been available a complete, user-friendly solution designed specifically for WordPress. Until now..

In this article, I share my “secret” htaccess URL canonicalization formula. I originally developed this method in July of 2007, and have been using it successfully on a variety of WordPress-powered sites since that time. Thus, having verified the effective performance of this technique, I feel confident in sharing it with the open-source community. But first, a bit of context..

What is URL canonicalization?

Technically speaking, a canonical URL is the definitive URL for any given web page. For example, let’s say you have a web page located at the following URL:

http://domain.tld/subdirectory/canonicalized/

Assuming that this URL is the preferred, official URL, it is referred to as the canonical URL for that particular web page. As the canonical URL, that is the address that you want listed in the search engines, bookmarked by visitors, and so on. Unfortunately, on sites running WordPress versions less than 2.3, that web page would also be accessible at the following URLs:

http://domain.tld/subdirectory/canonicalized/
http://domain.tld/subdirectory/canonicalized
http://domain.tld/subdirectory/index.php/canonicalized/
http://domain.tld/subdirectory/index.php/canonicalized
http://domain.tld/subdirectory/index.php?p=77
http://domain.tld/subdirectory/?p=77

http://www.domain.tld/subdirectory/canonicalized/
http://www.domain.tld/subdirectory/canonicalized
http://www.domain.tld/subdirectory/index.php/canonicalized/
http://www.domain.tld/subdirectory/index.php/canonicalized
http://www.domain.tld/subdirectory/index.php?p=77
http://www.domain.tld/subdirectory/?p=77

And so on.. Each of these non-canonical URLs is accessible by visitors and search engines, creating many potential sources of duplicate content. Duplicate content is undesirable as it may damage the search engine placement of the associated web page. This happens when the search engines index multiple versions of the same page and effectively divide the attributed link equity among the duplicates. Consolidating the worth of your pages by serving only a singular instance of each of them is beneficial to you, your visitors, and your search engine rankings.

Should you use this canonicalization method?

As previously mentioned, WordPress 2.3+ provides a built-in URL canonicalization technique via PHP. Thus, if you are using WP 2.3+, you technically don’t need to use this method, however, if you prefer to handle URL rewriting at the server level, then you may indeed benefit from its use.

For those of us running WP versions less than 2.3, WordPress includes no comprehensive canonicalization method. In fact, this technique was developed before WordPress 2.3, and is specifically designed to provide complete URL canonicalization for WordPress-powered sites. If this sounds like you, and you are concerned about duplicate content issues, I would definitely recommend implementing this solution 1.

Presenting Comprehensive URL Canonicalization for WordPress

Now that we’re all on the same page (so to speak), let’s have a look at the complete htaccess canonicalization code, as used here at Perishable Press:

# Comprehensive URL Canonicalization for WordPress
RedirectMatch permanent index.php/(.*) http://perishablepress.com/press/$1

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.perishablepress\.com$ [NC]
RewriteRule ^(.*)$ http://perishablepress.com%{REQUEST_URI} [R=301,L]

RewriteBase /press/

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(html|php)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(html|php)$ http://perishablepress.com/press/$1 [R=301,L] 

RewriteCond %{REQUEST_URI} /+[^\.]+$
RewriteRule ^(.+[^/])$ %{REQUEST_URI}/ [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /press/index.php [L]
</IfModule>

To verify the effectiveness of this method, try accessing a few non-canonical URLs for this site. For example, entering any of the non-canonical URL formats previously presented will return the official, canonical URL for any given page.

The Breakdown

Before presenting a generalized, copy-&-paste version of this htaccess canonicalization solution, let’s examine the functionality involved with each of its various directives:

Remove all instances of index.php from URLs
In the first line, we call upon RedirectMatch to match and remove all instances of index.php occurring within our permalink URLs.
Check and initialize the mod_rewrite module
In the second line, we check for the presence of the mod_rewrite module via an <IfModule mod_rewrite.c> container. Then, in the third line, we initialize the rewrite engine with RewriteEngine On.
Specify www prefix preference for all URLs
In the fourth and fifth lines, we effectively match and eliminate all instances of the www prefix from our URLs. With a simple modification we could ensure that the www prefix was included in our canonical URLs. For more information on this, check out our 2-for-1 special.
Specify the base directory for the rewrite rules
In the sixth line, we specify the base directory for the rewrite rules using the RewriteBase directive. For those of us familiar with the htaccess rules used for WordPress permalinks, this line should look familiar.
Remove all instances of index.php from the end of URLs
In the seventh and eighth lines, we are matching and removing all instances of index.php (or index.html) from the end of our permalink URLs.
Ensure a trailing slash at the end of all URLs
In the 9th and 10th lines, we ensure that all permalink URLs include a trailing slash.
Execute the remaining portion of the default permalink directives
The next three lines constitute the remainder of the default WordPress htaccess permalink rules. These should look familiar, and are required in order for WordPress to utilize its native permalink functionality.
Wrap it up by closing the IfModule container
Back in the second line, we opened a <IfModule mod_rewrite.c> container to check for the presence of the required mod_rewrite module. Now that we have finished specifying our complete set of rewrite directives, we no longer require the rewrite module and thus may close the container.

Generalized Examples

Now that we have seen a site-specific version of this technique, and have gained a clearer picture as to its underlying functionality, let’s wrap things up a two generalized examples: one for root-installations of WordPress, and the other for subdirectory installations. After each example, I will describe the few edits required for proper implementation and usage.

WordPress installed in the ROOT directory

# Comprehensive URL Canonicalization for WordPress in Root
RedirectMatch permanent index.php/(.*) http://domain.tld/$1

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld%{REQUEST_URI} [R=301,L]

RewriteBase /

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(html|php)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(html|php)$ http://domain.tld/$1 [R=301,L] 

RewriteCond %{REQUEST_URI} /+[^\.]+$
RewriteRule ^(.+[^/])$ %{REQUEST_URI}/ [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

To use, replace your existing root htaccess WordPress permalink rules with this code. Then, replace all instances of “domain.tld” with your domain name. That’s it. Upload and test like crazy. If everything seems like it is working, you can have some fun by entering various non-canonical URL variations and watching it work.

WordPress installed in a SUB-directory

# Comprehensive URL Canonicalization for WordPress in Subdirectory
RedirectMatch permanent index.php/(.*) http://domain.tld/subdirectory/$1

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld%{REQUEST_URI} [R=301,L]

RewriteBase /subdirectory/

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(html|php)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(html|php)$ http://domain.tld/subdirectory/$1 [R=301,L] 

RewriteCond %{REQUEST_URI} /+[^\.]+$
RewriteRule ^(.+[^/])$ %{REQUEST_URI}/ [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /subdirectory/index.php [L]
</IfModule>

To use, replace your existing subdirectory htaccess WordPress permalink rules with this code. Then, replace all instances of “domain.tld” with your domain name. Finally, ensure that all instances of “subdirectory” with the name of your own. That’s it. Upload and test like spider pig. If everything seems like it is working, you can entertain the kids by entering various non-canonical URL variations and watching as they “magically” switch back to their correct locations.

Modifications & Variations

Canonical URL for root domain name

If you are running WordPress in a subdirectory, you may also want to ensure canonicalization of your site’s root URL. To do this, place the following code to your site’s root htaccess file:

# Remove index.php from root URL
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php [NC]
RewriteRule ^index\.php$ http://domain.tld/ [R=301,L] 

# Permanently redirect from www domain to non-www domain
RewriteCond %{HTTP_HOST} ^www\.domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld/$1 [R=301,L]

Once in place, this code will ensure that only the non-www, non-index.php versions of your site’s root domain URL are served. Remember, you only need this code if you are running WordPress in a directory other than root.

Serve www version instead of non-www version of URLs

If you prefer to use the www prefix in your URLs (as opposed to omitting them), simply replace these two lines:

RewriteCond %{HTTP_HOST} ^www\.domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld%{REQUEST_URI} [R=301,L]

..with these two lines:

RewriteCond %{HTTP_HOST} ^domain\.tld$ [NC]
RewriteRule ^(.*)$ http://www.domain.tld%{REQUEST_URI} [R=301,L]

..and, again, remember to replace all instances of “domain.tld” with your domain name.

Conclusion

The canonicalization solution presented in this article is both comprehensive and effective. As mentioned before, I have been using this technique on myriad sites since July of 2007 with great success. Since employing this method, I have virtually eliminated all opportunity for duplicate content to be served from my site. Of course, there are other duplicate content issues not associated with canonical URLs, but we will save that for another article :)

Hopefully, sharing these rules will open the doors to improvements. If you see something that could be improved, optimized, or otherwise enhanced, please share your wisdom with us. As an open-source community, we only benefit when everyone shares, and we all benefit when somebody does.

Footnotes