Comprehensive URL Canonicalization via htaccess for WordPress-Powered Sites

by Jeff Starr on Wednesday, January 16, 2008 45 Responses

[ Image: Decorative Icon ] Permalink URL canonicalization is automated via PHP in WordPress 2.3+, however, for those of us running sites on pre-2.3 versions or preferring to deal with rewrites directly via Apache, comprehensive WordPress URL canonicalization via htaccess may seem impossible. While there are several common methods that are partially effective, there has not yet been available a complete, user-friendly solution designed specifically for WordPress. Until now..

In this article, I share my “secret” htaccess URL canonicalization formula. I originally developed this method in July of 2007, and have been using it successfully on a variety of WordPress-powered sites since that time. Thus, having verified the effective performance of this technique, I feel confident in sharing it with the open-source community. But first, a bit of context..

What is URL canonicalization?

Technically speaking, a canonical URL is the definitive URL for any given web page. For example, let’s say you have a web page located at the following URL:

http://domain.tld/subdirectory/canonicalized/

Assuming that this URL is the preferred, official URL, it is referred to as the canonical URL for that particular web page. As the canonical URL, that is the address that you want listed in the search engines, bookmarked by visitors, and so on. Unfortunately, on sites running WordPress versions less than 2.3, that web page would also be accessible at the following URLs:

http://domain.tld/subdirectory/canonicalized/
http://domain.tld/subdirectory/canonicalized
http://domain.tld/subdirectory/index.php/canonicalized/
http://domain.tld/subdirectory/index.php/canonicalized
http://domain.tld/subdirectory/index.php?p=77
http://domain.tld/subdirectory/?p=77

http://www.domain.tld/subdirectory/canonicalized/
http://www.domain.tld/subdirectory/canonicalized
http://www.domain.tld/subdirectory/index.php/canonicalized/
http://www.domain.tld/subdirectory/index.php/canonicalized
http://www.domain.tld/subdirectory/index.php?p=77
http://www.domain.tld/subdirectory/?p=77

And so on.. Each of these non-canonical URLs is accessible by visitors and search engines, creating many potential sources of duplicate content. Duplicate content is undesirable as it may damage the search engine placement of the associated web page. This happens when the search engines index multiple versions of the same page and effectively divide the attributed link equity among the duplicates. Consolidating the worth of your pages by serving only a singular instance of each of them is beneficial to you, your visitors, and your search engine rankings.

Should you use this canonicalization method?

As previously mentioned, WordPress 2.3+ provides a built-in URL canonicalization technique via PHP. Thus, if you are using WP 2.3+, you technically don’t need to use this method, however, if you prefer to handle URL rewriting at the server level, then you may indeed benefit from its use.

For those of us running WP versions less than 2.3, WordPress includes no comprehensive canonicalization method. In fact, this technique was developed before WordPress 2.3, and is specifically designed to provide complete URL canonicalization for WordPress-powered sites. If this sounds like you, and you are concerned about duplicate content issues, I would definitely recommend implementing this solution 1.

Presenting Comprehensive URL Canonicalization for WordPress

Now that we’re all on the same page (so to speak), let’s have a look at the complete htaccess canonicalization code, as used here at Perishable Press:

# Comprehensive URL Canonicalization for WordPress
RedirectMatch permanent index.php/(.*) http://perishablepress.com/press/$1

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.perishablepress\.com$ [NC]
RewriteRule ^(.*)$ http://perishablepress.com%{REQUEST_URI} [R=301,L]

RewriteBase /press/

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(html|php)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(html|php)$ http://perishablepress.com/press/$1 [R=301,L] 

RewriteCond %{REQUEST_URI} /+[^\.]+$
RewriteRule ^(.+[^/])$ %{REQUEST_URI}/ [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /press/index.php [L]
</IfModule>

To verify the effectiveness of this method, try accessing a few non-canonical URLs for this site. For example, entering any of the non-canonical URL formats previously presented will return the official, canonical URL for any given page.

The Breakdown

Before presenting a generalized, copy-&-paste version of this htaccess canonicalization solution, let’s examine the functionality involved with each of its various directives:

Remove all instances of index.php from URLs
In the first line, we call upon RedirectMatch to match and remove all instances of index.php occurring within our permalink URLs.
Check and initialize the mod_rewrite module
In the second line, we check for the presence of the mod_rewrite module via an <IfModule mod_rewrite.c> container. Then, in the third line, we initialize the rewrite engine with RewriteEngine On.
Specify www prefix preference for all URLs
In the fourth and fifth lines, we effectively match and eliminate all instances of the www prefix from our URLs. With a simple modification we could ensure that the www prefix was included in our canonical URLs. For more information on this, check out our 2-for-1 special.
Specify the base directory for the rewrite rules
In the sixth line, we specify the base directory for the rewrite rules using the RewriteBase directive. For those of us familiar with the htaccess rules used for WordPress permalinks, this line should look familiar.
Remove all instances of index.php from the end of URLs
In the seventh and eighth lines, we are matching and removing all instances of index.php (or index.html) from the end of our permalink URLs.
Ensure a trailing slash at the end of all URLs
In the 9th and 10th lines, we ensure that all permalink URLs include a trailing slash.
Execute the remaining portion of the default permalink directives
The next three lines constitute the remainder of the default WordPress htaccess permalink rules. These should look familiar, and are required in order for WordPress to utilize its native permalink functionality.
Wrap it up by closing the IfModule container
Back in the second line, we opened a <IfModule mod_rewrite.c> container to check for the presence of the required mod_rewrite module. Now that we have finished specifying our complete set of rewrite directives, we no longer require the rewrite module and thus may close the container.

Generalized Examples

Now that we have seen a site-specific version of this technique, and have gained a clearer picture as to its underlying functionality, let’s wrap things up a two generalized examples: one for root-installations of WordPress, and the other for subdirectory installations. After each example, I will describe the few edits required for proper implementation and usage.

WordPress installed in the ROOT directory

# Comprehensive URL Canonicalization for WordPress in Root
RedirectMatch permanent index.php/(.*) http://domain.tld/$1

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld%{REQUEST_URI} [R=301,L]

RewriteBase /

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(html|php)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(html|php)$ http://domain.tld/$1 [R=301,L] 

RewriteCond %{REQUEST_URI} /+[^\.]+$
RewriteRule ^(.+[^/])$ %{REQUEST_URI}/ [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

To use, replace your existing root htaccess WordPress permalink rules with this code. Then, replace all instances of “domain.tld” with your domain name. That’s it. Upload and test like crazy. If everything seems like it is working, you can have some fun by entering various non-canonical URL variations and watching it work.

WordPress installed in a SUB-directory

# Comprehensive URL Canonicalization for WordPress in Subdirectory
RedirectMatch permanent index.php/(.*) http://domain.tld/subdirectory/$1

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld%{REQUEST_URI} [R=301,L]

RewriteBase /subdirectory/

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(html|php)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(html|php)$ http://domain.tld/subdirectory/$1 [R=301,L] 

RewriteCond %{REQUEST_URI} /+[^\.]+$
RewriteRule ^(.+[^/])$ %{REQUEST_URI}/ [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /subdirectory/index.php [L]
</IfModule>

To use, replace your existing subdirectory htaccess WordPress permalink rules with this code. Then, replace all instances of “domain.tld” with your domain name. Finally, ensure that all instances of “subdirectory” with the name of your own. That’s it. Upload and test like spider pig. If everything seems like it is working, you can entertain the kids by entering various non-canonical URL variations and watching as they “magically” switch back to their correct locations.

Modifications & Variations

Canonical URL for root domain name

If you are running WordPress in a subdirectory, you may also want to ensure canonicalization of your site’s root URL. To do this, place the following code to your site’s root htaccess file:

# Remove index.php from root URL
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php [NC]
RewriteRule ^index\.php$ http://domain.tld/ [R=301,L] 

# Permanently redirect from www domain to non-www domain
RewriteCond %{HTTP_HOST} ^www\.domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld/$1 [R=301,L]

Once in place, this code will ensure that only the non-www, non-index.php versions of your site’s root domain URL are served. Remember, you only need this code if you are running WordPress in a directory other than root.

Serve www version instead of non-www version of URLs

If you prefer to use the www prefix in your URLs (as opposed to omitting them), simply replace these two lines:

RewriteCond %{HTTP_HOST} ^www\.domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld%{REQUEST_URI} [R=301,L]

..with these two lines:

RewriteCond %{HTTP_HOST} ^domain\.tld$ [NC]
RewriteRule ^(.*)$ http://www.domain.tld%{REQUEST_URI} [R=301,L]

..and, again, remember to replace all instances of “domain.tld” with your domain name.

Conclusion

The canonicalization solution presented in this article is both comprehensive and effective. As mentioned before, I have been using this technique on myriad sites since July of 2007 with great success. Since employing this method, I have virtually eliminated all opportunity for duplicate content to be served from my site. Of course, there are other duplicate content issues not associated with canonical URLs, but we will save that for another article :)

Hopefully, sharing these rules will open the doors to improvements. If you see something that could be improved, optimized, or otherwise enhanced, please share your wisdom with us. As an open-source community, we only benefit when everyone shares, and we all benefit when somebody does.

Footnotes

About the author

[ Jeff Starr ]

Jeff Starr is a web developer, graphic designer and content producer with over 10 years of experience and a passion for quality and detail. Jeff is co-author of the book Digging into WordPress and strives to help people be the best they can be on the Web. + Follow Jeff on Twitter and subscribe to Perishable Press for quality web-design content delivered fresh.


45 Responses
[ Gravatar Icon ]

Rick Beckman#1

Great tips; I’d use ‘em if I wasn’t already on WP 2.3. :P On a related note, would you be willing to help a reader out with a rewriting problem (see my URL).

Canonicalized URLs are great, but when one wants to switch permalink structures (say, to make them more human friendliness or to increase keyword density or what-have-you), links to old style permalinks will break.

I’ve been trying to figure out how to rewrite date based permalinks to simple post title based permalinks (per my URL) for days now, to no avail. I assume it’s simple if one can get the regex correct. I’m hoping you can. (WordPress support forums have been absolutely no help in this matter.)

[ Gravatar Icon ]

Greg#2

Interesting information!

[ Gravatar Icon ]

Perishable#3

Rick,

As mentioned elsewhere, I will definitely look into an htaccess solution for rewriting permalinks to remove the associated year/month/date portion of the URL. In fact, I have been contemplating changing my permalink structure as well, although I am not yet convinced that the work would be worth the payoff. Who knows.. I guess there is only one way to find out, and now I have a good reason to look into it :)

[ Gravatar Icon ]

Rick Beckman#4

If you can devise a solution, I’ll definitely be appreciative!

Best of luck!

[ Gravatar Icon ]

Donace#5

@ Rick
not sure what version (think it was 2.3) my mate had running on his site and he had similar permalinks.

I was snooping at his backend and found that you could add a custom setting.

I set it up as /%category/%title and it ended by having url’s as follows:

Post
http://cvrle77.byethost13.com/funny-videos/light-sabercarefull

Category
http://cvrle77.byethost13.com/category/blondes-jokes

A few kinks as you can see by the second url but i hope to iron these out and will experiment later with htaccess but my knowledge is limited but i will let you know if i find the grail.

[ Gravatar Icon ]

Rick Beckman#6

Regarding my problem of changing permalinks, I figured out how to do it, and I blogged it.

Lemme know if you can improve upon it. Essentially, it’s a one line addition to .htaccess before the WordPress-generated permalink code:

RedirectMatch permanent ^/[0-9]{4}/[0-9]{2}/[0-9]{2}/([a-z0-9\-/]*) http://rickbeckman.org/$1

[ Gravatar Icon ]

Rick Beckman#7

In my previous comment, there’s a mistake in my code that I just discovered. Using it “As Is” will result in day archives (e.g., /2008/10/31/) being inaccessible; changing the asterisk (*) in the code to a plus sign (+) resolves. The proper code then would be

RedirectMatch permanent ^/[0-9]{4}/[0-9]{2}/[0-9]{2}/([a-z0-9\-/]+) http://rickbeckman.org/$1

[ Gravatar Icon ]

Perishable#8

Remarkable work, Rick. Thanks for taking the time to share it with us. Also, your article explaining the technique is great, and I am now forced to resign my previously prepared version to the circular file.. - oh well :)

Seriously, I am this close to modifying the permalinks here at Perishable Press. Other than the reason(s) mentioned at your site, I am extremely concerned about retaining link equity and avoiding 404 errors.

[ Gravatar Icon ]

Graham Smith#9

Watcha Mr J,
Been reading this artcile that you sent me. I did actually read it at the time it came out.

I’ll just explain why. My host doesn’t have mod-rewrite, so I was forced to use the ‘almost friendly permalinks’ with the ‘index.php’ which i felt was simply uncool.

I asked him, probably a little naively if there was anything that could be done with this.

He then came back with a server side solution, using some filters, some knowledge etc.

The result of that is the ‘index.php’ has now been dropped and I have yummy friendly permanlinks, even though we can’t used mod-rewrite.

Is that cool?

Graham Smith
ImJustCreative.com
Blog & Web Ramblings from ‘my’ Gutter.

[ Gravatar Icon ]

Perishable#10

Sounds great, Graham. I tend to forget that not everyone is using Apache software, so if you are on something else, like MS IIS for example, you will definitely need to devise an alternate approach. I just checked your site, and the permalinks are indeed working well. Congrats!

[ Gravatar Icon ]

Sean#11

I love you. Great Post!

[ Gravatar Icon ]

Perishable#12

Hi Sean :) Thanks for the love, it is reciprocated in full! Looking forward to following you on Twitter ;)

[ Gravatar Icon ]

bill#13

The only issue with the automatic redirect that wordpress has done to “helps with” seo is that they have made it a 302 redirect when it should be a 301… so they are not helping.

I would suggest working with your htaccess file like this post says and not leaving it up to wordpress to give you the 302 redirect

[ Gravatar Icon ]

Jeff Starr#14

There are other reasons to use the htaccess method as well, including the fact the WordPress fails to address certain types of canonicalization issues. For example, WordPress does not redirect permalink URLs containing the index.php string to their canonical address; the HTAccess method described in my article canonicalizes all instances of index.php-laced URLs.

[ Gravatar Icon ]

Michelle#15

Hi, I’ve found your article very useful but still can’t do something that seems very simple. I hope you can help because I’ve spent hours and every variation I try doesn’t seem to work!

I have WP in the sub-directory /blog. My .htaccess file is in my root.

Having moved over from a REALLY dated Movable Type installation, my permalinks are in this format:
/blog/archives/123456.php (where 123456 is the postid MT assigned)

I’ve moved to WP (v2.5.1) and want to use this format:
/blog/archives/123456

I’m getting errors when I try to access the old permalinks with the php extension. What rewrite rule should I add to my .htaccess? Please, please help!

[ Gravatar Icon ]

Michelle#16

Update - I found a solution that worked, but it didn’t use the rewrite rules.

RedirectMatch permanent /blog/archives/([0-9]{6})\.php$ http://domain.com/blog/archives/$1

[ Gravatar Icon ]

Jeff Starr#17

Very nice! RedirectMatch is a powerful tool for handling sensitive redirects. Thanks for posting your solution; I am glad you resolved the issue! :)

[ Gravatar Icon ]

federico#18

Hi guys…hope someone can help me…

I’ve got wordpress running in /htdocs/wordpress/
and I’d like to get rid of the “/wordpress” part in the displayed url “http://www.mysite/wordpress/”, turning it in “http://www.mysite.com/”

As i can’t edit the httpd.conf to change the DocumentRoot, I must solve the issue through the .htaccess file.
I’ve tried a couple of solutions but none worked: i get “Permission denied” or 404 errors.

Can anyone help?

tnx in advance.

F.

[ Gravatar Icon ]

Jeff Starr#19

Hi federico: check out the first few paragraphs of this post:

Working with Multiple Themes Outside of the WordPress Installation Directory

..hopefully that will give you some ideas.

[ Gravatar Icon ]

federico#20

Hi Jeff…
tnx so much…it worked perfectly…

c u

Federico

[ Gravatar Icon ]

Deb Phillips#21

Jeff - I want to say “Thanks a million!” for this great detailed and easy-to-understand article. This has so simply solved several issues I’ve wanted to fix for my WordPress site. Thank you, thank you, thank you!

[ Gravatar Icon ]

Deb Phillips#22

Help! I’m suddenly getting errors such as the following browser error (Safari) when clicking on all internal links (navigation, tags, categories):

Too many redirects occurred trying to open “http://lewisvillephotos.com/tag/shallowford-square”. This might occur if you open a page that is redirected to open another page which then is redirected to open the original page.

=====

I’m not sure how to use the above error info. I haven’t set a separate 301 redirect on the above-referenced tag or the other links I’ve just now checked.

Has the URL Canonicalization code caused this type of error, or is there something else going on? If so, can you suggest what might be occurring?

Thank you!

[ Gravatar Icon ]

Deb Phillips#23

I failed to mention in my previous “Help!” post that I’ve temporarily resorted back to a previous version of my .htaccess file. So all links on my site should be working properly at the moment. Thanks.

[ Gravatar Icon ]

Jeff Starr#24

Hi Deb, I haven’t encountered this issue before. There are several things to look at, however, that will help diagnose the issue. First, as you suggest, thoroughly investigate any additional redirects that may be coming from elsewhere on your server. Second, does the error message appear for all pages, and if so, for which non-canonical URLs is it happening? Lastly, which version of WordPress are you using? The code has not been tested on the latest versions, so the conflict could be there as well. Oh, and one more question.. are you experiencing the error in other browsers as well, or only Safari?

[ Gravatar Icon ]

Deb Phillips#25

Hi, Steve —

I’m running WP 2.6 (not any of the subsequent increments).

I’ve looked at the existing redirects in the .htaccess file. They only reference specific pages and were not related to any of the random links I was checking.

However, looking at those redirects leads to another question. If I am able to successfully implement your URL Canonicalization code, since I’ve changed from a www to a non-www URL, should I also change the other existing redirect code lines to point to “http://lewisvillephotos.com” — and not leave the current references to “http://www.lewisvillephotos.com” in those redirects? Does that matter?

As far as testing any non-canonical pages, would that type of page be, for example, a comments page or a website search results page? Whether either of those is non-canonical, my only random checks, though, were for navigation pages, tags and categories, and other internal links referencing previous posts.

Since I resorted back to my old .htaccess code as soon as I realized there was a problem, I therefore can’t test any non-canonical pages until I try your code again. And I may try that later at night, when traffic tends to be lower.

Any other thoughts?

Thank you!

[ Gravatar Icon ]

Jeff Starr#26

@Deb: Hmm.. I might be able to get a better handle on this issue if I could examine the contents of your htaccess file (the version in which you have implemented the canonicalization code). Send it to Jeff (my name) at this domain via regular email and I will take a look at it.

[ Gravatar Icon ]

Jason#27

You have no idea how helpful this post was. THANK YOU!!

[ Gravatar Icon ]

Proxy#28

I want to say “Thanks!” for this great detailed and easy-to-understand article.

[ Gravatar Icon ]

get domain name#29

I was wondering how I’d handle changing from example.com/year/month/day/title permalinks to examle.com/title style permalinks and I think i’ve got it.

I just need a rewrite rule to strip out the year/month/day/ part of permalinks

[ Gravatar Icon ]

Steve Harold#30

I am still trying to get my brain around this permalink issue. At the moment my wordpress blog uses the number convention whereas I would love it to include keywords to help its ranking.

[ Gravatar Icon ]

Jeff Starr#31

@Steve Harold: It looks like you are still using WordPress’ default URL structure. To get the post title to appear as keywords in the URL, enable permalinks via the WordPress Admin.

[ Gravatar Icon ]

Simone#32

I had as well issues in the beginning with permalinks, but it is worth to solve it once.

[ Gravatar Icon ]

Steve Harold#33

hi Jeff
Thanks for the tip I will persist with the permalinks challenge as Simone says.

[ Gravatar Icon ]

mike#34

wow..really complicated! is there a book somewhere I can pick up to get more information?

[ Gravatar Icon ]

Jeff Starr#35

@mike: It’s really not that complicated if you take the time to read through carefully. I haven’t seen a book on this topic per se, but you may be able to find some “absolute beginner” articles that will help you with the basics. Good luck! :)

[ Gravatar Icon ]

Instant Free People Search Guy#36

Interesting information. I just made the move from the blogger platform to WP - lots to digest. This post will help my progress - thanks.

Trackbacks / Pingbacks
  1. .htaccess changes can break LiveWriter | wehuberconsultingllc.com
  2. URL Canonicalization - A brief introduction to URL canonicalization and what URL canonicalization means for your photoblog. » Ryan Wentzel Photography - The Blog
  3. Fix URL Canonicalization with ASP and not IIS | Dolphin Promotions: SEO, Web design and Technology Blog UK
  4. Alero Strategic Design : Another forum for business
  5. Changing How Wordpress Handles Requests for Index.html – SimplyRambling
  6. Tracking URL Parameter & Duplicate Content
  7. Twitter Trackbacks for Comprehensive URL Canonicalization via htaccess for WordPress-Powered Sites • Perishable Press [perishablepress.com] on Topsy.com
  8. yon Leveron blog » Blog Archive » Tech bag of the day
  9. isgarnet.net
Comments are closed for this post

If you have or need further information, contact me.



Attention: Do NOT follow this link!