Bulletproof Sitemap Redirects via .htaccess

[ Bulletproof Sitemap Redirection ] Sitemaps have been shown to help search engines and other visitors understand and navigate your website. This tutorial gives you a simple yet powerful .htaccess technique for ensuring that search engines and other visitors can easily find your sitemap files. So even if they are looking for your sitemap in the wrong location, they’ll always be redirected to the actual, existing sitemap for your site. This strategy helps to improve consistency, minimize 404 errors, and save server resources. So it’s good for performance and SEO.

Help visitors find your sitemap

Typically, your sitemap is located in the root directory of your website, for example:

https://perishablepress.com/sitemap.xml

In general, sitemaps are served in XML format, but they also come in zipped flavor, typically via GZ (g-zip compressed) format. So for most sites, it is common to provide two versions of your sitemap: one with an .xml extension, and another with an .xml.gz extension.

https://perishablepress.com/sitemap.xml
https://perishablepress.com/sitemap.xml.gz

Yet regardless of where it’s located or how it’s formatted, your sitemap is useless unless search engines and other visitors can find it. Thus, it is highly recommended that your site include a robots.txt file that includes the location of your sitemap. Something like this added to robots.txt works great:

Sitemap: https://perishablepress.com/sitemap.xml

For compliant search engines that know what they’re doing, this simple robots declaration is all that’s needed. The spiders will hit your robots.txt file before the crawl, locate your sitemap, and continue to crawl your site accordingly. Sounds easy, right?

Unfortunately, things don’t always go according to plan. The spiders and scripts crawling your site may not recognize or obey your robots.txt directives. In my experience, very few actually do. Bad bots and malicious scripts relentlessly will pound your server looking for the elusive sitemap in myriad subdirectories, hoping to detect a vulnerability. And, if you happen to keep your sitemap in an unconventional location, the confusion can get real ugly. Here are some examples of sitemap URLs requested by “lost” bots who are scanning for vulnerabilities (or just being stupid):

http://example.com/something/random_sitemap.htm
http://example.com/another/guess/whatever_sitemap.html
http://example.com/just/cant/find/the/elusive_sitemap.xml
.
.
.

If you examine your site’s access and error logs, you may find all sorts of 404 “Not Found” errors for these sorts of requests. This type of apparently random scanning for “secret” sitemaps happens constantly around the Web, wasting valuable server resources like memory and bandwidth. It’s the sort of malicious activity that’s a real nuisance for anyone paying attention. Fortunately there is no need to tolerate such lunacy..

.htaccess to the rescue

Fortunately, we can make absolutely certain that our sitemaps are always found by anyone or anything that is requesting it from any location on our site. All you need is the ability to create and/or edit your site’s root .htaccess file (or server configuration file). If this is possible, you’re in business. Here are the requirements for the “bulletproof” sitemap technique provided in this tutorial:

  • Apache server (any version) with .htaccess and mod_alias enabled
  • Sitemap(s) located in the site’s root directory (e.g., example.com/sitemap.xml)
  • Sitemap(s) served only in XML and g-zip format

These requirements cover most setups. For example, many WordPress sites use a plugin to automatically generate their sitemap(s) in two formats: XML and g-zip, exactly what’s required for the bulletproof technique to work properly. For example, here at Perishable Press, I use the popular Google XML Sitemaps plugin, which generates the following set of sitemaps (and sub-sitemaps):

/sitemap.xml
/sitemap.xml.gz

/sitemap-pt-post-2016-03.xml
/sitemap-pt-post-2016-03.xml.gz
.
.
.

So in the site root directory is my main sitemap, which includes lots of “sub-sitemaps”. Pretty sure this is the most common (and recommended) structure for sitemaps, but let me know if I’m sorely mistaken about this. Hopefully this scenario covers your own setup; if unsure, you can examine your sitemap(s) and verify accordingly.

So if that sounds like you, make a quick backup of your site’s root .htaccess file and get ready for the magic bullet..

Bulletproof sitemap redirects

The actual implementation of this redirect technique couldn’t be easier. Simply include the following .htaccess directive in your site’s root .htaccess file:

# Bulletproof sitemap redirects
<IfModule mod_alias.c>
	RedirectMatch 301 (?i)(?<!^)/(.*)?sitemap(.*)?\.(htm|html|xml)(\.gz)? /sitemap.xml$4
</IfModule>

Then save changes, upload to the server, and you’re good to go. No modifications are required — strictly plug-&-play. Here is how this code works:

  1. Checks if mod_alias is available
  2. Sets the regex to case-insensitive
  3. Skips the redirect if at site root
  4. Matches any request that includes “sitemap” followed by “.htm”, “html”, or “xml”
  5. Optionally matches if the request is appended with “.gz”
  6. If the request fits the conditions, it is redirected to the root sitemap (either XML or g-zip version)

Looks simple, but this directive literally is years in the making. I’ve been fine-tuning the technique since, oh, back in 2008, after I wrote Redirect All Requests for a Nonexistent File to the Actual File. Tweaking things a little bit with each iteration of my .htaccess file, until finally now it’s perfect and ready for public consumption :)

Test before going live

After implementing this technique, you can (and should) verify that everything is working properly by requesting the following URLs:

http://example.com/sitemap.xml
http://example.com/sitemap.xml.gz

http://example.com/random/sitemap.htm
http://example.com/random/sitemap.html
http://example.com/random/sitemap.xml

http://example.com/random/sitemap.htm.gz
http://example.com/random/sitemap.html.gz
http://example.com/random/sitemap.xml.gz

http://example.com/random/random_sitemap.htm
http://example.com/random/random_sitemap.html
http://example.com/random/random_sitemap.xml

http://example.com/random/random_sitemap.htm.gz
http://example.com/random/random_sitemap.html.gz
http://example.com/random/random_sitemap.xml.gz

http://example.com/random/random_sitemap_random.htm
http://example.com/random/random_sitemap_random.html
http://example.com/random/random_sitemap_random.xml

http://example.com/random/random_sitemap_random.htm.gz
http://example.com/random/random_sitemap_random.html.gz
http://example.com/random/random_sitemap_random.xml.gz

With these examples, you can change each instance of “random” with any string. You can also prepend more directories to each path. Go ahead and try to break it: you can’t because the code is bulletproof (insert maniacal laughter). Also, remember to edit the “example.com” to match your own domain. Test until satisfied ;)