Bulletproof Sitemap Redirects via .htaccess
Sitemaps have been shown to help search engines and other visitors understand and navigate your website. This tutorial gives you a simple yet powerful .htaccess technique for ensuring that search engines and other visitors can easily find your sitemap files. So even if they are looking for your sitemap in the wrong location, they’ll always be redirected to the actual, existing sitemap for your site. This strategy helps to improve consistency, minimize 404 errors, and save server resources. So it’s good for performance and SEO.
Help visitors find your sitemap
Typically, your sitemap is located in the root directory of your website, for example:
https://perishablepress.com/sitemap.xml
In general, sitemaps are served in XML format, but they also come in zipped flavor, typically via GZ (g-zip compressed) format. So for most sites, it is common to provide two versions of your sitemap: one with an .xml
extension, and another with an .xml.gz
extension.
https://perishablepress.com/sitemap.xml
https://perishablepress.com/sitemap.xml.gz
Yet regardless of where it’s located or how it’s formatted, your sitemap is useless unless search engines and other visitors can find it. Thus, it is highly recommended that your site include a robots.txt file that includes the location of your sitemap. Something like this added to robots.txt
works great:
Sitemap: https://perishablepress.com/sitemap.xml
For compliant search engines that know what they’re doing, this simple robots declaration is all that’s needed. The spiders will hit your robots.txt
file before the crawl, locate your sitemap, and continue to crawl your site accordingly. Sounds easy, right?
Unfortunately, things don’t always go according to plan. The spiders and scripts crawling your site may not recognize or obey your robots.txt
directives. In my experience, very few actually do. Bad bots and malicious scripts relentlessly will pound your server looking for the elusive sitemap in myriad subdirectories, hoping to detect a vulnerability. And, if you happen to keep your sitemap in an unconventional location, the confusion can get real ugly. Here are some examples of sitemap URLs requested by “lost” bots who are scanning for vulnerabilities (or just being stupid):
http://example.com/something/random_sitemap.htm
http://example.com/another/guess/whatever_sitemap.html
http://example.com/just/cant/find/the/elusive_sitemap.xml
.
.
.
If you examine your site’s access and error logs, you may find all sorts of 404 “Not Found” errors for these sorts of requests. This type of apparently random scanning for “secret” sitemaps happens constantly around the Web, wasting valuable server resources like memory and bandwidth. It’s the sort of malicious activity that’s a real nuisance for anyone paying attention. Fortunately there is no need to tolerate such lunacy..
.htaccess to the rescue
Fortunately, we can make absolutely certain that our sitemaps are always found by anyone or anything that is requesting it from any location on our site. All you need is the ability to create and/or edit your site’s root .htaccess file (or server configuration file). If this is possible, you’re in business. Here are the requirements for the “bulletproof” sitemap technique provided in this tutorial:
- Apache server (any version) with .htaccess and
mod_alias
enabled - Sitemap(s) located in the site’s root directory (e.g.,
example.com/sitemap.xml
) - Sitemap(s) served only in XML and g-zip format
These requirements cover most setups. For example, many WordPress sites use a plugin to automatically generate their sitemap(s) in two formats: XML and g-zip, exactly what’s required for the bulletproof technique to work properly. For example, here at Perishable Press, I use the popular Google XML Sitemaps plugin, which generates the following set of sitemaps (and sub-sitemaps):
/sitemap.xml
/sitemap.xml.gz
/sitemap-pt-post-2016-03.xml
/sitemap-pt-post-2016-03.xml.gz
.
.
.
So in the site root directory is my main sitemap, which includes lots of “sub-sitemaps”. Pretty sure this is the most common (and recommended) structure for sitemaps, but let me know if I’m sorely mistaken about this. Hopefully this scenario covers your own setup; if unsure, you can examine your sitemap(s) and verify accordingly.
So if that sounds like you, make a quick backup of your site’s root .htaccess
file and get ready for the magic bullet..
Bulletproof sitemap redirects
The actual implementation of this redirect technique couldn’t be easier. Simply include the following .htaccess directive in your site’s root .htaccess file:
# Bulletproof sitemap redirects
<IfModule mod_alias.c>
RedirectMatch 301 (?i)(?<!^)/(.*)?sitemap(.*)?\.(htm|html|xml)(\.gz)? /sitemap.xml$4
</IfModule>
Then save changes, upload to the server, and you’re good to go. No modifications are required — strictly plug-&-play. Here is how this code works:
- Checks if
mod_alias
is available - Sets the regex to case-insensitive
- Skips the redirect if at site root
- Matches any request that includes “sitemap” followed by “.htm”, “html”, or “xml”
- Optionally matches if the request is appended with “.gz”
- If the request fits the conditions, it is redirected to the root sitemap (either XML or g-zip version)
Looks simple, but this directive literally is years in the making. I’ve been fine-tuning the technique since, oh, back in 2008, after I wrote Redirect All Requests for a Nonexistent File to the Actual File. Tweaking things a little bit with each iteration of my .htaccess file, until finally now it’s perfect and ready for public consumption :)
Test before going live
After implementing this technique, you can (and should) verify that everything is working properly by requesting the following URLs:
http://example.com/sitemap.xml
http://example.com/sitemap.xml.gz
http://example.com/random/sitemap.htm
http://example.com/random/sitemap.html
http://example.com/random/sitemap.xml
http://example.com/random/sitemap.htm.gz
http://example.com/random/sitemap.html.gz
http://example.com/random/sitemap.xml.gz
http://example.com/random/random_sitemap.htm
http://example.com/random/random_sitemap.html
http://example.com/random/random_sitemap.xml
http://example.com/random/random_sitemap.htm.gz
http://example.com/random/random_sitemap.html.gz
http://example.com/random/random_sitemap.xml.gz
http://example.com/random/random_sitemap_random.htm
http://example.com/random/random_sitemap_random.html
http://example.com/random/random_sitemap_random.xml
http://example.com/random/random_sitemap_random.htm.gz
http://example.com/random/random_sitemap_random.html.gz
http://example.com/random/random_sitemap_random.xml.gz
With these examples, you can change each instance of “random” with any string. You can also prepend more directories to each path. Go ahead and try to break it: you can’t because the code is bulletproof (insert maniacal laughter). Also, remember to edit the “example.com
” to match your own domain. Test until satisfied ;)
Update: WP Sitemaps
As explained in these posts, WordPress 5.5 and beyond features built-in sitemaps that are enabled by default. And because WordPress handles redirection for “near-miss” URL requests, it should redirect all sitemap requests to the correct location, with no extra .htaccess necessary. At least, that’s how it should work; in practice your results may vary.
For example, I did some testing on my own sites, to see if WordPress was automatically redirecting near-miss requests to the correct sitemaps. For most of the sites tested, the redirection is working fine. For this site however, sitemap redirects were not working as expected.
So if running WordPress 5.5 or better, check if your site is redirecting sitemap requests properly. It should be working fine. If not, here is an updated bulletproof technique that you can add to your site’s root/public .htaccess file:
RedirectMatch 301 (?i)^/sitemap(.*).(html|xml(\.gz)?)/? /wp-sitemap.xml
That code redirects all sitemap requests to the new WordPress Sitemaps. As always, test thoroughly before going live.
2 responses to “Bulletproof Sitemap Redirects via .htaccess”
Dear Jeff,
I’m facing a problem that maybe related to .htaccess and i’ve read your posts trying decipher if it’s the problem.
The permalinks of many images that I have in my website are not the same that the file url and it’s creating many 404 errors. I’ve checked the sitemap and I don’t understand why this is happening.
Example:
Permalink:
http://www.globalaircrafts.com/globalaircrafts/aircrafts/2011-agusta-a109s-grand/agusta-a109s-gra…aft-venda-sale-8/
FILE URL:
http://www.globalaircrafts.com/globalaircrafts/wp-content/uploads/2017/06/AGUSTA-A109S-GRAND-HELICOPTER-TURBINE-GLOBAL-AIRCRAFT-VENDA-SALE-8.jpg
Any ideas how can I fix it?
Thanks
Ps: I’m not very familiar with coding.. But I guess you already noticed that :)
Hi Monique, It looks like something is interfering with normal functionality. My best advice would be to troubleshoot your plugins and theme to determine if there is any issue. Also investigate any .htaccess rules that you may have in place, to see if any of them affect related URLs.