Better Robots.txt Rules for WordPress
Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt
file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.
This post summarizes my research and gives you a near-perfect robots file, so you can copy/paste completely “as-is”, or use a template to give you a starting point for your own customization.
Robots.txt in 30 seconds
Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file, robots.txt
wields considerable power. And we want to use whatever power we can get to our greatest advantage.
Better robots.txt for WordPress
Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. With that in mind, here are the new and improved robots.txt rules for WordPress:
User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
Only one small edit is required: change the Sitemap
to match the location of your sitemap (or remove the line if no sitemap is available).
I use this exact code on nearly all of my major sites. It’s also fine to customize the rules, say if you need to exclude any custom directories and/or files, based on your actual site structure and SEO strategy.
Usage
To add the robots rules code to your WordPress-powered site, just copy/paste the code into a blank file named robots.txt
. Then add the file to your web-accessible root directory, for example:
https://perishablepress.com/robots.txt
If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice an additional robots directive that forbids crawl access to the site’s blackhole for bad bots. Let’s have a look:
User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Disallow: /blackhole/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://perishablepress.com/wp-sitemap.xml
Spiders don’t need to be crawling around anything in /wp-admin/
, so that’s disallowed. Likewise, trackbacks, xmlrpc, and feeds don’t need to be crawled, so we disallow those as well. Also, notice that we add an explicit Allow
directive that allows access to the WordPress Ajax file, so crawlers and bots have access to any Ajax-generated content. Lastly, we make sure to declare the location of our sitemap, just to make it official.
Notes & Updates
Update! The following directives have been removed from the tried and true robots.txt rules in order to appease Google’s new requirements that googlebot always is allowed complete crawl access to any publicly available file.
Disallow: /wp-content/
Disallow: /wp-includes/
Because /wp-content/
and /wp-includes/
include some publicly accessible CSS and JavaScript files, it’s recommended to just allow googlebot complete access to both directories always. Otherwise you’ll be spending valuable time chasing structural and file name changes in WordPress, and trying to keep them synchronized with some elaborate set of robots rules. It’s just easier to allow open access to these directories. Thus the two directives above were removed permanently from robots.txt, and are not recommended in general.
Apparently Google is so hardcore about this new requirement1 that they actually are penalizing sites (a LOT) for non-compliance2. Bad news for hundreds of thousands of site owners who have better things to do than keep up with Google’s constant, often arbitrary changes.
- 1 Google demands complete access to all publicly accessible files.
- 2 Note that it may be acceptable to disallow bot access to
/wp-content/
and/wp-includes/
for other (non-Google) bots. Do your research though, before making any assumptions.
Previously on robots.txt..
As mentioned, my previous robots.txt
file went unchanged for several years (which just vanished in the blink of an eye). The previous rules proved quite effective, especially with compliant spiders like googlebot
. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey). Consider the following robots rules, which were used here at Perishable Press way back in the day.
User-agent: *
Disallow: /mint/
Disallow: /labs/
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Disallow: /transfer/
Disallow: /*/cgi-bin/*
Disallow: /*/blackhole/*
Disallow: /*/trackback/*
Disallow: /*/xmlrpc.php
Allow: /*/20*/wp-*
Allow: /press/feed/$
Allow: /press/tag/feed/$
Allow: /*/wp-content/online/*
Sitemap: https://perishablepress.com/sitemap.xml
User-agent: ia_archiver
Disallow: /
Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign $
) is probably not well-supported either, although Google certainly gets it.
These patterns may be better supported in the future, but going forward there is no reason to include them. As seen in the “better robots” rules (above), the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.
Learn more..
Check out the following recommended sources to learn more about robots.txt, SEO, and more:
106 responses to “Better Robots.txt Rules for WordPress”
Jeff
What if Google is indexing my former temp url? I had a domain transfer over. We were working on the site on this temp url, but then the host transferred over the url, and now Google is occasionally dropping it from searches. I was wondering if I could disallow them from crawling that old url, and if i did that would it tell Google to only pay attention to the new url?
Yes, disallowing an URL should be enough to get Google to stop crawling it, but if it’s already been crawled it could remain in the index for awhile. I would recommend redirecting the old to the new URL via .htaccess. That way, google et al do not have a choice in the matter :)
Jeff, how would you suggest the code be done for the .htacess. See, my hosting tells me that all their hosting accounts also have temp url’s associated to them. We bought our url from Godaddy and had it released over to Bluehost. In the meantime, we were building the site on it’s temp url. When Go released it, Blue said everything had been pointed in the right direction. We are disappearing from search results like daily at 130pm EST on our site’s title, then reappearing at 7pm the same night. Yet, google is still delivering search results on that old temp url, but when you click on them in Google you get a 404. I am just lost as to what to do?
The first thing I would do is get everything straightened out with the hosts that are involved.. then once they’ve done their part, .htaccess can be used to redirect traffic from one URL to another, like so:
RedirectMatch 301 /old-url/ http://example.com/new-url/
It can be tricky setting up efficient redirects when more than one URL is involved, but if you let me know two of the URLs that are involved – one old URL and one new URL – I can try to help with a more specific .htaccess example.
Hi Jeff, I’m really hoping you can help me. We just changed over from Google XML Sitemaps to Yoast’s Sitemap and I’m having a major problem.
I’m getting over 2,000 warnings. It says “Sitemap contains URL’s which are blocked by robots.txt” and it refers to this: wp-content/uploads, which is not blocked by the robots.txt file.
Can you offer any advice on how to fix it? I would greatly appreciate it, Thank you!
Here’s what I have in my robots.txt file:
User-agent: *
Disallow: /feed/
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: ?wptheme=
Disallow: /unpublished/
Allow: /tag/mint/
Allow: /tag/feed/
Allow: /wp-content/online/
Allow: /wp-content/uploads/
Sitemap: http://www.singleinstilettos.com/sitemap_index.xml
Sitemap: http://www.singleinstilettos.com/post-sitemap.xml
Sitemap: http://www.singleinstilettos.com/page-sitemap.xml
Sitemap: http://www.singleinstilettos.com/attachment-sitemap.xml
User-agent: ia_archiver
Disallow: /
The first thing I would do is verify that warning by using an online robots.txt checker. Here is a good one:
http://www.searchenginepromotionhelp.com/m/robots-text-tester/
Go there, enter the entire robots.txt contents, and then enter some different URLs to check.
I’m thinking that the warning is because the explicit “Allow” line isn’t recognized..
Thanks so much, Jeff! I really appreciate all your help…you’re the best! :)
Boy
That’s great. Thanks.
So here’s our .hta
# BEGIN WordPress
RewriteEngine On
RewriteBase /
RewriteRule ^index.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
# END WordPress
our temp url was:
http://69.89.31.128/~uglychr1/brony
and recently I found this below along with three others on google search, while my normal url is gone
http://69.89.31.128/~uglychr1/brony/product/my-little-pony-rainbow-dash-cutie-mens-sky-blue-costume-hoodie-sweatshirt-2/
our current url is brony.com
thanks for your help.
Yes, so the redirect from the temp URL to the current URL should look something like this:
RedirectMatch 301 /~uglychr1/(.*) http://brony.com/$1
Note that this may need fine-tuning depending on your URL structure, but hopefully will open the door to a potential solution.
cool, so where does that line go in the .hta code at?
It should work when placed anywhere, but it’s good practice to add new techniques to .htaccess from top to bottom, so just add it to the end of the file should be fine. Also, if you’re new to .htaccess, it’s recommended that you make a backup file of the original before making any changes. That way, if something goes wrong or there is an error, you can restore original functionality.
so like this?
# BEGIN WordPress
RewriteEngine On
RewriteBase /
RewriteRule ^index.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
RedirectMatch 301 /~uglychr1/(.*) http://brony.com/$1
# END WordPress
Technically that should work, but the rule isn’t WP-specific, so it would be more correct to place it after the line that says “
# END WordPress
”.thanks for your help I appreciate it.
Jeff,
the .htaccess seemed to do the trick, but I notice in the error log that this is happening now:
[Mon Apr 08 08:27:20 2013] [error] [client 66.249.76.181] Request exceeded the limit of 10 internal redirects due to probable configuration error. Use ‘LimitInternalRecursion’ to increase the limit if necessary. Use ‘LogLevel debug’ to get a backtrace.
[Mon Apr 08 08:27:20 2013] [error] [client 66.249.76.181] Request exceeded the limit of 10 internal redirects due to probable configuration error. Use ‘LimitInternalRecursion’ to increase the limit if necessary. Use ‘LogLevel debug’ to get a backtrace.
is this something I should be worried about, or fix? If so,. how do I fix it?>
Yes, that means there is an endless loop happening for whichever URL(s) were being requested. If the redirects are pointing at the same domain, mod_rewrite should provide a way to redirect properly. The trick is to distinguish between the temporary URL at
http://69.89.31.128/~uglychr1/brony
, and the actual domain namehttp://brony.com/
. Until you get it sorted, I would remove theRedirectMatch
directive and restore original functionality. Endless loops generally are bad for SEO.Ok, so how do I fix it?
Like I said, first I would remove the redirect directive to resolve the infinite loop. That will give you time to investigate and resolve the issue. You might want to ask your host about it, they should be able to help provide some clues.
I’ve tried talking to the host, and they don’t seem to care. They keep saying that it’s out of their support range.
Hey Jeff,
Not sure if this has already been asked before, but what if your WordPress installation is in a subdirectory, should your robots.txt file include the subdirectory name?
Example:
User-agent: * Disallow: /subdirectory/wp-content/plugins/ Disallow: /subdirectory/wp-admin/ Disallow: /subdirectory/wp-includes/
Or is this not necessary if there is an .htaccess RewriteRule and an index.php
(require('./subdirectory/wp-blog-header.php');
already in place in the root folder?Thanks for your attention… You’re site is awesome!
Hi Chris, the only time you need to include the subdirectory in robots.txt rules is in cases where there could be confusion as to which directory you’re referring to. So in other words, if there is only one
/wp-content/
directory on your site, then just doing this is fine:Disallow: /wp-content/plugins/
This is because the robots.txt directives are expressions that will match any URL in which they exist. (if that makes sense, lol!) Hopefully that’s clear :)
That absolutely makes sense, but raises another question. I have multiple subdirectories on the same domain, with previous versions of the WordPress site in them. Should I specifically disallow those directories?
Nope, a benefit of using expressions for robots.txt rules is that they match all occurrences of the string, so no need to repeat anything.
My concern is actually for the other content in those subdirectories: pages, posts, links, etc… Wouldn’t that get picked up by the crawlers? (I have no idea, I just want a proper representation of my current WordPress installation to appear in Google et al.)…
Thanks for your time, as always.
Good question.. actually the “wp-” directories aren’t referenced anywhere in the URLs of your web pages. By denying access to the underlying directory structure, you’re keeping compliant search engines and bots away from sensitive files. As seen in the first robots.txt example in the above article, the one exception is the
/uploads/
directory, which houses uploaded media files and other content. So to allow Google et al to crawl media content, the uploads directory is whitelisted with an explicit “Allow” directive.