Better Robots.txt Rules for WordPress
Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s
robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.
This post summarizes my research and gives you a near-perfect robots file, so you can copy/paste completely “as-is”, or use a template to give you a starting point for your own customization.
Robots.txt in 30 seconds
Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file,
robots.txt wields considerable power. And we want to use whatever power we can get to our greatest advantage.
Better robots.txt for WordPress
Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. With that in mind, here are the new and improved robots.txt rules for WordPress:
User-agent: * Disallow: /wp-admin/ Disallow: /trackback/ Disallow: /xmlrpc.php Disallow: /feed/ Allow: /wp-admin/admin-ajax.php Sitemap: https://example.com/sitemap.xml
Only one small edit is required: change the
Sitemap to match the location of your sitemap (or remove the line if no sitemap is available).
I use this exact code on nearly all of my major sites. It’s also fine to customize the rules, say if you need to exclude any custom directories and/or files, based on your actual site structure and SEO strategy.
To add the robots rules code to your WordPress-powered site, just copy/paste the code into a blank file named
robots.txt. Then add the file to your web-accessible root directory, for example:
If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice an additional robots directive that forbids crawl access to the site’s blackhole for bad bots. Let’s have a look:
User-agent: * Disallow: /wp-admin/ Disallow: /trackback/ Disallow: /xmlrpc.php Disallow: /feed/ Disallow: /blackhole/ Allow: /wp-admin/admin-ajax.php Sitemap: https://perishablepress.com/wp-sitemap.xml
Spiders don’t need to be crawling around anything in
/wp-admin/, so that’s disallowed. Likewise, trackbacks, xmlrpc, and feeds don’t need to be crawled, so we disallow those as well. Also, notice that we add an explicit
Allow directive that allows access to the WordPress Ajax file, so crawlers and bots have access to any Ajax-generated content. Lastly, we make sure to declare the location of our sitemap, just to make it official.
Notes & Updates
Update! The following directives have been removed from the tried and true robots.txt rules in order to appease Google’s new requirements that googlebot always is allowed complete crawl access to any publicly available file.
Disallow: /wp-content/ Disallow: /wp-includes/
Apparently Google is so hardcore about this new requirement1 that they actually are penalizing sites (a LOT) for non-compliance2. Bad news for hundreds of thousands of site owners who have better things to do than keep up with Google’s constant, often arbitrary changes.
- 1 Google demands complete access to all publicly accessible files.
- 2 Note that it may be acceptable to disallow bot access to
/wp-includes/for other (non-Google) bots. Do your research though, before making any assumptions.
Previously on robots.txt..
As mentioned, my previous
robots.txt file went unchanged for several years (which just vanished in the blink of an eye). The previous rules proved quite effective, especially with compliant spiders like
googlebot. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey). Consider the following robots rules, which were used here at Perishable Press way back in the day.
User-agent: * Disallow: /mint/ Disallow: /labs/ Disallow: /*/wp-* Disallow: /*/feed/* Disallow: /*/*?s=* Disallow: /*/*.js$ Disallow: /*/*.inc$ Disallow: /transfer/ Disallow: /*/cgi-bin/* Disallow: /*/blackhole/* Disallow: /*/trackback/* Disallow: /*/xmlrpc.php Allow: /*/20*/wp-* Allow: /press/feed/$ Allow: /press/tag/feed/$ Allow: /*/wp-content/online/* Sitemap: https://perishablepress.com/sitemap.xml User-agent: ia_archiver Disallow: /
Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign
$) is probably not well-supported either, although Google certainly gets it.
These patterns may be better supported in the future, but going forward there is no reason to include them. As seen in the “better robots” rules (above), the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.
Check out the following recommended sources to learn more about robots.txt, SEO, and more:
is there any wordpress plugin firee/premium to generate a better robots.txt file
I’ve not heard of one, but it’s a great idea. Hmmm… ;)
hi, if the WP install is not in the root (lets say it’s in
/newsite/like in http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory ) where does the robots.txt go (root or install folder) and what are the paths to be:
The robots.txt file should always be located in the web-accessible root directory, never a subdirectory. Then for the actual rules, either of your examples should work, and you can test them in Google’s Webmaster tools. The difference between the two examples here is that the first set will “disallow” any URL that contains
/wp-admin/and so on. The second set of rules will only disallow URLs that contain
newsite/wp-admin/et al, which is more specific. Also, I think the rules in your second example should include an initial slash, like so:
Thanks Jeff. Still a bit confused… so when wordpress is installed in a subdirectory (i did this basically: http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory#Moving_a_Root_install_to_its_own_directory) but runs the root the robots.txt file (in the root) should be:
– point to where the physical files/folders are
– point to virtual root because the index.php tells it to go to
webmaster tools just shows you what’s written in the robots.txt
Hmm.. it looks like Google removed their Robots.txt analyzer, which is unfortunate because it was a super-useful tool. There should be something similar online, but I haven’t checked.
For the two different Disallow rules, the rules both work because each path is treated as a regular expression, which means that googlebot will check your site URLs for the presence of either
/wp-admin/(whichever is used). So for example if you use this:
Googlebot will ignore the following URLs:
..and so forth. BUT
/newsite/wp-admin/won’t match these URLs:
So when you use
Disallow: /wp-admin/, you’re basically blocking any URL that includes
/wp-admin/, which is less restrictive than if you use
Disallow: /newsite/wp-admin/. So either of these example robots directive will work fine, but it’s easier to just roll with
/wp-admin/because it makes the code more portable, and covers any other installations of WP that might be added in the future.
What about the following (and the rest of your txt examples), when WP is installed inside a directory do they also need the directory name added before them?
I assumed everything to do with wordpress needs to have the installation directory before hand so I went with /wp before everything, hope that’s right. Is there anyway to check the robots.txt for errors besides google webmaster tools?
Actually, see previous comment — there’s no need to add the /wp, but it’s also fine if you do add it. Either way.
To check for robots errors, the robots.txt sites have general validation tools for syntax, etc. At the moment there is no way to check for crawler-specific errors, not even in Google Webmaster Tools (they took it down).
No need to add
/wp/, unless you’re running multiple sites and for some reason want to allow crawling of same-name directories in other WP installs. The reason why you don’t have to add the
/wp/is because robots rules are treated similar to regex expressions, where the disallow pattern is checked against each URL. So if you write this for example:
..compliant search engines will match any URL that contains the string, “
/category/categories”, such as:
I hope that helps!
I installed WP Robots Txt Plugin in my website and it rocks. Added custom instructions in Robots.txt file.
Thank you very very much for your help.
I’ll preface my question with….I’m not a techie at all, so any help you can give would be appreciated.
I noticed you have the following on the robots.txt file:
We don’t have these as directories on the system. Should we still have them on the robots.txt file, just in case? Any guidance would be appreciated! Thanks
Excellent question. Those rules won’t hurt anything if the directories don’t exist, but I recommend removing them if not needed.
Thanks so much Jeff for the help! It’s very much appreciated. :)
Hi Thanks for the explanation. I would like to know whether I should block pages, categories and archives for better ranking?
If you don’t need those resources, I suggest using .htaccess to redirect to the home page.
I am a little confused w/ regard to the robots txt file and seo.
For example you have Disallow: /wp-content/
This seems like it would prevent for example google from indexing blog post images, gallery images and so on.
I thought, may be incorrectly, that having google index a site’s images was good for seo.
Happy trails, Mike
Good point. I actually keep my images in a folder named
/online/, and then allow for crawling/indexing with the following line:
So if you’re using
/uploads/, you would replace that with this:
Thank you Jeff. Would you then have the following in the robots text?
Allow: /wp-content/uploads/over ride the
Happy trails, Mike Foate
Yes that is correct, the more specific Allow directive will override the Disallow directive, so the bot will be able to crawl and index the uploads content :)
Here is what we have…..not very good with setting up robots.txt files….what changes would you make if any?
# User-agent: *
# Disallow: /wp-admin/
# Disallow: /wp-includes/
# Disallow: /wp-trackback
# Disallow: /wp-feed
# Disallow: /wp-comments
# Disallow: /wp-content/plugins
# Disallow: /wp-content/themes
# Disallow: /wp-login.php
# Disallow: /wp-register.php
# Disallow: /feed
# Disallow: /trackback
# Disallow: /cgi-bin
# Disallow: /comments
# Disallow: *?s=
Hi Stephen, I recommend removing the pound signs “#” from the beginning of each line, and also remove one of the “Sitemap” directives, as duplicates may cause issues.
In Stephens robots.txt file what purpose does the following directive serve?
I googled it and cant find anything on it.
Happy trails, Mike
That directive is to prevent compliant bots from crawling any search results, which use the “?s=” in the requesting URLs. :-)
this is my robots.txt file,when i update last 10 days before some search result are gone from Google search result especially tag result. i think this is not correct robots.txt file of my site
# disallow all files in these directories
Hey prashant, yes those robots.txt rules could be simplified. Currently there is a lot of redundancy.. such as with the trackback rules, really you only need one, such as
/trackback. Until you get it sorted I would remove all of these rules and just add something simple yet effective, such as the rules presented in this article. That will give you time to sort things out while allowing google et al to crawl your content.