Better Robots.txt Rules for WordPress
Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt
file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.
This post summarizes my research and gives you a near-perfect robots file, so you can copy/paste completely “as-is”, or use a template to give you a starting point for your own customization.
Robots.txt in 30 seconds
Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file, robots.txt
wields considerable power. And we want to use whatever power we can get to our greatest advantage.
Better robots.txt for WordPress
Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. With that in mind, here are the new and improved robots.txt rules for WordPress:
User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
Only one small edit is required: change the Sitemap
to match the location of your sitemap (or remove the line if no sitemap is available).
I use this exact code on nearly all of my major sites. It’s also fine to customize the rules, say if you need to exclude any custom directories and/or files, based on your actual site structure and SEO strategy.
Usage
To add the robots rules code to your WordPress-powered site, just copy/paste the code into a blank file named robots.txt
. Then add the file to your web-accessible root directory, for example:
https://perishablepress.com/robots.txt
If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice an additional robots directive that forbids crawl access to the site’s blackhole for bad bots. Let’s have a look:
User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Disallow: /blackhole/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://perishablepress.com/wp-sitemap.xml
Spiders don’t need to be crawling around anything in /wp-admin/
, so that’s disallowed. Likewise, trackbacks, xmlrpc, and feeds don’t need to be crawled, so we disallow those as well. Also, notice that we add an explicit Allow
directive that allows access to the WordPress Ajax file, so crawlers and bots have access to any Ajax-generated content. Lastly, we make sure to declare the location of our sitemap, just to make it official.
Notes & Updates
Update! The following directives have been removed from the tried and true robots.txt rules in order to appease Google’s new requirements that googlebot always is allowed complete crawl access to any publicly available file.
Disallow: /wp-content/
Disallow: /wp-includes/
Because /wp-content/
and /wp-includes/
include some publicly accessible CSS and JavaScript files, it’s recommended to just allow googlebot complete access to both directories always. Otherwise you’ll be spending valuable time chasing structural and file name changes in WordPress, and trying to keep them synchronized with some elaborate set of robots rules. It’s just easier to allow open access to these directories. Thus the two directives above were removed permanently from robots.txt, and are not recommended in general.
Apparently Google is so hardcore about this new requirement1 that they actually are penalizing sites (a LOT) for non-compliance2. Bad news for hundreds of thousands of site owners who have better things to do than keep up with Google’s constant, often arbitrary changes.
- 1 Google demands complete access to all publicly accessible files.
- 2 Note that it may be acceptable to disallow bot access to
/wp-content/
and/wp-includes/
for other (non-Google) bots. Do your research though, before making any assumptions.
Previously on robots.txt..
As mentioned, my previous robots.txt
file went unchanged for several years (which just vanished in the blink of an eye). The previous rules proved quite effective, especially with compliant spiders like googlebot
. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey). Consider the following robots rules, which were used here at Perishable Press way back in the day.
User-agent: *
Disallow: /mint/
Disallow: /labs/
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Disallow: /transfer/
Disallow: /*/cgi-bin/*
Disallow: /*/blackhole/*
Disallow: /*/trackback/*
Disallow: /*/xmlrpc.php
Allow: /*/20*/wp-*
Allow: /press/feed/$
Allow: /press/tag/feed/$
Allow: /*/wp-content/online/*
Sitemap: https://perishablepress.com/sitemap.xml
User-agent: ia_archiver
Disallow: /
Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign $
) is probably not well-supported either, although Google certainly gets it.
These patterns may be better supported in the future, but going forward there is no reason to include them. As seen in the “better robots” rules (above), the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.
Learn more..
Check out the following recommended sources to learn more about robots.txt, SEO, and more:
106 responses to “Better Robots.txt Rules for WordPress”
is there any wordpress plugin firee/premium to generate a better robots.txt file
I’ve not heard of one, but it’s a great idea. Hmmm… ;)
hi, if the WP install is not in the root (lets say it’s in
/newsite/
like in http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory ) where does the robots.txt go (root or install folder) and what are the paths to be:Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
OR
Disallow: newsite/wp-admin/
Disallow: newsite/wp-includes/
Disallow: newsite/wp-content/plugins/
The robots.txt file should always be located in the web-accessible root directory, never a subdirectory. Then for the actual rules, either of your examples should work, and you can test them in Google’s Webmaster tools. The difference between the two examples here is that the first set will “disallow” any URL that contains
/wp-admin/
and so on. The second set of rules will only disallow URLs that containnewsite/wp-admin/
et al, which is more specific. Also, I think the rules in your second example should include an initial slash, like so:Disallow: /newsite/wp-admin/
Disallow: /newsite/wp-includes/
Disallow: /newsite/wp-content/plugins/
Thanks Jeff. Still a bit confused… so when wordpress is installed in a subdirectory (i did this basically: http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory#Moving_a_Root_install_to_its_own_directory) but runs the root the robots.txt file (in the root) should be:
Disallow: /newsite/wp-admin/
– point to where the physical files/folders are
or
Disallow: /wp-admin/
– point to virtual root because the index.php tells it to go to
/newsite/
anyway?webmaster tools just shows you what’s written in the robots.txt
Hmm.. it looks like Google removed their Robots.txt analyzer, which is unfortunate because it was a super-useful tool. There should be something similar online, but I haven’t checked.
For the two different Disallow rules, the rules both work because each path is treated as a regular expression, which means that googlebot will check your site URLs for the presence of either
/newsite/wp-admin/
or/wp-admin/
(whichever is used). So for example if you use this:Disallow: /newsite/wp-admin/
Googlebot will ignore the following URLs:
http://example.com/newsite/wp-admin/some-file.php
http://example.com/newsite/wp-admin/some-other-file.php
http://example.com/subfolder/newsite/wp-admin/some-file.php
http://example.com/subfolder/newsite/wp-admin/some-other-file.php
http://example.com/subfolder/subfolder/newsite/wp-admin/some-file.php
http://example.com/subfolder/subfolder/newsite/wp-admin/some-other-file.php
..and so forth. BUT
/newsite/wp-admin/
won’t match these URLs:http://example.com/wp-admin/some-file.php
http://example.com/subfolder/wp-admin/some-other-file.php
So when you use
Disallow: /wp-admin/
, you’re basically blocking any URL that includes/wp-admin/
, which is less restrictive than if you useDisallow: /newsite/wp-admin/
. So either of these example robots directive will work fine, but it’s easier to just roll with/wp-admin/
because it makes the code more portable, and covers any other installations of WP that might be added in the future.Thaks Jeff…
What about the following (and the rest of your txt examples), when WP is installed inside a directory do they also need the directory name added before them?
Disallow: /tag/
Disallow: /category/categories
Disallow: /author/
Disallow: /feed/
Disallow: /trackback/
Disallow: /print/
Disallow: /2001/
TO
Disallow: /wp/tag/
Disallow: /wp/category/categories
Disallow: /wp/author/
Disallow: /wp/feed/
Disallow: /wp/trackback/
Disallow: /wp/print/
Disallow: /wp/2001/
I assumed everything to do with wordpress needs to have the installation directory before hand so I went with /wp before everything, hope that’s right. Is there anyway to check the robots.txt for errors besides google webmaster tools?
Actually, see previous comment — there’s no need to add the /wp, but it’s also fine if you do add it. Either way.
To check for robots errors, the robots.txt sites have general validation tools for syntax, etc. At the moment there is no way to check for crawler-specific errors, not even in Google Webmaster Tools (they took it down).
No need to add
/wp/
, unless you’re running multiple sites and for some reason want to allow crawling of same-name directories in other WP installs. The reason why you don’t have to add the/wp/
is because robots rules are treated similar to regex expressions, where the disallow pattern is checked against each URL. So if you write this for example:Disallow: /category/categories
..compliant search engines will match any URL that contains the string, “
/category/categories
”, such as:http://example.com/wp/category/categories
http://example.com/wp/category/categories/whatever
http://example.com/something/category/categories/whatever
http://example.com/something/else/category/categories/whatever
I hope that helps!
I installed WP Robots Txt Plugin in my website and it rocks. Added custom instructions in Robots.txt file.
Thank you very very much for your help.
Best Regards
I’ll preface my question with….I’m not a techie at all, so any help you can give would be appreciated.
I noticed you have the following on the robots.txt file:
Disallow: /blackhole/
Disallow: /transfer/
Disallow: /tweets/
Disallow: /mint/
We don’t have these as directories on the system. Should we still have them on the robots.txt file, just in case? Any guidance would be appreciated! Thanks
Excellent question. Those rules won’t hurt anything if the directories don’t exist, but I recommend removing them if not needed.
Thanks so much Jeff for the help! It’s very much appreciated. :)
Hi Thanks for the explanation. I would like to know whether I should block pages, categories and archives for better ranking?
If you don’t need those resources, I suggest using .htaccess to redirect to the home page.
Hi Jeff,
I am a little confused w/ regard to the robots txt file and seo.
For example you have Disallow: /wp-content/
This seems like it would prevent for example google from indexing blog post images, gallery images and so on.
I thought, may be incorrectly, that having google index a site’s images was good for seo.
Your thoughts?
Happy trails, Mike
Good point. I actually keep my images in a folder named
/online/
, and then allow for crawling/indexing with the following line:Allow: /wp-content/online/
So if you’re using
/uploads/
, you would replace that with this:Allow: /wp-content/uploads/
Thank you Jeff. Would you then have the following in the robots text?
Disallow: /wp-content/
Allow: /wp-content/uploads/
Does the
Allow: /wp-content/uploads/
over ride theDisallow: /wp-content/
?Happy trails, Mike Foate
Yes that is correct, the more specific Allow directive will override the Disallow directive, so the bot will be able to crawl and index the uploads content :)
Here is what we have…..not very good with setting up robots.txt files….what changes would you make if any?
Thanks
# User-agent: *
# Disallow: /wp-admin/
# Disallow: /wp-includes/
# Disallow: /wp-trackback
# Disallow: /wp-feed
# Disallow: /wp-comments
# Disallow: /wp-content/plugins
# Disallow: /wp-content/themes
# Disallow: /wp-login.php
# Disallow: /wp-register.php
# Disallow: /feed
# Disallow: /trackback
# Disallow: /cgi-bin
# Disallow: /comments
# Disallow: *?s=
Sitemap: http://www.attractionmarketingdirect.com/sitemap.xml
Sitemap: http://www.attractionmarketingdirect.com/sitemap.xml
Hi Stephen, I recommend removing the pound signs “#” from the beginning of each line, and also remove one of the “Sitemap” directives, as duplicates may cause issues.
Howdy,
In Stephens robots.txt file what purpose does the following directive serve?
Disallow: *?s=
I googled it and cant find anything on it.
Happy trails, Mike
That directive is to prevent compliant bots from crawling any search results, which use the “?s=” in the requesting URLs. :-)
this is my robots.txt file,when i update last 10 days before some search result are gone from Google search result especially tag result. i think this is not correct robots.txt file of my site
Sitemap: http://www.youthfundoo.com/sitemap.xml
Sitemap: http://www.youthfundoo.com/sitemap-image.xml
User-agent: Mediapartners-Google
Disallow:
User-Agent: *
Allow: /
User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /blog/wp-admin/
Disallow: /blog/wp-includes/
Disallow: /blog/wp-content/plugins/
Disallow: /blog/wp-content/themes/
Disallow: /blog/wp-content/upgrade/
Disallow: /blog/page/
disallow: /blog/*?*
Disallow: /blog/comments/feed/
Disallow: /blog/tag
Disallow: /blog/author
Disallow: /blog/trackback
Disallow: /blog/*trackback
Disallow: /blog/*trackback*
Disallow: /blog/*/trackback
Disallow: /blog/*.html/$
Disallow: /blog/feed/
Disallow: /blog/xmlrpc.php
Disallow: /blog/?s=*
Disallow: *?wptheme
Disallow: ?comments=*
Disallow: /blog/?p=*
Disallow: /blog/search?
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /page/
Disallow: /*?*
Disallow: /comments/feed/
Disallow: /tag
Disallow: /author
Disallow: /trackback
Disallow: /*trackback
Disallow: /*trackback*
Disallow: /*/trackback
Disallow: /*.html/$
Disallow: /feed/
Disallow: /xmlrpc.php
Disallow: /?s=*
Disallow: /?p=*
Disallow: /search?
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Mobile
Allow: /
Hey prashant, yes those robots.txt rules could be simplified. Currently there is a lot of redundancy.. such as with the trackback rules, really you only need one, such as
/trackback
. Until you get it sorted I would remove all of these rules and just add something simple yet effective, such as the rules presented in this article. That will give you time to sort things out while allowing google et al to crawl your content.