Latest TweetsVerify any search engine or visitor via CLI Forward-Reverse Lookup perishablepress.com/cli-forwar…
Perishable Press

Better Robots.txt Rules for WordPress

[ Better Robots.txt Rules for WP ] Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.

This post summarizes my research and gives you a near-perfect robots file, so you can copy/paste completely “as-is”, or use a template to give you a starting point for your own customization.

Robots.txt in 30 seconds

Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file, robots.txt wields considerable power. And we want to use whatever power we can get to our greatest advantage.

Better robots.txt for WordPress

Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. With that in mind, here are the new and improved robots.txt rules for WordPress:

User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

Only one small edit is required: change the Sitemap to match the location of your sitemap (or remove the line if no sitemap is available).

I use this exact code on nearly all of my major sites. It’s also fine to customize the rules, say if you need to exclude any custom directories and/or files, based on your actual site structure and SEO strategy.

Usage

To add the robots rules code to your WordPress-powered site, just copy/paste the code into a blank file named robots.txt. Then add the file to your web-accessible root directory, for example:

https://perishablepress.com/robots.txt

If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice an additional robots directive that forbids crawl access to the site’s blackhole for bad bots. Let’s have a look:

User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Disallow: /blackhole/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://perishablepress.com/sitemap.xml

Spiders don’t need to be crawling around anything in /wp-admin/, so that’s disallowed. Likewise, trackbacks, xmlrpc, and feeds don’t need to be crawled, so we disallow those as well. Also, notice that we add an explicit Allow directive that allows access to the WordPress Ajax file, so crawlers and bots have access to any Ajax-generated content. Lastly, we make sure to declare the location of our sitemap, just to make it official.

Notes & Updates

Update! The following directives have been removed from the tried and true robots.txt rules in order to appease Google’s new requirements that googlebot always is allowed complete crawl access to any publicly available file.

Disallow: /wp-content/
Disallow: /wp-includes/

Because /wp-content/ and /wp-includes/ include some publicly accessible CSS and JavaScript files, it’s recommended to just allow googlebot complete access to both directories always. Otherwise you’ll be spending valuable time chasing structural and file name changes in WordPress, and trying to keep them synchronized with some elaborate set of robots rules. It’s just easier to allow open access to these directories. Thus the two directives above were removed permanently from robots.txt, and are not recommended in general.

Apparently Google is so hardcore about this new requirement1 that they actually are penalizing sites (a LOT) for non-compliance2. Bad news for hundreds of thousands of site owners who have better things to do than keep up with Google’s constant, often arbitrary changes.

  • 1 Google demands complete access to all publicly accessible files.
  • 2 Note that it may be acceptable to disallow bot access to /wp-content/ and /wp-includes/ for other (non-Google) bots. Do your research though, before making any assumptions.

Previously on robots.txt..

As mentioned, my previous robots.txt file went unchanged for several years (which just vanished in the blink of an eye). The previous rules proved quite effective, especially with compliant spiders like googlebot. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey). Consider the following robots rules, which were used here at Perishable Press way back in the day.

Important! Please do not use the following rules on any live site. They are for reference and learning purposes only. For live sites, use the Better robots.txt rules, provided in the previous section.
User-agent: *
Disallow: /mint/
Disallow: /labs/
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Disallow: /transfer/
Disallow: /*/cgi-bin/*
Disallow: /*/blackhole/*
Disallow: /*/trackback/*
Disallow: /*/xmlrpc.php
Allow: /*/20*/wp-*
Allow: /press/feed/$
Allow: /press/tag/feed/$
Allow: /*/wp-content/online/*
Sitemap: https://perishablepress.com/sitemap.xml

User-agent: ia_archiver
Disallow: /

Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign $) is probably not well-supported either, although Google certainly gets it.

These patterns may be better supported in the future, but going forward there is no reason to include them. As seen in the “better robots” rules (above), the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.

Learn more..

Check out the following recommended sources to learn more about robots.txt, SEO, and more:

Jeff Starr
About the Author Jeff Starr = Creative thinker. Passionate about free and open Web.
Archives
106 responses
  1. Jithin Johny George December 3, 2012 @ 10:56 pm

    is there any wordpress plugin firee/premium to generate a better robots.txt file

  2. hi, if the WP install is not in the root (lets say it’s in /newsite/ like in http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory ) where does the robots.txt go (root or install folder) and what are the paths to be:

    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /wp-content/plugins/

    OR

    Disallow: newsite/wp-admin/
    Disallow: newsite/wp-includes/
    Disallow: newsite/wp-content/plugins/

    • Jeff Starr

      The robots.txt file should always be located in the web-accessible root directory, never a subdirectory. Then for the actual rules, either of your examples should work, and you can test them in Google’s Webmaster tools. The difference between the two examples here is that the first set will “disallow” any URL that contains /wp-admin/ and so on. The second set of rules will only disallow URLs that contain newsite/wp-admin/ et al, which is more specific. Also, I think the rules in your second example should include an initial slash, like so:

      Disallow: /newsite/wp-admin/
      Disallow: /newsite/wp-includes/
      Disallow: /newsite/wp-content/plugins/

      • Thanks Jeff. Still a bit confused… so when wordpress is installed in a subdirectory (i did this basically: http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory#Moving_a_Root_install_to_its_own_directory) but runs the root the robots.txt file (in the root) should be:

        Disallow: /newsite/wp-admin/

        – point to where the physical files/folders are

        or

        Disallow: /wp-admin/

        – point to virtual root because the index.php tells it to go to /newsite/ anyway?

        webmaster tools just shows you what’s written in the robots.txt

      • Jeff Starr

        Hmm.. it looks like Google removed their Robots.txt analyzer, which is unfortunate because it was a super-useful tool. There should be something similar online, but I haven’t checked.

        For the two different Disallow rules, the rules both work because each path is treated as a regular expression, which means that googlebot will check your site URLs for the presence of either /newsite/wp-admin/ or /wp-admin/ (whichever is used). So for example if you use this:

        Disallow: /newsite/wp-admin/

        Googlebot will ignore the following URLs:

        http://example.com/newsite/wp-admin/some-file.php
        http://example.com/newsite/wp-admin/some-other-file.php
        http://example.com/subfolder/newsite/wp-admin/some-file.php
        http://example.com/subfolder/newsite/wp-admin/some-other-file.php
        http://example.com/subfolder/subfolder/newsite/wp-admin/some-file.php
        http://example.com/subfolder/subfolder/newsite/wp-admin/some-other-file.php

        ..and so forth. BUT /newsite/wp-admin/ won’t match these URLs:

        http://example.com/wp-admin/some-file.php
        http://example.com/subfolder/wp-admin/some-other-file.php

        So when you use Disallow: /wp-admin/, you’re basically blocking any URL that includes /wp-admin/, which is less restrictive than if you use Disallow: /newsite/wp-admin/. So either of these example robots directive will work fine, but it’s easier to just roll with /wp-admin/ because it makes the code more portable, and covers any other installations of WP that might be added in the future.

  3. Thaks Jeff…

  4. What about the following (and the rest of your txt examples), when WP is installed inside a directory do they also need the directory name added before them?

    Disallow: /tag/
    Disallow: /category/categories
    Disallow: /author/
    Disallow: /feed/
    Disallow: /trackback/
    Disallow: /print/
    Disallow: /2001/

    TO

    Disallow: /wp/tag/
    Disallow: /wp/category/categories
    Disallow: /wp/author/
    Disallow: /wp/feed/
    Disallow: /wp/trackback/
    Disallow: /wp/print/
    Disallow: /wp/2001/

    • I assumed everything to do with wordpress needs to have the installation directory before hand so I went with /wp before everything, hope that’s right. Is there anyway to check the robots.txt for errors besides google webmaster tools?

      • Jeff Starr

        Actually, see previous comment — there’s no need to add the /wp, but it’s also fine if you do add it. Either way.

        To check for robots errors, the robots.txt sites have general validation tools for syntax, etc. At the moment there is no way to check for crawler-specific errors, not even in Google Webmaster Tools (they took it down).

    • Jeff Starr

      No need to add /wp/, unless you’re running multiple sites and for some reason want to allow crawling of same-name directories in other WP installs. The reason why you don’t have to add the /wp/ is because robots rules are treated similar to regex expressions, where the disallow pattern is checked against each URL. So if you write this for example:

      Disallow: /category/categories

      ..compliant search engines will match any URL that contains the string, “/category/categories”, such as:

      http://example.com/wp/category/categories
      http://example.com/wp/category/categories/whatever
      http://example.com/something/category/categories/whatever
      http://example.com/something/else/category/categories/whatever

      I hope that helps!

  5. I installed WP Robots Txt Plugin in my website and it rocks. Added custom instructions in Robots.txt file.
    Thank you very very much for your help.
    Best Regards

  6. I’ll preface my question with….I’m not a techie at all, so any help you can give would be appreciated.

    I noticed you have the following on the robots.txt file:

    Disallow: /blackhole/
    Disallow: /transfer/
    Disallow: /tweets/
    Disallow: /mint/

    We don’t have these as directories on the system. Should we still have them on the robots.txt file, just in case? Any guidance would be appreciated! Thanks

  7. Hi Thanks for the explanation. I would like to know whether I should block pages, categories and archives for better ranking?

  8. cowboy Mike March 11, 2013 @ 10:47 am

    Hi Jeff,
    I am a little confused w/ regard to the robots txt file and seo.

    For example you have Disallow: /wp-content/

    This seems like it would prevent for example google from indexing blog post images, gallery images and so on.

    I thought, may be incorrectly, that having google index a site’s images was good for seo.

    Your thoughts?

    Happy trails, Mike

    • Jeff Starr

      Good point. I actually keep my images in a folder named /online/, and then allow for crawling/indexing with the following line:

      Allow: /wp-content/online/

      So if you’re using /uploads/, you would replace that with this:

      Allow: /wp-content/uploads/

  9. cowboy Mike March 11, 2013 @ 1:46 pm

    Thank you Jeff. Would you then have the following in the robots text?

    Disallow: /wp-content/
    Allow: /wp-content/uploads/

    Does the Allow: /wp-content/uploads/over ride the Disallow: /wp-content/ ?

    Happy trails, Mike Foate

    • Jeff Starr

      Yes that is correct, the more specific Allow directive will override the Disallow directive, so the bot will be able to crawl and index the uploads content :)

  10. Stephen Malan March 18, 2013 @ 2:32 pm

    Here is what we have…..not very good with setting up robots.txt files….what changes would you make if any?

    Thanks

    # User-agent: *
    # Disallow: /wp-admin/
    # Disallow: /wp-includes/
    # Disallow: /wp-trackback
    # Disallow: /wp-feed
    # Disallow: /wp-comments
    # Disallow: /wp-content/plugins
    # Disallow: /wp-content/themes
    # Disallow: /wp-login.php
    # Disallow: /wp-register.php
    # Disallow: /feed
    # Disallow: /trackback
    # Disallow: /cgi-bin
    # Disallow: /comments
    # Disallow: *?s=
    Sitemap: http://www.attractionmarketingdirect.com/sitemap.xml
    Sitemap: http://www.attractionmarketingdirect.com/sitemap.xml

    • Jeff Starr

      Hi Stephen, I recommend removing the pound signs “#” from the beginning of each line, and also remove one of the “Sitemap” directives, as duplicates may cause issues.

  11. cowboy Mike March 19, 2013 @ 12:24 pm

    Howdy,
    In Stephens robots.txt file what purpose does the following directive serve?

    Disallow: *?s=

    I googled it and cant find anything on it.

    Happy trails, Mike

  12. this is my robots.txt file,when i update last 10 days before some search result are gone from Google search result especially tag result. i think this is not correct robots.txt file of my site

    Sitemap: http://www.youthfundoo.com/sitemap.xml
    Sitemap: http://www.youthfundoo.com/sitemap-image.xml

    User-agent: Mediapartners-Google
    Disallow:

    User-Agent: *
    Allow: /

    User-agent: *
    # disallow all files in these directories
    Disallow: /cgi-bin/
    Disallow: /blog/wp-admin/
    Disallow: /blog/wp-includes/
    Disallow: /blog/wp-content/plugins/
    Disallow: /blog/wp-content/themes/
    Disallow: /blog/wp-content/upgrade/
    Disallow: /blog/page/
    disallow: /blog/*?*
    Disallow: /blog/comments/feed/
    Disallow: /blog/tag
    Disallow: /blog/author
    Disallow: /blog/trackback
    Disallow: /blog/*trackback
    Disallow: /blog/*trackback*
    Disallow: /blog/*/trackback
    Disallow: /blog/*.html/$
    Disallow: /blog/feed/
    Disallow: /blog/xmlrpc.php
    Disallow: /blog/?s=*
    Disallow: *?wptheme
    Disallow: ?comments=*
    Disallow: /blog/?p=*
    Disallow: /blog/search?
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /wp-content/
    Disallow: /page/
    Disallow: /*?*
    Disallow: /comments/feed/
    Disallow: /tag
    Disallow: /author
    Disallow: /trackback
    Disallow: /*trackback
    Disallow: /*trackback*
    Disallow: /*/trackback
    Disallow: /*.html/$
    Disallow: /feed/
    Disallow: /xmlrpc.php
    Disallow: /?s=*
    Disallow: /?p=*
    Disallow: /search?

    User-agent: Googlebot-Image
    Allow: /wp-content/uploads/

    User-agent: Adsbot-Google
    Allow: /

    User-agent: Googlebot-Mobile
    Allow: /

    • Jeff Starr

      Hey prashant, yes those robots.txt rules could be simplified. Currently there is a lot of redundancy.. such as with the trackback rules, really you only need one, such as /trackback. Until you get it sorted I would remove all of these rules and just add something simple yet effective, such as the rules presented in this article. That will give you time to sort things out while allowing google et al to crawl your content.

[ Comments are closed for this post ]