Latest TweetsGreat post about the latest power grab: www.eff.org/deeplinks/2018/09/…
Perishable Press

Better Robots.txt Rules for WordPress

[ Better Robots.txt Rules for WP ] Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.

This post summarizes my research and gives you a near-perfect robots file, so you can copy/paste completely “as-is”, or use a template to give you a starting point for your own customization.

Robots.txt in 30 seconds

Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file, robots.txt wields considerable power. And we want to use whatever power we can get to our greatest advantage.

Better robots.txt for WordPress

Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. With that in mind, here are the new and improved robots.txt rules for WordPress:

User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

Only one small edit is required: change the Sitemap to match the location of your sitemap (or remove the line if no sitemap is available).

I use this exact code on nearly all of my major sites. It’s also fine to customize the rules, say if you need to exclude any custom directories and/or files, based on your actual site structure and SEO strategy.

Usage

To add the robots rules code to your WordPress-powered site, just copy/paste the code into a blank file named robots.txt. Then add the file to your web-accessible root directory, for example:

https://perishablepress.com/robots.txt

If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice an additional robots directive that forbids crawl access to the site’s blackhole for bad bots. Let’s have a look:

User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Disallow: /blackhole/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://perishablepress.com/sitemap.xml

Spiders don’t need to be crawling around anything in /wp-admin/, so that’s disallowed. Likewise, trackbacks, xmlrpc, and feeds don’t need to be crawled, so we disallow those as well. Also, notice that we add an explicit Allow directive that allows access to the WordPress Ajax file, so crawlers and bots have access to any Ajax-generated content. Lastly, we make sure to declare the location of our sitemap, just to make it official.

Notes & Updates

Update! The following directives have been removed from the tried and true robots.txt rules in order to appease Google’s new requirements that googlebot always is allowed complete crawl access to any publicly available file.

Disallow: /wp-content/
Disallow: /wp-includes/

Because /wp-content/ and /wp-includes/ include some publicly accessible CSS and JavaScript files, it’s recommended to just allow googlebot complete access to both directories always. Otherwise you’ll be spending valuable time chasing structural and file name changes in WordPress, and trying to keep them synchronized with some elaborate set of robots rules. It’s just easier to allow open access to these directories. Thus the two directives above were removed permanently from robots.txt, and are not recommended in general.

Apparently Google is so hardcore about this new requirement1 that they actually are penalizing sites (a LOT) for non-compliance2. Bad news for hundreds of thousands of site owners who have better things to do than keep up with Google’s constant, often arbitrary changes.

  • 1 Google demands complete access to all publicly accessible files.
  • 2 Note that it may be acceptable to disallow bot access to /wp-content/ and /wp-includes/ for other (non-Google) bots. Do your research though, before making any assumptions.

Previously on robots.txt..

As mentioned, my previous robots.txt file went unchanged for several years (which just vanished in the blink of an eye). The previous rules proved quite effective, especially with compliant spiders like googlebot. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey). Consider the following robots rules, which were used here at Perishable Press way back in the day.

Important! Please do not use the following rules on any live site. They are for reference and learning purposes only. For live sites, use the Better robots.txt rules, provided in the previous section.
User-agent: *
Disallow: /mint/
Disallow: /labs/
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Disallow: /transfer/
Disallow: /*/cgi-bin/*
Disallow: /*/blackhole/*
Disallow: /*/trackback/*
Disallow: /*/xmlrpc.php
Allow: /*/20*/wp-*
Allow: /press/feed/$
Allow: /press/tag/feed/$
Allow: /*/wp-content/online/*
Sitemap: https://perishablepress.com/sitemap.xml

User-agent: ia_archiver
Disallow: /

Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign $) is probably not well-supported either, although Google certainly gets it.

These patterns may be better supported in the future, but going forward there is no reason to include them. As seen in the “better robots” rules (above), the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.

Learn more..

Check out the following recommended sources to learn more about robots.txt, SEO, and more:

Jeff Starr
About the Author Jeff Starr = Designer. Developer. Producer. Writer. Editor. Etc.
Archives
106 responses
  1. excuse me for being ignorant but do i use “Disallow: ?wptheme=” as is or do i have to change it? Also the other lines, do i use them as is, obviosly I know to change the sitemap to mine, but the rest?

    thanks

    • Chris Abbott February 23, 2011 @ 1:59 pm

      Yes, you should leave the “?wptheme=” line as is, as well as all the other lines. They use the agnostic absolute path starting with “/” so it will work on any domain.

      Note, though, that you will need to change the paths if you’ve installed wordpress into a directory other than the root. For example “Disallow: /press/wp-content/”.

      See http://www.robotstxt.org/robotstxt.html for more details or http://tool.motoricerca.info/robots-checker.phtml for validation. Google Webmaster Tools also provides a robots.txt validation service.

      /chris

  2. Chris, Thanks for the explanation and the links.

    I tried validating my newly mad robots.text file via the link you gave, however it’s saying that almost every line has an error.

    eg. User-agent: *
    You specified both the generic user-agent “*” and specific user-agents for this block of code; this could be misinterpreted.

    Disallow: /feed/
    You specified both a generic path (“/” or empty disallow) and specific paths for this block of code; this could be misinterpreted.
    You should not separate with blank lines commands belonging to the same block of code. Please, remove the empty line(s) above this row.

    etc. etc.

    • Chris Abbott February 26, 2011 @ 1:27 pm

      Yeah bee, after testing with Jeff’s provided format, I found that re-ordering it to…

      User-agent: ia_archiver
      Disallow: /

      User-agent: *
      Disallow: /feed/
      Disallow: /cgi-bin/
      Disallow: /wp-admin/
      Disallow: /wp-content/
      Disallow: /wp-includes/
      Disallow: /trackback/
      Disallow: /xmlrpc.php
      Disallow: ?wptheme=
      Disallow: /blackhole/
      Disallow: /transfer/
      Disallow: /tweets/
      Disallow: /mint/
      Allow: /tag/mint/
      Allow: /tag/feed/
      Allow: /wp-content/online/

      Sitemap: https://perishablepress.com/sitemap-perish.xml
      Sitemap: https://perishablepress.com/sitemap-press.xml

      …plays nicer with the validation tool. Specifically, I removed the empty line after User-agent line and placed the specific robots rules first, then the all robots asterisk rule next (last) just before the sitemaps rule (really last).

      I think with this approach it’s attempting to maximize compatibility with lesser quality bots, as they seem to choke on the all robots asterisk and sitemap rules.

      Either way, like Jeff mentioned, robots.txt is just a suggestion and all the worthwhile bots can process this just fine I’m sure.

      /chris

  3. Been reading through all of this and like you, haven’t revised my robots file for a long time. Combined some of your stuff and others. This is what I have. Let me know if I’m going overboard please?

    User-agent: *
    Allow: /wp-content/uploads/
    Allow: /*/20*/wp-*
    Allow: /*/wp-content/online/*
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: ?wptheme=
    Disallow: /wp-
    Disallow: /xmlrpc.php
    Disallow: /cgi-bin/
    Disallow: /trackback
    Disallow: /comments
    Disallow: /feed/
    Disallow: /blackhole/
    Disallow: /category/*/*
    Disallow: /transfer/
    Disallow: /tweets/
    Disallow: /mint/
    Disallow: /*?*
    Disallow: /*?
    Disallow: /*.php$
    Disallow: /*.js$
    Disallow: /*.inc$
    Disallow: /*.css$
    Disallow: /*.gz$
    Disallow: /*.wmv$
    Disallow: /*.cgi$
    Disallow: /*.xhtml$
    User-agent: ia_archiver
    Disallow: /
    User-agent: ia_archiver/1.6
    Disallow: /
    # Google Image
    User-agent: Googlebot-Image
    Disallow:
    Allow: /*
    # Google AdSense
    User-agent: Mediapartners-Google*
    Disallow:
    Allow: /
    # digg mirror
    User-agent: duggmirror
    Disallow: /
    # BEGIN XML-SITEMAP-PLUGIN
    Sitemap: http://moondogsports.com/sitemap.xml.gz
    # END XML-SITEMAP-PLUGIN

  4. You are missing just one thing in your robots.txt to take care of one last duplicate content issue.

    Google has indexed: perishablepress.com/page/2/
    which is duplicate content.

    In the robots.txt you could add:
    Disallow: /page
    Disallow: /page/*
    Disallow: */page/*

    Cheers!

    • …unless you’re using a static front page which is itself ‘paged’, then you might not want to do that.

      http://codex.wordpress.org/Conditional_Tags#A_Paged_Page

    • Jeff Starr

      I would argue that it’s not duplicate content, but that’s just me.

      For your robots rules, blocking “/page” is not a good idea, and the wildcard selector isn’t recommended.

      A better way to keep the second page out of the index of all compliant search engines would be something like:

      Disallow: /page/

      And omit the other two rules.

  5. JeffM

    Not sure I understand. Could you give an example of the URL which would be noindexed but doesn’t have duplicate content?

    If you are using a static front page, then the URL of that page would simply be your website homepage which would get indexed.

  6. The content of a static front page can be split into multiple pages, using the <--nextpage--> tag, just like you would with regular posts. When you browse to page 2 or higher, your permalinked URI ends in /page/x/

    If you have your Front page set to ‘Your latest posts’, the number of posts returned to the loop is limited to the ‘Blog pages show at most’ setting (unless you use a custom loop to override that). When you browse away from the first block (to ‘Older posts’, for example), your permalinked URI ends up like described above.

    The phantom URI/dupe content issue is something we’ve discussed here at length, and it’s a known flaw in WP. I have a dowloadable fix for it on my site.

    I can’t figure why Perishable would have a phantom URI indexed, but it just underlines the importance of dealing with it.

    [A penny for your thoughts, JS?]

    • Jeff Starr

      I don’t consider it duplicate content, mostly because of the way I have my other page types set up with regards to controlling search enginz.

      A visitor sees the first X number of posts on my site and then clicks the “previous posts” button to arrive at page/2/. There the visitor sees a different set of posts, the next set of X or whatever. Nowhere on my site is the content of page 2 (or any other page) duplicated.

      Sure, some content is similar on different types of page views (e.g., tag and cat archives), but all of my date-based archives are noindex/nofollow and ultimately unique.

      Then again, I just crawled out of bed, so I may be dreaming about all of this.

      • With you on that, sensei. I see no dupe content on paged Perishables, wtf?

        I also see no point whatsoever in Disallow:ing a /page/ segment, given it’s WP’s protocol for a paged front end, and robots do what they like.

  7. You saved me a lot of time with this. I was wondering why some pages were not getting indexed on google , only the home page was. Your solutions instantly corrected that. Thanks you

  8. What’s /wp-content/online/ — is this a rename of /wp-content/uploads/ ?

    I noticed that you’ve changed your robots.txt since you’ve written this. Why the change from
    Disallow: ?wptheme= to Disallow: ?theme=? What does this mean?

    • Jeff Starr

      Hey Yael, the /online/ directory is non-wp, used for demos, code, and other assets throughout the site.

      The change in ?wptheme= to ?theme= is due to a change in the plugin used for visitors to switch themes on the site. The new plugin uses a different parameter, and disallowing it via robots.txt helps prevent duplicate content (over 20 themes worth).

  9. Should I allow /wp-content/uploads/?

    Thanks for answering.

  10. lee hughes June 27, 2012 @ 11:28 pm

    Hi,

    I have added your robots.txt to my website but i’m getting errors in webmaster tools that the robots are blocking some urls.

    You can see the errors here http://www.ephemeralproject.com/wp-content/uploads/2012/06/sitemap.jpg

    Any idea on how to fix this?

    Thanks

  11. I’m having the issue of google webmastertools saying there are 1400 broken urls on my website. So I tried with robots.txt to block stuff like preview pages who might cause this. However google says those broken links are still there and, now they complain that my robots.txt is blocking googlebots. Any idea? Thanks!

    Google: You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).

    While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.

  12. I would recommend considering the “Crawl-delay” directive. Maybe something like the following stuck right after “User-agent: *”.

    Crawl-delay: 10

    It asks friendly bots to slow down their crawl rate. Your systems administrator will thank you.

    Paul

[ Comments are closed for this post ]