Latest TweetsGreat post about the latest power grab: www.eff.org/deeplinks/2018/09/…
Perishable Press

Better Robots.txt Rules for WordPress

[ Better Robots.txt Rules for WP ] Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.

This post summarizes my research and gives you a near-perfect robots file, so you can copy/paste completely “as-is”, or use a template to give you a starting point for your own customization.

Robots.txt in 30 seconds

Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file, robots.txt wields considerable power. And we want to use whatever power we can get to our greatest advantage.

Better robots.txt for WordPress

Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. With that in mind, here are the new and improved robots.txt rules for WordPress:

User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

Only one small edit is required: change the Sitemap to match the location of your sitemap (or remove the line if no sitemap is available).

I use this exact code on nearly all of my major sites. It’s also fine to customize the rules, say if you need to exclude any custom directories and/or files, based on your actual site structure and SEO strategy.

Usage

To add the robots rules code to your WordPress-powered site, just copy/paste the code into a blank file named robots.txt. Then add the file to your web-accessible root directory, for example:

https://perishablepress.com/robots.txt

If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice an additional robots directive that forbids crawl access to the site’s blackhole for bad bots. Let’s have a look:

User-agent: *
Disallow: /wp-admin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /feed/
Disallow: /blackhole/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://perishablepress.com/sitemap.xml

Spiders don’t need to be crawling around anything in /wp-admin/, so that’s disallowed. Likewise, trackbacks, xmlrpc, and feeds don’t need to be crawled, so we disallow those as well. Also, notice that we add an explicit Allow directive that allows access to the WordPress Ajax file, so crawlers and bots have access to any Ajax-generated content. Lastly, we make sure to declare the location of our sitemap, just to make it official.

Notes & Updates

Update! The following directives have been removed from the tried and true robots.txt rules in order to appease Google’s new requirements that googlebot always is allowed complete crawl access to any publicly available file.

Disallow: /wp-content/
Disallow: /wp-includes/

Because /wp-content/ and /wp-includes/ include some publicly accessible CSS and JavaScript files, it’s recommended to just allow googlebot complete access to both directories always. Otherwise you’ll be spending valuable time chasing structural and file name changes in WordPress, and trying to keep them synchronized with some elaborate set of robots rules. It’s just easier to allow open access to these directories. Thus the two directives above were removed permanently from robots.txt, and are not recommended in general.

Apparently Google is so hardcore about this new requirement1 that they actually are penalizing sites (a LOT) for non-compliance2. Bad news for hundreds of thousands of site owners who have better things to do than keep up with Google’s constant, often arbitrary changes.

  • 1 Google demands complete access to all publicly accessible files.
  • 2 Note that it may be acceptable to disallow bot access to /wp-content/ and /wp-includes/ for other (non-Google) bots. Do your research though, before making any assumptions.

Previously on robots.txt..

As mentioned, my previous robots.txt file went unchanged for several years (which just vanished in the blink of an eye). The previous rules proved quite effective, especially with compliant spiders like googlebot. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey). Consider the following robots rules, which were used here at Perishable Press way back in the day.

Important! Please do not use the following rules on any live site. They are for reference and learning purposes only. For live sites, use the Better robots.txt rules, provided in the previous section.
User-agent: *
Disallow: /mint/
Disallow: /labs/
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Disallow: /transfer/
Disallow: /*/cgi-bin/*
Disallow: /*/blackhole/*
Disallow: /*/trackback/*
Disallow: /*/xmlrpc.php
Allow: /*/20*/wp-*
Allow: /press/feed/$
Allow: /press/tag/feed/$
Allow: /*/wp-content/online/*
Sitemap: https://perishablepress.com/sitemap.xml

User-agent: ia_archiver
Disallow: /

Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign $) is probably not well-supported either, although Google certainly gets it.

These patterns may be better supported in the future, but going forward there is no reason to include them. As seen in the “better robots” rules (above), the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.

Learn more..

Check out the following recommended sources to learn more about robots.txt, SEO, and more:

Jeff Starr
About the Author Jeff Starr = Creative thinker. Passionate about free and open Web.
Archives
106 responses
  1. Is it advisable to block /feed url on Robots.txt File?

    User-agent: *
    Disallow: /feed/

    If Yes, Why Would You Do That ?

    • Jeff Starr

      You don’t want search engines to return your feeds in the search results.. you want them to return your web pages :)

      • Jeff,

        In that regard, what are your thoughts on Joost’s comment on this matter:

        Blocking /feed/ is a bad idea because an RSS feed is actually a valid sitemap for Google. Blocking it would prevent Google from using that to find new content on your site.

        Here’s the (short) post where he says that: http://yoast.com/example-robots-txt-wordpress/

      • Jeff Starr

        Sure that makes sense if you don’t have a sitemap ;) Otherwise, keeping your feed content out of search results keeps juice focused on your actual web pages.

  2. @Jeff: Now that’s some sense-talk! :-)

    Thanks!

  3. Thanks for publishing this helpful article, its very informative, keep it Up!

[ Comments are closed for this post ]