Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.
Robots.txt in 30 seconds
Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file, robots.txt wields considerable power. And we want to use whatever power we can get to our greatest advantage.
Robots.txt and WordPress
Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. Here is a good starting point for your next WP-based robots.txt:
User-agent: *
Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-
Allow: /wp-content/uploads/
Sitemap: http://example.com/sitemap.xml
That’s a plug-n-play version that you can further customize to fit specific site structure as well as your own SEO strategy. To use this code for your WordPress-powered site, just copy/paste into a blank file named “robots.txt located in your web-accessible root directory, for example:
http://perishablepress.com/robots.txt
If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice some additional robots directives that are used to forbid crawl access to stuff like the site’s blackhole for bad bots. Let’s have a look:
User-agent: *
Disallow: /feed/
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: ?wptheme=
Disallow: /blackhole/
Disallow: /transfer/
Disallow: /tweets/
Disallow: /mint/
Allow: /tag/mint/
Allow: /tag/feed/
Allow: /wp-content/online/
Sitemap: http://perishablepress.com/sitemap-perish.xml
Sitemap: http://perishablepress.com/sitemap-press.xml
User-agent: ia_archiver
Disallow: /
Spiders don’t need to be crawling around anything in my /cgi-bin/, so that’s disallowed, as are several non-WP subdirectories such as my Twitter Archive, FTP directory, and Mint installation.
Then I add a few explicit Allow directives to unblock access to specific URLs otherwise disallowed by existing rules. I also declare both of my sitemaps (yes you can have more than one), and then finally completely disallow access to everything for the annoying ia_archiver user-agent.
Previously on robots.txt
As mentioned, my previous robots.txt file went unchanged for several years (which just vanished in the blink of an eye), but proved quite effective, especially with compliant spiders like googlebot. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey):
User-agent: *
Disallow: /mint/
Disallow: /labs/
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Disallow: /transfer/
Disallow: /*/cgi-bin/*
Disallow: /*/blackhole/*
Disallow: /*/trackback/*
Disallow: /*/xmlrpc.php
Allow: /*/20*/wp-*
Allow: /press/feed/$
Allow: /press/tag/feed/$
Allow: /*/wp-content/online/*
Sitemap: http://perishablepress.com/sitemap.xml
User-agent: ia_archiver
Disallow: /
Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign $) is probably not well-supported either, although Google certainly gets it.
These patterns may be better supported in the future, but going forward there is no reason to include them. As with the robots examples above, the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.
106 Responses
bee – February 23, 2011 •
excuse me for being ignorant but do i use “
Disallow: ?wptheme=” as is or do i have to change it? Also the other lines, do i use them as is, obviosly I know to change the sitemap to mine, but the rest?thanks
Chris Abbott – February 23, 2011 •
Yes, you should leave the “
?wptheme=” line as is, as well as all the other lines. They use the agnostic absolute path starting with “/” so it will work on any domain.Note, though, that you will need to change the paths if you’ve installed wordpress into a directory other than the root. For example “
Disallow: /press/wp-content/”.See http://www.robotstxt.org/robotstxt.html for more details or http://tool.motoricerca.info/robots-checker.phtml for validation. Google Webmaster Tools also provides a robots.txt validation service.
/chris
bee – February 23, 2011 •
Chris, Thanks for the explanation and the links.
I tried validating my newly mad robots.text file via the link you gave, however it’s saying that almost every line has an error.
eg.
User-agent: *You specified both the generic user-agent “
*” and specific user-agents for this block of code; this could be misinterpreted.Disallow: /feed/You specified both a generic path (“
/” or empty disallow) and specific paths for this block of code; this could be misinterpreted.You should not separate with blank lines commands belonging to the same block of code. Please, remove the empty line(s) above this row.
etc. etc.
Chris Abbott – February 26, 2011 •
Yeah bee, after testing with Jeff’s provided format, I found that re-ordering it to…
User-agent: ia_archiverDisallow: /User-agent: *Disallow: /feed/Disallow: /cgi-bin/Disallow: /wp-admin/Disallow: /wp-content/Disallow: /wp-includes/Disallow: /trackback/Disallow: /xmlrpc.phpDisallow: ?wptheme=Disallow: /blackhole/Disallow: /transfer/Disallow: /tweets/Disallow: /mint/Allow: /tag/mint/Allow: /tag/feed/Allow: /wp-content/online/Sitemap: http://perishablepress.com/sitemap-perish.xmlSitemap: http://perishablepress.com/sitemap-press.xml…plays nicer with the validation tool. Specifically, I removed the empty line after User-agent line and placed the specific robots rules first, then the all robots asterisk rule next (last) just before the sitemaps rule (really last).
I think with this approach it’s attempting to maximize compatibility with lesser quality bots, as they seem to choke on the all robots asterisk and sitemap rules.
Either way, like Jeff mentioned, robots.txt is just a suggestion and all the worthwhile bots can process this just fine I’m sure.
/chris
MoonDog – February 25, 2011 •
Been reading through all of this and like you, haven’t revised my robots file for a long time. Combined some of your stuff and others. This is what I have. Let me know if I’m going overboard please?
User-agent: *Allow: /wp-content/uploads/Allow: /*/20*/wp-*Allow: /*/wp-content/online/*Disallow: /wp-adminDisallow: /wp-includesDisallow: /wp-content/pluginsDisallow: /wp-content/cacheDisallow: /wp-content/themesDisallow: ?wptheme=Disallow: /wp-Disallow: /xmlrpc.phpDisallow: /cgi-bin/Disallow: /trackbackDisallow: /commentsDisallow: /feed/Disallow: /blackhole/Disallow: /category/*/*Disallow: /transfer/Disallow: /tweets/Disallow: /mint/Disallow: /*?*Disallow: /*?Disallow: /*.php$Disallow: /*.js$Disallow: /*.inc$Disallow: /*.css$Disallow: /*.gz$Disallow: /*.wmv$Disallow: /*.cgi$Disallow: /*.xhtml$User-agent: ia_archiverDisallow: /User-agent: ia_archiver/1.6Disallow: /# Google ImageUser-agent: Googlebot-ImageDisallow:Allow: /*# Google AdSenseUser-agent: Mediapartners-Google*Disallow:Allow: /# digg mirrorUser-agent: duggmirrorDisallow: /# BEGIN XML-SITEMAP-PLUGINSitemap: http://moondogsports.com/sitemap.xml.gz# END XML-SITEMAP-PLUGINXarj – March 6, 2011 •
You are missing just one thing in your robots.txt to take care of one last duplicate content issue.
Google has indexed:
perishablepress.com/page/2/which is duplicate content.
In the robots.txt you could add:
Disallow: /pageDisallow: /page/*Disallow: */page/*Cheers!
jeffM – March 10, 2011 •
…unless you’re using a static front page which is itself ‘paged’, then you might not want to do that.
http://codex.wordpress.org/Conditional_Tags#A_Paged_Page
Jeff Starr – March 10, 2011 •
I would argue that it’s not duplicate content, but that’s just me.
For your robots rules, blocking “
/page” is not a good idea, and the wildcard selector isn’t recommended.A better way to keep the second page out of the index of all compliant search engines would be something like:
Disallow: /page/And omit the other two rules.
Xarj – March 10, 2011 •
JeffM
Not sure I understand. Could you give an example of the URL which would be noindexed but doesn’t have duplicate content?
If you are using a static front page, then the URL of that page would simply be your website homepage which would get indexed.
jeffM – March 10, 2011 •
The content of a static front page can be split into multiple pages, using the
<--nextpage-->tag, just like you would with regular posts. When you browse to page 2 or higher, your permalinked URI ends in/page/x/If you have your Front page set to ‘Your latest posts’, the number of posts returned to the loop is limited to the ‘Blog pages show at most’ setting (unless you use a custom loop to override that). When you browse away from the first block (to ‘Older posts’, for example), your permalinked URI ends up like described above.
The phantom URI/dupe content issue is something we’ve discussed here at length, and it’s a known flaw in WP. I have a dowloadable fix for it on my site.
I can’t figure why Perishable would have a phantom URI indexed, but it just underlines the importance of dealing with it.
[A penny for your thoughts, JS?]
Jeff Starr – March 10, 2011 •
I don’t consider it duplicate content, mostly because of the way I have my other page types set up with regards to controlling search enginz.
A visitor sees the first X number of posts on my site and then clicks the “previous posts” button to arrive at
page/2/. There the visitor sees a different set of posts, the next set of X or whatever. Nowhere on my site is the content of page 2 (or any other page) duplicated.Sure, some content is similar on different types of page views (e.g., tag and cat archives), but all of my date-based archives are noindex/nofollow and ultimately unique.
Then again, I just crawled out of bed, so I may be dreaming about all of this.
jeffM – March 10, 2011
With you on that, sensei. I see no dupe content on paged Perishables, wtf?
I also see no point whatsoever in Disallow:ing a
/page/segment, given it’s WP’s protocol for a paged front end, and robots do what they like.Blog Designer – May 4, 2012 •
You saved me a lot of time with this. I was wondering why some pages were not getting indexed on google , only the home page was. Your solutions instantly corrected that. Thanks you
Yael K. Miller – June 18, 2012 •
What’s /wp-content/online/ — is this a rename of /wp-content/uploads/ ?
I noticed that you’ve changed your robots.txt since you’ve written this. Why the change from
Disallow: ?wptheme=toDisallow: ?theme=? What does this mean?Jeff Starr – June 18, 2012 •
Hey Yael, the
/online/directory is non-wp, used for demos, code, and other assets throughout the site.The change in
?wptheme=to?theme=is due to a change in the plugin used for visitors to switch themes on the site. The new plugin uses a different parameter, and disallowing it via robots.txt helps prevent duplicate content (over 20 themes worth).Yael K. Miller – June 18, 2012 •
Should I allow
/wp-content/uploads/?Thanks for answering.
Jeff Starr – June 18, 2012 •
That’s up to you, whether or not there is content in there that you want indexed.. e.g., images for Google image search :)
Yael K. Miller – June 19, 2012
Thanks again.
lee hughes – June 27, 2012 •
Hi,
I have added your robots.txt to my website but i’m getting errors in webmaster tools that the robots are blocking some urls.
You can see the errors here http://www.ephemeralproject.com/wp-content/uploads/2012/06/sitemap.jpg
Any idea on how to fix this?
Thanks
Lenny21 – July 8, 2012 •
I’m having the issue of google webmastertools saying there are 1400 broken urls on my website. So I tried with robots.txt to block stuff like preview pages who might cause this. However google says those broken links are still there and, now they complain that my robots.txt is blocking googlebots. Any idea? Thanks!
Google: You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).
While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
Paul G – September 5, 2012 •
I would recommend considering the “Crawl-delay” directive. Maybe something like the following stuck right after “User-agent: *”.
Crawl-delay: 10It asks friendly bots to slow down their crawl rate. Your systems administrator will thank you.
Paul