Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt.
Robots.txt in 30 seconds
Primarily, robots directives disallow obedient spiders access to specified parts of your site. They can also explicitly “allow” access to specific files and directories. So basically they’re used to let Google, Bing et al know where they can go when visiting your site. You can also do nifty stuff like instruct specific user-agents and declare sitemaps. For just a simple text file, robots.txt wields considerable power. And we want to use whatever power we can get to our greatest advantage.
Robots.txt and WordPress
Running WordPress, you want search engines to crawl and index your posts and pages, but not your core WP files and directories. You also want to make sure that feeds and trackbacks aren’t included in the search results. It’s also good practice to declare a sitemap. Here is a good starting point for your next WP-based robots.txt:
User-agent: *
Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-
Allow: /wp-content/uploads/
Sitemap: http://example.com/sitemap.xml
That’s a plug-n-play version that you can further customize to fit specific site structure as well as your own SEO strategy. To use this code for your WordPress-powered site, just copy/paste into a blank file named “robots.txt located in your web-accessible root directory, for example:
http://perishablepress.com/robots.txt
If you take a look at the contents of the robots.txt file for Perishable Press, you’ll notice some additional robots directives that are used to forbid crawl access to stuff like the site’s blackhole for bad bots. Let’s have a look:
User-agent: *
Disallow: /feed/
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: ?wptheme=
Disallow: /blackhole/
Disallow: /transfer/
Disallow: /tweets/
Disallow: /mint/
Allow: /tag/mint/
Allow: /tag/feed/
Allow: /wp-content/online/
Sitemap: http://perishablepress.com/sitemap-perish.xml
Sitemap: http://perishablepress.com/sitemap-press.xml
User-agent: ia_archiver
Disallow: /
Spiders don’t need to be crawling around anything in my /cgi-bin/, so that’s disallowed, as are several non-WP subdirectories such as my Twitter Archive, FTP directory, and Mint installation.
Then I add a few explicit Allow directives to unblock access to specific URLs otherwise disallowed by existing rules. I also declare both of my sitemaps (yes you can have more than one), and then finally completely disallow access to everything for the annoying ia_archiver user-agent.
Previously on robots.txt
As mentioned, my previous robots.txt file went unchanged for several years (which just vanished in the blink of an eye), but proved quite effective, especially with compliant spiders like googlebot. Unfortunately, it contains language that only a few of the bigger search engines understand (and thus obey):
User-agent: *
Disallow: /mint/
Disallow: /labs/
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Disallow: /transfer/
Disallow: /*/cgi-bin/*
Disallow: /*/blackhole/*
Disallow: /*/trackback/*
Disallow: /*/xmlrpc.php
Allow: /*/20*/wp-*
Allow: /press/feed/$
Allow: /press/tag/feed/$
Allow: /*/wp-content/online/*
Sitemap: http://perishablepress.com/sitemap.xml
User-agent: ia_archiver
Disallow: /
Apparently, the wildcard character isn’t recognized by lesser bots, and I’m thinking that the end-pattern symbol (dollar sign $) is probably not well-supported either, although Google certainly gets it.
These patterns may be better supported in the future, but going forward there is no reason to include them. As with the robots examples above, the same pattern-matching is possible without using wildcards and dollar signs, enabling all compliant bots to understand your crawl preferences.
106 Responses
Rick Beckman – February 10, 2011 •
WordPress has a hook to modify the robots.txt data programmatically, I think. Would be nice to have this as a plugin that could be updated as you improve the method. A more advanced plugin would allow for turning rules on & off as desired, adding custom rules, etc.
Jeff Starr – February 10, 2011 •
That’s a great idea, I wish I had the time!
Now that I think about it, I think there is already a plugin that does this to some degree, but if not somebody should definitely do it.
Peter Wilson – February 10, 2011 •
Yep, it’s the do_robots hook.
I add a few rules using a standard plugin I put in the must use directory (/wp-content/mu-plugins/). I might add a few more from Jeff’s post above once I’ve had the chance to consider it.
It’s tuned for a WordPress Network but I’ve added it to pastebin anyway. http://wordpress.pastebin.com/j9W2JYTr
Otto – February 10, 2011 •
Not so sure that blocking the feed is a great move. Google is generally pretty good at parsing feed content.
Jeff Starr – February 10, 2011 •
It’s a close call, with duplicate content vs having your feed indexed. Unless feed content is different than the site content, blocking
/feed/is a good move because it preserves page rank and keeps the focus on the site.You’ll see in my previous robots file that I allowed the main feed to be indexed. These days however, I’m trying to keep duplicate content down to a minimum.
Kaspars – February 11, 2011
I really doubt that. Google crawls RSS feeds for Google Reader and it knows it’s RSS or ATOM. It even finds your RSS feeds and allows you to add them as sitemaps in Google Webmaster Tools.
And that’s why there is
rel="alternate"in the link to the feed (in head).Jeff Starr – February 11, 2011
Perhaps, but eliminating duplicate content in the search index should take precedence over a bit of convenience in the Webmaster Tools area.
Also,
rel="alternate"is meaningless if your feed content is identical to your blog content, which is the case 99.99% of the time.Rachael – February 10, 2011 •
Yeah, but it’s often duplicate content and you’d prefer someone land on the html article than an xml feed from a search query.
There are other considerations, of course, but that’s typically why I disallow feeds.
jeffM – February 11, 2011 •
Jeff
Now that’s a clean and mean robots.txt file!
Glad you eventually buried the paranoia that must have gripped you back in the day ;)
UR so a dude, dude.
iceflatline – February 15, 2011 •
Great post. Could you elaborate on why you use Disallow:
?wptheme=I’ve not seen that one before. Is it a directive specific to your particular theme?Jeff Starr – February 15, 2011 •
Yes, the
?wptheme=string is for the WP Theme Switch plugin. The goal of course to keep duplicate versions of your site out of the search index.iceflatline – February 16, 2011 •
Jeff, thank you. May I ask one more question? What is the rationale for disallowing
xmlrpc.php? My understanding is the it is an API primarily for remote publishing so I am unclear on what a crawler would glean from it? Thanks in advance for your thoughts…Jeff Starr – February 16, 2011 •
As far as I know, there is no reason the
xmlrpc.phpfile needs to be crawled and indexed. The API is there for scripts and apps to work with directly. Disallowing robots access in no way affects the xmlrpc functionality.Yael K. Miller – February 21, 2011 •
Why disallow sitemap? Isn’t the whole point of the sitemap so spiders can crawl it?
Jeff Starr – February 21, 2011 •
Nope, you want spiders to crawl your canonical content, posts, pages, and etc.
Unless feed content is different than the site content, blocking
/feed/is a good move because it preserves page rank and keeps the focus on the site.Yael K. Miller – February 21, 2011
Where do posts and pages actually live? In /wp-content/ ?
Jeff Starr – February 21, 2011
In the database, and the URLs are dynamically generated by WordPress.
Yael K. Miller – February 21, 2011 •
Thanks.
Yael K. Miller – February 21, 2011 •
What’s /wp-content/online/ that you allow it?
Jeff Starr – February 21, 2011 •
The
/online/directory houses demos, scripts, and other assets for articles and such.Yael K. Miller – February 21, 2011
It’s a directory you created for your own site, correct? It’s not universal for all WP installations.
Jeff Starr – February 21, 2011
Yes, correct.
Yael K. Miller – February 21, 2011 •
If you disallow ia_archiver how is it that according to the SearchStatus Firefox addon, you still have an Alexa rating?
Jeff Starr – February 21, 2011 •
Maybe because they don’t obey
robots.txt..? Remember, robots rules are merely suggestions.Yael K. Miller – February 21, 2011 •
What should the file permission for robots.txt be?
Jeff Starr – February 21, 2011 •
rw- r-- r--(644) should work fine :)yhanpolo – February 22, 2011 •
Hi Jeff, thanks for this great post, I have already modified my robots.txt file according to this post. thanks a lot.