I am in the process of migrating my sites from A Small Orange to Media Temple. Part of that process involves canonicalizing domain URLs to help maximize SEO strategy. At ASO, URL canonicalization required just a few htaccess directives:
# enforce no www prefix
<IfModule mod_rewrite.c>
RewriteCond %{HTTP_HOST} !^domain\.tld$ [NC]
RewriteRule ^(.*)$ http://domain.tld/$1 [R=301,L]
</IfModule>
When placed in the web-accessible root directory’s htaccess file, that snippet will ensure that all requests for your site are not prefixed with www. There’s also a force-www technique if that’s how you roll. Either way, the point is that on most shared hosting, URL canonicalization is simple.
Continue Reading
If you’ve been keeping an eye on your 404 errors recently, you will have noticed an increase in requests for nonexistent mobile files and directories, especially over the past year or so. The scripts and bots requesting these files from your server seem to be looking for a mobile version of your site. Unfortunately, they are wasting bandwidth and resources in the process. It has become common to see the following 404 errors constantly repeated in your log files:
http://domain.tld/apple-touch-icon.png
http://domain.tld/iphone
http://domain.tld/mobile
http://domain.tld/mobi
http://domain.tld/m
So some bot comes along, assumes that your site includes a mobile version, and then tries its hand at guessing the location. In the common request-set listed above, we see the bot looking first for an “apple-touch icon,” and then for mobile content in various directories. If this only happens once in awhile, it’s no big deal. But these days I’ve been seeing many different bots requesting these nonexistent resources.
Even worse, these mobile-hungry bots can’t seem to remember where they’ve been – they typically request the same resources repeatedly, and in multiple locations within the directory structure. I frequently see hundreds of these types of requests in my weekly error-log analyses. Needless to say, this is an incredible waste of time, bandwidth, and server resources.
Continue Reading
Jeff Morris recently demonstrated a potential issue with the way WordPress handles multipaged posts and comments. The issue involves WordPress’ inability to discern between multipaged posts and comments that actually exist and those that do not. By redirecting requests for nonexistent numbered pages to the original post, WordPress creates an infinite amount of duplicate content for your site. In this article, we explain the issue, discuss the implications, and provide an easy, working solution.
Understanding the “infinite duplicate content” issue
Using the <!--nextpage--> tag, WordPress makes it easy to split your post content into multiple pages, and also makes it easy to paginate the display of your comment threads. For both paged posts and paged comments, WordPress appends the page number to the permalink. So for example, if we have a post split into 3 pages, WordPress will generate the following set of completely valid permalinks (based on name-only permalink structure):
Continue Reading
A few months ago, many WordPress sites were attacked with some extremely malicious code. While searching for a good solution, I discovered the following gem of a plugin in the pastebin repository:
<?php /* Plugin Name: Block Bad Queries */
if (strlen($_SERVER['REQUEST_URI']) > 255 ||
strpos($_SERVER['REQUEST_URI'], "eval(") ||
strpos($_SERVER['REQUEST_URI'], "base64")) {
@header("HTTP/1.1 414 Request-URI Too Long");
@header("Status: 414 Request-URI Too Long");
@header("Connection: Close");
@exit;
} ?>
This script checks for excessively long request strings (i.e., greater than 255 characters), as well as the presence of either “eval(” or “base64” in the request URI. These sorts of nefarious requests were implicated in the September 2009 WordPress attacks.
Continue Reading
Canonical URLs are important for maintaining consistent linkage, reducing duplicate content issues, and increasing the overall integrity of your site. In addition to cleaning up trailing slashes and removing extraneous index.php and index.html strings, removing the www subdirectory prefix is an excellent way to shorten links and deliver consistent, canonical URLs.
Of course, an optimal way of removing (or adding) the www prefix is accomplished via HTAccess canonicalization:
Continue Reading
When building web pages, it is often necessary to add links that require parameterized query strings. For example, when adding links to the various validation services, you may find yourself linking to an accessibility checker, such as the freely available Cynthia service:
Continue Reading
Simple one for you today. After posting on how to use HTAccess to redirect subordinate URLs to the root (or parent) directory, I thought I would share an alternate way of accomplishing the same trick using PHP. Fortunately, using this PHP redirect technique doesn’t require access to or fiddling with your site’s HTAccess (or Apache configuration) file and it is very easy to implement.
The scene, as discussed in greater detail in my previous article on this topic, involves a very specific type of redirect whereby some target URL is redirected up the directory structure to one of its parent directories. Redirecting such subordinate content generally falls into one of two categories:
Continue Reading
In my previous article on redirecting 404 requests for favicon files, I presented an HTAccess technique for redirecting all requests for nonexistent favicon.ico files to the actual file located in the site’s web-accessible root directory:
# REDIRECT FAVICONZ
<ifmodule mod_rewrite.c>
RewriteCond %{THE_REQUEST} favicon.ico [NC]
RewriteRule (.*) http://domain.tld/favicon.ico [R=301,L]
</ifmodule>
As discussed in the article, this code is already in effect here at Perishable Press, as may be seen by clicking on any of the following links:
Clearly, none of these URL requests target the “real” favicon.ico file, yet thanks to the previous method they are all happily redirected to the proper location. This is useful for a variety of reasons, including preventing excessive and unnecessary server strain due to malicious scripts.
Continue Reading
For the last several months, I have been seeing an increasing number of 404 errors requesting “favicon.ico” appended onto various URLs:
http://perishablepress.com/press/favicon.ico
http://perishablepress.com/press/2007/06/12/favicon.ico
http://perishablepress.com/press/2007/09/25/absolute-horizontal-and-vertical-centering-via-css/favicon.ico
http://perishablepress.com/press/2007/08/01/temporary-site-redirect-for-visitors-during-site-updates/favicon.ico
http://perishablepress.com/press/2007/01/16/maximum-and-minimum-height-and-width-in-internet-explorer/favicon.ico
When these errors first began appearing in the logs several months ago, I didn’t think too much of it — “just another idiot who can’t find my site’s favicon..” As time went on, however, the frequency and variety of these misdirected requests continued to increase. A bit frustrating perhaps, but not serious enough to justify immediate action. After all, what’s the worst that can happen? The idiot might actually find the blasted thing? Wouldn’t that be nice..
Continue Reading
I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following:
- http://perishablepress.com/press/page/2/?tag=spam
- http://perishablepress.com/press/page/3/?tag=code
- http://perishablepress.com/press/page/2/?tag=email
- http://perishablepress.com/press/page/2/?tag=xhtml
- http://perishablepress.com/press/page/4/?tag=notes
- http://perishablepress.com/press/page/2/?tag=flash
- http://perishablepress.com/press/page/2/?tag=links
- http://perishablepress.com/press/page/3/?tag=theme
- http://perishablepress.com/press/page/2/?tag=press
..plus hundreds and hundreds more 1. The URL pattern is always the same: a different page number followed by a query string containing one of the tags used here at Perishable Press, for example: “/?tag=something”. The problem is that there are no such links anywhere on the site. The site employs permalink format for all WordPress-generated links (e.g., post/page links, search queries, tag queries, etc.). In an effort to locate the source of these URLs, I have performed multiple, thorough searches through every aspect of my site (files, database, code, examples, etc.), and the results are always the same: nothing. UGH! What is the source of these misguided URLs? Hopefully, you will be able to help shed some light on this unexplained crawl behavior (please, I beg you!!).
Continue Reading
During my previous rendezvous involving comprehensive canonicalization for WordPress, I offer my personally customized technique for ensuring consistently precise and accurate URL delivery. That particular method targets WordPress exclusively (although the logic could be manipulated for general use), and requires a bit of editing to adapt the code to each particular configuration. In this follow-up tutorial, I present a basic www-canonicalization technique that accomplishes the following:
- requires or removes the
www prefix for all URLs
- absolutely no editing when requiring the
www prefix
- minimal amount of editing when removing the
www prefix
- minimal amount of code used to execute either technique
I have found this “universal” www-canonicalization technique extremely useful in its simplicity and elegance. Especially when requiring the www prefix, nothing could be easier: simply copy, paste, done — absolutely no hard-coding necessary!
Continue Reading
For future reference, this article covers each of the many ways to access your WordPress-generated feeds. Several different URL formats are available for the various types of WordPress feeds — posts, comments, and categories — for both permalink and default URL structures. For each example, replace “http://domain.tld/” with the URL of your blog. Note: even though your blog’s main feed is accessible through many different URLs, there are clear benefits to using a single, consistent feed URL throughout your site.
Continue Reading
How to streamline and maximize the effectiveness of your WordPress URLs by using htaccess to remove extraneous post-date information: years, months, and days..
Recently, there has been much discussion about whether or not to remove the post-date information from WordPress permalinks 1. Way back during the WordPress 1.2/1.5 days, URL post-date inclusion had become very popular, in part due to reports of potential conflicts with post-name-only permalinks. Throw in the inevitable “monkey-see, monkey-do” mentality typical of many bloggers, and suddenly an entire wave of WordPressers had adopted the following permalink structure:
/%year%/%monthnum%/%day%/%postname%/
The benefits of using this format are primarily organizational in nature. Post-date information that is “built-in” to every URL provides immediate, “at-a-glance” knowledge of post “freshness”. Looking ahead ten, twenty or even a hundred years into the future of the blogosphere, there will be trillions of posts and articles, each with their own unique URL. Archived copies of content may or may not include creation date: dynamically archived pages require deliberate database queries, while those archived statically may no longer have access to post-date data. Including post dates in permalinks provides permanent, facilitative record of content origination. Needless to say, most adopters of dated permalinks probably jump on board because the WordPress Admin makes it super-easy to follow the crowd.
Continue Reading
Permalink URL canonicalization is automated via PHP in WordPress 2.3+, however, for those of us running sites on pre-2.3 versions or preferring to deal with rewrites directly via Apache, comprehensive WordPress URL canonicalization via htaccess may seem impossible. While there are several common methods that are partially effective, there has not yet been available a complete, user-friendly solution designed specifically for WordPress. Until now..
In this article, I share my “secret” htaccess URL canonicalization formula. I originally developed this method in July of 2007, and have been using it successfully on a variety of WordPress-powered sites since that time. Thus, having verified the effective performance of this technique, I feel confident in sharing it with the open-source community. But first, a bit of context..
Continue Reading
While planning my current site renovation project, I considered changing the format of my permalinks. Reasons for modifying the permalink structure of a site include:
- Optimizing URLs for the search engines
- Simplifying URL structure for improved readability
- Removing the implication that your site content is somehow organized chronologically
- Removing other unwanted organizational implications (e.g., categorically, topically, etc.)
Like many people who configured WordPress permalinks a couple of years ago, I chose to include the day, month, and year along with the blog URL and post title. For over two years now, Perishable Press has employed the following permalink format:
Continue Reading
Recently, while deliberating an optimal method for eliminating nofollow link attributes from Perishable Press, I collected, installed, tested and reviewed every WordPress no-nofollow/dofollow plugin that I could find. As of the writing of this post, I have evaluated 12 15 dofollow plugins, all of which are freely available on the Internet.
In this article, I present a concise, current, and comprehensive reference for WordPress no-nofollow and dofollow plugins. Every attempt has been made to provide accurate, useful, and complete information for each of the plugins represented below. Further, as this subject is a newfound interest of mine, it is my intention to keep this post updated with fresh information, so please bookmark it for future reference. Finally, please help expand/enhance this list by dropping any relevant information via comment area below. Thanks & enjoy!
Continue Reading
Recently, after researching comment links for an upcoming article, I realized that my default <input> values were being submitted as the URL for all comments left without associated website information. During the most recent site redesign, I made the mistake of doing this in comments.php:
...
<input class="input" name="url" id="url" value="[website]" onfocus="this.select();" type="text" tabindex="3" size="44" maxlength="133" alt="website" />
...
Notice the value="[website]" attribute? It seemed like a good idea at the time — I even threw in a nice onfocus auto-highlighting snippet for good measure. I ran the form with this in place for around eight weeks before finally noticing multiple comments using this for their site URL:
Continue Reading
![[ Image: Harvesting the Land ]](http://perishablepress.com/press/wp-content/images/2007/misc-chunks/harvesting-com-logs.gif)
Harvesting Raw Logs For those of us using cPanel as the control panel for our websites, a wealth of information is readily available via cPanel ‘Raw Access Logs’. These logs are perpetually updated with data involving user agents, IP addresses, HTTP activity, resource access, and a whole lot more. Here is a quick tutorial on accessing and interpreting your cPanel raw access logs.
Part One: Grab ‘em
To grab a copy of your raw access logs, log into cPanel and click on the "Raw Access Logs" icon. Within the Raw Access Log interface, scroll through the list of available log files and download the raw access log(s) of your choice.
Exit cPanel and navigate to your local copy of the raw access log, which should have been downloaded as a zipped/g-zipped file (i.e., .zip or .gz file extension), with a name similar to accesslog_your-domain.com_4_20_2007.gz.
Unzip the file and extract its contents, which should be a single file named your-domain.com. Rename the file by appending a .log or .txt extension to the file name. Alternatively, if the file is not named with a .com, .net, or .whatever extension, no rename is necessary, as it also may be opened via right-click » ‘Open With…’.
That’s all there is to it. If you understand how to interpret the contents of your Raw Access Log, you’re solid gold, baby. Otherwise, continue reading for a breif tutorial to get you started with the basics..
Continue Reading
URL’s frequently employ potentially conflicting characters such as question marks, ampersands, and pound signs. Fortunately, it is possible to encode such characters via their escaped hexadecimal ASCII representations. For example, we would write "?" as "%3F". Here are a few more URL character codes (case-insensitive):
Continue Reading