Recently launched a simple one-page website called Wutsearch. Wutsearch is a search-engine launchpad that I use for my homepage. Usually I work with several browsers open, each set to open multiple tabs on startup. For a long time, all those tabs were set to the Google homepage. Then DuckDuckGo. Now, it’s Wutsearch 🚀 Continue reading »
This post is about how I cleaned up an incorrect URL in the Google search results. My business site is basically a one-page portfolio site, located at the URL https://monzillamedia.com/. But in the Google search results, the URL was showing as https://monzilla.biz/, which did not exist. So all potential customers were getting an error page. Fortunately I was able to re-acquire the monzilla.biz domain and redirect all traffic to monzillamedia.com. Continue reading »
Concept One These days in this crazy world it makes sense to archive locally any critical online data. That way when the Internet is not working (for whatever reason), you still have access to your important infos and data. For those who are listening and interested in being prepared, here is the quickest, easiest way that I have been able to find to archive complete offline copies of websites to your Mac (or iOS). It makes use of a free […] Continue reading »
Awhile ago, I was confused by repetitive 404 “Not Found” errors in my server logs. The 404 requests look like someone is typing out various words, a few letters at a time. This post shows what these weird 404s look like from the server’s perspective, and then goes on to explain why they happen and why there is no practical way of preventing them. Continue reading »
Here is a working list of all user agents for the major, top search engines. I use this information frequently for my plugins such as Blackhole for Bad Bots and BBQ Pro, so I figured it would be useful to post the information online for the benefit of others. Having the user agents for these popular bots all in one place helps to streamline my development process. Each search engine includes references and a regex pattern to match all known […] Continue reading »
There are several ways to instruct Google to stay away from various pages in your site: Robots.txt directives Nofollow attributes on links Meta noindex/nofollow directives X-Robots noindex/nofollow directives ..and so on. These directives all function in different ways, but they all serve the same basic purpose: control how Google crawls the various pages on your site. For example, you can use meta noindex to instruct Google not to index your sitemap, RSS feed, or any other page you wish. This […] Continue reading »
One way to prevent Google from crawling certain pages is to use <meta /> elements in the <head></head> section of your web documents. For example, if I want to prevent Google from indexing and archiving a certain page, I would add the following code to the head of my document: Continue reading »
I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our sites every minute of the day. For the most part, there are effective methods available enabling […] Continue reading »
I recently added OpenSearch functionality to Perishable Press. Now, OpenSearch-enabled browsers such as Firefox and IE 7 alert users with the option to customize their browser’s built-in search feature with an exclusive OpenSearch-powered search option for Perishable Press. The autodiscovery feature of supportive browsers detects the custom search protocol and enables users to easily add it to their collection of readily available site-specific search options. Now, users may search the entire Perishable Press domain with the click of a button. […] Continue reading »
There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap: Continue reading »
Hmmm.. Let’s see here. Google can do it. MSN/Live can do it. Even Ask can do it. So why oh why can’t Yahoo’s grubby Slurp crawler manage to adhere to robots.txt crawl directives? Just when I thought Yahoo! finally figured it out, I discover more Slurp tracks in my Blackhole trap for bad spiders: Continue reading »
While writing my previous article on creating the perfect WordPress title tags, I deliberately avoided discussing the use of separators in titles. I feel that the topic is worthy of its own article, enabling a more thorough exploration of the details. Title separators are the symbols, punctuation, and other characters used to distinguish between various parts of the page title. For example, a title may include the blog name, post title and blog description, with each element separated by a […] Continue reading »
When frustration builds, and finally reaches its the boiling point, it’s nice to be able to express yourself to someone. Although I really don’t enjoy ranting about things, but when it comes to certain aspects of Yahoo!, I just can’t he’p myse’f. So, thanks to recent attempt at using My Yahoo!, it’s time to get some of this off my chest, clear the decks, and give Yahoo! (yet another) chance to clean up its act. Here are a few complaints […] Continue reading »
Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with meta robots directives such as these: <meta name="googlebot" content="index,archive,follow,noodp"/> <meta name="robots" content="all,index,follow"/> <meta name="msnbot" content="all,index,follow"/> I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots”1 crawl and represent my (X)HTML-based content in search results. For other, non-(X)HTML types of content, however, using meta robots directives to control indexing and caching is not an option. An […] Continue reading »
Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded in my site’s blackhole log (I know, that sounds terrible): Continue reading »
In this article, I discuss how to get the most out of your site’s images by optimizing them for both people and search engines.. For many sites, images play an important role in the communication process. If used correctly, images have the power to make your articles come alive with clarity and vibrancy. Some visitors may merely notice the image and continue reading, while others will want to know more about your images and dig deeper. While checking out your […] Continue reading »