Recently launched a simple one-page website called Wutsearch. Wutsearch is a search-engine launchpad that I use for my homepage. Usually I work with several browsers open, each set to open multiple tabs on startup. For a long time, all those tabs were set to the Google homepage. Then DuckDuckGo. Now, it’s Wutsearch 🚀 Continue reading »
Lately I’ve been getting a significant number of really weird 404 requests for one of my sites. At first I ignored them. Then upon closer inspection, I realized that the requests were reporting user agents like Googlebot, Bingbot, and other top search engines. So there was cause for concern. You don’t want legitimate search engines tripping over endless 404 requests that are completely unrelated to your site content. That gets into “negative SEO” territory, and should be investigated and resolved […] Continue reading »
Here is a working list of all user agents for the major, top search engines. I use this information frequently for my plugins such as Blackhole for Bad Bots and BBQ Pro, so I figured it would be useful to post the information online for the benefit of others. Having the user agents for these popular bots all in one place helps to streamline my development process. Each search engine includes references and a regex pattern to match all known […] Continue reading »
One way to prevent Google from crawling certain pages is to use <meta /> elements in the <head></head> section of your web documents. For example, if I want to prevent Google from indexing and archiving a certain page, I would add the following code to the head of my document: Continue reading »
I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our sites every minute of the day. For the most part, there are effective methods available enabling […] Continue reading »
There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap: Continue reading »
A great way to improve your CSS skills is to check out the stylesheets used by other websites. Digging behind the scenes and exploring some applied CSS provides new ideas and insights about everything from specificity and formatting to hacks and shortcuts. Learning CSS by reading about ideal cases and theoretical applications is certainly important, but actually seeing how the language is applied in “real-world” scenarios provides first-hand knowledge and insight. While there are millions of standards-based, CSS-designed websites to […] Continue reading »
Hmmm.. Let’s see here. Google can do it. MSN/Live can do it. Even Ask can do it. So why oh why can’t Yahoo’s grubby Slurp crawler manage to adhere to robots.txt crawl directives? Just when I thought Yahoo! finally figured it out, I discover more Slurp tracks in my Blackhole trap for bad spiders: Continue reading »
When frustration builds, and finally reaches its the boiling point, it’s nice to be able to express yourself to someone. Although I really don’t enjoy ranting about things, but when it comes to certain aspects of Yahoo!, I just can’t he’p myse’f. So, thanks to recent attempt at using My Yahoo!, it’s time to get some of this off my chest, clear the decks, and give Yahoo! (yet another) chance to clean up its act. Here are a few complaints […] Continue reading »
I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following: https://example.com/press/page/2/?tag=spam https://example.com/press/page/3/?tag=code https://example.com/press/page/2/?tag=email https://example.com/press/page/2/?tag=xhtml https://example.com/press/page/4/?tag=notes https://example.com/press/page/2/?tag=flash https://example.com/press/page/2/?tag=links https://example.com/press/page/3/?tag=theme https://example.com/press/page/2/?tag=press Note: For these example URLs, I replaced my domain, perishablepress.com with the generic example.com. Turns out that listing the plain-text […] Continue reading »
Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded in my site’s blackhole log (I know, that sounds terrible): Continue reading »
Okay, I realize that the title sounds a bit odd, but nowhere near as odd as my recent discovery of Slurp ignoring explicit robots.txt rules and digging around in my highly specialized bot trap, which I have lovingly dubbed “the blackhole”. What is up with that, Yahoo!? — does your Slurp spider obey robots.txt directives or not? I have never seen Google crawling around that side of town, neither has MSN nor even Ask ventured into the forbidden realms. Has […] Continue reading »
By now, everyone is familiar with the Yahoo Developer Network’s 14 best-practices for speeding up your website. Certainly, many (if not all) of these performance optimization tips are ideal for high-traffic sites such as Yahoo or Google, but not all of them are recommended for smaller sites such as Perishable Press. Nonetheless, throughout the current site renovation project, I have attempted to implement as many of these practices as possible. At the time of this writing, I somehow have managed […] Continue reading »
Keeping track of your access and error logs is a critical component of any serious security strategy. Many times, you will see a recorded entry that looks legitimate, such that it may easily be dismissed as genuine Google fare, only to discover upon closer investigation a fraudulent agent. There are many such cloaked or disguised agents crawling around these days, mimicking various search engines to hide beneath the radar. So it’s always a good idea to implement a procedure for […] Continue reading »
Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces. Of course, this policy presents a bit of a dilemma when the culprit is one of the four major search engines. Slamming the door on […] Continue reading »
About the Robots Exclusion Standard: The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. Notes on the robots.txt Rules: Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also that not all robots […] Continue reading »