New book on WordPress Theme Development: WordPress Themes In Depth
ip
Tag Archive

Yahoo! Lies about Obeying Robots.txt Directives

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap: Read more »

Yahoo! Once Again Caught Disobeying Robots.txt Rules

Hmmm.. Let’s see here. Google can do it. MSN/Live can do it. Even Ask can do it. So why oh why can’t Yahoo’s grubby Slurp crawler manage to adhere to robots.txt crawl directives? Just when I thought Yahoo! finally figured it out, I discover more Slurp tracks in my Blackhole trap for bad spiders: Read more »

Unexplained Crawl Behavior Involving Tagged Query Strings

I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following: http://perishablepress.com/press/page/2/?tag=spam http://perishablepress.com/press/page/3/?tag=code http://perishablepress.com/press/page/2/?tag=email http://perishablepress.com/press/page/2/?tag=xhtml http://perishablepress.com/press/page/4/?tag=notes http://perishablepress.com/press/page/2/?tag=flash http://perishablepress.com/press/page/2/?tag=links http://perishablepress.com/press/page/3/?tag=theme http://perishablepress.com/press/page/2/?tag=press ..plus hundreds and hundreds more 1. The URL pattern is always the same: a different page number followed by a query string containing one of the tags used here at Perishable Press, for example: “/?tag=something”. The problem is that there are […] Read more »

Series Summary: Building the 3G Blacklist

In the now-complete series, Building the 3G Blacklist, I share insights and discoveries concerning website security and protection against malicious attacks. Each article in the series focuses on unique blacklist strategies designed to protect sites transparently, effectively, and efficiently. The five articles culminate in the release of the next generation 3G Blacklist. For the record, here is a quick summary of the entire Building the 3G Blacklist series: Read more »

Building the 3G Blacklist, Part 5: Improving Site Security by Selectively Blocking Individual IPs

In this continuing five-article series, I share insights and discoveries concerning website security and protecting against malicious attacks. Wrapping up the series with this article, I provide the final key to our comprehensive blacklist strategy: selectively blocking individual IPs. Previous articles also focus on key blacklist strategies designed to protect your site transparently, effectively, and efficiently. In the next article, these five articles will culminate in the release of the next generation 3G Blacklist. Improving Site Security by Selectively Blocking Individual IPs The final component of the 3G Blacklist establishes a vehicle through which individual IPs may be blocked. As […] Read more »

Over 150 of the Worst Spammers, Scrapers and Crackers from 2007

Update 2010/07/07: Please visit the 2010 IP Blacklist for more current information. Over the course of each year, I blacklist a considerable number of individual IP addresses. Every day, Perishable Press is hit with countless numbers of spammers, scrapers, crackers and all sorts of other hapless turds. Weekly examinations of my site’s error logs enable me to filter through the chaff and cherry-pick only the most heinous, nefarious attackers for blacklisting. Minor offenses are generally dismissed, but the evil bastards that insist on wasting resources running redundant automated scripts are immediately investigated via IP lookup and denied access via simple […] Read more »

Yahoo! Slurp in My Blackhole (Yet Again)

Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded in my site’s blackhole log (I know, that sounds terrible): Read more »

A Dramatic Week Here at Perishable Press..

..And we’re back. After an insane week spent shopping for a new host, dealing with some Bad Behavior, and transferring Perishable Press to its new home on a virtual private server (VPS), everything is slowly falling back into place. Along the way, there have been some interesting challenges and many lessons learned. Here are a few of the highlights.. The tide may be turning for A Small Orange I am certainly not alone when I say that shopping for a new hosting provider and transferring websites is one of my least favorite aspects of web development. In my experience, switching […] Read more »

Yahoo! in my Blackhole

Okay, I realize that the title sounds a bit odd, but nowhere near as odd as my recent discovery of Slurp ignoring explicit robots.txt rules and digging around in my highly specialized bot trap, which I have lovingly dubbed “the blackhole”. What is up with that, Yahoo!? — does your Slurp spider obey robots.txt directives or not? I have never seen Google crawling around that side of town, neither has MSN nor even Ask ventured into the forbidden realms. Has anyone else experienced such unexpected behavior from one the four major search engines? Hmmm.. let’s dig a little further.. Here […] Read more »

How to Verify the Four Major Search Engines

Keeping track of your access and error logs is a critical component of any serious security strategy. Many times, you will see a recorded entry that looks legitimate, such that it may easily be dismissed as genuine Google fare, only to discover upon closer investigation a fraudulent agent. There are many such cloaked or disguised agents crawling around these days, mimicking various search engines to hide beneath the radar. Thus, it is a good idea to implement a procedure for scanning and checking select agents for authenticity. In general, the verification process involves a “forward/reverse” DNS lookup, which is then […] Read more »

Temporary Site Redirect for Visitors during Site Updates

In our article Stupid htaccess Tricks, we present the htaccess code required for redirecting visitors temporarily during periods of site maintenance. Although the article provides everything needed to implement the temporary redirect, I think readers would benefit from a more thorough examination of the process — nothing too serious, just enough to get it right. After discussing temporary redirects via htaccess, I’ll also explain how to accomplish the same thing using only PHP. Read more »

WP-ShortStat Slowing Down Root Index Pages

For over a year now, I have been using Markus Kämmerer’s (Happy Arts Blog) WP-ShortStat plugin for WordPress. The plugin is relatively well-maintained and remains one of my favorite admin tools. Great for popping in on stats without logging into Mint. Nonetheless, due to its IP/country-detection functionality, WP-ShortStat has experienced its share of difficulties (e.g., read through the change log on the plugin’s home page). In this article, I describe how WP-Shortstat slows down the root index-page of a site, and then suggest a (temporary) fix for the issue. Read more »

How to Block IP Addresses with PHP

Figuratively speaking, hunting down and killing spammers, scrapers, and other online scum remains one of our favorite pursuits. Once we have determined that a particular IP address is worthy of banishment, we generally invoke the magical powers of htaccess to lock the gates. When htaccess is not available, we may summon the versatile functionality of PHP to get the job done. This method is relatively straightforward. Simply edit, copy and paste the following code example into the top of any PHP for which you wish to block access: Read more »

Harvesting cPanel Raw Access Logs

Harvesting Raw Logs For those of us using cPanel as the control panel for our websites, a wealth of information is readily available via cPanel ‘Raw Access Logs’. These logs are perpetually updated with data involving user agents, IP addresses, HTTP activity, resource access, and a whole lot more. Here is a quick tutorial on accessing and interpreting your cPanel raw access logs. Part One: Grab ‘em To grab a copy of your raw access logs, log into cPanel and click on the "Raw Access Logs" icon. Within the Raw Access Log interface, scroll through the list of available log […] Read more »

Disobedient Robots and Company

In our never-ending battle against spammers, leeches, scrapers, and other online undesirables, we have implemented several powerful security measures to improve the operational integrity of our perpetual virtual existence. Here is a rundown of the new behind-the-scenes security features of Perishable Press: Automated spambot trap, designed to identify bots (and/or stupid people) that disobey rules specified in the site’s robots.txt file. Automated disobedient-robot identification (via reverse IP lookup), admin-notification (via email) and blacklist inclusion (via htaccess). Automated inclusion of disobedient robot identification on our now public "Disobedient Robots" page. Imroved htaccess rules, designed to eliminate scum-sucking worms and other useless […] Read more »

Latest Tweets Plugin update! User Submitted Posts now w/ better performance, custom form templates and 14 new action/filter hooks: bit.ly/1CBVbPL