New book on WordPress Theme Development: WordPress Themes In Depth
robots
Tag Archive

Suspicious Behavior from Yahoo! Slurp Crawler

Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces. Of course, this policy presents a bit of a dilemma when the culprit is one of the four major search engines. Slamming the door on Yahoo! would be unwise, but if their Slurp crawler continues behaving suspiciously, I may have no choice. Check out the […] Read more »

Invite Only: Visitor Exclusivity via the Opt-In Method

Web developers trying to control comment-spam, bandwidth-theft, and content-scraping must choose between two fundamentally different approaches: selectively deny target offenders (the "blacklist" method) or selectively allow desirable agents (the "opt-in", or "whitelist" method). Currently popular according to various online forums and discussion boards is the blacklist method. The blacklist method requires the webmaster to create and maintain a working list of undesirable agents, usually blocking their access via htaccess or php. The downside of "blacklisting" is that it requires considerable effort to stay current with the exponential number of ever-evolving threats, which require exceedingly long lists for an effective response. […] Read more »

Disobedient Robots and Company

In our never-ending battle against spammers, leeches, scrapers, and other online undesirables, we have implemented several powerful security measures to improve the operational integrity of our perpetual virtual existence. Here is a rundown of the new behind-the-scenes security features of Perishable Press: Automated spambot trap, designed to identify bots (and/or stupid people) that disobey rules specified in the site’s robots.txt file. Automated disobedient-robot identification (via reverse IP lookup), admin-notification (via email) and blacklist inclusion (via htaccess). Automated inclusion of disobedient robot identification on our now public "Disobedient Robots" page. Imroved htaccess rules, designed to eliminate scum-sucking worms and other useless […] Read more »

Stop Bitacle from Stealing Content

If you have yet to encounter the content-scraping site, bitacle.org, consider yourself lucky. The scum-sucking worm-holes at bitacle.org are well-known for literally (404 link removed 2013/03/28), blatantly, and piggishly stealing blog content and using it for financial gains through advertising. While I am not here to discuss the legal, philosophical, or technical ramifications of illegal bitacle behavior, I am here to provide a few critical tools that will help stop bitacle from stealing your content. The htaccess Finger Perhaps the most straightforward and effective method for keeping the bitacle thieves away from your site, adding the following htaccess rules to […] Read more »

Robots Notes Plus

About the Robots Exclusion Standard 1: The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. Notes on the robots.txt Rules: Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also that not all robots obey the robots rules — even Google has been reported to ignore certain robots rules. Also, comments are allowed […] Read more »

Latest Tweets Download the latest nightly build of WordPress: wordpress.org/nightly-builds/w…