Perishable Press

WordPress, Web Design, Code & Tutorials

robots tag archive

Suspicious Behavior from Yahoo! Slurp Crawler

Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces. Of course, this policy presents a bit of a dilemma when the culprit is one of the four major search engines. Slamming [...] • Read more »

Invite Only: Visitor Exclusivity via the Opt-In Method

Web developers trying to control comment-spam, bandwidth-theft, and content-scraping must choose between two fundamentally different approaches: selectively deny target offenders (the "blacklist" method) or selectively allow desirable agents (the "opt-in", or "whitelist" method). Currently popular according to various online forums and discussion boards is the blacklist method. The blacklist method requires the webmaster to create and maintain a working list of undesirable agents, usually blocking their access via htaccess or php. The downside of "blacklisting" is that [...] • Read more »

Disobedient Robots and Company

In our never-ending battle against spammers, leeches, scrapers, and other online undesirables, we have implemented several powerful security measures to improve the operational integrity of our perpetual virtual existence. Here is a rundown of the new behind-the-scenes security features of Perishable Press: Automated spambot trap, designed to identify bots (and/or stupid people) that disobey rules specified in the site’s robots.txt file. Automated disobedient-robot identification (via reverse IP lookup), admin-notification (via email) and blacklist inclusion (via htaccess). Automated [...] • Read more »

Stop Bitacle from Stealing Content

If you have yet to encounter the content-scraping site, bitacle.org, consider yourself lucky. The scum-sucking worm-holes at bitacle.org are well-known for literally (404 link removed 2013/03/28), blatantly, and piggishly stealing blog content and using it for financial gains through advertising. While I am not here to discuss the legal, philosophical, or technical ramifications of illegal bitacle behavior, I am here to provide a few critical tools that will help stop bitacle from stealing your content. The htaccess [...] • Read more »

Robots Notes Plus

About the Robots Exclusion Standard 1: The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. Notes on the robots.txt Rules: Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also [...] • Read more »