Latest TweetsHeads up: Blasty (DMCA service) MIA: perishablepress.com/avoid-blas…
Perishable Press
Tag: robots
Found 26 matching results
Page 2 of 2

Yahoo! Slurp in My Blackhole (Yet Again)

Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of […] Continue »

Yahoo! in my Blackhole

Okay, I realize that the title sounds a bit odd, but nowhere near as odd as my recent discovery of Slurp ignoring explicit robots.txt rules and digging around in my highly specialized bot trap, which I have lovingly dubbed “the blackhole”. What is up with that, Yahoo!? — does your Slurp spider obey robots.txt directives […] Continue »

Comprehensive Reference for WordPress No-Nofollow/Dofollow Plugins

Recently, while deliberating an optimal method for eliminating nofollow link attributes from Perishable Press, I collected, installed, tested and reviewed every WordPress no-nofollow/dofollow plugin that I could find. In this article, I present a concise, current, and comprehensive reference for WordPress no-nofollow and dofollow plugins. Every attempt has been made to provide accurate, useful, and […] Continue »

Stop WordPress from Leaking PageRank to Admin Pages

During the most recent Perishable Press redesign, I noticed that several of my WordPress admin pages had been assigned significant levels of PageRank. Not good. After some investigation, I realized that my ancient robots.txt rules were insufficient in preventing Google from indexing various WordPress admin pages. Specifically, the following pages have been indexed and subsequently […] Continue »

Eliminate 404 Errors for PHP Functions

Recently, I discussed the suspicious behavior recently observed by the Yahoo! Slurp crawler. As revealed by the site’s closely watched 404-error logs, Yahoo! had been requesting a series of nonexistent resources. Although a majority of the 404 errors were exclusive to the Slurp crawler, there were several instances of requests that were also coming from […] Continue »

Suspicious Behavior from Yahoo! Slurp Crawler

Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces. […] Continue »

Invite Only: Traffic Control via Whitelist

Web developers trying to control comment-spam, bandwidth-theft, and content-scraping must choose between two fundamentally different approaches: selectively deny target offenders (the “blacklist” method) or selectively allow desirable agents (the “opt-in”, or “whitelist” method). Currently popular according to various online forums and discussion boards is the blacklist method. The blacklist method requires the webmaster to create […] Continue »

Disobedient Robots and Company

In our never-ending battle against spammers, leeches, scrapers, and other online undesirables, we have implemented several powerful security measures to improve the operational integrity of our perpetual virtual existence. Here is a rundown of the new behind-the-scenes security features of Perishable Press. Continue »

Stop Bitacle from Stealing Content

If you have yet to encounter the content-scraping site, bitacle.org, consider yourself lucky. The scum-sucking worm-holes at bitacle.org are well-known for literally, blatantly, and piggishly stealing blog content and using it for financial gains through advertising. While I am not here to discuss the legal, philosophical, or technical ramifications of illegal bitacle behavior, I am […] Continue »

Robots Notes Plus

About the Robots Exclusion Standard1: The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the […] Continue »

12 Next Posts »