Articles tagged with “spider”
- Unexplained Crawl Behavior Involving Tagged Query Strings
- I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following:
http://perishablepress.com/press/page/2/?tag=spam
http://perishablepress.com/press/page/3/?tag=code
http://perishablepress.com/press/page/2/?tag=email
http://perishablepress.com/press/page/2/?tag=xhtml
http://perishablepress.com/press/page/4/?tag=notes
http://perishablepress.com/press/page/2/?tag=flash
http://perishablepress.com/press/page/2/?tag=links
http://perishablepress.com/press/page/3/?tag=theme
http://perishablepress.com/press/page/2/?tag=press
..plus hundreds and hundreds ...
- Taking Advantage of the X-Robots Tag
- Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with meta robots directives such as these:
I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots” 1 crawl and represent my (X)HTML-based content in search results.
For other, non-(X)HTML ...
- Yahoo! Slurp in My Blackhole (Yet Again)
- Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded ...
- How to Add Meta Noindex to Your Feeds
- Want to make sure that your feeds are not indexed by Google and other compliant search engines? Add the following code to the channel element of your XML-based (RSS, etc.) feeds:
Here is an example of how I use this tag for Perishable Press feeds (vertical spacing added for emphasis):
Perishable Press
http://perishablepress.com/press
Digital Design and Dialogue ~
Mon, 29 Oct 2007 21:38:24
en...
- Yahoo! in my Blackhole
- Okay, I realize that the title sounds a bit odd, but nowhere near as odd as my recent discovery of Slurp ignoring explicit robots.txt rules and digging around in my highly specialized bot trap, which I have lovingly dubbed “the blackhole”. What is up with that, Yahoo!? — does your Slurp spider obey robots.txt directives or not? I have never seen Google crawling around that side of town, neither has MSN nor even Ask ventured into the ...
- How to Verify the Four Major Search Engines
- Keeping track of your access and error logs is a critical component of any serious security strategy. Many times, you will see a recorded entry that looks legitimate, such that it may easily be dismissed as genuine Google fare, only to discover upon closer investigation a fraudulent agent. There are many such cloaked or disguised agents crawling around these days, mimicking various search engines to hide beneath the radar. Thus, it ...
- Suspicious Behavior from Yahoo! Slurp Crawler
- [ Keywords: yahoo, slurp, crawl, crawling, spider, url, 404, errors, suspicious, behavior ]
Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t ...
- Invite Only: Visitor Exclusivity via the Opt-In Method
- Web developers trying to control comment-spam, bandwidth-theft, and content-scraping must choose between two fundamentally different approaches: selectively deny target offenders (the "blacklist" method) or selectively allow desirable agents (the "opt-in", or "whitelist" method).
Currently popular according to various online forums and discussion boards is the blacklist method. The blacklist method requires the webmaster to create and maintain a working list of undesirable agents, usually blocking their access via htaccess or php. The downside of "blacklisting" is that ...
- Robots Notes Plus
- About the Robots Exclusion Standard1:
The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.
Notes on the robots.txt Rules:
Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also that not all robots ...