Okay, I realize that the title sounds a bit odd, but nowhere near as odd as my recent discovery of Slurp ignoring explicit robots.txt rules and digging around in my highly specialized bot trap, which I have lovingly dubbed “the blackhole”. What is up with that, Yahoo!? — does your Slurp spider obey robots.txt directives or not? I have never seen Google crawling around that side of town, neither has MSN nor even Ask ventured into the forbidden realms. Has anyone else experienced such unexpected behavior from one the four major search engines? Hmmm.. let’s dig a little further..
Here is the carefully formulated, highly specific, properly placed robots.txt rule that explicitly and strictly forbids all agents from accessing my blackhole bot trap:
User-agent: * Disallow: */blackhole/*
Nothing unusual here. This is standard stuff, right? But wait, what about that crazy wildcard character? Does Yahoo! acknowledge such a creature? Sure they do (404 link removed). So what’s up, then? I surely don’t know. This is such unexpected behavior from such a popular, highly visible search engine. Thus, let’s make sure that we are actually dealing with ol’ Slurp, and not some nasty impostor. To do this, we’ll follow Yahoo’s own advice and perform a forward-reverse IP lookup for verification. Here are the results:
Reverse lookup for IP:
OrgName: Inktomi Corporation OrgID: INKT Address: 701 First Ave City: Sunnyvale StateProv: CA PostalCode: 94089 Country: US NetRange: 188.8.131.52 - 184.108.40.206 CIDR: 220.127.116.11/16 NetName: INKTOMI-BLK-6 NetHandle: NET-74-6-0-0-1 Parent: NET-74-0-0-0-0 NetType: Direct Allocation NameServer: NS1.YAHOO.COM NameServer: NS2.YAHOO.COM NameServer: NS3.YAHOO.COM NameServer: NS4.YAHOO.COM NameServer: NS5.YAHOO.COM Comment: RegDate: 2006-02-13 Updated: 2007-03-09 RAbuseHandle: NETWO857-ARIN RAbuseName: Network Abuse RAbusePhone: +1-408-349-3300 RAbuseEmail: firstname.lastname@example.org . . .
Forward lookup for hostname:
Yup, apparently, all checks out — it was Yahoo’s machine alright. Tsk, tsk — naughty Slurp! Even worse, this is not the first time Slurp has been caught sniffing around where it should not be sniffing. Needless to say, I will definitely be keeping a close eye on Yahoo! from now on.
Finally, just for the record, here is the log entry for the blackhole event to which this article refers:
<dt class="special"><strong>18.104.22.168</strong></dt> <dd class="data">[2007-11-21 (Wed) 20:52:37] "GET /blackhole/ HTTP/1.0"</dd> <dd class="data">Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)</dd> <dd class="data"><br /> OrgName: Inktomi Corporation <br /> OrgID: INKT<br /> Address: 701 First Ave<br /> City: Sunnyvale<br /> StateProv: CA<br /> PostalCode: 94089<br /> Country: US<br /> <br /> NetRange: 22.214.171.124 - 126.96.36.199 <br /> CIDR: 188.8.131.52/16 <br /> NetName: INKTOMI-BLK-6<br /> NetHandle: NET-74-6-0-0-1<br /> Parent: NET-74-0-0-0-0<br /> NetType: Direct Allocation<br /> NameServer: NS1.YAHOO.COM<br /> NameServer: NS2.YAHOO.COM<br /> NameServer: NS3.YAHOO.COM<br /> NameServer: NS4.YAHOO.COM<br /> NameServer: NS5.YAHOO.COM<br /> Comment: <br /> RegDate: 2006-02-13<br /> Updated: 2007-03-09<br /> <br /> RAbuseHandle: NETWO857-ARIN<br /> RAbuseName: Network Abuse <br /> RAbusePhone: +1-408-349-3300<br /> RAbuseEmail: email@example.com <br /> <br /> OrgAbuseHandle: NETWO857-ARIN<br /> OrgAbuseName: Network Abuse <br /> OrgAbusePhone: +1-408-349-3300<br /> OrgAbuseEmail: firstname.lastname@example.org<br /> <br /> OrgTechHandle: NA258-ARIN<br /> OrgTechName: Netblock Admin <br /> OrgTechPhone: +1-408-349-3300<br /> OrgTechEmail: email@example.com<br /> <br /> # ARIN WHOIS database, last updated 2007-11-20 19:10<br /> # Enter ? for additional hints on searching ARIN's WHOIS database.<br /> </dd>
Know what’s up? Drop a comment and share the wealth..