Tag: yahoo

SEO Experiment: Let Google Sort it Out

Posted on May 10, 2009 in Optimization by Jeff Starr

One way to prevent Google from crawling certain pages is to use <meta> elements in the <head> section of your web documents. For example, if I want to prevent Google from indexing and archiving a certain page, I would add the following code to the head of my document:

<meta name="googlebot" content="noindex,noarchive" />

I’m no SEO guru, but it is my general understanding that it is possible to manipulate the flow of page rank throughout a site through strategic implementation of <meta> directives.

Continue Reading

Yahoo! Slurp too Stupid to be a Robot

Posted on March 15, 2009 in Nonsense, Websites by Jeff Starr

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our websites every minute of the day.

For the most part, there are effective methods available enabling us to protect our sites against the endless hordes of irrelevant and mischievous bots. Such evil is easily blocked with virtually zero side-effects because their presence is simply irrelevant.

But what about bad bots that aren’t exactly irrelevant, such as Yahoo’s mindless Slurp crawler? By disobeying the robots.txt protocol as promised, Yahoo’s Slurp clearly falls into the “bad-bot” category. Unlike typical “nonsense” bots, Slurp is not exactly irrelevant (yet), so simply blocking them is not a reasonable solution.

Continue Reading

Yahoo! Lies about Obeying Robots.txt Directives

Posted on November 16, 2008 in Websites by Jeff Starr

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap:

Continue Reading

CSS Implementations of the Rich and Famous

Posted on October 26, 2008 in Presentation by Jeff Starr

[ Robin Leach of 'Lifestyles of the Rich and Famous' ] A great way to improve your CSS skills is to check out the stylesheets used by other websites. Digging behind the scenes and exploring some applied CSS provides new ideas and insights about everything from specificity and formatting to hacks and shortcuts. Learning CSS by reading about ideal cases and theoretical applications is certainly important, but actually seeing how the language is applied in “real-world” scenarios provides first-hand knowledge and insight. While there are millions of standards-based, CSS-designed websites to explore, studying a few of the Web’s elite players and CSS experts helps to put things into perspective by providing context for subsequent CSS investigations. Prime candidates include industry leaders, standards buffs, CSS specialists, professional bloggers, and other successful establishments. In this article, we reveal the CSS implementations used by the following “rich and famous” websites:

Continue Reading

Yahoo Incongruities.

Posted on July 13, 2008 in Technology by Jeff Starr

Screenshot: My Yahoo Search Results When frustration builds, and finally reaches its the boiling point, it’s nice to be able to express yourself to someone. Although I really don’t enjoy ranting about things, but when it comes to certain aspects of Yahoo!, I just can’t he’p myse’f. So, thanks to recent attempt at using My Yahoo!, it’s time to get some of this off my chest, clear the decks, and give Yahoo! (yet another) chance to clean up its act. Here are a few complaints I have against various aspects of the Yahoo! enterprise..

First and foremost, Yahoo! sends virtually zero traffic. I know this sounds painfully selfish, but I don’t understand why the top-ranking search engine finds Perishable Press worthy of around 2000 referrals per day, but Yahoo! can’t seem to justify more than a few each week. I mean come on, even good ‘ol MSN/Live manages to send a few hundred uniques per month 1.

So that sucks, and honestly doesn’t win Yahoo! any golden biscuits from me, not that it needs them. But perhaps more relevant and disturbing is Yahoo!’s relentless army of Slurp bots. I have never seen so much unusual, unexpected, and unwanted crawl behavior from any other legitimate search-engine robot. Seriously, some of the URL requests recorded in my arsenal of Web and error logs are like from another planet or something. Where are they coming up with some of this stuff:

Continue Reading

Unexplained Crawl Behavior Involving Tagged Query Strings

Posted on June 4, 2008 in Websites by Jeff Starr

I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following:

  • http://perishablepress.com/press/page/2/?tag=spam
  • http://perishablepress.com/press/page/3/?tag=code
  • http://perishablepress.com/press/page/2/?tag=email
  • http://perishablepress.com/press/page/2/?tag=xhtml
  • http://perishablepress.com/press/page/4/?tag=notes
  • http://perishablepress.com/press/page/2/?tag=flash
  • http://perishablepress.com/press/page/2/?tag=links
  • http://perishablepress.com/press/page/3/?tag=theme
  • http://perishablepress.com/press/page/2/?tag=press

..plus hundreds and hundreds more 1. The URL pattern is always the same: a different page number followed by a query string containing one of the tags used here at Perishable Press, for example: “/?tag=something”. The problem is that there are no such links anywhere on the site. The site employs permalink format for all WordPress-generated links (e.g., post/page links, search queries, tag queries, etc.). In an effort to locate the source of these URLs, I have performed multiple, thorough searches through every aspect of my site (files, database, code, examples, etc.), and the results are always the same: nothing. UGH! What is the source of these misguided URLs? Hopefully, you will be able to help shed some light on this unexplained crawl behavior (please, I beg you!!).

Continue Reading

Yahoo! Slurp in My Blackhole (Yet Again)

Posted on December 16, 2007 in Websites by Jeff Starr

Yup, ‘ol Slurp is at it again, flagrantly disobeying specific robots.txt rules forbidding access to my bad-bot trap, lovingly dubbed the “blackhole.” As many readers know, this is not the first time Yahoo has been caught behaving badly. This time, Yahoo was caught trespassing five different times via three different IPs over the course of four different days. Here is the data recorded in my site’s blackhole log (I know, that sounds terrible):

Continue Reading

Yahoo! in my Blackhole

Posted on November 25, 2007 in Websites by Jeff Starr

Okay, I realize that the title sounds a bit odd, but nowhere near as odd as my recent discovery of Slurp ignoring explicit robots.txt rules and digging around in my highly specialized bot trap, which I have lovingly dubbed “the blackhole”. What is up with that, Yahoo!? — does your Slurp spider obey robots.txt directives or not? I have never seen Google crawling around that side of town, neither has MSN nor even Ask ventured into the forbidden realms. Has anyone else experienced such unexpected behavior from one the four major search engines? Hmmm.. let’s dig a little further..

Here is the carefully formulated, highly specific, properly placed robots.txt rule that explicitly and strictly forbids all agents from accessing my blackhole bot trap:

Continue Reading

Prevent JavaScript Elements from Breaking Page Layout when Following Yahoo Performance Tip #6: Place Scripts at the Bottom

Posted on November 12, 2007 in Presentation, Structure by Jeff Starr

By now, everyone is familiar with the Yahoo Developer Network’s 14 “best-practices” for speeding up your website. Certainly, many (if not all) of these performance optimization tips are ideal for high-traffic sites such as Yahoo or Google, but not all of them are recommended for smaller sites such as Perishable Press. Nonetheless, throughout the current site renovation project, I have attempted to implement as many of these practices as possible. At the time of this writing, I somehow have managed to score an average 77% (whoopee!) via the YSlow extension for Firebug.

Of the handful of these tips that I am able (or willing) to follow, number 6 — move scripts to the bottom — is definitely one of the easiest. The reason for doing this is at least twofold:

[…] it’s better to move scripts from the top to as low in the page as possible. One reason is to enable progressive rendering, but another is to achieve greater download parallelization.
— Yahoo! Developer Network

Many people mistakenly assume that the <script> element (and associated contents) must be located squarely in the document <head>, however, this simply isn’t true. As outlined in the official HTML 4.01 Document Type Definition and also in the official XHTML 1.1 Document Type Definition, the <script> element is allowed:

Continue Reading

How to Verify the Four Major Search Engines

Posted on October 9, 2007 in Websites by Jeff Starr

Keeping track of your access and error logs is a critical component of any serious security strategy. Many times, you will see a recorded entry that looks legitimate, such that it may easily be dismissed as genuine Google fare, only to discover upon closer investigation a fraudulent agent. There are many such cloaked or disguised agents crawling around these days, mimicking various search engines to hide beneath the radar. Thus, it is a good idea to implement a procedure for scanning and checking select agents for authenticity. In general, the verification process involves a “forward/reverse” DNS lookup, which is then cross-verified with the search engine in question. Let’s have a quick look at how to do this..

First, visit and bookmark the following articles (and/or this article). These resources explain how to identify and verify the agents for each of the four major search engines: Google, Yahoo!, MSN/Live, and Ask.

Continue Reading

Suspicious Behavior from Yahoo! Slurp Crawler

Posted on August 13, 2007 in Websites by Jeff Starr

[ Image: Black and white illustration of the upper half of a man's suspicious, paranoid face ] Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces.

Of course, this policy presents a bit of a dilemma when the culprit is one of the four major search engines. Slamming the door on Yahoo! would be unwise, but if their Slurp crawler continues behaving suspiciously, I may have no choice. Check out the following records, pulled directly from one of my error logs, where Yahoo! exhibits some extremely questionable behavior.

Continue Reading

Robots Notes Plus

Posted on April 3, 2006 in Function by Jeff Starr

About the Robots Exclusion Standard 1:

The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.

Notes on the robots.txt Rules:

Rules of specificity apply, not inheritance. Always include a blank line between rules. Note also that not all robots obey the robots rules — even Google has been reported to ignore certain robots rules. Also, comments are allowed (and recommended) within any robots.txt file when written on a per-line basis. Simply begin each line of comments with a pound sign “#”.

Prevent Robots from Indexing the Entire Site:

User-agent: *
Disallow: /

Prevent a Specific Robot from Indexing the Entire Site:

User-agent: Googlebot-Image
Disallow: /

Prevent all Robots from Indexing Specific Pages/Directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.html

A Specific Example:

In this example, no robots are allowed to index anything except for Google, which is allowed to index everything except the specified pages/directories. Note the required blank line between the rules.

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/

Another Specific Example:

In this example, no agents are allowed to index anything except for Alexa, which is allowed to index anything. Note that there is a blank space after the colon, which enables this rule to work.

User-agent: *
Disallow: /

User-agent: ia_archiver
Disallow:

Prevent all Agents Except for Google:

Here is Google’s preferred way to disallow all agents anything except Google, which is allowed everything. Note that “Allow” is not a standard parameter and therefore is not recommended.

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

Notes on the “meta robots” Tag:

Certain robots rules may also be included in the head section of a web document. Examine the following examples:

<meta name="robots" content="noindex,nofollow,noarchive" />
<meta name="robots" content="noindex,nofollow" />
<meta name="googlebot" content="none" />
<meta name="alexa" content="all" />

Here is a general list of values available for the “content” attribute of the “meta robots” tag:

noindex, index — Determines indexing of site/pages.
nofollow, follow — Determines following of links.
nosnippet — Do not display excerpts or cached content.
noarchive — Do not display or collect cached content.

Additionally, Altavista supports:

noimageindex — Index text but not images.
noimageclick — Link to pages but not images.

References