Analyzing Weird 404 Search Engine Requests

Lately I’ve been getting a significant number of really weird 404 requests for one of my sites. At first I ignored them. Then upon closer inspection, I realized that the requests were reporting user agents like Googlebot, Bingbot, and other top search engines. So there was cause for concern. You don’t want legitimate search engines tripping over endless 404 requests that are completely unrelated to your site content. That gets into “negative SEO” territory, and should be investigated and resolved asap. This article explains what I was dealing with, how I investigated, and what I did to resolve the issue.

Bizarre, unrelated requests

The site in question focuses on web design snippets. People who follow my work will know which site I’m talking about, but basically it’s all about code snippets and tutorials (hint: it’s not Perishable Press). In any case, when you have a site that focuses on a specific niche, you expect to see the following types of 404 errors every now and then:

  • 404s due to moved resources
  • 404s due to link typos
  • 404s related to predictive algorithms (e.g., apple icons, favicons, etc.)
  • 404s due to malicious scanning
  • 404s that are similar in nature to actual/existing resources

These types of 404 “Not Found” errors are common, and should be no reason for alarm. Except for maybe the exploit scanning stuff, which you can protect against using firewalls such as 6G and BBQ Pro. But in general, these sorts of 404s are normal. Google itself basically says that there is no penalty for rocking a few 404’s now and then. It’s just part of how the Web works. Stuff changes, resources move, no big deal.

What you don’t expect to see, or what isn’t part of your normal occasional 404 type errors, are legions of 404 requests that are completely unrelated to what your site is about. For example, the site in question is all about code snippets like HTML, JavaScript, CSS, and so forth. So why on earth did I begin seeing hundreds of 404 requests for the following resources:

http://example.com/produto/bacon-com-cheddar
http://example.com/produto/banana-com-chocolate
http://example.com/produto/bacon-3
http://example.com/produto/napolitana
http://example.com/cardapio/esfihas-salgadas/4
http://example.com/produto/carijo
http://example.com/produto/esfihas-salgadas-carne-com-catupiry

I mean, what in the world is going on with this sort of nonsense. These requests do not fit into any logical pattern of expected 404s for the site. So yeah it’s a code snippets site, primary focus. These requests — which were hitting daily by the tens of hundreds — are looking for resources related to food, colleges, and various international products. And in languages other than English. How on earth did Italian food resources and European colleges get associated with my site?

In order to get to the bottom of the mystery, I began to delve a little deeper..

Examining the user agents, are they spoofed?

After getting sick and tired of endless weird 404 errors, I decided to investigate the reported user agents and IP addresses. Here is a small sampling of what I found:

http://example.com/produto/bacon-com-cheddar
REMOTE ADDRESS: 68.180.229.107
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

http://example.com/produto/banana-com-chocolate
REMOTE ADDRESS: 68.180.229.107
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

http://example.com/produto/bacon-3
REMOTE ADDRESS: 207.46.13.115
USER AGENT: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

http://example.com/produto/napolitana
REMOTE ADDRESS: 207.46.13.115
USER AGENT: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

http://example.com/cardapio/esfihas-salgadas/4
REMOTE ADDRESS: 207.46.13.115
USER AGENT: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

http://example.com/produto/carijo
REMOTE ADDRESS: 66.249.75.86
USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

http://example.com/produto/esfihas-salgadas-carne-com-catupiry
REMOTE ADDRESS: 66.249.75.102
USER AGENT: Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Going into it, I was expecting to see a bunch of known bad bots, so definitely surprised to see that the requests were reporting user agents belonging to the major search engines. I figured that these requests had to be spoofed, as they were so far removed from anything remotely close to what the site was about content-wise. So the next logical step was to determine the legitimacy of these requests by further examining the server logs.

Verifying identity via log analysis

The easiest way to verify identity of anything hitting your site is to examine the server logs. Simply check the value of the HOST field and compare it with the reported user agent and other request variables to see if everything lines up. For example, my site logged the following request not too long ago:

TIME: February 2nd 2016, 01:32pm
REQUEST: http://example.com/%22%20title=%22.guessed%20post%20title%22%3e%3cimg%20src=%22https:/example.com/wp-content/themes/whatever/scripts/timthumb.php?src=http://wordpress.com.zapeljivka.si/tim.php
SITE: https://example.com/
REFERRER: undefined
QUERY STRING: src=http://wordpress.com.zapeljivka.si/tim.php
REMOTE ADDRESS: 5.135.154.181
PROXY ADDRESS: 5.135.154.181
HOST: www.woodpeckermarket.com
HTTP HOST: example.com
SERVER NAME: example.com
REMOTE IDENTITY: undefined
USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

As you can see here, the user agent reports as “Googlebot”, but the HOST info reports www.woodpeckermarket.com, which clearly is not anything Google would be doing. If the weird Italian food requests had been logged similar to this, I would have dismissed the 404s as just another pathetic scanner doing the only thing they know how to do.

Unfortunately, that’s not what the server log revealed. Instead, I discovered reports similar to the following (names of search engines added for clarity):

BING

TIME: January 28th 2016, 11:13pm
*404: http://example.com/produto/napolitana
SITE: https://example.com/
REFERRER: undefined
HTTP HOST: example.com
HOST: msnbot-207-46-13-85.search.msn.com
PROXY: 207.46.13.85
QUERY STRING: undefined
REMOTE ADDRESS: 207.46.13.85
REMOTE IDENTITY: undefined
USER AGENT: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)


GOOGLE

TIME: January 29th 2016, 06:10pm
*404: http://example.com/halfandhalf/1
SITE: https://example.com/
REFERRER: undefined
HTTP HOST: example.com
HOST: crawl-66-249-75-102.googlebot.com
PROXY: 66.249.75.102
QUERY STRING: undefined
REMOTE ADDRESS: 66.249.75.102
REMOTE IDENTITY: undefined
USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)


YAHOO
	
TIME: January 31st 2016, 10:47pm
*404: http://example.com/cardapio/pizzas-doces
SITE: https://example.com/
REFERRER: undefined
HTTP HOST: example.com
HOST: b115339.yse.yahoo.net
PROXY: 68.180.229.107
QUERY STRING: undefined
REMOTE ADDRESS: 68.180.229.107
REMOTE IDENTITY: undefined
USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Notice the details reported for these requests: the host name matches up with both IP address and user agent. So the easy method of verifying illegitimacy won’t work. They look real based on the reported data. In my experience, most likely requests such as these are going to be legit, but honestly I don’t know whether it’s possible at any level to fake an IP address (the host name and user agent are trivial to spoof). So I needed to investigate further to solve the mystery.

Verifying identity via WHOIS/DNS Lookup

If analyzing log files proves insufficient, a quick forward/reverse WHOIS lookup should be more than convincing. There are numerous ways to do a forward/reverse lookup. The easiest method for most folks is to just use an online service, such as this one. For example, using that site to do a DNS lookup of the IP reported by “Googlebot”, 66.249.75.86, gives the following host name:

crawl-66-249-75-86.googlebot.com

Then doing an IP lookup, the service confirms that the request is made via Googlebot, returning the correct IP address:

66.249.75.86

So from this, I conclude that yes it is in fact Google making the weird 404 requests at my site. It should be noted at this point, that if the forward/reverse lookup had failed, the mystery would have deepened, because it should be impossible to spoof an IP address.

Note also that it’s possible to run a forward/reverse DNS lookup directly via the command line, using the host command:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Here we are using Terminal to first check the host of IP 66.249.66.1, and then the host of crawl-66-249-66-1.googlebot.com. Because the results match up, it’s safe to conclude that the reported identity is correct.

Performing this same test for the other reported search engines (Bing and Yahoo), everything turns up legit. In other words, the mystery 404s are real and need resolved.

Putting it all together

At this point in the story, I have determined the following:

  • Bing, Google, Yahoo, et al are making some really funky 404 requests at my site
  • The weird links/text are nowhere to be found on my site or anywhere else online
  • The weird 404s involve requests for food, colleges, and various international resources
  • The search engines tend to crawl these links frequently and at the same time (every 15 minutes or so)

So what does all of this mean? Well I doubt there is some big conspiracy that involves the major search engines, Italian food, and my humble snippets site. So most likely there is something out there that is linking to my site — either purposely or inadvertently — using these weird URLs. So the question now is a matter of determining whether this behavior is malicious or not, and then how to deal with it. Is something out there feeding search engines “bad SEO” links to my site? Or is it more innocent than that.. I tend to be cynical about this sort of thing, but it is impossible to say for sure without more evidence.

Google Webmaster Tools to the rescue?

Trying to make sense of this to resolve the issue, I signed up for Google Webmaster Tools, hoping that their 404 data would reveal the source of the 404 errors. About a week later, the weird 404s were in fact reported by Google, but there was no associated referral/source information. So not really helpful other than to verify my own conclusions that the 404s were actually coming from Google et al.

So again, something out there, for whatever reason, is pointing the search engines at my site looking for all sorts of completely unrelated resources. To embellish the Webmaster Tools data, I tried using several backlink checker services available online. Not only were they not helpful, they were very frustrating and a total waste of time. The three or so services that I tried all promised the moon, but ended up delivering nothing while trying to convince me to pay for the very information that they said I would get if I simply signed up. Very deceptive and just yuck.

At this point, it finally was time to wrap things up and move on with my life..

Solution: 410 Gone

After collecting data for several weeks, it was time to conclude the investigation and implement a solution. Really the choices for handling the situation were few:

  • Do nothing and hope that it all works out on its own
  • Take advantage of the link juice and add some new pages to my site
  • Handle the 404s via .htaccess

The first option is a tempting strategy for some aspects of working online, but not advisable when SEO may be involved. Better to be proactive about it. The second option, although potentially useful, would be counter-productive in this scenario. And so that leave the obvious choice, manage the weird traffic at the server level with a snippet of .htaccess.

Handling these sorts of weird, nonexistent requests via .htaccess is straightforward. The only real question is the type of response to serve: 403, 404, 410, etc. We’ve already ruled out leaving it as 404, because we want to let the search engines know that the issue has been dealt with properly. In my experience, this situation calls for a 410 “Gone” response, because that’s pretty much what it is: the weird resources are just gone — as in they don’t exist. So it wouldn’t make sense to serve 403 “Forbidden” — that would be sending the wrong signal.

Going with a 410 response, the last thing to decide is the targeted request vector, which can be anything available to Apache/.htaccess. As these are actual search engines making the requests, we definitely do NOT want to target the user agent, IP address, host, or any other identifying information. Rather it’s gonna make more sense to simply target via the request string, or even better target the least common set of unique characters.

In the examples given above, that clearly would be something like /produto/, which is a pattern that simply does not exist on any of my domains. So it would be an easy thing to do with zero chance of false positives.

Unfortunately, the variety of weird 404 errors goes far beyond just “produtos”. As mentioned above, the search engines keep coming around looking for resources related to food, colleges, and various international products. So it took some time to cull through the data and assemble a complete, concise list of targeted patterns.

.htaccess FTW

After gathering and sorting through all of the data, I wrote the following .htaccess directive:

RedirectMatch 410 /(ajuda|areas-de|atletico|bed-bumper|benutzer|bishop|brocolis|browserconfig|bryant|cadastro|cardapio|chosen\.css|cnmdirj|coletiva|comparetable|contato|curriculo|east-miss|estrella|federico|halfandhalf|hillcrest|international-college|localizacao|loesungen|padrao|pagamentos|pizzas|pramukh|pressem|produkt|produto|quatro-queijos|rankings-and|regency-beauty|restaurante|social-circuit|there-is-no-way-this-exists|tricoci|union-college)

This does exactly what it says: serves a 410 Gone response to any request that includes any of the patterns defined in the regex. For the amount of weird 404s that this deals with, this solution really is quite elegant. And the best part is that none of the targeted patterns ever will be included in any URL on my site. So it’s a total victory for dealing with what could have been a creeping SEO nightmare.

Note: if you decide to try a solution such as this, the code itself should be placed in your site’s root .htaccess file. And make sure that mod_rewrite is enabled on the server. Of course, you’ll want to customize the regular expression to match whatever strings you need to target.

Update: finally a referrer is reported

After collecting crawl data for several weeks with nary a referrer reported, finally the following visit appears today in my server log:

TIME: February 19th 2016, 07:38pm
*404: http://example.com/themes/__css/boot.css
SITE: https://example.com/
REFERRER: https://www.montarepizzaria.com.br/cardapio-pizzaria/?PDT=1&n=Esfihas%20Salgadas&lista=1
HTTP HOST: example.com
HOST: crawl-66-249-73-213.googlebot.com
PROXY: 66.249.73.213
QUERY STRING: undefined
REMOTE ADDRESS: 66.249.73.213
REMOTE IDENTITY: undefined
USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This request is for a resource that had not yet been added to the .htaccess snippet:

http://example.com/themes/__css/boot.css

This looks so common to me that I almost ignored it.. but then taking a closer look at the referrer:

https://www.montarepizzaria.com.br/cardapio-pizzaria/?PDT=1&n=Esfihas%20Salgadas&lista=1

..I noticed the string, “cardapio”, which is one of the 404 strings blocked in the previous .htaccess snippet. So it’s safe to assume relation.

Needless to say, I was excited to discover an actual referrer, and began to investigate immediately. Unfortunately, however, the results were inconclusive. Not only does the site suffer from misconfigured SSL/HTTPS, it also redirects immediately to another page as soon as the associated non-SSL page is loaded. So there is no way to examine the actual referring page, and the redirected page yields zero informations. Keeping my eye on it though, and will follow-up with any juicy developments.

Stay tuned and thanks for reading.

References & Resources