Analyzing Weird 404 Search Engine Requests
Lately I’ve been getting a significant number of really weird 404 requests for one of my sites. At first I ignored them. Then upon closer inspection, I realized that the requests were reporting user agents like Googlebot, Bingbot, and other top search engines. So there was cause for concern. You don’t want legitimate search engines tripping over endless 404 requests that are completely unrelated to your site content. That gets into “negative SEO” territory, and should be investigated and resolved asap. This article explains what I was dealing with, how I investigated, and what I did to resolve the issue.
Bizarre, unrelated requests
The site in question focuses on web design snippets. People who follow my work will know which site I’m talking about, but basically it’s all about code snippets and tutorials (hint: it’s not Perishable Press). In any case, when you have a site that focuses on a specific niche, you expect to see the following types of 404 errors every now and then:
- 404s due to moved resources
- 404s due to link typos
- 404s related to predictive algorithms (e.g., apple icons, favicons, etc.)
- 404s due to malicious scanning
- 404s that are similar in nature to actual/existing resources
These types of 404 “Not Found” errors are common, and should be no reason for alarm. Except for maybe the exploit scanning stuff, which you can protect against using firewalls such as 6G and BBQ Pro. But in general, these sorts of 404s are normal. Google itself basically says that there is no penalty for rocking a few 404’s now and then. It’s just part of how the Web works. Stuff changes, resources move, no big deal.
http://example.com/produto/bacon-com-cheddar http://example.com/produto/banana-com-chocolate http://example.com/produto/bacon-3 http://example.com/produto/napolitana http://example.com/cardapio/esfihas-salgadas/4 http://example.com/produto/carijo http://example.com/produto/esfihas-salgadas-carne-com-catupiry
I mean, what in the world is going on with this sort of nonsense. These requests do not fit into any logical pattern of expected 404s for the site. So yeah it’s a code snippets site, primary focus. These requests — which were hitting daily by the tens of hundreds — are looking for resources related to food, colleges, and various international products. And in languages other than English. How on earth did Italian food resources and European colleges get associated with a website that’s focused on web-dev related topics?
In order to get to the bottom of the mystery, I began to delve a little deeper..
Examining the user agents, are they spoofed?
After getting sick and tired of endless weird 404 errors, I decided to investigate the reported user agents and IP addresses. Here is a small sampling of what I found:
http://example.com/produto/bacon-com-cheddar REMOTE ADDRESS: 18.104.22.168 USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) http://example.com/produto/banana-com-chocolate REMOTE ADDRESS: 22.214.171.124 USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) http://example.com/produto/bacon-3 REMOTE ADDRESS: 126.96.36.199 USER AGENT: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) http://example.com/produto/napolitana REMOTE ADDRESS: 188.8.131.52 USER AGENT: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) http://example.com/cardapio/esfihas-salgadas/4 REMOTE ADDRESS: 184.108.40.206 USER AGENT: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) http://example.com/produto/carijo REMOTE ADDRESS: 220.127.116.11 USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) http://example.com/produto/esfihas-salgadas-carne-com-catupiry REMOTE ADDRESS: 18.104.22.168 USER AGENT: Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Going into it, I was expecting to see a bunch of known bad bots, so definitely surprised to see that the requests were reporting user agents belonging to the major search engines. I figured that these requests had to be spoofed, as they were so far removed from anything remotely close to what the site was about content-wise. So the next logical step was to determine the legitimacy of these requests by further examining the server logs.
Verifying identity via log analysis
The easiest way to verify identity of anything hitting your site is to examine the server logs. Simply check the value of the HOST field and compare it with the reported user agent and other request variables to see if everything lines up. For example, my site logged the following request not too long ago:
TIME: February 2nd 2016, 01:32pm REQUEST: http://example.com/%22%20title=%22.guessed%20post%20title%22%3e%3cimg%20src=%22https:/example.com/wp-content/themes/whatever/scripts/timthumb.php?src=http://wordpress.com.zapeljivka.si/tim.php SITE: https://example.com/ REFERRER: undefined QUERY STRING: src=http://wordpress.com.zapeljivka.si/tim.php REMOTE ADDRESS: 22.214.171.124 PROXY ADDRESS: 126.96.36.199 HOST: www.woodpeckermarket.com HTTP HOST: example.com SERVER NAME: example.com REMOTE IDENTITY: undefined USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
As you can see here, the user agent reports as “Googlebot”, but the HOST info reports
www.woodpeckermarket.com, which clearly is not anything Google would be doing. If the weird Italian food requests had been logged similar to this, I would have dismissed the 404s as just another pathetic scanner doing the only thing they know how to do.
Unfortunately, that’s not what the server log revealed. Instead, I discovered reports similar to the following (names of search engines added for clarity):
BING TIME: January 28th 2016, 11:13pm *404: http://example.com/produto/napolitana SITE: https://example.com/ REFERRER: undefined HTTP HOST: example.com HOST: msnbot-207-46-13-85.search.msn.com PROXY: 188.8.131.52 QUERY STRING: undefined REMOTE ADDRESS: 184.108.40.206 REMOTE IDENTITY: undefined USER AGENT: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) GOOGLE TIME: January 29th 2016, 06:10pm *404: http://example.com/halfandhalf/1 SITE: https://example.com/ REFERRER: undefined HTTP HOST: example.com HOST: crawl-66-249-75-102.googlebot.com PROXY: 220.127.116.11 QUERY STRING: undefined REMOTE ADDRESS: 18.104.22.168 REMOTE IDENTITY: undefined USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) YAHOO TIME: January 31st 2016, 10:47pm *404: http://example.com/cardapio/pizzas-doces SITE: https://example.com/ REFERRER: undefined HTTP HOST: example.com HOST: b115339.yse.yahoo.net PROXY: 22.214.171.124 QUERY STRING: undefined REMOTE ADDRESS: 126.96.36.199 REMOTE IDENTITY: undefined USER AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Notice the details reported for these requests: the host name matches up with both IP address and user agent. So the easy method of verifying illegitimacy won’t work. They look real based on the reported data. In my experience, most likely requests such as these are going to be legit, but honestly I don’t know whether it’s possible at any level to fake an IP address (the host name and user agent are trivial to spoof). So I needed to investigate further to solve the mystery.
Verifying identity via WHOIS/DNS Lookup
If analyzing log files proves insufficient, a quick forward/reverse WHOIS lookup should be more than convincing. There are numerous ways to do a forward/reverse lookup. The easiest method for most folks is to just use an online service, such as this one. For example, using that site to do a DNS lookup of the IP reported by “Googlebot”,
188.8.131.52, gives the following host name:
Then doing an IP lookup, the service confirms that the request is made via Googlebot, returning the correct IP address:
So from this, I conclude that yes it is in fact Google making the weird 404 requests at my site. It should be noted at this point, that if the forward/reverse lookup had failed, the mystery would have deepened, because it should be impossible to spoof an IP address.
Note also that it’s possible to run a forward/reverse DNS lookup directly via the command line, using the host command:
> host 184.108.40.206 220.127.116.11.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com. > host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 18.104.22.168
Here we are using Terminal to first check the host of IP
22.214.171.124, and then the host of
crawl-66-249-66-1.googlebot.com. Because the results match up, it’s safe to conclude that the reported identity is correct.
Performing this same test for the other reported search engines (Bing and Yahoo), everything turns up legit. In other words, the mystery 404s are real and need resolved.
Putting it all together
At this point in the story, I have determined the following:
- Bing, Google, Yahoo, et al are making some really funky 404 requests
- The weird links/text are nowhere to be found on my site or anywhere else
- The weird 404s involve food, colleges, and various international resources
- The search engines tend to crawl these links frequently and at the same time (every 15 minutes or so)
So what does all of this mean? Well I doubt there is some big conspiracy that involves the major search engines, Italian food, and my humble snippets site. So most likely there is something out there that is linking to my site — either purposely or inadvertently — using these weird URLs. So the question now is a matter of determining whether this behavior is malicious or not, and then how to deal with it. Is something out there feeding search engines “bad SEO” links to my site? Or is it more innocent than that.. I tend to be cynical about this sort of thing, but it is impossible to say for sure without more evidence.
Google Webmaster Tools to the rescue?
Trying to make sense of this to resolve the issue, I signed up for Google Webmaster Tools, hoping that their 404 data would reveal the source of the 404 errors. About a week later, the weird 404s were in fact reported by Google, but there was no associated referral/source information. So not really helpful other than to verify my own conclusions that the 404s were actually coming from Google et al.
So again, something out there, for whatever reason, is pointing the search engines at my site looking for all sorts of completely unrelated resources. To embellish the Webmaster Tools data, I tried using several backlink checker services available online. Not only were they not helpful, they were very frustrating and a total waste of time. The three or so services that I tried all promised the moon, but ended up delivering nothing while trying to convince me to pay for the very information that they said I would get if I simply signed up. Very deceptive and just yuck.
At this point, it finally was time to wrap things up and move on with my life..
Solution: 410 Gone
After collecting data for several weeks, it was time to conclude the investigation and implement a solution. Really the choices for handling the situation were few:
- Do nothing and hope that it all works out on its own
- Take advantage of the link juice and add some new pages to my site
- Handle the 404s via .htaccess
The first option is a tempting strategy for some aspects of working online, but not advisable when SEO may be involved. Better to be proactive about it. The second option, although potentially useful, would be counter-productive in this scenario. And so that leave the obvious choice, manage the weird traffic at the server level with a snippet of .htaccess.
Handling these sorts of weird, nonexistent requests via .htaccess is straightforward. The only real question is the type of response to serve:
410, etc. We’ve already ruled out leaving it as 404, because we want to let the search engines know that the issue has been dealt with properly. In my experience, this situation calls for a 410 “Gone” response, because that’s pretty much what it is: the weird resources are just gone — as in they don’t exist. So it wouldn’t make sense to serve 403 “Forbidden” — that would be sending the wrong signal.
Going with a 410 response, the last thing to decide is the targeted request vector, which can be anything available to Apache/.htaccess. As these are actual search engines making the requests, we definitely do NOT want to target the user agent, IP address, host, or any other identifying information. Rather it’s gonna make more sense to simply target via the request string, or even better target the least common set of unique characters.
In the examples given above, that clearly would be something like
/produto/, which is a pattern that simply does not exist on any of my domains. So it would be an easy thing to do with zero chance of false positives.
Unfortunately, the variety of weird 404 errors goes far beyond just “produtos”. As mentioned above, the search engines keep coming around looking for resources related to food, colleges, and various international products. So it took some time to cull through the data and assemble a complete, concise list of targeted patterns.
After gathering and sorting through all of the data, I wrote the following .htaccess directive:
RedirectMatch 410 /(ajuda|areas-de|atletico|bed-bumper|benutzer|bishop|brocolis|browserconfig|bryant|cadastro|cardapio|chosen\.css|cnmdirj|coletiva|comparetable|contato|curriculo|east-miss|estrella|federico|halfandhalf|hillcrest|international-college|localizacao|loesungen|padrao|pagamentos|pizzas|pramukh|pressem|produkt|produto|quatro-queijos|rankings-and|regency-beauty|restaurante|social-circuit|there-is-no-way-this-exists|tricoci|union-college)
This does exactly what it says: serves a 410 Gone response to any request that includes any of the patterns defined in the regex. For the amount of weird 404s that this deals with, this solution really is quite elegant. And the best part is that none of the targeted patterns ever will be included in any URL on my site. So it’s a total victory for dealing with what could have been a creeping SEO nightmare.
Note: if you decide to try a solution such as this, the code itself should be placed in your site’s root .htaccess file. And make sure that
mod_rewrite is enabled on the server. Of course, you’ll want to customize the regular expression to match whatever strings you need to target.
Update: finally a referrer is reported
After collecting crawl data for several weeks with nary a referrer reported, finally the following visit appears today in my server log:
TIME: February 19th 2016, 07:38pm *404: http://example.com/themes/__css/boot.css SITE: https://example.com/ REFERRER: https://www.montarepizzaria.com.br/cardapio-pizzaria/?PDT=1&n=Esfihas%20Salgadas&lista=1 HTTP HOST: example.com HOST: crawl-66-249-73-213.googlebot.com PROXY: 126.96.36.199 QUERY STRING: undefined REMOTE ADDRESS: 188.8.131.52 REMOTE IDENTITY: undefined USER AGENT: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
This request is for a resource that had not yet been added to the .htaccess snippet:
This looks so common to me that I almost ignored it.. but then taking a closer look at the referrer:
..I noticed the string, “cardapio”, which is one of the 404 strings blocked in the previous .htaccess snippet. So it’s safe to assume relation.
Needless to say, I was excited to discover an actual referrer, and began to investigate immediately. Unfortunately, however, the results were inconclusive. Not only does the site suffer from misconfigured SSL/HTTPS, it also redirects immediately to another page as soon as the associated non-SSL page is loaded. So there is no way to examine the actual referring page, and the redirected page yields zero informations. Keeping my eye on it though, and will follow-up with any juicy developments.
Stay tuned and thanks for reading.
References & Resources
- Verify Googlebot
- Massive 404 attack with non existent URLs. How to prevent this?
- Can we spoof $_SERVER[‘REMOTE_ADDR’] / user ip with php cURL?
Great article, thanks for your contribution.