Latest TweetsVerify any search engine or visitor via CLI Forward-Reverse Lookup perishablepress.com/cli-forwar…
Perishable Press

Block nuisance requests for .well-known, apple-app, etc.

[ Block Nuisance Requests ] Anyone who is paying attention to their server access and error logs has probably noticed that Google and other bots have been making endless requests for .well-known, apple-app-site-association, and various related files. This quick post explains how to save some server bandwidth and resources by blocking such repetitive requests, and also looks at a related problem with certain search engines <cough> not respecting a standard “410 Gone” server response.

What’s up

For the past several months I’ve noticed an uptick for requests for the following resources:

apple-app-site-association
.well-known/apple-app-site-association
.well-known/assetlinks.json

Googlebot especially is continually snooping for these files, even if there is nothing that actually links to them. I first noticed this trend while examining my sites in Google Webmaster Tools. Every site, every crawl, googlebot and others are requesting these files.

And that’s not a bad thing IF the files actually exist. But they don’t on my server, and I am getting tired of googlebot not heeding a simple “410 Gone” response, which I serve here on this site for example, for any/all requests for any of the above files.

Wake up

And why is Google reporting 410 responses as if they were 404? By definition a 410 response is designed to convey a clear message that the resource does not exist; i.e., it’s GONE. 410 is meant to provide webmasters with a way to clean up their servers. According to specification:

The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent.

So please pay attention and make a note, googlebot. 410 does NOT mean please keep checking over and over and over again because the resource is not found — it means that the resource is GONE. Please wake up, Google.

Shut up

So after a few months of getting these endless requests for this particular set of files, I finally decided to do something about it. Here is a quick snippet that I’ve been adding to my sites, basically telling unruly bots to shut up:

# Block nuisance requests
<IfModule mod_alias.c>
	RedirectMatch 403 (?i)\.well-known
	RedirectMatch 403 (?i)apple-app-site-association
</IfModule>

That will block all requests for .well-known and apple-app-site-association. So only implement these directives if you’re sure that these files do not exist on your server. Notice that we’re serving a crystal-clear 403 forbidden response. At the time of this writing, Google seems to understand and respect the meaning of this particular response code, and thus the requests do not appear in the “errors” section of Webmaster Tools.

[ Google Webmaster Tools - Not Understanding the Meaning of a 410 Response ]
Case in point: these stupid 410 URLs are IMPOSSIBLE to get rid of because Webmaster Tools doesn’t respect 410

Technically a 410 Gone response would be more accurate, but as explained Google doesn’t seem to comprehend the meaning of an explicit message telling them that the requested resource is GONE. Most good bots understand and respect 410, and remove the resource from memory, so as to not keep endlessly requesting it. You know, so they’re not wasting time, energy, and resources.

Thus the whole point and not-so-hidden moral of the story:

410 was once used to erase a resource from memory, but now alas it’s meaningless because the largest search engine in the world treats it like a common 404.

And there’s nothing that any of us can do about it.

Jeff Starr
About the Author Jeff Starr = Creative thinker. Passionate about free and open Web.
Archives
9 responses
  1. [No I will not fix your computer] August 29, 2016 @ 1:05 pm

    I was wondering why the Google bots were acting relatively stupid with these requests – although I didn’t think they were bad (of course the wasted bandwidth as you mentioned is a nuisance).

    I might as well make a rule for the ridiculous https://{domain}/https:/{random letter here} requests I keep finding as well that tends make routine rounds in my logs, from: Google, Yahoo, and now MSN seems to be in that action.

    • Jeff Starr

      What’s “bad” is the fact that Google effectively nullifies the usefulness of the 410 response. For years 410 gave us a way to clean up our servers and stop annoying 404 requests for non-existent resources. Now, thanks to Google, 410 is practically useless, as a majority of traffic comes their search results. In other words, with Google it no longer is possible to permanently remove a resource from their index (whether or not the resource actually existed in the first place).

      • [No I will not fix your computer] August 30, 2016 @ 11:12 am

        I’ve been crawling the Google support section all morning trying to find a contact form (or something) to inform them of these errors in regards to their bots, but to no avail. It’s getting to the point that they’re looking for pages from other domains on the websites I monitor with the .well-known and the apple-app query strings causing some serious errors.

        Ideas?

      • Jeff Starr

        Yep, that’s another good reason to apply the .htaccess snippet provided in the tutorial. Just to stop endless 404 requests and to help Google understand that those files don’t always exist.

        It’s just pathetic that googlebot requests these files when there is absolutely nothing on the planet that links to them. They are assuming the files exist, and wasting a LOT of bandwidth, energy, time, and resources in the process. And even worse, when you try to tell them that the files do not exist (i.e., via 410), they totally ignore the response and continue mindlessly requesting the same files over and over..

        Frustrating? You betcha.

  2. Paul Driver September 4, 2016 @ 5:18 am

    Isn’t the .well-known rewrite rule an important thing for https these days?

    • Jeff Starr

      Only if you’re actually using it. As stated in the article:

      “And that’s not a bad thing IF the files actually exist.”

      So may be important to some, but not all.

  3. Thanks for the tip, Jeff. I’d like to add same blocking rules but using mod_rewrite in case I discover those files are useful and decide to use them:

    RewriteBase /
    RewriteCond %{REQUEST_URI} ^/(.well-known|apple-app-site-association) [NC]
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule . - [R=403,L,NS,E=error_notes:well-known]

    However, I believe that it should be possible to find someone who works at Google and could pass this on to the search infrastructure team. There are real people working there, not some unapproachable supernatural beings.

    • Jeff Starr

      Thanks for the feedback. In my experience getting through to someone at Google is difficult next to impossible. Is there a telephone hotline to the Google search team? Can you reach them via snail mail? Smoke signals? Or do you have to know someone on the inside and pray that your message gets through to the “real people” working the controls?

      Sure you can spend all day posting in their “support” forums, but in my experience it’s just a huge waste of time. Google is at the point now where they are too big to care. They can do whatever they want and really there is nothing that you or I can do to change it.

      The best that people with limited time and resources can do (i.e., the “common folk”) is post about their experiences and hope that some of those “real people” at Google are listening. AND able to actually do something about it.

      Don’t EVEN get me started on how Gmail marks people’s email replies as spam. Google all the way, baby.

  4. [No I will not fix your computer] September 7, 2016 @ 6:52 am

    I would like to give this piece of code to this doing this on a Windows Server (I had to fix these up on both Linux & Windows)

    <rule name="Block Stupid GoogleBot" stopProcessing="true">
    	<match url=".*" ignoreCase="false" />
    	<conditions logicalGrouping="MatchAny">
    		<add input="{URL}" pattern=".well-known" />
    		<add input="{QUERY_STRING}" pattern=".well-known" />
    		<add input="{URL}" pattern="apple-app-site-association" />
    		<add input="{QUERY_STRING}" pattern="apple-app-site-association" />
    		<add input="{URL}" pattern="https:/[a-z]$" />
    		<add input="{QUERY_STRING}" pattern="https:/[a-z]$" />
    	</conditions>
    	<action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />
    </rule>

    All problems gone.

[ Comments are closed for this post ]