Latest TweetsGreat post about the latest power grab: www.eff.org/deeplinks/2018/09/…
Perishable Press

(Please) Stop Using Unsafe Characters in URLs

Just as there are specifications for designing with CSS, HTML, and JavaScript, there are specifications for working with URIs/URLs. The Internet Engineering Task Force (IETF) clearly defines these specifications in numerous documents, including the following:

The specifications for Uniform Resource Identifiers (URIs) and more specifically Uniform Resource Locators (URLs) provide a safe, consistent way to request, identify, and resolve resources on the Internet. As clearly stated in RFC3986:

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. This specification defines the generic URI syntax and a process for resolving URI references that might be in relative form, along with guidelines and security considerations for the use of URIs on the Internet. The URI syntax defines a grammar that is a superset of all valid URIs, allowing an implementation to parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier.

Thanks to the brilliant work of experts such as Tim Berners-Lee, Roy Fielding, Larry Masinter, and Mark McCahill, developers have a safe, consistent protocol for working with URIs/URLs on the Web. It is important that we adhere to these specifications when developing software, plugins, apps, and the like. Failing to do so introduces potential security vulnerabilities which may be exploited by nefarious individuals and malicious scripts.

Character Encoding Chart

To help promote the cause of Web Standards and adhering to specifications, here is a quick reference chart explaining which characters are “safe” and which characters should be encoded in URLs.

Classification Included characters Encoding required?
Safe characters Alphanumerics [0-9a-zA-Z], special characters $-_.+!*'(),, and reserved characters used for their reserved purposes (e.g., question mark used to denote a query string) NO
ASCII Control characters Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal.) YES
Non-ASCII characters Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal.) YES
Reserved characters ; / ? : @ = & (does not include blank space) YES1
Unsafe characters Includes the blank/empty space and " < > # % { } | \ ^ ~ [ ] ` YES
1 Reserved characters only need encoded when not used for their defined, reserved purposes.

Usafe Characters

More about “unsafe” characters from RFC1738:

Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters “<” and “>” are unsafe because they are used as the delimiters around URLs in free text; the quote mark (“"”) is used to delimit URLs in some systems. The character “#” is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character “%” is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are “{”, “}”, “|”, “\”, “^”, “~”, “[”, “]”, and “`”.

All unsafe characters must always be encoded within a URL. For example, the character “#” must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.

Reserved Characters

More about “reserved” characters from RFC1738:

Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters “;”, “/”, “?”, “:”, “@”, “=” and “&” are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme.

Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL.

Thus, only alphanumerics, the special characters “$-_.+!*'(),”, and reserved characters used for their reserved purposes may be used unencoded.

On the other hand, characters that are not required to be encoded (including alphanumerics) may be encoded within the scheme-specific part of a URL, as long as they are not being used for a reserved purpose.

URLs in HTML and JavaScript

In earlier versions of HTML, the entire range of the ISO-8859-1 (ISO-Latin) character set may be used in documents. Since HTML4, the entire Unicode character set may also be used. In HTTP, however, the range of allowed characters is expressly limited to only a subset of the US-ASCII character set (see the Character Encoding Chart for details).

So, when writing HTML, ISO and Unicode characters may be used everywhere in the document except where URLs are referenced*. This includes the following elements:

<a>, <applet>, <area>, <base>, <bgsound>, <body>,
<embed>, <form>, <frame>, <iframe>, <img>, <input>, <link>,
<object>, <script>, <table>, <td>, <th>, <tr>
* Update: As Mathias explains, “it’s perfectly okay to leave those symbols unencoded, as browsers will take care of them as per the URL parsing algorithm in the HTML spec.”

As flexible as HTML is in terms of which characters may be used, there are strict limits to which characters may be used when referencing URLs. This limitation applies not only to URLs used in HTML, but also to URLs referenced in any coding language (e.g., JavaScript, PHP, Perl, etc.).

Unsafe Characters in WordPress

In version 3.5, WordPress uses improper, unencoded URLs to enqueue JavaScript libraries. Specifically, in the WP Admin area, various URLs are called using square brakets[” and “]”, which are clearly classified as unsafe characters. Here is an example:

http://example.com/wp-admin/load-scripts.php?c=1&load[]=swfobject,jquery,utils&ver=3.5

Also affecting the WordPress Admin, here is an example of unsafe characters in URLs, pointed out in this comment:

http://test.site/wp-admin/post.php?t=1347548645469?t=1347548651124?t=1347548656685?t=1347548662469?t=1347548672300?t=1347548681615?

“Special-use” specifies that the question mark “?” is reserved for the denotation of a query string, but must be encoded for any other purpose. Unfortunately, WordPress is including multiple unencoded question marks for URLs involved with its “preview” functionality. In other words, in any URL, the first question mark “?” may be unencoded to denote the query string, but subsequent “?” must be encoded.

These errors may not be a huge deal, but they increase potential vulnerability and certainly should be fixed in the next WP update. Likewise, future versions of WordPress should keep URI/URL specifications in mind and verify that all URLs are properly encoded.

A Dangerous Trend

WordPress isn’t the only popular piece of software that’s not following specification; rather, we’re seeing a disturbing trend wherein big companies such as Google are including unsafe characters in their URLs. Here is a recently reported example:

http://blog.sergeys.us/beer?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed:+SergeySus+(Sergey+Sus+Photography+%C2%BB+Blog)&amp;utm_content=Google+Reader

Notice the unencoded “:”? Apparently Google is including them in URLs for FeedBurner and Google Reader. Hopefully this is just an oversight that will be corrected in a future update.

For more examples of unsafe characters in popular apps and plugins, scan through some of the comments left on my 5G, 6G (beta), and BBQ plugin.

5G/6G Blacklist

For the record, the 5G Blacklist, 6G Blacklist (beta) — and all of my blacklists for that matter — are built on the foundation of IETF specifications. As explained in detail here and here, the .htaccess rules used in my G-series firewalls are designed to block malicious URL requests such as those that contain unsafe characters. Other firewall/security plugins and scripts operate in similar fashion, using standards and specifications to determine which URLs are potentially dangerous.

Developers, please stop using unsafe characters in URLs.

Many people rely on such plugins and blacklists to help protect their sites against threatening activity, but such security measures fail when developers ignore specification and include unencoded characters in URLs. Worse, by introducing inconsistency into the system, noncompliant scripts pose a potential security risk and open the doors to attacks.

WordPress and 5G Blacklist

As mentioned, WordPress 3.5 includes unencoded square brackets in various URLs in the Admin area. As explained, the 5G Blacklist blocks such unsafe characters to help users secure their WP-powered sites. Thus, if you’re running both WordPress and 5G, there will be an issue wherein certain URL requests are denied with a “403 – Forbidden” response.

So, until WordPress can get things fixed up, here is how to modify the 5G Blacklist (don’t even think about modifying any WP core files) to “allow” those unsafe URLs to pass through the firewall.

Step 1

In the 5G Blacklist, locate this section of code:

# 5G:[QUERY STRINGS]
<ifModule mod_rewrite.c>
 RewriteEngine On
 RewriteBase /
 RewriteCond %{QUERY_STRING} (environ|localhost|mosconfig|scanner) [NC,OR]
 RewriteCond %{QUERY_STRING} (menu|mod|path|tag)\=\.?/? [NC,OR]
 RewriteCond %{QUERY_STRING} boot\.ini  [NC,OR]
 RewriteCond %{QUERY_STRING} echo.*kae  [NC,OR]
 RewriteCond %{QUERY_STRING} etc/passwd [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\%27$   [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\\'$    [NC,OR]
 RewriteCond %{QUERY_STRING} \.\./      [NC,OR]
 RewriteCond %{QUERY_STRING} \?         [NC,OR]
 RewriteCond %{QUERY_STRING} \:         [NC,OR]
 RewriteCond %{QUERY_STRING} \[         [NC,OR]
 RewriteCond %{QUERY_STRING} \]         [NC]
 RewriteRule .* - [F]
</ifModule>

Step 2

Replace that entire block of code with this revised version that excludes the rules that block the unsafe characters:

# 5G:[QUERY STRINGS]
<ifModule mod_rewrite.c>
 RewriteEngine On
 RewriteBase /
 RewriteCond %{QUERY_STRING} (environ|localhost|mosconfig|scanner) [NC,OR]
 RewriteCond %{QUERY_STRING} (menu|mod|path|tag)\=\.?/? [NC,OR]
 RewriteCond %{QUERY_STRING} boot\.ini  [NC,OR]
 RewriteCond %{QUERY_STRING} echo.*kae  [NC,OR]
 RewriteCond %{QUERY_STRING} etc/passwd [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\%27$   [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\\'$    [NC,OR]
 RewriteCond %{QUERY_STRING} \.\./      [NC,OR]
 RewriteCond %{QUERY_STRING} \?         [NC,OR]
 RewriteCond %{QUERY_STRING} \:         [NC]
 RewriteRule .* - [F]
</ifModule>

Done. No further edits should be required, unless you’ve made any of your own modifications.

Take-home message

When developing for the Web, adherence to standards and protocols is important. By taking the time to properly encode your URLs, you eliminate inconsistency, eliminate vulnerabilities, facilitate extensibility, and ensure proper functionality. Hopefully this article serves as a reminder and helps clear up any confusion about which characters need encoded and why it’s so important to do so.

Jeff Starr
About the Author Jeff Starr = Fullstack Developer. Book Author. Teacher. Human Being.
Archives
24 responses
  1. this post is racist. web is evolving, and there are enough space for other charsets. Not only English chars.

    • Jeff Starr

      Wow, interesting opinion, emrah. Thanks for chiming in with that.

      I should add the post merely strives to explain existing specifications (and why they’re important), it makes no value judgments one way or another regarding which character sets are better/worse than others. Totally not the point here.

      And I’m pretty sure that UTF-8 includes non-english characters as well, so no need to start accusing anything/anyone of being “racist”. Sheesh.

  2. Admirably thorough article. Let’s hope Google is reading.

  3. Mathias Bynens January 2, 2013 @ 4:18 am

    The specifications for Uniform Resource Identifiers (URIs) and more specifically Uniform Resource Locators (URLs) provide a safe, consistent way to request, identify, and resolve resources on the Internet. As clearly stated in RFC3986: […]

    If only that were true. Sadly, RFC3986 doesn’t match reality. That’s why Anne van Kesteren has been working on a URL spec based on existing implementations.

    In HTTP, however, the range of allowed characters is expressly limited to only a subset of the US-ASCII character set […] So, when writing HTML, ISO and Unicode characters may be used everywhere in the document except where URLs are referenced.

    That’s not true at all. HTML != HTTP. <a href=☺>…</a> is perfectly valid HTML. It’s up to browsers to resolve special characters to their percent-encoded escape sequences as needed before any HTTP requests are made to the URL. For more information, see the URL parsing algorithm in the HTML spec (404 link removed 2014/07/20).

    Also, in a blog post on special characters in URLs, you might want to mention Punycode.

    • Jeff Starr

      Great feedback, Mathias, thank you. The new spec looks ambitious and promising. It will be interesting to watch as things unfold in 2013.

      For your second point, I guess I’m confused.. researching this topic on the Web, I came across numerous sources, including this article, which seem to contradict what you’re saying here.. Also, not all HTTP requests are initiated by browsers, especially those involving unsafe characters, which are usually transmitted via script, command line, etc. The main point of the article is that unsafe characters should not be present in the URL, according to the referenced specifications. A smiley face may be perfectly valid HTML, but it needs to be encoded to be safe for HTTP.

      Then again, my expertise is admittedly limited in this area, so any further infos are most welcome.

      • Mathias Bynens January 3, 2013 @ 6:04 am

        Hey Jeff, thanks for the reply! I’ll try to explain what I meant exactly, in case it was unclear.

        I fail to see where that article contradicts anything I’m saying. You’re right that those special characters shouldn’t be used without encoding them in URLs as far as HTTP and RFC3986 are concerned — but contrary to what you’re saying here, in HTML it’s perfectly okay to leave those symbols unencoded, as browsers will take care of them as per the URL parsing algorithm in the HTML spec.

      • Jeff Starr

        Ah, that makes sense, thanks for explaining. I went ahead and added a note in that section of the article to clarify.

        Question: where in the spec does it explain how browsers should handle unsafe and reserved characters? For example, if I have an HTML document that references a URL containing, say, square brackets (classified as “unsafe”):

        http://example.com/wp-admin/load-scripts.php?c=1&load[]=swfobject,jquery,utils&ver=3.5

        As mentioned, the issue with WordPress is that browsers aren’t encoding the square brackets that are included in some URLs. Similarly with Google, there are additional question marks (“reserved” characters) included in URLs that aren’t getting encoded. It would be great to hear your thoughts on what’s happening (or not) with this issue.

  4. Thanks for writing this post — I had no idea that an URL itself could be unsafe.

    Google also uses “?” when you create goal-specific URLs in Google Analytics.

    And Amazon loves making long URLs when you’re searching on amazon.com Example (my business partner’s books): http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Ddigital-text&amp;field-keywords=Phyllis+Zimbler+Miller

    (Although, I could be wrong: when ? or % is displayed does that mean it’s always un-encoded or is Firefox displaying encoded characters as un-encoded?)

    Ok, so where can I find the information to encode these unsafe characters?

    • Jeff Starr

      Hi Yael, thanks for the feedback. Just to be clear, the first instance of “?” in a URL denotes a query string and is valid; it’s only subsequent instances of “?” that need encoding. Also, the Amazon URL you mention looks valid, and technically there is nothing wrong with “long” URLs (although they can be unwieldy to work with).

      As Mathias mentions in the previous comment, it’s up to browsers to properly encode/escape URLs before sending the request, but I’m not sure if that’s always the case.

      And to encode unsafe characters, any online URL decoder/encoder should do the job, for example this one. There are various conversion charts also available online. If I have time, I’ll try posting an article about this.

  5. You
    Are
    Brilliant!

    thanks for sharing 5G and this solution to my newly 403 problems

  6. Sean Ellingham January 16, 2013 @ 9:13 am

    Unless I’m reading the RFCs wrongly (or misinterpreting your post), saying that all reserved characters must always be encoded is incorrect. For example, in an HTTP URL (at least as far as I can tell), the reserved characters “;”, “:”, “@”, “&” and “=” are perfectly acceptable in the path and query string without being encoded – see page 17 of RFC1738.

    • Jeff Starr

      saying that all reserved characters must always be encoded is incorrect.

      I don’t think I say that anywhere in the article, so if there is some confusion please let me know so I may clarify.

      Also, page 17 of RFC1738 refers to “ip based protocols”, which use the reserved characters according to their specifically defined reserved purpose. At least, that’s how I currently understand the RFC, please advise if I am misguided.

      • Sean Ellingham January 16, 2013 @ 3:46 pm

        True, it’s not said directly, but when I read the article that was the impression I received. I believe the culprit is the quick reference chart, which says encoding is required for everything except safe characters.

        The particular part of page 17 of RFC1738 I was referring to was this section:

        ; HTTP

        httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
        hpath = hsegment *[ "/" hsegment ]
        hsegment = *[ uchar | ";" | ":" | "@" | "&amp;" | "=" ]
        search = *[ uchar | ";" | ":" | "@" | "&amp;" | "=" ]

        My understanding of that (with the definitions around it) was that those five reserved characters were allowed as is in the path and query string. Then again, that’s just my interpretation – I’m not making any claims that I’m an expert in this area, so I’ll gladly defer to better judgement. These RFCs aren’t exactly the easiest things to get your head around!

      • Jeff Starr

        Thanks for clarifying, I’ve updated the chart with a note about encoding of reserved characters.

        For RFC, page 3 states:

        “Thus, only alphanumerics, the special characters “$-_.+!*'(),”, and reserved characters used for their reserved purposes may be used unencoded within a URL.”

        I think the information you reference on p17 is showing that reserved characters are acceptable if used according to definition (i.e., p3).

        But you are correct that the RFCs are not the easiest thing to understand. And like you, I’m no expert in this area so if anything is incorrect I’ll be glad to revise accordingly.

      • Sean Ellingham January 16, 2013 @ 4:48 pm

        Hmm, I obviously missed that line on page 3 – that sentence does seem fairly conclusive. What is confusing me though is that the definitions for the schemes listed on pages 17 onwards seem to conflict with that, by implying that the reserved characters can potentially be used in certain other parts of a URL. Using HTTP as an example, if the definitions are expanded, it seems to say that a path can comprise multiple “/” delimited sections, where a section comprises any number of: unreserved characters, escape sequences, or the “:”, “;”, “@”, “&” or “=” characters – it is because these characters are explicitly listed, combined with the following statement on page 8, that I believe them to be valid.

        Within the and components, “/”, “;”, “?” are
        reserved. The “/” character may be used within HTTP to designate a
        hierarchical structure.

        However, it is worth considering that (if my interpretation is correct) whilst RFC1738 seems to allow unencoded “&” and “=” in the query string, we would expect these to be ‘reserved’ as we are used to their use in application/x-www-form-urlencoded style form submissions – although I don’t know where that encoding of key/value pairs is specified (I believe it might be in the HTML specification, although I’m not sure).

      • Sean Ellingham January 16, 2013 @ 4:51 pm

        Oops, the quote from page 8 got eaten as I forgot to replace the angle brackets with the HTML entities – need to get to bed! Here’s what it should have said:

        “Within the <path> and <searchpart> components, “/”, “;”, “?” are
        reserved. The “/” character may be used within HTTP to designate a
        hierarchical structure.”

  7. Really liked what you had to say in your post, (Please) Stop Using Unsafe Characters in URLs : Perishable Press, thanks for the good read!
    — Riva

    http://www.terrazoa.com

  8. I checked RFC 1738 since I was having a problem with commas in request strings triggering 403s using the 5G .htaccess.

    It appears that commas are fine, defined as “extra” characters (see the BNF section), one of the allowable sets of unreserved characters (alpha | digit | safe | extra). There is a comma in your initial list of safe characters in the OP, but it’s ambiguous whether it’s in the list or just punctuation. The comma does not seem to belong in the Reserved list.

  9. very interesting. I recently developed a search interface that included arrays of checkboxes and used the get method so that searches could be saved. I wonder if my technique is using unsafe characters or if this is acceptable:

    the options[] field is added as a url parameter like this:

    ?options%5B%5D=large&options%5B%5D=medium

[ Comments are closed for this post ]