(Please) Stop Using Unsafe Characters in URLs

Just as there are specifications for designing with CSS, HTML, and JavaScript, there are specifications for working with URIs/URLs. The Internet Engineering Task Force (IETF) clearly defines these specifications in numerous documents, including the following:

The specifications for Uniform Resource Identifiers (URIs) and more specifically Uniform Resource Locators (URLs) provide a safe, consistent way to request, identify, and resolve resources on the Internet. As clearly stated in RFC3986:

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. This specification defines the generic URI syntax and a process for resolving URI references that might be in relative form, along with guidelines and security considerations for the use of URIs on the Internet. The URI syntax defines a grammar that is a superset of all valid URIs, allowing an implementation to parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier.

Thanks to the brilliant work of experts such as Tim Berners-Lee, Roy Fielding, Larry Masinter, and Mark McCahill, developers have a safe, consistent protocol for working with URIs/URLs on the Web. It is important that we adhere to these specifications when developing software, plugins, apps, and the like. Failing to do so introduces potential security vulnerabilities which may be exploited by nefarious individuals and malicious scripts.

Character Encoding Chart

To help promote the cause of Web Standards and adhering to specifications, here is a quick reference chart explaining which characters are “safe” and which characters should be encoded in URLs.

Classification Included characters Encoding required?
Safe characters Alphanumerics [0-9a-zA-Z], special characters $-_.+!*'(), and reserved characters used for their reserved purposes (e.g., question mark used to denote a query string) NO
ASCII Control characters Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal.) YES
Non-ASCII characters Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal.) YES
Reserved characters $ & + , / : ; = ? @ (not including blank space) YES*
Unsafe characters Includes the blank/empty space and " < > # % { } | \ ^ ~ [ ] ` YES

* Note: Reserved characters only need encoding when not used for their defined, reserved purposes.

Usafe Characters

More about “unsafe” characters from RFC1738:

Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters “<” and “>” are unsafe because they are used as the delimiters around URLs in free text; the quote mark (“"”) is used to delimit URLs in some systems. The character “#” is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character “%” is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are “{”, “}”, “|”, “\”, “^”, “~”, “[”, “]”, and “`”.

All unsafe characters must always be encoded within a URL. For example, the character “#” must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.

Reserved Characters

More about “reserved” characters from RFC1738:

Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters “;”, “/”, “?”, “:”, “@”, “=” and “&” are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme.

Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL.

Thus, only alphanumerics, the special characters “$-_.+!*'(),”, and reserved characters used for their reserved purposes may be used unencoded within a URL.

On the other hand, characters that are not required to be encoded (including alphanumerics) may be encoded within the scheme-specific part of a URL, as long as they are not being used for a reserved purpose.

URLs in HTML and JavaScript

In earlier versions of HTML, the entire range of the ISO-8859-1 (ISO-Latin) character set may be used in documents. Since HTML4, the entire Unicode character set may also be used. In HTTP, however, the range of allowed characters is expressly limited to only a subset of the US-ASCII character set (see the Character Encoding Chart for details).

So, when writing HTML, ISO and Unicode characters may be used everywhere in the document except where URLs are referenced*. This includes the following elements:

<a>, <applet>, <area>, <base>, <bgsound>, <body>,
<embed>, <form>, <frame>, <iframe>, <img>, <input>, <link>,
<object>, <script>, <table>, <td>, <th>, <tr>

* Update (2013/01/03): As Mathias explains, “it’s perfectly okay to leave those symbols unencoded, as browsers will take care of them as per the URL parsing algorithm in the HTML spec.”

As flexible as HTML is in terms of which characters may be used, there are strict limits to which characters may be used when referencing URLs. This limitation applies not only to URLs used in HTML, but also to URLs referenced in any coding language (e.g., JavaScript, PHP, Perl, etc.).

Unsafe Characters in WordPress

In version 3.5, WordPress uses improper, unencoded URLs to enqueue JavaScript libraries. Specifically, in the WP Admin area, various URLs are called using square brakets[” and “]”, which are clearly classified as unsafe characters. Here is an example:

http://example.com/wp-admin/load-scripts.php?c=1&load[]=swfobject,jquery,utils&ver=3.5

Also affecting the WordPress Admin, here is an example of unsafe characters in URLs, pointed out in this comment:

http://test.site/wp-admin/post.php?t=1347548645469?t=1347548651124?t=1347548656685?t=1347548662469?t=1347548672300?t=1347548681615?

“Special-use” specifies that the question mark “?” is reserved for the denotation of a query string, but must be encoded for any other purpose. Unfortunately, WordPress is including multiple unencoded question marks for URLs involved with its “preview” functionality. In other words, in any URL, the first question mark “?” may be unencoded to denote the query string, but subsequent “?” must be encoded.

These errors may not be a huge deal, but they increase potential vulnerability and certainly should be fixed in the next WP update. Likewise, future versions of WordPress should keep URI/URL specifications in mind and verify that all URLs are properly encoded.

A Dangerous Trend

WordPress isn’t the only popular piece of software that’s not following specification; rather, we’re seeing a disturbing trend wherein big companies such as Google are including unsafe characters in their URLs. Here is a recently reported example:

http://blog.sergeys.us/beer?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed:+SergeySus+(Sergey+Sus+Photography+%C2%BB+Blog)&amp;utm_content=Google+Reader

Notice the unencoded “:”? Apparently Google is including them in URLs for FeedBurner and Google Reader. Hopefully this is just an oversight that will be corrected in a future update.

For more examples of unsafe characters in popular apps and plugins, scan through some of the comments left on my 5G, 6G (beta), and BBQ plugin.

5G/6G Blacklist

For the record, the 5G Blacklist, 6G Blacklist (beta) — and all of my blacklists for that matter — are built on the foundation of IETF specifications. As explained in detail here and here, the .htaccess rules used in my G-series firewalls are designed to block malicious URL requests such as those that contain unsafe characters. Other firewall/security plugins and scripts operate in similar fashion, using standards and specifications to determine which URLs are potentially dangerous.

Developers, please stop using unsafe characters in URLs.

Many people rely on such plugins and blacklists to help protect their sites against threatening activity, but such security measures fail when developers ignore specification and include unencoded characters in URLs. Worse, by introducing inconsistency into the system, noncompliant scripts pose a potential security risk and open the doors to attacks.

WordPress and 5G Blacklist

As mentioned, WordPress 3.5 includes unencoded square brackets in various URLs in the Admin area. As explained, the 5G Blacklist blocks such unsafe characters to help users secure their WP-powered sites. Thus, if you’re running both WordPress and 5G, there will be an issue wherein certain URL requests are denied with a “403 – Forbidden” response.

So, until WordPress can get things fixed up, here is how to modify the 5G Blacklist (don’t even think about modifying any WP core files) to “allow” those unsafe URLs to pass through the firewall.

Step 1

In the 5G Blacklist, locate this section of code:

# 5G:[QUERY STRINGS]
<ifModule mod_rewrite.c>
 RewriteEngine On
 RewriteBase /
 RewriteCond %{QUERY_STRING} (environ|localhost|mosconfig|scanner) [NC,OR]
 RewriteCond %{QUERY_STRING} (menu|mod|path|tag)\=\.?/? [NC,OR]
 RewriteCond %{QUERY_STRING} boot\.ini  [NC,OR]
 RewriteCond %{QUERY_STRING} echo.*kae  [NC,OR]
 RewriteCond %{QUERY_STRING} etc/passwd [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\%27$   [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\\'$    [NC,OR]
 RewriteCond %{QUERY_STRING} \.\./      [NC,OR]
 RewriteCond %{QUERY_STRING} \?         [NC,OR]
 RewriteCond %{QUERY_STRING} \:         [NC,OR]
 RewriteCond %{QUERY_STRING} \[         [NC,OR]
 RewriteCond %{QUERY_STRING} \]         [NC]
 RewriteRule .* - [F]
</ifModule>

Step 2

Replace that entire block of code with this revised version that excludes the rules that block the unsafe characters:

# 5G:[QUERY STRINGS]
<ifModule mod_rewrite.c>
 RewriteEngine On
 RewriteBase /
 RewriteCond %{QUERY_STRING} (environ|localhost|mosconfig|scanner) [NC,OR]
 RewriteCond %{QUERY_STRING} (menu|mod|path|tag)\=\.?/? [NC,OR]
 RewriteCond %{QUERY_STRING} boot\.ini  [NC,OR]
 RewriteCond %{QUERY_STRING} echo.*kae  [NC,OR]
 RewriteCond %{QUERY_STRING} etc/passwd [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\%27$   [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\\'$    [NC,OR]
 RewriteCond %{QUERY_STRING} \.\./      [NC,OR]
 RewriteCond %{QUERY_STRING} \?         [NC,OR]
 RewriteCond %{QUERY_STRING} \:         [NC]
 RewriteRule .* - [F]
</ifModule>

Done. No further edits should be required, unless you’ve made any of your own modifications.

Take-home message

When developing for the Web, adherence to standards and protocols is important. By taking the time to properly encode your URLs, you eliminate inconsistency, eliminate vulnerabilities, facilitate extensibility, and ensure proper functionality. Hopefully this article serves as a reminder and helps clear up any confusion about which characters need encoded and why it’s so important to do so.