(Please) Stop Using Unsafe Characters in URLs

Updated October 19, 2023 • 26 comments

Just as there are specifications for designing with CSS, HTML, and JavaScript, there are specifications for working with URIs/URLs. The Internet Engineering Task Force (IETF) clearly defines these specifications in RFC 3986: Uniform Resource Identifier (URI): Generic Syntax. Within that document, there are guidelines regarding which characters may be used safely within URIs. This post summarizes the information, and encourages developers to understand and implement accordingly.

FYI: URL is a specific type of URI. Learn more »

About the RFC 3986 Specification
Character Encoding Chart
More about Character Types
URLs in HTML and JavaScript
Unsafe Characters in WordPress
A Dangerous Trend
URLs and Firewall Security
WordPress and 5G Blacklist
Take-home Message

About the RFC 3986 Specification

The specifications for Uniform Resource Identifiers (URIs) and more specifically Uniform Resource Locators (URLs) provide a safe, consistent way to request, identify, and resolve resources on the Internet. As clearly stated in RFC 3986:

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. This specification defines the generic URI syntax and a process for resolving URI references that might be in relative form, along with guidelines and security considerations for the use of URIs on the Internet. The URI syntax defines a grammar that is a superset of all valid URIs, allowing an implementation to parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier.

Thanks to the brilliant work of experts such as Tim Berners-Lee, Roy Fielding, Larry Masinter, and Mark McCahill, developers have a safe, consistent protocol for working with URIs/URLs on the Web. It is important that we adhere to these specifications when developing software, plugins, apps, and the like. Failing to do so introduces potential security vulnerabilities which may be exploited by nefarious individuals and malicious scripts.

Character Encoding Chart

To help promote the cause of Web Standards and adhering to specifications, here is a quick reference chart explaining which characters are “safe” and which characters should be encoded in URLs.

Classification	Included characters	Encoding required?
Safe characters	Alphanumerics `[0-9a-zA-Z]` and unreserved characters. Also reserved characters when used for their reserved purposes (e.g., question mark used to denote a query string)	NO
Unreserved characters	`- . _ ~` (does not include blank space)	NO
Reserved characters	`: / ? # [ ] @ ! $ & ' ( ) * + , ; =` (does not include blank space)	YES¹
Unsafe characters	Includes the blank/empty space and " < > % { } \| \ ^ `	YES
ASCII Control characters	Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal)	YES
Non-ASCII characters	Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal)	YES
All other characters	Any character(s) not mentioned above should be percent-encoded.	YES

¹ Reserved characters only need encoded when not used for their defined, reserved purposes.

FYI: As with the specifications, the above chart is a work in progress and subject to change. If you have any suggestions to improve it, please let me know.

The above chart is a summary of which characters need to be encoded in URIs/URLs, based on the current specification RFC 3986. Web developers should be mindful when working with URLs in their applications and implementations.

More about Character Types

Here is some further discussion about each of the various character types: Reserved Characters, Unreserved Characters, Unsafe Characters, and ASCII Characters.

Reserved Characters

More information about “reserved” characters from RFC 3986:

URIs include components and subcomponents that are delimited by characters in the “reserved” set. These characters are called “reserved” because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI’s dereferencing algorithm. If data for a URI component would conflict with a reserved character’s purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI. […]

If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character’s encoding in US-ASCII.

Unreserved Characters

More information about “unreserved” characters from RFC 3986:

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Normalization and Comparison). For consistency, percent-encoded octets in the ranges of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

Usafe Characters

More about “unsafe” characters from RFC 1738. Note that RFC 1738 is now obsolete, however the information remains useful in the general sense, and is shared below for educational and reference purposes.

Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters < and > are unsafe because they are used as the delimiters around URLs in free text; the quote mark (") is used to delimit URLs in some systems. The character # is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character % is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are {, }, |, \, ^, ~, [, ], and `.

All unsafe characters must always be encoded within a URL. For example, the character # must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.

Again, the above “unsafe character” information is from RFC 1738, which is obsoleted and replaced by RFC 3986. For example, the tilde ~ character is now an unreserved character and does not need to be encoded. Likewise with the # hashtag (octothorp) character, it now is a reserved character and does not need to be encoded unless used for non-reserved purposes. If in doubt, follow the RFC 3986.

ASCII Characters

More information about ASCII characters from W3C Internationalization:

Currently Web addresses are typically expressed using Uniform Resource Identifiers or URIs. The URI syntax defined in RFC 3986 STD 66 (Uniform Resource Identifier (URI): Generic Syntax) essentially restricts Web addresses to a small number of characters: basically, just upper and lower case letters of the English alphabet, European numerals and a small number of symbols.

The original reason for this was to aid in transcription and usability, both in computer systems and in non-computer communications, to avoid clashes with characters used conventionally as delimiters around URIs, and to facilitate entry using those input facilities available to most Internet users.

User’s expectations and use of the Internet have moved on since then, and there is now a growing need to enable use of characters from any language in Web addresses. A Web address in your own language and alphabet is easier to create, memorize, transcribe, interpret, guess, and relate to. It is also important for brand recognition. This, in turn, is better for business, better for finding things, and better for communicating. In short, better for the Web.

Imagine, for example, that all web addresses had to be written in Japanese katakana, as shown in the example below. How easy would it be for you, if you weren’t Japanese, to recognize the content or owner of the site, or type the address in your browser, or write the URI down on notepaper, etc.?

http://ヒキワリ.ナットウ.ニホン

There have been several developments recently that begin to make this possible. […] Learn more at w3.org »

URLs in HTML and JavaScript

In earlier versions of HTML, the entire range of the ISO-8859-1 (ISO-Latin) character set may be used in documents. Since HTML4, the entire Unicode character set may also be used. In HTTP, however, the range of allowed characters is expressly limited to only a subset of the US-ASCII character set (see the Character Encoding Chart for details).

So, when writing HTML, ISO and Unicode characters may be used everywhere in the document except where URLs are referenced*. This includes the following elements:

<a>, <applet>, <area>, <base>, <bgsound>, <body>,
<embed>, <form>, <frame>, <iframe>, <img>, <input>, <link>,
<object>, <script>, <table>, <td>, <th>, <tr>

* Update: As Mathias explains, “it’s perfectly okay to leave those symbols unencoded, as browsers will take care of them as per the URL parsing algorithm in the HTML spec.”

As flexible as HTML is in terms of which characters may be used, there are strict limits to which characters may be used when referencing URLs. This limitation applies not only to URLs used in HTML, but also to URLs referenced in any coding language (e.g., JavaScript, PHP, Perl, etc.).

Unsafe Characters in WordPress

Note: Things have changed a lot with WordPress and URI specification. So the following information may be outdated, but the example serves to illustrate how developers should be mindful of specification when crafting URIs in applications.

In version 3.5, WordPress uses improper, unencoded URLs to enqueue JavaScript libraries. Specifically, in the WP Admin area, various URLs are called using square brackets [ ], which are classified as unsafe characters. Here is an example:

http://example.com/wp-admin/load-scripts.php?c=1&load[]=swfobject,jquery,utils&ver=3.5

Also affecting the WordPress Admin, here is an example of unsafe characters in URLs, pointed out in this comment:

http://test.site/wp-admin/post.php?t=1347548645469?t=1347548651124?t=1347548656685?t=1347548662469?t=1347548672300?t=1347548681615?

“Special-use” specifies that the question mark “?” is reserved for the denotation of a query string, but must be encoded for any other purpose. Unfortunately, WordPress is including multiple unencoded question marks for URLs involved with its “preview” functionality. In other words, in any URL, the first question mark “?” may be unencoded to denote the query string, but subsequent “?” must be encoded.

These errors may not be a huge deal, but they increase potential vulnerability and certainly should be fixed in the next WP update. Likewise, future versions of WordPress should keep URI/URL specifications in mind and verify that all URLs are properly encoded.

A Dangerous Trend

Note: Things have changed a lot with Google and URI specification. So the following information may be outdated, but the example serves to illustrate how developers should be mindful of specification when crafting URIs in applications.

WordPress isn’t the only popular piece of software that is not following specification; rather, we’re seeing a disturbing trend wherein big companies such as Google are including unsafe characters in their URLs. Here is a recently reported example:

http://blog.sergeys.us/beer?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+SergeySus+(Sergey+Sus+Photography+%C2%BB+Blog)&utm_content=Google+Reader

Notice the unencoded colon : character? Apparently Google is including them in URLs for FeedBurner and Google Reader. Hopefully this is just an oversight that will be corrected in a future update.

For more examples of unsafe characters in popular apps and plugins, scan through some of the comments left on my 5G, 6G (beta), and BBQ plugin.

URLs and Firewall Security

For the record, the 5G Blacklist, 6G Firewall (beta) — and all of my firewalls for that matter — are built on the foundation of IETF specifications. As explained in detail here and here, the .htaccess rules used in my G-series firewalls are designed to block malicious URL requests such as those that contain unsafe characters. Other firewall/security plugins and scripts operate in similar fashion, using standards and specifications to determine which URLs are potentially dangerous.

Developers, please stop using unsafe characters in URLs.

Many WAF firewalls and security applications rely on pattern-recognition to help protect their sites against threatening activity, but such security measures fail when developers ignore specification and include unencoded unsafe characters in URLs. Worse, by introducing inconsistency into the system, noncompliant scripts pose a potential security risk and open the doors to attacks.

WordPress and 5G Blacklist

Note: This section contains outdated information. WordPress now is way beyond version 3.5, and the URI specification has changed considerably. Also, the 5G Blacklist is superseded by the 6G and 7G Firewall. And 8G is in the works :)

As mentioned, WordPress 3.5 includes unencoded square brackets in various URLs in the Admin area. As explained, the 5G Blacklist blocks such unsafe characters to help users secure their WP-powered sites. Thus, if you’re running both WordPress and 5G, there will be an issue wherein certain URL requests are denied with a “403 – Forbidden” response.

So, until WordPress can get things fixed up, here is how to modify the 5G Blacklist (don’t even think about modifying any WP core files) to “allow” those unsafe URLs to pass through the firewall.

Step 1

In the 5G Blacklist, locate this section of code:

# 5G:[QUERY STRINGS]
<ifModule mod_rewrite.c>
 RewriteEngine On
 RewriteBase /
 RewriteCond %{QUERY_STRING} (environ|localhost|mosconfig|scanner) [NC,OR]
 RewriteCond %{QUERY_STRING} (menu|mod|path|tag)\=\.?/? [NC,OR]
 RewriteCond %{QUERY_STRING} boot\.ini  [NC,OR]
 RewriteCond %{QUERY_STRING} echo.*kae  [NC,OR]
 RewriteCond %{QUERY_STRING} etc/passwd [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\%27$   [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\\'$    [NC,OR]
 RewriteCond %{QUERY_STRING} \.\./      [NC,OR]
 RewriteCond %{QUERY_STRING} \?         [NC,OR]
 RewriteCond %{QUERY_STRING} \:         [NC,OR]
 RewriteCond %{QUERY_STRING} \[         [NC,OR]
 RewriteCond %{QUERY_STRING} \]         [NC]
 RewriteRule .* - [F]
</ifModule>

Step 2

Replace that entire block of code with this revised version that excludes the rules that block the unsafe characters:

# 5G:[QUERY STRINGS]
<ifModule mod_rewrite.c>
 RewriteEngine On
 RewriteBase /
 RewriteCond %{QUERY_STRING} (environ|localhost|mosconfig|scanner) [NC,OR]
 RewriteCond %{QUERY_STRING} (menu|mod|path|tag)\=\.?/? [NC,OR]
 RewriteCond %{QUERY_STRING} boot\.ini  [NC,OR]
 RewriteCond %{QUERY_STRING} echo.*kae  [NC,OR]
 RewriteCond %{QUERY_STRING} etc/passwd [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\%27$   [NC,OR]
 RewriteCond %{QUERY_STRING} \=\\\'$    [NC,OR]
 RewriteCond %{QUERY_STRING} \.\./      [NC,OR]
 RewriteCond %{QUERY_STRING} \?         [NC,OR]
 RewriteCond %{QUERY_STRING} \:         [NC]
 RewriteRule .* - [F]
</ifModule>

Done. No further edits should be required, unless you’ve made any of your own modifications.

Take-home Message

When developing for the Web, adherence to standards and protocols is important. By taking the time to properly encode your URLs, you eliminate inconsistency, eliminate vulnerabilities, facilitate extensibility, and ensure proper functionality. Hopefully this article serves as a reminder and helps clear up any confusion about which characters need encoded and why it’s so important to do so.

About the Author

Jeff Starr = Web Developer. Book Author. Secretly Important.

26 responses to “(Please) Stop Using Unsafe Characters in URLs”

emrah 2012/12/31 7:49 pm

this post is racist. web is evolving, and there are enough space for other charsets. Not only English chars.
- Jeff Starr 2012/12/31 8:23 pm • Post Author
  
  Wow, interesting opinion, emrah. Thanks for chiming in with that.
  
  I should add the post merely strives to explain existing specifications (and why they’re important), it makes no value judgments one way or another regarding which character sets are better/worse than others. Totally not the point here.
  
  And I’m pretty sure that UTF-8 includes non-english characters as well, so no need to start accusing anything/anyone of being “racist”. Sheesh.
  - emrah 2013/01/01 8:15 am
    
    ok i was kidding with a little truth. UTF-8 is fine enough.
  - Jeff Starr 2013/01/01 11:43 pm • Post Author
    
    That’s good to know.. I was genuinely concerned about you ;)
  - Mathias Bynens 2013/01/03 6:09 am
    
    Minor nitpick: UTF-8 is a character encoding, so technically there is no such thing as “UTF-8 characters”. You probably meant to say “Unicode symbols”.
  - Jeff Starr 2013/01/03 5:49 pm • Post Author
    
    Yes that is what I meant to say, thanks Mathias.
Paul 2013/01/01 7:16 am

Admirably thorough article. Let’s hope Google is reading.
Mathias Bynens 2013/01/02 4:18 am

The specifications for Uniform Resource Identifiers (URIs) and more specifically Uniform Resource Locators (URLs) provide a safe, consistent way to request, identify, and resolve resources on the Internet. As clearly stated in RFC3986: […]

If only that were true. Sadly, RFC3986 doesn’t match reality. That’s why Anne van Kesteren has been working on a URL spec based on existing implementations.

In HTTP, however, the range of allowed characters is expressly limited to only a subset of the US-ASCII character set […] So, when writing HTML, ISO and Unicode characters may be used everywhere in the document except where URLs are referenced.

That’s not true at all. HTML != HTTP. <a href=☺>…</a> is perfectly valid HTML. It’s up to browsers to resolve special characters to their percent-encoded escape sequences as needed before any HTTP requests are made to the URL. For more information, see the URL parsing algorithm in the HTML spec (404 link removed 2014/07/20).

Also, in a blog post on special characters in URLs, you might want to mention Punycode.
- Jeff Starr 2013/01/02 6:41 pm • Post Author
  
  Great feedback, Mathias, thank you. The new spec looks ambitious and promising. It will be interesting to watch as things unfold in 2013.
  
  For your second point, I guess I’m confused.. researching this topic on the Web, I came across numerous sources, including this article, which seem to contradict what you’re saying here.. Also, not all HTTP requests are initiated by browsers, especially those involving unsafe characters, which are usually transmitted via script, command line, etc. The main point of the article is that unsafe characters should not be present in the URL, according to the referenced specifications. A smiley face may be perfectly valid HTML, but it needs to be encoded to be safe for HTTP.
  
  Then again, my expertise is admittedly limited in this area, so any further infos are most welcome.
  - Mathias Bynens 2013/01/03 6:04 am
    
    Hey Jeff, thanks for the reply! I’ll try to explain what I meant exactly, in case it was unclear.
    
    I fail to see where that article contradicts anything I’m saying. You’re right that those special characters shouldn’t be used without encoding them in URLs as far as HTTP and RFC3986 are concerned — but contrary to what you’re saying here, in HTML it’s perfectly okay to leave those symbols unencoded, as browsers will take care of them as per the URL parsing algorithm in the HTML spec.
  - Jeff Starr 2013/01/03 6:17 pm • Post Author
    
    Ah, that makes sense, thanks for explaining. I went ahead and added a note in that section of the article to clarify.
    
    Question: where in the spec does it explain how browsers should handle unsafe and reserved characters? For example, if I have an HTML document that references a URL containing, say, square brackets (classified as “unsafe”):
    
    http://example.com/wp-admin/load-scripts.php?c=1&load[]=swfobject,jquery,utils&ver=3.5
    
    As mentioned, the issue with WordPress is that browsers aren’t encoding the square brackets that are included in some URLs. Similarly with Google, there are additional question marks (“reserved” characters) included in URLs that aren’t getting encoded. It would be great to hear your thoughts on what’s happening (or not) with this issue.
Yael K. Miller 2013/01/02 10:42 am

Thanks for writing this post — I had no idea that an URL itself could be unsafe.

Google also uses “?” when you create goal-specific URLs in Google Analytics.

And Amazon loves making long URLs when you’re searching on amazon.com Example (my business partner’s books): http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Ddigital-text&field-keywords=Phyllis+Zimbler+Miller

(Although, I could be wrong: when ? or % is displayed does that mean it’s always un-encoded or is Firefox displaying encoded characters as un-encoded?)

Ok, so where can I find the information to encode these unsafe characters?
- Jeff Starr 2013/01/02 6:53 pm • Post Author
  
  Hi Yael, thanks for the feedback. Just to be clear, the first instance of “?” in a URL denotes a query string and is valid; it’s only subsequent instances of “?” that need encoding. Also, the Amazon URL you mention looks valid, and technically there is nothing wrong with “long” URLs (although they can be unwieldy to work with).
  
  As Mathias mentions in the previous comment, it’s up to browsers to properly encode/escape URLs before sending the request, but I’m not sure if that’s always the case.
  
  And to encode unsafe characters, any online URL decoder/encoder should do the job, for example this one. There are various conversion charts also available online. If I have time, I’ll try posting an article about this.
  - Yael K. Miller 2013/01/03 10:18 am
    
    Thanks.
Maxi 2013/01/07 2:16 am

You
Are
Brilliant!

thanks for sharing 5G and this solution to my newly 403 problems
Sean Ellingham 2013/01/16 9:13 am

Unless I’m reading the RFCs wrongly (or misinterpreting your post), saying that all reserved characters must always be encoded is incorrect. For example, in an HTTP URL (at least as far as I can tell), the reserved characters “;”, “:”, “@”, “&” and “=” are perfectly acceptable in the path and query string without being encoded – see page 17 of RFC1738.
- Jeff Starr 2013/01/16 2:56 pm • Post Author
  
  saying that all reserved characters must always be encoded is incorrect.
  
  I don’t think I say that anywhere in the article, so if there is some confusion please let me know so I may clarify.
  
  Also, page 17 of RFC1738 refers to “ip based protocols”, which use the reserved characters according to their specifically defined reserved purpose. At least, that’s how I currently understand the RFC, please advise if I am misguided.
  - Sean Ellingham 2013/01/16 3:46 pm
    
    True, it’s not said directly, but when I read the article that was the impression I received. I believe the culprit is the quick reference chart, which says encoding is required for everything except safe characters.
    
    The particular part of page 17 of RFC1738 I was referring to was this section:
    
    ; HTTP
    
    httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
    hpath = hsegment *[ "/" hsegment ]
    hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
    search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
    
    My understanding of that (with the definitions around it) was that those five reserved characters were allowed as is in the path and query string. Then again, that’s just my interpretation – I’m not making any claims that I’m an expert in this area, so I’ll gladly defer to better judgement. These RFCs aren’t exactly the easiest things to get your head around!
  - Jeff Starr 2013/01/16 4:13 pm • Post Author
    
    Thanks for clarifying, I’ve updated the chart with a note about encoding of reserved characters.
    
    For RFC, page 3 states:
    
    “Thus, only alphanumerics, the special characters “$-_.+!*'(),”, and reserved characters used for their reserved purposes may be used unencoded within a URL.”
    
    I think the information you reference on p17 is showing that reserved characters are acceptable if used according to definition (i.e., p3).
    
    But you are correct that the RFCs are not the easiest thing to understand. And like you, I’m no expert in this area so if anything is incorrect I’ll be glad to revise accordingly.
  - Sean Ellingham 2013/01/16 4:48 pm
    
    Hmm, I obviously missed that line on page 3 – that sentence does seem fairly conclusive. What is confusing me though is that the definitions for the schemes listed on pages 17 onwards seem to conflict with that, by implying that the reserved characters can potentially be used in certain other parts of a URL. Using HTTP as an example, if the definitions are expanded, it seems to say that a path can comprise multiple “/” delimited sections, where a section comprises any number of: unreserved characters, escape sequences, or the “:”, “;”, “@”, “&” or “=” characters – it is because these characters are explicitly listed, combined with the following statement on page 8, that I believe them to be valid.
    
    Within the and components, “/”, “;”, “?” are
    reserved. The “/” character may be used within HTTP to designate a
    hierarchical structure.
    
    However, it is worth considering that (if my interpretation is correct) whilst RFC1738 seems to allow unencoded “&” and “=” in the query string, we would expect these to be ‘reserved’ as we are used to their use in application/x-www-form-urlencoded style form submissions – although I don’t know where that encoding of key/value pairs is specified (I believe it might be in the HTML specification, although I’m not sure).
  - Sean Ellingham 2013/01/16 4:51 pm
    
    Oops, the quote from page 8 got eaten as I forgot to replace the angle brackets with the HTML entities – need to get to bed! Here’s what it should have said:
    
    “Within the <path> and <searchpart> components, “/”, “;”, “?” are
    reserved. The “/” character may be used within HTTP to designate a
    hierarchical structure.”
Riva 2013/01/27 12:23 pm

Really liked what you had to say in your post, (Please) Stop Using Unsafe Characters in URLs : Perishable Press, thanks for the good read!
— Riva

http://www.terrazoa.com
P. Don 2013/01/31 10:15 am

I checked RFC 1738 since I was having a problem with commas in request strings triggering 403s using the 5G .htaccess.

It appears that commas are fine, defined as “extra” characters (see the BNF section), one of the allowable sets of unreserved characters (alpha | digit | safe | extra). There is a comma in your initial list of safe characters in the OP, but it’s ambiguous whether it’s in the list or just punctuation. The comma does not seem to belong in the Reserved list.
Tim 2013/04/16 7:52 am

very interesting. I recently developed a search interface that included arrays of checkboxes and used the get method so that searches could be saved. I wonder if my technique is using unsafe characters or if this is acceptable:

the options[] field is added as a url parameter like this:

?options%5B%5D=large&options%5B%5D=medium
G. C. 2022/05/31 11:39 am

Just so you are aware:

Classifying sub-delims as “unsafe” is inaccurate.

Sub-delimiters are a class for which frameworks and projects that involve URL parsing algorithms are the target. Specifically, from rfc 3986:

A subset of the reserved characters (gen-delims) is used as delimiters of the generic URI components described in Section 3. A component’s ABNF syntax rule will not use the reserved or gen-delims rule names directly; instead, each tax rule lists the characters allowed within that component (i.e., not delimiting it), and any of those characters that are also in the reserved set are “reserved” for use as subcomponent delimiters within the component.

Only the most common subcomponents are defined by this specification; other subcomponents may be defined by a URI scheme’s specification, or by the implementation-specific syntax of a URI’s dereferencing algorithm, provided that such subcomponents are delimited by characters in the reserved set allowed within that component.

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character’s encoding in US-ASCII.

Simply they are safe for their intended use, as delimiters, and should otherwise be percent encoded.

For framework and library devs like me the distinction is kind of important.
Joe 2022/10/09 12:22 am

First; you should understand that “unsafe characters” no longer exists in the current standard as a character class, since you mention it. However you then go on to say it is still useful — no, that’s incorrect. It is now wrong information.

RFC 1738 is not just obsolete, it is deprecated.

When you echo 1738’s verbiage about gen-delims and sub-delims, you are repeating the very misconception which caused the standard to be revised.

{, }, |, \, ^, ~, [, ]
These are not unsafe characters.

Example: It is safe to use [ ] { } in URLs
It is non-standard to use them in resource names.
If you are writing a URL dereferencing algorithm, or USING such an algorithm, these characters are reserved FOR YOU

gen-delims = “:” “[” / “]” “@”

sub-delims = “!” / “$” / “&” / “‘” / “(” / “)”
/ “*” / “+” / “,” / “;” / “=”

During the host portions and query portion, rules are stricter, however during the path component
[]@:;,=()!*+

Each of these characters are reserved for URL systems to create semantics with, so that you may know no resource names are conflicting with your DSL syntax.