Character Sets for Random Strings

The character set you choose for random strings affects entropy, compatibility, and usability. This guide helps you select appropriate character sets for different use cases.

Standard Character Sets

Lowercase letters (a-z) provide 26 characters and 4.7 bits of entropy per character. This is the most readable and compatible character set, working in case-insensitive systems and being easy to communicate verbally. However, it requires longer strings to achieve the same entropy as mixed-case alternatives.

Uppercase letters (A-Z) have identical properties to lowercase—26 characters and 4.7 bits per character. Combining lowercase and uppercase gives 52 characters and 5.7 bits per character, increasing security without adding special characters that might cause compatibility issues.

Digits (0-9) add 10 characters. Alphanumeric combinations (a-z, A-Z, 0-9) provide 62 characters and 5.95 bits per character. This is widely compatible and works in most contexts: URLs, databases, file names, and programming identifiers.

Special characters (!@#$%^&*()_+-=[]{}|;:,.<>?) maximize entropy at 6.55 bits per character when combined with alphanumeric (94 total printable ASCII characters). However, they introduce compatibility challenges and may need escaping in URLs, shells, or databases.

The relationship between character set size and entropy is logarithmic: entropy per character = log2(character_set_size). Doubling the character set size doesn't double entropy—it adds one bit per character. Going from 62 to 94 characters only increases entropy by about 0.6 bits per character.

URL-Safe Character Sets

URL-safe strings avoid characters with special meanings in URLs. The unreserved characters in URLs are: A-Z, a-z, 0-9, hyphen (-), period (.), underscore (_), and tilde (~). These 66 characters never need URL encoding.

Base64URL is a standard URL-safe encoding that uses A-Z, a-z, 0-9, hyphen (-), and underscore (_)—64 characters total. It replaces standard Base64's plus (+) and slash (/) which have special URL meanings. This is the de facto standard for URL-safe random strings.

Query parameters have additional considerations. While you can URL-encode any characters, avoiding the need to encode simplifies code and makes URLs more readable. Stick to alphanumeric plus hyphen and underscore for query parameter values used as tokens.

Fragments (the part after #) have slightly different rules but generally benefit from the same character set restrictions. Some frameworks parse fragment content, so URL-safe characters prevent unexpected behavior.

Path segments benefit from restrictive character sets. While many special characters are technically valid in URL paths when encoded, using only alphanumeric plus hyphen makes paths cleaner and more compatible with various web servers and proxies.

Excluding Ambiguous Characters

Ambiguous characters look similar in many fonts, causing confusion when humans read or type strings. The most problematic are: 0 (zero) and O (capital o), 1 (one), l (lowercase L), and I (capital i).

When to exclude ambiguous characters depends on human interaction. If users never see the strings (purely internal tokens), include all characters for maximum entropy. If users might need to read strings (viewing confirmation codes), exclude ambiguous characters for clarity. If users must type strings (entering backup codes), definitely exclude ambiguous characters to prevent errors.

The entropy cost of excluding ambiguous characters is small. Removing 0, O, 1, l, I from alphanumeric (62 chars) leaves 57 characters. This reduces entropy from 5.95 to 5.83 bits per character—only 0.12 bits per character. A 20-character string loses about 2.4 bits of total entropy, easily compensated by adding one extra character.

Additional exclusions can improve usability. Some exclude 0 and O but keep 1, l, I if the font context is clear. Others exclude Z and 2 which can look similar in some fonts. Vowels might be excluded to prevent accidentally generating words (which could be offensive or confusing). The tradeoff is always: how much does exclusion improve usability versus how much does it reduce entropy?

Custom character sets for specific contexts solve domain-specific problems. Hexadecimal (0-9, a-f) is universally understood by developers and provides clean, unambiguous strings. DNA sequences use A, C, G, T. Morse code uses dots and dashes. Define character sets based on your requirements, calculating entropy to ensure strings are long enough for their purpose.

Vyzkoušet nástroj

Random String Generator

Související články

What Are Random Strings?

Random strings are sequences of characters selected unpredictably from a defined set. They're everywhere in modern software—from the session cookie in your browser to the API key for your cloud services. Understanding random strings helps you use them appropriately and securely.

Random String Security

Random strings are critical security primitives. Used correctly, they provide unpredictable tokens that protect user sessions and authenticate API access. Used incorrectly, they create vulnerabilities attackers exploit. This guide covers essential security practices.

← Random String Generator Průvodce