String charset / raw bytes in strings ===================================== Strings should be "possibly malformed best-effort UTF-8 with the most precomposed characters". Here's why: * Existing data or data from external system (where the SLUL part might just be a component "in the middle") might have to deal with malformed data. * It removes the need to having separate types of binary and character sequences. * There's seldom any need to allow fully arbitrary UTF-8 characters but not binary. For example, with text content, you almost always want to strip out some control characters such as NUL. And often also RTL/LTR switch characters, lookalikes, newlines, tabs, HTML < > & ' " etc. * Oftentimes you just want to copy data as-is, without knowing what is valid or invalid data (and this also avoid duplication). * Non-precomposed characters, when a precomposed character exists, are redundant and overly long. And they are known to cause problems. (But in some cases you may still want to preserve them). Maybe the formal definition should be like this: A string is a sequence of bytes, including UTF-8 characters in their precomposed form. Only bytes that are part of a valid UTF-8 character in the most precomposed form possible can be valid characters. A string can then be represented as a sequence of valid characters and not-valid-character byte sequences. Note that there's a distinction between Unicode codepoints and characters; codepoints that aren't part of a valid Unicode character consitute a not-valid-character byte sequence.