notes/string_charset_bytes.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


String charset / raw bytes in strings
=====================================

Strings should be "possibly malformed best-effort UTF-8 with
the most precomposed characters". Here's why:

* Existing data or data from external system (where the SLUL part might just
  be a component "in the middle") might have to deal with malformed data.
* It removes the need to having separate types of binary and character
  sequences.
* There's seldom any need to allow fully arbitrary UTF-8 characters but
  not binary. For example, with text content, you almost always want to
  strip out some control characters such as NUL. And often also RTL/LTR
  switch characters, lookalikes, newlines, tabs, HTML < > & ' " etc.
* Oftentimes you just want to copy data as-is, without knowing what is
  valid or invalid data (and this also avoid duplication).
* Non-precomposed characters, when a precomposed character exists, are
  redundant and overly long. And they are known to cause problems.
  (But in some cases you may still want to preserve them).

Maybe the formal definition should be like this:

    A string is a sequence of bytes, including UTF-8 characters in their
    precomposed form. Only bytes that are part of a valid UTF-8 character
    in the most precomposed form possible can be valid characters.

    A string can then be represented as a sequence of valid characters and
    not-valid-character byte sequences.

    Note that there's a distinction between Unicode codepoints and characters;
    codepoints that aren't part of a valid Unicode character consitute a
    not-valid-character byte sequence.