blob: eaacf096991517ce2fb4765b80fab88464dba7ab (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
String charset / raw bytes in strings
=====================================
Strings should be "possibly malformed best-effort UTF-8 with
the most precomposed characters". Here's why:
* Existing data or data from external system (where the SLUL part might just
be a component "in the middle") might have to deal with malformed data.
* It removes the need to having separate types of binary and character
sequences.
* There's seldom any need to allow fully arbitrary UTF-8 characters but
not binary. For example, with text content, you almost always want to
strip out some control characters such as NUL. And often also RTL/LTR
switch characters, lookalikes, newlines, tabs, HTML < > & ' " etc.
* Oftentimes you just want to copy data as-is, without knowing what is
valid or invalid data (and this also avoid duplication).
* Non-precomposed characters, when a precomposed character exists, are
redundant and overly long. And they are known to cause problems.
(But in some cases you may still want to preserve them).
Maybe the formal definition should be like this:
A string is a sequence of bytes, including UTF-8 characters in their
precomposed form. Only bytes that are part of a valid UTF-8 character
in the most precomposed form possible can be valid characters.
A string can then be represented as a sequence of valid characters and
not-valid-character byte sequences.
Note that there's a distinction between Unicode codepoints and characters;
codepoints that aren't part of a valid Unicode character consitute a
not-valid-character byte sequence.
|