1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
|
What shoud SLUL do?
===================
SLUL will use "UTF-8 everywhere", but will allow broken encodings / WTF-8,
and only show replacement characters on display (and filter/reject on validation,
e.g. a "is_valid_email" function might reject invalid UTF-8).
This could be called "(Possibly broken) UTF-8 everywhere".
For code that really needs some other encoding, there could be an escape
hatch. See below.
Details
-------
A combination of the following could be both simple and powerful:
* byte strings for everything that does not care about encoding
* "possibly broken UTF-8" for everything that is unlikely to
be or to support other encodings (e.g. text in a user interface,
filenames [which can be WTF-8])
* Maybe some kind of encoding override and/or storage of the original
encoding. This does not have to be as fast as UTF-8. Code can be
"capable of non-UTF-8" and use this data, or it can be assume
"UTF-8 everywhere" and just see the UTF-8 bytes (which can either be
generated on the fly, or be stored in the string).
And this little modification makes things much more powerful:
# in pseudo-code:
type StringChunk = struct {
# Most code will only support "possibly broken UTF-8"
# Should it be possible to omit this, and generate UTF-8 on the fly?
usize len
byte[len] s
# Code that does more advanced text operations
# can choose to support loss-less access to the original encoding
?ref OrigEncChunk orig_encoding
}
type OrigEncChunk = struct {
usize len
ref OrigEncoding
byte[len] s
}
type OrigEncoding<C,G,P,S> = struct {
?ref CodePointType<C> codepoint_type
?ref GraphemeType<G> grapheme_type
?ref PhonemeType<P> phoneme_type
# "semanteme" = semantic unit
?ref SemantemeType<S> semanteme_type
# TODO "slot var" type! i.e. qualifiers inside slot types!
func pipe to_codepoint(Iterator<byte> b, ref var C codepoint)
func pipe to_grapheme(Iterator<byte> b, ref var G grapheme)
func pipe to_phoneme(Iterator<byte> b, ref var P phoneme)
}
Further reading
---------------
https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape
https://ztdtext.readthedocs.io/en/latest/design.html
https://ztdtext.readthedocs.io/en/latest/design/lucky%207.html
- type: code unit
- type: code point
- max u/p and p/u
- function: encode (code points -> code units)
- function: decode (code units -> code points)
- translation between encodings if possible if they use the same code point type
|