notes/strings_api.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


What shoud SLUL do?
===================

SLUL will use "UTF-8 everywhere", but will allow broken encodings / WTF-8,
and only show replacement characters on display (and filter/reject on validation,
e.g. a "is_valid_email" function might reject invalid UTF-8).

This could be called "(Possibly broken) UTF-8 everywhere".

For code that really needs some other encoding, there could be an escape
hatch. See below.

Details
-------

A combination of the following could be both simple and powerful:

* byte strings for everything that does not care about encoding
* "possibly broken UTF-8" for everything that is unlikely to
  be or to support other encodings (e.g. text in a user interface,
  filenames [which can be WTF-8])
* Maybe some kind of encoding override and/or storage of the original
  encoding. This does not have to be as fast as UTF-8. Code can be
  "capable of non-UTF-8" and use this data, or it can be assume
  "UTF-8 everywhere" and just see the UTF-8 bytes (which can either be
  generated on the fly, or be stored in the string).

And this little modification makes things much more powerful:

    # in pseudo-code:
    type StringChunk = struct {
        # Most code will only support "possibly broken UTF-8"
        # Should it be possible to omit this, and generate UTF-8 on the fly?
        usize len
        byte[len] s
        # Code that does more advanced text operations
        # can choose to support loss-less access to the original encoding
        ?ref OrigEncChunk orig_encoding
    }

    type OrigEncChunk = struct {
        usize len
        ref OrigEncoding
        byte[len] s
    }

    type OrigEncoding<C,G,P,S> = struct {
        ?ref CodePointType<C> codepoint_type
        ?ref GraphemeType<G> grapheme_type
        ?ref PhonemeType<P> phoneme_type
        # "semanteme" = semantic unit
        ?ref SemantemeType<S> semanteme_type

        # TODO "slot var" type! i.e. qualifiers inside slot types!
        func pipe to_codepoint(Iterator<byte> b, ref var C codepoint)
        func pipe to_grapheme(Iterator<byte> b, ref var G grapheme)
        func pipe to_phoneme(Iterator<byte> b, ref var P phoneme)
    }

Further reading
---------------

https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape

https://ztdtext.readthedocs.io/en/latest/design.html
https://ztdtext.readthedocs.io/en/latest/design/lucky%207.html
- type: code unit
- type: code point
- max u/p and p/u
- function: encode (code points -> code units)
- function: decode (code units -> code points)
- translation between encodings if possible if they use the same code point type