aboutsummaryrefslogtreecommitdiffhomepage
path: root/notes/strings.txt
blob: 5b909bc00a0a114c90819cfc9a53c2f7ebb77f43 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152

Strings
=======

Goals:
1. Empty and 1 char strings should not need pointers
2. Chunked strings should be supported (at least with 4 KiB arena blocks)
3. Strings constants must not need any relocations.


Goals in detail:
- Implicit string references
- Proper string constant supports (e.g. constant arrays of string constants):
    - Support for relative string pointers? (this removes the need for jagged arrays)
    - READING or COPYING from a string pointer needs special handling.
      The relative pointers need to be converted to absolute pointers.
- Support for encoding 1 char printable ASCII and empty strings directly in pointers?

Encoding of String Pointers
---------------------------
Pointers:
- 0 = none
- 65536+ = normal, absolute, string pointer
- 1-65535 special values
    127 one char values (7 bit ASCII char only; UTF-8 chars need "long" encoding)
    1 empty value
    perhaps two ASCII char values also? (we have 15 bits - 1 value)
    others: special relative pointers

Relative pointer encoding:
- We have 15 bits
- We can only encode relative offsets +/- 16 KiB. Solutions:
    - Use 1 bit for "fragmented pointers" to extend the range by
      "stealing" a few bits each from a few nearby string pointers.
    - Use chunked arrays (we already need to support this).
      That way, we can ensure that there are holes in the array
      to accomodate for space for the strings.
      (This also means that chunked arrays would need to use relative pointers).

Problems:
- String dereferencing requires an additional check! In pseudo-code:
    reg := s_addr
    if is_relative(reg)
        reg += s_addr
- Constant data structures with SEVERAL strings cannot be encoded if the size
  of the strings is so large that the "farthest" string is to farther away
  than 16 KiB.
    - This can be sovled with chunked strings!


Encoding of String Data - Alternative 1
---------------------------------------
First byte:
- 0 = empty string (special case, in case we must have a valid pointer for some reason)
- 1-247 = null terminated string
- 248-251 = continguous string, 8,16,32,64 bit length, respectively
- 252-255 = chained string, 8,16,32,64 bit length, respectively

This is followed by either:
- nothing (in case of an empty string)
- null terminated string
- first (or only) string segment

String segments:
- length (8,16,32,64 bit)
- string data

Encoding of String Data - Alternative 2
---------------------------------------
First byte:
- 0-247: length of string
- 248-255: bits:
   0.. = continguous string
   1.. = chained string
   .0. = 8 bit string length and segment length
   .1. = maximum bits for string length and segment length
   ..0 = absolute pointers
   ..1 = relative pointers

This is followed by either:
- nothing (in case of an empty string)
- string data (up to 247 bytes), plus zero terminator (for C compatibility and simpler string looping)
- string header (for contiguous or chained strings)

String header:
- length of string in bytes
- pointer to string data OR first segment

Segment:
- length of segment
- pointer to next segment
- pointer to string data in this segment

Encoding of String Data - Alternative 3
---------------------------------------
First byte:
- 0-254: length of string
- 255: chained string with relative pointers

This is followed by either:
- nothing (in case of an empty string)
- string data (up to 247 bytes), plus zero terminator (for C compatibility and simpler string looping)
- string header (for chained strings)

String header:
- length of string in bytes
- first segment

Segment:
- length of segment
- relative pointer to next segment
- relative pointer to string data in this segment

Encoding of String Data - Alternative 4
---------------------------------------
First byte:
- 0: chained string with relative pointers
- 1-255: length of string

This is followed by either:
- string header (for chained strings)
- string data (up to 247 bytes), plus zero terminator (for C compatibility and simpler string looping)

String header:
- length of string in bytes
- first segment

Segment:
- length of segment
- relative pointer to next segment
- relative pointer to string data in this segment


Old idea from 2021-02-25
------------------------
- Pointer: 0 = merge with none (for optional types)
           1-127: 1 byte string
           128+: pointer to string
- Data: First byte:
        0 = empty string
        1-247 = null terminated
        255 = chained string. 64 bit length
        254 = chained string. 32 bit length
        253 = chained string. 16 bit length
        252 = chained string. 8 bit length
        251 = contiguous string. 64 bit length
        250 = contiguous string. 32 bit length.
        249 = contiguous string. 16 bit length.
        248 = contiguous string. 8 bit length.
- Properties:
    - Can be passed around like a pointer
    - Can be null if the type allows this
    - Unverified UTF-8. Care must be taken when outputting, or performing index (etc.) operations.