Strings ======= Goals: 1. Empty and 1 char strings should not need pointers 2. Chunked strings should be supported (at least with 4 KiB arena blocks) 3. Strings constants must not need any relocations. Goals in detail: - Implicit string references - Proper string constant supports (e.g. constant arrays of string constants): - Support for relative string pointers? (this removes the need for jagged arrays) - READING or COPYING from a string pointer needs special handling. The relative pointers need to be converted to absolute pointers. - Support for encoding 1 char printable ASCII and empty strings directly in pointers? Encoding of String Pointers --------------------------- Pointers: - 0 = none - 65536+ = normal, absolute, string pointer - 1-65535 special values 127 one char values (7 bit ASCII char only; UTF-8 chars need "long" encoding) 1 empty value perhaps two ASCII char values also? (we have 15 bits - 1 value) others: special relative pointers Relative pointer encoding: - We have 15 bits - We can only encode relative offsets +/- 16 KiB. Solutions: - Use 1 bit for "fragmented pointers" to extend the range by "stealing" a few bits each from a few nearby string pointers. - Use chunked arrays (we already need to support this). That way, we can ensure that there are holes in the array to accomodate for space for the strings. (This also means that chunked arrays would need to use relative pointers). Problems: - String dereferencing requires an additional check! In pseudo-code: reg := s_addr if is_relative(reg) reg += s_addr - Constant data structures with SEVERAL strings cannot be encoded if the size of the strings is so large that the "farthest" string is to farther away than 16 KiB. - This can be sovled with chunked strings! Encoding of String Data - Alternative 1 --------------------------------------- First byte: - 0 = empty string (special case, in case we must have a valid pointer for some reason) - 1-247 = null terminated string - 248-251 = continguous string, 8,16,32,64 bit length, respectively - 252-255 = chained string, 8,16,32,64 bit length, respectively This is followed by either: - nothing (in case of an empty string) - null terminated string - first (or only) string segment String segments: - length (8,16,32,64 bit) - string data Encoding of String Data - Alternative 2 --------------------------------------- First byte: - 0-247: length of string - 248-255: bits: 0.. = continguous string 1.. = chained string .0. = 8 bit string length and segment length .1. = maximum bits for string length and segment length ..0 = absolute pointers ..1 = relative pointers This is followed by either: - nothing (in case of an empty string) - string data (up to 247 bytes), plus zero terminator (for C compatibility and simpler string looping) - string header (for contiguous or chained strings) String header: - length of string in bytes - pointer to string data OR first segment Segment: - length of segment - pointer to next segment - pointer to string data in this segment Encoding of String Data - Alternative 3 --------------------------------------- First byte: - 0-254: length of string - 255: chained string with relative pointers This is followed by either: - nothing (in case of an empty string) - string data (up to 247 bytes), plus zero terminator (for C compatibility and simpler string looping) - string header (for chained strings) String header: - length of string in bytes - first segment Segment: - length of segment - relative pointer to next segment - relative pointer to string data in this segment Encoding of String Data - Alternative 4 --------------------------------------- First byte: - 0: chained string with relative pointers - 1-255: length of string This is followed by either: - string header (for chained strings) - string data (up to 247 bytes), plus zero terminator (for C compatibility and simpler string looping) String header: - length of string in bytes - first segment Segment: - length of segment - relative pointer to next segment - relative pointer to string data in this segment Old idea from 2021-02-25 ------------------------ - Pointer: 0 = merge with none (for optional types) 1-127: 1 byte string 128+: pointer to string - Data: First byte: 0 = empty string 1-247 = null terminated 255 = chained string. 64 bit length 254 = chained string. 32 bit length 253 = chained string. 16 bit length 252 = chained string. 8 bit length 251 = contiguous string. 64 bit length 250 = contiguous string. 32 bit length. 249 = contiguous string. 16 bit length. 248 = contiguous string. 8 bit length. - Properties: - Can be passed around like a pointer - Can be null if the type allows this - Unverified UTF-8. Care must be taken when outputting, or performing index (etc.) operations.