aboutsummaryrefslogtreecommitdiffhomepage
path: root/notes/jagged_arrays.txt
blob: e43d7d591f9e285618ca6029a0d04cbd153d26c2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213

Jagged arrays
=============

Jagged arrays are arrays with elements of different size,
where the elements are still stored sequentially *within* the array.

Jagged arrays are needed to allocate data as static constant data,
(which cannot contain references in SLUL). For example:
- For arrays of strings
- For list of lists (of possibly different size)

For arrays of strings:
- Large array-of-strings could perhaps be compressed as a whole.

Alternative solutions:
- Relative pointers
- Compile-time execution (this could provide compression as well)

A general solution: Serialize the array
---------------------------------------

List should be an opaque/private type, so normally it wouldn't be possible
to use them as non-reference type, and hence they could typically not be
allocated as static constant data.

Note: This could possibly make it easier to perform exploits, because the
serialized format does not involve any absolute pointers, and thereby
defeats ASLR in a way.

The following types need special handling:

- strings
- Lists
- Maps

A workaround could be to provide a stable serialization format for those
types, and have some marker to distinguish them from normal arrays at
runtime.

So two things are necessary. First, some syntax for stable serialization
in the interface:

    type List<T>
    # Syntax with a per-type sentinel value. Not perfect, because it
    # would be nice to have multiple sentinel values to support mixed
    # 8/16/32/64 bit sizes/lengths/offsets.
    serialization array(T) sentinel usize .max
    # Syntax with a fixed range of sentinel values per type (usize, ref, ...)
    serialization array(T) sentinel usize
    # Syntax where the sentinel value is a type-independent bit pattern.
    # On little-endian platforms, this means that the first field has to
    # reserve the highest N values (e.g. 0xF0..0xFF).
    # On big-endian platforms, a fixed number of bits are always 1 and the
    # high-order bits in the following machine-word contain the mode bits.
    # The same range of highest N values still has to be reserved for
    # source-level platform independence.
    serialize array(T)


Then the sentinel value in the impl also:

    type List<T> = struct {
        usize len cond != .max  # must be forbidden here, or you get an error!
        ...
    }

Then some special casing in methods:

    func List<T>.get(size index) -> T
    {
        if this.is_serialized() { # compiler built-in function
            Deserializer des = .new(this)
            des.skip(index)
            # note that the element is guaranteed to be of a serializable
            # type, or one would get a compiler error.
            return des.get()
        } else {
            # Handle normally
            ...
        }
    }

Methods that allocate Lists would have to limit the range to not
accidentally use the sentinel value as a length:

    func .new_allocated(arena, usize length, slot T initval) -> arena List<T>
        cond length < .max
        # also: should constraints go here or immediately after the parameter?

And then some usage of the serialization:

    # The type MUST be serializable here!
    data List<List<string>> the_lists = {
        { "Hello", "world" },
        { "abc", "123" },
    }

Note that this can't be extended to arrays, because arrays don't contain
any struct/metadata about the array, so there is no way to distinguish
between normal and serialized data. This means that []string cannot be
used for static constant data.

Format of the serialized data
-----------------------------

The serialized data would have to support pointers to interior
fields/elements!

And it cannot have normal pointers inside itself, only absolute
pointers.

So perhaps some encoding like this could be used:

    <sentinel>      i.e. max value
    <size>          total length of object (including elements and their
                    nested elements). Needed only if memory mapping of
                    non 100% trusted data is to be supported.
    <length>        numbers of elements
    <offset_1>      relative pointer to first element
    <offset_2>      relative pointer to second element
    ...
    <offset_n>      relative pointer to n'th element
    <element_1>     serialization of element 1 (with sentinel value again)
    ...             any nested objects inside element 1
    <element_2>     serialization of element 2 (with sentinel value again)
    ...             any nested objects inside element 2
    ...
    <element_1>     serialization of element n (with sentinel value again)
    ...             any nested objects inside element n

Mapping serialized data from files
----------------------------------

Note: There seems to be no safe way to do this, because the mapped file
could be modified after the safety checks (i.e. a race condition). So
it should only be done on unmodifiable system files. Perhaps it should
only be allowed from pre-defined directories (e.g. /usr/lib/mmap/ or
~/.mmap/).

Note 2: There should be some indication in the filename that the contents
could trigger security exploits. Perhaps the files should be plain ELF
.so files? Or perhaps they should end with .insecure, .yolo or something.

Note 3: Such files cannot be portable (consider that integers and booleans
are encoded as-is; if a reference is created to a such value, then it MUST
have the system endianess/size, i.e. it's not portable. And even the
sentinel values have this problem).

Note 4: When the safety checks below fail, there should be an error
message informing the user that the safety checks are prone to race
conditions.


An advantage of supporting "direct usage" of serialized data like this,
without any conversion, is that serialized data could possibly be mmap'ed
instead of requiring parsing, memory allocation, copying, etc.

But the files MUST have the sentinel values in place, or the safety checks
of the compiler could be bypassed. The following safety measures are
required:

1. Before getting the "root pointer" to the outer array/dictionary of the
   data, the object must be checked.
2. Before getting a pointer to a nested object, the nested object must be
   checked.
3. The mapping must not be modifiable by other processes.

Checks for objects:

1. It should be checked that there is a sentinel value at the start,
   so the data does not get parsed as non-serialized data (which would
   be used as-is, without deserialization).
   (And if there are different sentinel values for different types, then
    the type has to be known),
2. The end address (start address + <size>) must not exceed that of the
   outer type (or the size of the memory mapping, if it is the outermost
   object).

Related problem: Enums with values
----------------------------------

    type ItemKind = enum(string id, string description) {
        box("box", "a box")
        chair("chair", "a chair")
        table("table, "a table")
    }

Problem: Keeping the data private!
Problem: How to store the data?

Solution: Jagged arrays constants with checked indexes:

    type ItemKind = enum {
        box
        chair
        table
    }

     # in private impl file:
     stringtable<ItemKind> item_ids = {
        "box"
        "chair"
        "table"
     };

     stringtable<ItemKind> item_descriptions

Advantages with this solution:
1. Implementation (=values) can be hidden
2. stringtable<int> can be used as a generic stringtable
   Negative indices are not used.
   But what about the length?? Should it be stored somewhere?
3. Values can be defined in another module than the enum definition!