notes/array_data_structure.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148


Dynamic array data structure
============================

Goals:

* Size-efficient for small lists (including very small ones)
* Size-efficient for large lists
* Speed-efficient for large lists
    - get element by index
    - append element
    - pop last element
* Ability to have the same in-memory format as strings?

It is probably a good idea to take some middle ground/combination
of many different approaches. For large lists it should be OK to
allow for an additional level of indirection for the elements,
but perhaps with fast-paths for the first and last chunks?


Notation:

    n..m is an range of from n to m, including both n and m.
    [n] is either an array size or array element.
    / means "or"

Hybrid inline / chunked-linked-list
-----------------------------------

3 machine words:
(for 32b: 12 bytes, for 64b: 24 bytes)

    size
    a
    b

Have two modes:

1. Inline array mode
    - When all elements fit inside a,b
    - For byte arrays, size = 0..11 (32 bit systems) or 0..23 (64 bit)
2. Chunked-linked-list mode:
    - a = first chunk
    - b = last chunk

Inefficient word-oriented exponential/merging list
--------------------------------------------------

8 machine words:
(for 32b: 32 bytes, for 64b: 64 bytes)

    size
    chunks[7]

Have two modes:

1. Inline array mode
    - When all elements fit inside chunks[7]
    - For byte arrays, size = 0..31 (32 bit systems) or 0..63 (64 bit)
2. Initial exponential array mode:
    - chunks[] contain pointers to array-chunks
    - the initial array chunk goes into chunks[0] and could be ~64 bytes
    - following chunks increase exponentially in size, and go into the
      following chunks[] elements.
3. Merging exponential array mode:
    - when chunk 7 is full:
        - merge chunk 0 and 1
        - store the new chunk as chunk 0
        - "shift down" other chunks (n = 1..6):
            chunk[n] = chunk[n+1]
        - chunk[7] = new chunk of double size as chunk[6]

Pros:

* Somewhat fast lookups (only ever 1 level of indirection)
* Fully deterministic data structure

Cons:

* Adding a large number of elements at the same time is inefficient,
  since it might be stored in multiple chunks and/or lead to multiple
  merge operations.


Flexible word-oriented exponential/merging list
-----------------------------------------------

8 machine words:
(for 32b: 32 bytes, for 64b: 64 bytes)

    size
    word_0_size
    chunks[6]

Have two modes:

1. Inline array mode
    - When all elements fit inside {word_0_size,chunks[7]}
    - For byte arrays, size = 0..31 (32 bit systems) or 0..63 (64 bit)
2. Initial exponential array mode:
    - words[] contain pointers to array-chunks
    - the initial array chunk goes into chunks[0] and could be ~64 bytes
        - word_0_size is set accordingly
    - following chunks increase exponentially in size, and go into the
      following chunks[] elements.
3. Merging exponential array mode:
    - when chunk 6 is full:
        - merge chunk 0 and 1
        - store the new chunk as chunk 0
        - "shift down" other chunks (n = 1..6):
            chunk[n] = chunk[n+1]
        - chunk[7] = new chunk of double size as chunk[6]

Fast arbitrary-sized multi-insertion becomes possible in two ways:

1. When inserting to a new/empty array, just set:
    * size = word_0_size = The initial size
    * chunks[0] = Chunk with initial elements (can have any size)
    * chunks[1..7] = NULL
2. When inserting


Pros:

* Somewhat fast lookups (only ever 1 level of indirection)
* Fully deterministic data structure

Cons:

* If an array has a very large initial size, and is appended to,
  then chunk[1] becomes 2*initial_size, which could be quite inefficient.

String-compatible lists
-----------------------

String - byte-array compatibility:

* It would be nice to support read-only usage of strings as arrays.
  (And small strings should be really compact.)
* It would be nice to support byte-arrays as strings, once the byte-array
  is read-only.

Arrays without pointers in elements can be excluded from scanning during
garbage collection (just like strings). So for the point of GC scanning it
also makes sense to have a compatible in-memory format for strings and arrays.

- Merging an "array header" with an "object GC metadata header" could
  save space.
- Doing it for non-8-bit-types could be tricky due to alignment differences.