Dynamic array data structure ============================ Goals: * Size-efficient for small lists (including very small ones) * Size-efficient for large lists * Speed-efficient for large lists - get element by index - append element - pop last element * Ability to have the same in-memory format as strings? It is probably a good idea to take some middle ground/combination of many different approaches. For large lists it should be OK to allow for an additional level of indirection for the elements, but perhaps with fast-paths for the first and last chunks? Notation: n..m is an range of from n to m, including both n and m. [n] is either an array size or array element. / means "or" Hybrid inline / chunked-linked-list ----------------------------------- 3 machine words: (for 32b: 12 bytes, for 64b: 24 bytes) size a b Have two modes: 1. Inline array mode - When all elements fit inside a,b - For byte arrays, size = 0..11 (32 bit systems) or 0..23 (64 bit) 2. Chunked-linked-list mode: - a = first chunk - b = last chunk Inefficient word-oriented exponential/merging list -------------------------------------------------- 8 machine words: (for 32b: 32 bytes, for 64b: 64 bytes) size chunks[7] Have two modes: 1. Inline array mode - When all elements fit inside chunks[7] - For byte arrays, size = 0..31 (32 bit systems) or 0..63 (64 bit) 2. Initial exponential array mode: - chunks[] contain pointers to array-chunks - the initial array chunk goes into chunks[0] and could be ~64 bytes - following chunks increase exponentially in size, and go into the following chunks[] elements. 3. Merging exponential array mode: - when chunk 7 is full: - merge chunk 0 and 1 - store the new chunk as chunk 0 - "shift down" other chunks (n = 1..6): chunk[n] = chunk[n+1] - chunk[7] = new chunk of double size as chunk[6] Pros: * Somewhat fast lookups (only ever 1 level of indirection) * Fully deterministic data structure Cons: * Adding a large number of elements at the same time is inefficient, since it might be stored in multiple chunks and/or lead to multiple merge operations. Flexible word-oriented exponential/merging list ----------------------------------------------- 8 machine words: (for 32b: 32 bytes, for 64b: 64 bytes) size word_0_size chunks[6] Have two modes: 1. Inline array mode - When all elements fit inside {word_0_size,chunks[7]} - For byte arrays, size = 0..31 (32 bit systems) or 0..63 (64 bit) 2. Initial exponential array mode: - words[] contain pointers to array-chunks - the initial array chunk goes into chunks[0] and could be ~64 bytes - word_0_size is set accordingly - following chunks increase exponentially in size, and go into the following chunks[] elements. 3. Merging exponential array mode: - when chunk 6 is full: - merge chunk 0 and 1 - store the new chunk as chunk 0 - "shift down" other chunks (n = 1..6): chunk[n] = chunk[n+1] - chunk[7] = new chunk of double size as chunk[6] Fast arbitrary-sized multi-insertion becomes possible in two ways: 1. When inserting to a new/empty array, just set: * size = word_0_size = The initial size * chunks[0] = Chunk with initial elements (can have any size) * chunks[1..7] = NULL 2. When inserting Pros: * Somewhat fast lookups (only ever 1 level of indirection) * Fully deterministic data structure Cons: * If an array has a very large initial size, and is appended to, then chunk[1] becomes 2*initial_size, which could be quite inefficient. String-compatible lists ----------------------- String - byte-array compatibility: * It would be nice to support read-only usage of strings as arrays. (And small strings should be really compact.) * It would be nice to support byte-arrays as strings, once the byte-array is read-only. Arrays without pointers in elements can be excluded from scanning during garbage collection (just like strings). So for the point of GC scanning it also makes sense to have a compatible in-memory format for strings and arrays. - Merging an "array header" with an "object GC metadata header" could save space. - Doing it for non-8-bit-types could be tricky due to alignment differences.