Memory sharing vs execution time overhead ========================================= There are basically 7 common "systems" to share code: 1. Position Independent + GOT/PLT (most ELF/*nix systems?) 2. Position Independent + GOT/PLT + prelinked files that are used if possible 3. Prelinking, and on collision: relocate on startup (Windows NT?) 4. Prelinking, and on collision: relocate on page-in (Windows 9x) 5. Prelinking, without collision handling (very old Linux versions) 6. Relocation at installation time (maybe systems with AOT compilation?) 7. Central dispatcher function (e.g. microkernel system or OS/BIOS) Number 4 is IMHO the best option (together with compressed pages/zram), but it requires kernel support. What is needed for SLUL? ------------------------ On most platforms, code can use: - relative jumps - relative data addressing (Constant data. Adjacent to code, with 1 shared page / tail-head merging) - possibly relative function addressing (if function refs are stored as native pointers) But: - calls between modules need special care - constant data access between modules needs special care - function pointer addressing might need special care (special _addr_typeXXX data slot) - function pointer invocation might need special care (special _call_typeXXX jump slot) Solution 1: - this can be very efficient at runtime: - only one or a few memory mapping - sharing is possible if done on a per-system basis rather than per-user. - "binary image file" with all code/data compiled together. can be mmap'ed in one operation. - must be updated on changes in the underlying files. - can be partitioned to reduce regeneration time, and to support more than 2GiB of code on 32 bit systems. - can be stored in home directory as well. Solution 2: - this reduces the number of mappings to one or a couple unshared/private pages + 1 shared page per lib. - encode a base pointer + an offset in the address of each mapped library. the base pointer points to an "offset table" / "GOT". the offset is the offset in the offset table. - base pointer could simply be found by zeroing out low order bits. - offset could be "mid order bits". - example 32 bit address: XBBB MMMM MMMM LLLL LLLL LLLL LLLL LLLL X: 1 bit. must be 0 to avoid kernel addresses. B: 3 bits. base pointer M: 8 bits. offset in "offset table". L: 20 bits. code address (up to 1 MiB of code per "block") - libraries that use more than 1 MiB can use more than 1 block (and also needs duplicated entries in the offset table) - how to efficiently call into the offset table? Solution 3: - old-Linux-style prelinking with a (literally) global address space - all executables can use a constant address - address format (32 bit version): - XHHH HHHH HHHH HHHH HLLL LLLL LLLL LLLL X: 1 bit. must be 0 to avoid kernel addresses. H: 16 bit name hash (oops - this will have collisions!) L: 15 bits. code address (up to 32 KiB, then following blocks are used sequentially) - how big is the chance of collisions? - exceeds 1% at 36 libraries - at 300 libraries, it is 50% - with dedicated look-up pages without other code (=smaller), the number of hash bits could be increased: (4 KiB per look-up table, 1 per library) - XHHH HHHH HHHH HHHH HHHH LLLL LLLL LLLL X: 1 bit. must be 0 to avoid kernel addresses. H: 19 bit name hash (oops - this will have collisions!) L: 16 bits. code address (up to 64 KiB, then following blocks are used sequentially) - how big is the chance of collisions? - exceeds 1% at 103 libraries - at 851 libraries, it is 50% Solution 4: (best solution? but not that scalable, and reduces security by creating a loophole in ASLR) - "central dispatcher" - have a "dispatcher" at a constant base address (globally constant, i.e. same for everyone, everywhere) - functions are deterministically "hashed" somehow to generate an absolute address (which is inside the "dispatcher" area). - collisions in the same module can be solved by hashing again. (- this way collisions with known implementations of dependencies could also be avoided) - how to handle "dynamic" collisions, i.e. collisions between: 1. executable and libraries (=impls of dependencies), or 2. between libraries - check return address (link register or stack entry)? (must jump to a separate space first, which can be allocated at any address that the loader chooses) - how to handle/avoid "saturation" (increasing collisions with increasing number of functions) - the number of hash-entries is roughly the number of unique imports across all modules that are directly or indirectly used in an executable. - 4096 / 8 bytes per jump = 512 entries - 2 MiB / 8 bytes = 262144 entries - could have multiple hash tables of exponentially growing sizes: - first 1 page, then 2 pages, then 4, ... - modules specify which hash table size they use. - solves the problem of modules with many direct imports - does NOT solve the problem with many indirect imports (via deep depenendency trees, that can grow deeper after upgrading and/or when replacing an impl with another impl) - how to avoid usage of a static address? - on platforms with "base-relative"/"bitmasked" jumps (instead of only plain relative jumps), it is possible to put the dispatcher at the lowest base-relative addresses. (this way, it is also possible to have mulitple dispatchers!) Solution 5: (simplest solution?) - put a dispatcher page (or pages) before each module's code, which contains only the module's imports. - this way it can be accessed by a simple PC-relative jump (and there are no constant/fixed addresses at all) - this requires (at least) 1 unshared/private page per module. (bad if there are many small modules) - BUT with "dispatcher cache" files, the pages could be mmap'ed files. (but it still needs 1 additional mapping per module) (and might be slower to load, due to more disk accesses) - BUT if the "dispatcher cache" data is placed at the end of each module's file, then loading would be fast, and it would no longer require any additional mappings. - BUT with kernel support, the pages could be made shareable systemwide.