Memory sharing vs execution time overhead
=========================================

There are basically 7 common "systems" to share code:
1. Position Independent + GOT/PLT  (most ELF/*nix systems?)
2. Position Independent + GOT/PLT + prelinked files that are used if possible
3. Prelinking, and on collision: relocate on startup (Windows NT?)
4. Prelinking, and on collision: relocate on page-in (Windows 9x)
5. Prelinking, without collision handling (very old Linux versions)
6. Relocation at installation time (maybe systems with AOT compilation?)
7. Central dispatcher function (e.g. microkernel system or OS/BIOS)


Number 4 is IMHO the best option (together with compressed pages/zram),
but it requires kernel support.


What is needed for SLUL?
------------------------
On most platforms, code can use:
- relative jumps
- relative data addressing (Constant data. Adjacent to code, with 1 shared page / tail-head merging)
- possibly relative function addressing (if function refs are stored as native pointers)

But:
- calls between modules need special care
- constant data access between modules needs special care
- function pointer addressing might need special care  (special _addr_typeXXX data slot)
- function pointer invocation might need special care  (special _call_typeXXX jump slot)

Solution 1:
- this can be very efficient at runtime:
    - only one or a few memory mapping
    - sharing is possible if done on a per-system basis rather than per-user.
- "binary image file" with all code/data compiled together. can be mmap'ed in one operation.
- must be updated on changes in the underlying files.
- can be partitioned to reduce regeneration time, and
  to support more than 2GiB of code on 32 bit systems.
- can be stored in home directory as well.

Solution 2:
- this reduces the number of mappings to one or a couple unshared/private pages + 1 shared page per lib.
- encode a base pointer + an offset in the address of each mapped library.
  the base pointer points to an "offset table" / "GOT".
  the offset is the offset in the offset table.
    - base pointer could simply be found by zeroing out low order bits.
    - offset could be "mid order bits".
    - example 32 bit address:  XBBB MMMM  MMMM LLLL  LLLL LLLL  LLLL LLLL
        X: 1 bit. must be 0 to avoid kernel addresses.
        B: 3 bits. base pointer
        M: 8 bits. offset in "offset table".
        L: 20 bits. code address (up to 1 MiB of code per "block")
    - libraries that use more than 1 MiB can use more than 1 block
      (and also needs duplicated entries in the offset table)
- how to efficiently call into the offset table?

Solution 3:
- old-Linux-style prelinking with a (literally) global address space
- all executables can use a constant address
- address format (32 bit version):
    - XHHH HHHH  HHHH HHHH  HLLL LLLL  LLLL LLLL
        X: 1 bit. must be 0 to avoid kernel addresses.
        H: 16 bit name hash (oops - this will have collisions!)
        L: 15 bits. code address (up to 32 KiB, then following blocks are used sequentially)
    - how big is the chance of collisions?
        - exceeds 1% at 36 libraries
        - at 300 libraries, it is 50%
- with dedicated look-up pages without other code (=smaller), the number
  of hash bits could be increased: (4 KiB per look-up table, 1 per library)
    - XHHH HHHH  HHHH HHHH  HHHH LLLL  LLLL LLLL
        X: 1 bit. must be 0 to avoid kernel addresses.
        H: 19 bit name hash (oops - this will have collisions!)
        L: 16 bits. code address (up to 64 KiB, then following blocks are used sequentially)
     - how big is the chance of collisions?
        - exceeds 1% at 103 libraries
        - at 851 libraries, it is 50%

Solution 4: (best solution? but not that scalable, and reduces security by creating a loophole in ASLR)
- "central dispatcher"
- have a "dispatcher" at a constant base address (globally constant, i.e. same for everyone, everywhere)
- functions are deterministically "hashed" somehow to generate an absolute address
  (which is inside the "dispatcher" area).
- collisions in the same module can be solved by hashing again.
    (- this way collisions with known implementations of dependencies could also be avoided)
- how to handle "dynamic" collisions, i.e. collisions between:
        1. executable and libraries (=impls of dependencies), or
        2. between libraries
    - check return address (link register or stack entry)?
        (must jump to a separate space first, which can be allocated at any address that the loader chooses)
- how to handle/avoid "saturation" (increasing collisions with increasing number of functions)
    - the number of hash-entries is roughly the number of unique imports across all modules that are directly or indirectly used in an executable.
    - 4096 / 8 bytes per jump = 512 entries
    - 2 MiB / 8 bytes = 262144 entries
    - could have multiple hash tables of exponentially growing sizes:
        - first 1 page, then 2 pages, then 4, ...
        - modules specify which hash table size they use.
        - solves the problem of modules with many direct imports
        - does NOT solve the problem with many indirect imports
          (via deep depenendency trees, that can grow deeper after upgrading
           and/or when replacing an impl with another impl)
- how to avoid usage of a static address?
    - on platforms with "base-relative"/"bitmasked" jumps (instead of only plain relative jumps),
      it is possible to put the dispatcher at the lowest base-relative addresses.
      (this way, it is also possible to have mulitple dispatchers!)

Solution 5: (simplest solution?)
- put a dispatcher page (or pages) before each module's code,
  which contains only the module's imports.
- this way it can be accessed by a simple PC-relative jump (and there are no constant/fixed addresses at all)
- this requires (at least) 1 unshared/private page per module. (bad if there are many small modules)
    - BUT with "dispatcher cache" files, the pages could be mmap'ed files.
        (but it still needs 1 additional mapping per module)
        (and might be slower to load, due to more disk accesses)
            - BUT if the "dispatcher cache" data is placed at the end of each module's file,
              then loading would be fast, and it would no longer require any additional mappings.
    - BUT with kernel support, the pages could be made shareable systemwide.