1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
|
Memory sharing vs execution time overhead
=========================================
There are basically 7 common "systems" to share code:
1. Position Independent + GOT/PLT (most ELF/*nix systems?)
2. Position Independent + GOT/PLT + prelinked files that are used if possible
3. Prelinking, and on collision: relocate on startup (Windows NT?)
4. Prelinking, and on collision: relocate on page-in (Windows 9x)
5. Prelinking, without collision handling (very old Linux versions)
6. Relocation at installation time (maybe systems with AOT compilation?)
7. Central dispatcher function (e.g. microkernel system or OS/BIOS)
Number 4 is IMHO the best option (together with compressed pages/zram),
but it requires kernel support.
What is needed for SLUL?
------------------------
On most platforms, code can use:
- relative jumps
- relative data addressing (Constant data. Adjacent to code, with 1 shared page / tail-head merging)
- possibly relative function addressing (if function refs are stored as native pointers)
But:
- calls between modules need special care
- constant data access between modules needs special care
- function pointer addressing might need special care (special _addr_typeXXX data slot)
- function pointer invocation might need special care (special _call_typeXXX jump slot)
Solution 1:
- this can be very efficient at runtime:
- only one or a few memory mapping
- sharing is possible if done on a per-system basis rather than per-user.
- "binary image file" with all code/data compiled together. can be mmap'ed in one operation.
- must be updated on changes in the underlying files.
- can be partitioned to reduce regeneration time, and
to support more than 2GiB of code on 32 bit systems.
- can be stored in home directory as well.
Solution 2:
- this reduces the number of mappings to one or a couple unshared/private pages + 1 shared page per lib.
- encode a base pointer + an offset in the address of each mapped library.
the base pointer points to an "offset table" / "GOT".
the offset is the offset in the offset table.
- base pointer could simply be found by zeroing out low order bits.
- offset could be "mid order bits".
- example 32 bit address: XBBB MMMM MMMM LLLL LLLL LLLL LLLL LLLL
X: 1 bit. must be 0 to avoid kernel addresses.
B: 3 bits. base pointer
M: 8 bits. offset in "offset table".
L: 20 bits. code address (up to 1 MiB of code per "block")
- libraries that use more than 1 MiB can use more than 1 block
(and also needs duplicated entries in the offset table)
- how to efficiently call into the offset table?
Solution 3:
- old-Linux-style prelinking with a (literally) global address space
- all executables can use a constant address
- address format (32 bit version):
- XHHH HHHH HHHH HHHH HLLL LLLL LLLL LLLL
X: 1 bit. must be 0 to avoid kernel addresses.
H: 16 bit name hash (oops - this will have collisions!)
L: 15 bits. code address (up to 32 KiB, then following blocks are used sequentially)
- how big is the chance of collisions?
- exceeds 1% at 36 libraries
- at 300 libraries, it is 50%
- with dedicated look-up pages without other code (=smaller), the number
of hash bits could be increased: (4 KiB per look-up table, 1 per library)
- XHHH HHHH HHHH HHHH HHHH LLLL LLLL LLLL
X: 1 bit. must be 0 to avoid kernel addresses.
H: 19 bit name hash (oops - this will have collisions!)
L: 16 bits. code address (up to 64 KiB, then following blocks are used sequentially)
- how big is the chance of collisions?
- exceeds 1% at 103 libraries
- at 851 libraries, it is 50%
Solution 4: (best solution? but not that scalable, and reduces security by creating a loophole in ASLR)
- "central dispatcher"
- have a "dispatcher" at a constant base address (globally constant, i.e. same for everyone, everywhere)
- functions are deterministically "hashed" somehow to generate an absolute address
(which is inside the "dispatcher" area).
- collisions in the same module can be solved by hashing again.
(- this way collisions with known implementations of dependencies could also be avoided)
- how to handle "dynamic" collisions, i.e. collisions between:
1. executable and libraries (=impls of dependencies), or
2. between libraries
- check return address (link register or stack entry)?
(must jump to a separate space first, which can be allocated at any address that the loader chooses)
- how to handle/avoid "saturation" (increasing collisions with increasing number of functions)
- the number of hash-entries is roughly the number of unique imports across all modules that are directly or indirectly used in an executable.
- 4096 / 8 bytes per jump = 512 entries
- 2 MiB / 8 bytes = 262144 entries
- could have multiple hash tables of exponentially growing sizes:
- first 1 page, then 2 pages, then 4, ...
- modules specify which hash table size they use.
- solves the problem of modules with many direct imports
- does NOT solve the problem with many indirect imports
(via deep depenendency trees, that can grow deeper after upgrading
and/or when replacing an impl with another impl)
- how to avoid usage of a static address?
- on platforms with "base-relative"/"bitmasked" jumps (instead of only plain relative jumps),
it is possible to put the dispatcher at the lowest base-relative addresses.
(this way, it is also possible to have mulitple dispatchers!)
Solution 5: (simplest solution?)
- put a dispatcher page (or pages) before each module's code,
which contains only the module's imports.
- this way it can be accessed by a simple PC-relative jump (and there are no constant/fixed addresses at all)
- this requires (at least) 1 unshared/private page per module. (bad if there are many small modules)
- BUT with "dispatcher cache" files, the pages could be mmap'ed files.
(but it still needs 1 additional mapping per module)
(and might be slower to load, due to more disk accesses)
- BUT if the "dispatcher cache" data is placed at the end of each module's file,
then loading would be fast, and it would no longer require any additional mappings.
- BUT with kernel support, the pages could be made shareable systemwide.
|