Thread synchronization ====================== Alternative solutions to traditional lock + memory ordering: Use a memory area for each "channel endpoint" --------------------------------------------- - Each channel endpoint has ONE sending/producing thread. - Each channel endpoint has ONE receiving/consuming thread. How does this scale? With 4 KiB pages and ALL threads communicating with each other (worst case): Threads Memory usage = n*(n-1)*4KiB 4 48 KiB 8 224 KiB 16 960 KiB 32 3 968 KiB 64 16 128 KiB 128 63.5 MiB 256 255.0 MiB <-- here it starts getting really problematic 512 1.0 GiB 1024 4.0 GiB 2048 16.0 GiB 4096 64.0 GiB With 32 byte queues (one cache line): Threads Memory usage = n*(n-1)*32 4 384 B 8 1792 B 16 7680 B 32 31 KiB 64 126 KiB 128 508 KiB 256 2040 KiB 512 8.0 MiB 1024 32.0 MiB 2048 128.0 MiB 4096 512.1 MiB Another problem is that the threads need to check one queue per sender. That's a lot of queue to loop through. (Unless it is possible to send "messages" between CPU threads)