I guess the code samples inside post are under https://david.alvarezrosa.com/LICENSE
But feel free to ping me if you need different license, quite open about it
A lot of people focus on the code and then assume the device in question is only there to run it. There's so much you can tweak. I don't always measure it, but last time I saw at least a 20% improvement in Network throughput just by tweaking a few things on the machine.
That specific advice isn't terribly transferable (you might choose to hack up SystemD or some other components instead, maybe even the problem definition itself), but the general idea of measuring and tuning the system running your code is solid.
If you're experiencing NMIs, the solution is simple if you don't care about the consequences; find them and remove them (ideally starting by finding what's generating them and verifying you don't need it). Disable the NMI watchdog, disable the PMU, disable PCIe Error Reporting (probably check dmesg and friends first to ensure your hardware is behaving correctly and fix that if not), disable anything related to NMIs at the BIOS/UEFI/IPMI/BMC layers, register a kernel module to swallow any you missed in your crusade, and patch the do_nmi() implementation with something sane for your use case in your custom kernel (there be dragons here, those NMIs obviously exist for a reason). It's probably easier to start from the ground up adding a minimal set of software for your system to run than to trim it back down, but either option is fine.
Are you experiencing NMIs though? You might want to take a peek at hwlatdetect and check for SMIs or other driver/firmware issues, fixing those as you find them.
It's probably also worth double-checking that you don't have any hard or soft IRQs being scheduled on your "isolated" core, that no RCU housekeeping is happening, etc. Make sure you pre-fault all the memory your software uses, no other core maps memory or changes page tables, power scaling is disabled (at least the deep C-states), you're not running workloads prone to thermal issues (1000W+ in a single chip is a lot of power, and it doesn't take much full-throttle AVX512 to heat it up), you don't have automatic updates of anything (especially not microcode or timekeeping), etc.
Also, generally speaking, your hardware can't actually multiplex most workloads without side effects. Abstractions letting you pretend otherwise are making compromises somewhere. Are devices you don't care about creating interrupts? That's a problem. Are programs you don't care about causing cache flushes? That's a problem. And so on. Strip the system back down to the bare minimum necessary to do whatever it is you want to do.
As to what SystemD is doing in particular? I dunno, probably something with timer updates, microcode updates, configuring thermals and power management some way I don't like, etc. I took the easy route and just installed something sufficiently minimalish and washed my hands of it. We went from major problems to zero problems instantly and never had to worry about DMA latency again.
What else could be improved? Would like to learn :)
Maybe using huge pages?
Disabling c-states, pinning network interfaces to dedicated cores (and isolating your application from those cores) and `SCHED_FIFO` (chrt -f 99 <prog>) helps a lot.
Transparent hugepages increase latency without you being aware of when it happens, I usually disable that.
Idk, there's a bunch but they all depend on your use-case. For example I always disable hyperthreading because I care more about latency than processing power- and I don't want to steal cache from my workload randomly.. but some people have more I/O bound workloads and hyperthreading is just and strict improvement in those situations.
In prod most trading companies do disable it, not sure about generic benchmarks best practices
It seems that in this case as you get contention the faster end will slow down (as it is consuming what the other end just read) and this will naturally create a small buffer and run at good speeds.
The hard part is probably that sentinel and ensuring that it can be set/cleared atomically. On Rust you can do `Option<T>` to get a sentinel for any type (and it very often doesn't take any space) but I don't think there is an API to atomically set/clear that flag. (Technically I think this is always possible because the sentinel that Option picks will always be small even if the T is very large, but I don't think there is an API for this.)
This was already the case with the cached index design at the end of the article, though. (Which doesn't require extra space or extra atomic stores.)
Would you mind expanding on the correctness guarantees enforced by the atomic semantics used? Are they ensuring two threads can't push to the same slot nor pop the same value from the ring? These type of atomic coordination usually comes from CAS or atomic increment calls, which I'm not seeing, thus I'm interested in hearing your take on it.
> note that there are only one consumer and one producer
That clarify things as you don't need multi-thread coordination on reads or writes if assuming single producer and single consumer.
- One single producer thread
- One single consumer thread
- Fixed buffer capacity
So to answer
> Are they ensuring two threads can't push to the same slot nor pop the same value from the ring?
No need for this usecase :)
It seems relatively rare to have a single producer and consumer thread, and be worth polling a ring buffer.
https://github.com/concurrencykit/ck/blob/master/include/ck_...
The whole point of lock-free data structures and algorithms is that sometimes you can do better by using these atomic operations inside your own code, rather than using a one-size-fits-all mutex based on those same atomic operations.
(Note that I say "sometimes". Too many people believe that lock-free structures are always faster; as always, your mileage may vary. In this case it's a huge win, to the point where I would bet it almost always moves the bottleneck to the code actually using the ring buffer.)
It's precisely the way we teach people how to build thread-safe systems. And we teach them to do it that way because we've learned from experience that letting them code up their own custom synchronization primitives leads to immense woe and suffering.
(and it's not slow because of the C++ mutex implementation, either - I tested a C/pthreads version, and it was the same speed as the C++ version)
I really don't understand what you are saying about not using custom primitives. The whole article is "YOLO your own synchronization" and it fails to grapple with the subtleties. An example of the unaddressed complexity: use of acquire-release semantics for head_ and tail_ atomics imposes no ordering whatsoever between observations of head_ and tail_. The final solution has four atomics that use acquire-release and does not discuss the fact that threads may observe the values of these four things in very surprising order. The issue is so complex that I consider this 50-page academic paper to be the bare minimum survey of the problem that a programmer should thoroughly understand before they even consider using atomics.
Which is a slight shame since Load-Linked/Store-Conditional is pretty cool, but I guess that's limited to ARM anyways, and now they've added extensions for CAS due to speed.
Things get interesting when you're working with a cpu that lacks the ldrex/strem assembly instructions that makes this all work. I think youre only options at that point are disable/enable interrupts. IF anyone has any insights into this constraint I'd love to hear it.
In reality multiple threads for single core doesn't make much sense right?
Not necessarily, I think -- depends what you're doing.
https://en.cppreference.com/w/cpp/atomic/atomic/compare_exch...
The exact syntax and naming will of course differ, but any language that exposes low-level atomics at all is going to provide a pretty similar set of operations.
Sure, C++ has a particular way of describing atomics in a cross-platform way, but the actual hardware operations are not specific to the language.
But at the hardware level all are kindof the same
Also doesn't have fences on the store, has extra branches that shouldn't be there, and is written in really stylistically weird c++.
Maybe an llm that likes a different language more, copying a broken implementation off github? Mostly commenting because the initial replies are "best" and "lol", though I sympathise with one of those.
Are we reading the same code? The stores are clearly after value accesses.
> Also doesn't have fences on the store
?? It uses acquire/release semantics seemingly correctly. Explicit fences are not required.
buffer_[head] = value;
head_.store(next_head, std::memory_order_release);
return true;
There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.
No, this is incorrect. If you think there's no relationship, you don't understand "release" semantics.
https://en.cppreference.com/w/cpp/atomic/memory_order.html
> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.
> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below).
Relaxed atomic writes can be reordered in any way.
To quibble a little bit: later program-order writes CAN be reordered before release writes. But earlier program-order writes may not be reordered after release writes.
> Relaxed atomic writes can be reordered in any way.
To quibble a little bit: they can't be reordered with other operations on the same variable.
I stand corrected.
Regarding the style, it follows the "almost always auto" idea from Herb Sutter
if (next_head == buffer.size())
next_head = 0;
parthttps://github.com/concurrencykit/ck/blob/master/include/ck_...
It's mentioned in the post, but worth reiterating!
In this post, I walk you step by step through implementing a single-producer single-consumer queue from scratch.
This pattern is widely used to share data between threads in the lowest-latency environments.