Hacker News Clone

dalvrosaMar 25, 2026, 6:44 PM

From 12M ops/s to 305 M ops/s on a lock-free ring buffer.

In this post, I walk you step by step through implementing a single-producer single-consumer queue from scratch.

This pattern is widely used to share data between threads in the lowest-latency environments.

random3Mar 27, 2026, 3:39 PM

Ring buffers never get old. Here’s a useful mention of some of the most extensive technical work by LMAX team over 15 years ago https://martinfowler.com/articles/lmax.html

loegMar 26, 2026, 2:32 PM

Your blog footer mentions that code samples are GPL unless otherwise noted. You don't seem to note otherwise in the article, so -- do you consider these snippets GPL licensed?

dalvrosaMar 26, 2026, 2:37 PM

Actually I'm not sure. GPL was for source code of the website itself

I guess the code samples inside post are under https://david.alvarezrosa.com/LICENSE

But feel free to ping me if you need different license, quite open about it

ramon156Mar 26, 2026, 3:03 PM

Something to add to this; if you're focussing on these low-level optimizations, make sure the device this code runs on is actually tuned.

A lot of people focus on the code and then assume the device in question is only there to run it. There's so much you can tweak. I don't always measure it, but last time I saw at least a 20% improvement in Network throughput just by tweaking a few things on the machine.

hansvmMar 26, 2026, 9:19 PM

That reminds me of one of the easiest big wins I've had in my career. SystemD was causing issues, so I slapped in Gentoo with the real-time kernel patch. Peak latency (practically speaking, the only core metric we cared about -- some control loop doing a bunch of expensive math and interacting with real hardware) went down 5000x.

That specific advice isn't terribly transferable (you might choose to hack up SystemD or some other components instead, maybe even the problem definition itself), but the general idea of measuring and tuning the system running your code is solid.

kajaktumMar 26, 2026, 10:52 PM

What do you think is causing the issue? We are having the same kind of problem. Core isolation, no_hz, core pinning, but i am still getting interrupted by nmi interrupts

hansvmMar 27, 2026, 12:34 AM

Details depend, but the root cause is basically the same every time: your hardware is designed to do something other than what you want it to do. It might be close enough that you want to give it a shot anyway (often works, often doesn't), but solutions can be outside of the realm of what's suitable for a "prod-ready" service.

If you're experiencing NMIs, the solution is simple if you don't care about the consequences; find them and remove them (ideally starting by finding what's generating them and verifying you don't need it). Disable the NMI watchdog, disable the PMU, disable PCIe Error Reporting (probably check dmesg and friends first to ensure your hardware is behaving correctly and fix that if not), disable anything related to NMIs at the BIOS/UEFI/IPMI/BMC layers, register a kernel module to swallow any you missed in your crusade, and patch the do_nmi() implementation with something sane for your use case in your custom kernel (there be dragons here, those NMIs obviously exist for a reason). It's probably easier to start from the ground up adding a minimal set of software for your system to run than to trim it back down, but either option is fine.

Are you experiencing NMIs though? You might want to take a peek at hwlatdetect and check for SMIs or other driver/firmware issues, fixing those as you find them.

It's probably also worth double-checking that you don't have any hard or soft IRQs being scheduled on your "isolated" core, that no RCU housekeeping is happening, etc. Make sure you pre-fault all the memory your software uses, no other core maps memory or changes page tables, power scaling is disabled (at least the deep C-states), you're not running workloads prone to thermal issues (1000W+ in a single chip is a lot of power, and it doesn't take much full-throttle AVX512 to heat it up), you don't have automatic updates of anything (especially not microcode or timekeeping), etc.

Also, generally speaking, your hardware can't actually multiplex most workloads without side effects. Abstractions letting you pretend otherwise are making compromises somewhere. Are devices you don't care about creating interrupts? That's a problem. Are programs you don't care about causing cache flushes? That's a problem. And so on. Strip the system back down to the bare minimum necessary to do whatever it is you want to do.

As to what SystemD is doing in particular? I dunno, probably something with timer updates, microcode updates, configuring thermals and power management some way I don't like, etc. I took the easy route and just installed something sufficiently minimalish and washed my hands of it. We went from major problems to zero problems instantly and never had to worry about DMA latency again.

dalvrosaMar 26, 2026, 3:06 PM

Agreed. For benchmarking I used this <https://github.com/david-alvarez-rosa/CppPlayground/blob/mai...> which relies on GoogleBenchmark and pins producer/consumer threads to dedicated CPU cores

What else could be improved? Would like to learn :)

Maybe using huge pages?

dijitMar 26, 2026, 6:37 PM

kernel tickrate is a pretty big one, most people don't bother and use what their OS ships with.

Disabling c-states, pinning network interfaces to dedicated cores (and isolating your application from those cores) and `SCHED_FIFO` (chrt -f 99 <prog>) helps a lot.

Transparent hugepages increase latency without you being aware of when it happens, I usually disable that.

Idk, there's a bunch but they all depend on your use-case. For example I always disable hyperthreading because I care more about latency than processing power- and I don't want to steal cache from my workload randomly.. but some people have more I/O bound workloads and hyperthreading is just and strict improvement in those situations.

dalvrosaMar 26, 2026, 6:42 PM

Thanks. Do you happen to know why hyperthreading should be disabled?

In prod most trading companies do disable it, not sure about generic benchmarks best practices

dijitMar 26, 2026, 6:44 PM

It eliminates cache contention between siblings, which leads to increased latency (randomly)

dalvrosaMar 27, 2026, 8:30 AM

Thanks!

jeffbeeMar 26, 2026, 7:13 PM

There are some microarchitectural resources that are either statically divided between running threads, or "cooperatively" fought over, and if you don't need to hide cache miss latency, which is the only thing hyperthreading is really good at, you're probably better off disabling the supernumerary threads.

dalvrosaMar 27, 2026, 8:30 AM

Thanks for the explanation :)

kevincoxMar 26, 2026, 2:46 PM

Random idea: If you have a known sentinel value for empty could you avoid the reader needing to read the writer's index? Just try to read, if it is empty the queue is empty, otherwise take the item and put an empty value there. Similarly for writing you can check the value, if it isn't empty the queue is full.

It seems that in this case as you get contention the faster end will slow down (as it is consuming what the other end just read) and this will naturally create a small buffer and run at good speeds.

The hard part is probably that sentinel and ensuring that it can be set/cleared atomically. On Rust you can do `Option<T>` to get a sentinel for any type (and it very often doesn't take any space) but I don't think there is an API to atomically set/clear that flag. (Technically I think this is always possible because the sentinel that Option picks will always be small even if the T is very large, but I don't think there is an API for this.)

loegMar 26, 2026, 3:03 PM

Yeah, or you could put a generation number in each slot adjacent to T and a read will only be valid if the slot's generation number == the last one observed + 1, for example. But ultimately the reader and writer still need to coordinate here, so we're just shifting the coordination cache line from the writer's index to the slot.

kevincoxMar 26, 2026, 3:08 PM

I think the key difference is that they only need to coordinate when the reader and writer are close together. If that slows one end down they naturally spread apart. So you don't lose throughput, only a little latency in the contested case.

loegMar 26, 2026, 3:10 PM

> I think the key difference is that they only need to coordinate when the reader and writer are close together.

This was already the case with the cached index design at the end of the article, though. (Which doesn't require extra space or extra atomic stores.)

kevincoxMar 26, 2026, 3:51 PM

That's a good point. They are very similar. I guess the sentinel design in theory doesn't need to synchronize at all as long as there is a decent buffer between them. But the cached design synchronizes less commonly the more space there is which sounds like it would be very similar in practice. The sentinel design might also have a few thrashing issues when the reader and writer are on the same page which would probably be a bit less of an issue with the cached index design.

mikhmhaMar 26, 2026, 9:16 PM

Lock-free ring buffer is my favorite data structure. I remember implementing it in C++ and then using a legitimate implementation in the form of boost:SPSC for prod. The idea is so simple. And then I started thinking about designing some programming language or framework around the concept, only to then stumble upon the idea of "message passing" for concurrency. Which of course led me to learn about Erlang. And then I went down the Erlang rabbit hole. It might have been a mistake...I made more money doing C++.

dalvrosaMar 27, 2026, 8:33 AM

Lol. Funny story :)

pixelpoetMar 26, 2026, 2:56 PM

Great article, thanks for sharing. And such a lovely website too :)

dalvrosaMar 26, 2026, 2:56 PM

Thanks for the feedback <3

erickpintorMar 26, 2026, 2:54 PM

Great post!

Would you mind expanding on the correctness guarantees enforced by the atomic semantics used? Are they ensuring two threads can't push to the same slot nor pop the same value from the ring? These type of atomic coordination usually comes from CAS or atomic increment calls, which I'm not seeing, thus I'm interested in hearing your take on it.

erickpintorMar 26, 2026, 3:01 PM

I see you replied on comment below with:

> note that there are only one consumer and one producer

That clarify things as you don't need multi-thread coordination on reads or writes if assuming single producer and single consumer.

dalvrosaMar 26, 2026, 3:02 PM

Exactly, that's right

dalvrosaMar 26, 2026, 3:01 PM

Thanks! That's not ensured, optimizations are only valid due to the constraints

- One single producer thread

- One single consumer thread

- Fixed buffer capacity

So to answer

> Are they ensuring two threads can't push to the same slot nor pop the same value from the ring?

No need for this usecase :)

loegMar 26, 2026, 3:05 PM

This is a SPSC queue -- there aren't multiple writers to coordinate, nor readers. It simplifies the design.

nitwit005Mar 26, 2026, 7:18 PM

It would be nice to have an example use case where the technique would show a benefit.

It seems relatively rare to have a single producer and consumer thread, and be worth polling a ring buffer.

ohaziMar 26, 2026, 8:59 PM

I use my own very similar version of this spsc lock-free ring buffer on almost every embedded project I work on that has to stream any sort of sampled data (e.g. audio). You can even have the consumer end be a DMA into something like a uart or USB peripheral so your microcontroller userspace doesn't have to touch the hardware.

BlackthornMar 26, 2026, 2:43 PM

I had what I thought was a pretty good implementation, but I wasn't aware of the cache line bouncing. Looks like I've got some updates to make.

dalvrosaMar 26, 2026, 2:46 PM

Glad that it helps :)

sanufarMar 26, 2026, 1:50 PM

Super fun, def gonna try this on my own time later

dalvrosaMar 26, 2026, 2:19 PM

Feel free to share your findings

brcmthrowawayMar 26, 2026, 7:04 PM

Is there a C library that I can get these data structures for free?

loegMar 26, 2026, 10:10 PM

ConcurrencyKit ck_ring. The SPSC macros are the most similar to this article:

https://github.com/concurrencykit/ck/blob/master/include/ck_...

brcmthrowawayMar 26, 2026, 7:01 PM

Random q: What was the first cpu to support atomic instructions?

jeffbeeMar 26, 2026, 7:15 PM

I don't know but the IBM 360 and the DEC PDP-10 both had them. Those are the earliest systems I ever saw.

devnotes77Mar 26, 2026, 2:44 PM

[dead]

lukesergeiMar 24, 2026, 1:09 PM

[flagged]

dalvrosaMar 24, 2026, 1:14 PM

:)

hedalMar 24, 2026, 2:33 PM

[flagged]

dalvrosaMar 24, 2026, 2:49 PM

Thanks!

jeffbeeMar 26, 2026, 7:18 PM

It's lock-free because it uses ordered loads and stores, which is also how you implement locks. I find the semantic distinction unconvincing. The post is really about how slow the default STL mutex implementation is.

pjdesnoMar 27, 2026, 4:15 PM

That's what "lock-free" means. You still need to use the hardware mechanisms provided for atomicity.

The whole point of lock-free data structures and algorithms is that sometimes you can do better by using these atomic operations inside your own code, rather than using a one-size-fits-all mutex based on those same atomic operations.

(Note that I say "sometimes". Too many people believe that lock-free structures are always faster; as always, your mileage may vary. In this case it's a huge win, to the point where I would bet it almost always moves the bottleneck to the code actually using the ring buffer.)

jeffbeeMar 27, 2026, 4:35 PM

My point is that the "huge win" is expressed in terms of a bogus and misleading baseline. The article moves immediately from the worst possible lock-based implementation to a pretty bad atomics-based implementation. The final punchline of the article is expressed as a ratio of the bad baseline. To make an honest conclusion, the article should also explore better ways of using the locks.

pjdesnoMar 27, 2026, 7:35 PM

It's not a "bogus and misleading baseline".

It's precisely the way we teach people how to build thread-safe systems. And we teach them to do it that way because we've learned from experience that letting them code up their own custom synchronization primitives leads to immense woe and suffering.

(and it's not slow because of the C++ mutex implementation, either - I tested a C/pthreads version, and it was the same speed as the C++ version)

jeffbeeMar 27, 2026, 8:18 PM

The GNU libstdc++ STL mutex is nothing but pthread_lock, so that's not a surprise.

I really don't understand what you are saying about not using custom primitives. The whole article is "YOLO your own synchronization" and it fails to grapple with the subtleties. An example of the unaddressed complexity: use of acquire-release semantics for head_ and tail_ atomics imposes no ordering whatsoever between observations of head_ and tail_. The final solution has four atomics that use acquire-release and does not discuss the fact that threads may observe the values of these four things in very surprising order. The issue is so complex that I consider this 50-page academic paper to be the bare minimum survey of the problem that a programmer should thoroughly understand before they even consider using atomics.

https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

loegMar 26, 2026, 10:12 PM

There are real practical implications of both the producer and consumer mutating the same cache line to take a lock that is fundamentally avoided by this "lock-free" design. It isn't meaningless.

jeffbeeMar 26, 2026, 11:27 PM

That only explains the last stage. In order to steelman the mutex alternative, everything before "further optimization" should have used 2 critical sections. That would give a realistic baseline.

loegMar 27, 2026, 1:33 AM

I don’t understand what you’re getting at.

kristianpMar 25, 2026, 7:05 PM

This is in C++, other languages have different atomic primitives.

JonChesterfieldMar 26, 2026, 2:18 PM

It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Also doesn't have fences on the store, has extra branches that shouldn't be there, and is written in really stylistically weird c++.

Maybe an llm that likes a different language more, copying a broken implementation off github? Mostly commenting because the initial replies are "best" and "lol", though I sympathise with one of those.

loegMar 26, 2026, 2:30 PM

> It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Are we reading the same code? The stores are clearly after value accesses.

> Also doesn't have fences on the store

?? It uses acquire/release semantics seemingly correctly. Explicit fences are not required.

JonChesterfieldMar 26, 2026, 2:46 PM

Push:

buffer_[head] = value;

head_.store(next_head, std::memory_order_release);

return true;

There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.

loegMar 26, 2026, 2:59 PM

> There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.

No, this is incorrect. If you think there's no relationship, you don't understand "release" semantics.

https://en.cppreference.com/w/cpp/atomic/memory_order.html

> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.

judofyrMar 26, 2026, 3:02 PM

This is just wrong. See https://en.cppreference.com/w/cpp/atomic/memory_order.html. Emphasis mine:

> A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below).

blacklionMar 26, 2026, 3:06 PM

write with release semantic cannot be reordered with any other writes, dependent or not.

Relaxed atomic writes can be reordered in any way.

loegMar 26, 2026, 3:12 PM

> write with release semantic cannot be reordered with any other writes, dependent or not.

To quibble a little bit: later program-order writes CAN be reordered before release writes. But earlier program-order writes may not be reordered after release writes.

> Relaxed atomic writes can be reordered in any way.

To quibble a little bit: they can't be reordered with other operations on the same variable.

blacklionMar 26, 2026, 3:59 PM

Yep, you are right, more precise, and precision is very important in this topic.

I stand corrected.

hrmtst93837Mar 26, 2026, 6:02 PM

That's backwards: in C++, a release store to head_ and an acquire load of that same atomic do order the prior buffer_ write, even though the data and index live in different locations, so the consumer that sees the new head can't legally see an older value for that slot unless something else is racing on it seperately. If this is broken, the bug is elsewhere.

dalvrosaMar 26, 2026, 2:22 PM

Sorry, but that's not actually true. There are no data races, the atomics prevent that (note that there are only one consumer and one producer)

Regarding the style, it follows the "almost always auto" idea from Herb Sutter

secondcomingMar 26, 2026, 2:42 PM

If you enforce that the buffer size is a power of 2 you just use a mask to do the

    if (next_head == buffer.size())
        next_head = 0;

part

JonChesterfieldMar 26, 2026, 2:46 PM

If it's a power of two, you don't need the branch at all. Let the unsigned index wrap.

loegMar 26, 2026, 3:09 PM

You ultimately need a mask to access the correct slot in the ring. But it's true that you can leave unmasked values in your reader/writer indices.

dalvrosaMar 26, 2026, 2:55 PM

Interesting, I've never heard about anybody using this. Maybe a bit unreadable? But yeah, should work :)

mandarax8Mar 26, 2026, 4:39 PM

See https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff... which takes it even further :)

dalvrosaMar 26, 2026, 6:07 PM

Nice one!

loegMar 26, 2026, 10:14 PM

I believe ConcurrencyKit's impl does this.

https://github.com/concurrencykit/ck/blob/master/include/ck_...

dalvrosaMar 26, 2026, 2:52 PM

Indeed that's true. That extra constraint enables further optimization

It's mentioned in the post, but worth reiterating!

loegMar 26, 2026, 3:07 PM

This was, in fact, mentioned in the article.

Optimizing a lock-free ring buffer

Comments