There’s also an order of magnitude higher events when doing event based work in processing.
This seems like a perfectly reasonable starting and gateway points that can have things organized for when the time comes.
Most things don’t scale that big.
The idea behind a DLQ is it will retry (with some backoff) eventually, and if it fails enough, it will stay there. You need monitoring to observe the messages that can't escape DLQ. Ideally, nothing should ever stay in DLQ, and if it does, it's something that should be fixed.
Sure, it's unavailability of course, but it's not data loss.
But if your DLQ is overloaded, you probably want to slow down or stop since sending a large fraction of your traffic to DLQ is counter productive. E.g. if you are sending 100% of messages to DLQ due to a bug, you should stop processing, fix the bug, and then resume from your normal queue.
If the problem is that the consumers themselves cannot write to the DLQ, then that feels like either Kafka is dying (no more writes allowed) or the consumers have been misconfigured.
Edit: In fact there seems to be a self inflicted problem being created here - having the DLQ on a different system, whether it be another instance of Kafka, or Postgres, or what have you, is really just creating another point of failure.
There's a balance. Do you want to have your Kafka cluster provisioned for double your normal event intake rate just in case you have the worst-case failure to produce elsewhere that causes 100% of events to get DLQ'd (since now you've doubled your writes to the shared cluster, which could cause failures to produce to the original topic).
In that sort of system, failing to produce to the original topic is probably what you want to avoid most. If your retention period isn't shorter than your time to recover from an incident like that, then priority 1 is often "make sure the events are recorded so they can be processed later."
IMO a good architecture here cleanly separates transient failures (don't DLQ; retry with backoff, don't advance consumer group) from "permanently cannot process" (DLQ only these), unlike in the linked article. That greatly reduces the odds of "everything is being DLQ'd!" causing cascading failures from overloading seldom-stressed parts of the system. Makes it much easier to keep your DLQ in one place, and you can solve some of the visibility problems from the article from a consumer that puts summary info elsewhere or such. There's still a chance for a bug that results in everything being wrongly rejected, but it makes you potentially much more robust against transient downstream deps having a high blast radius. (One nasty case here is if different messages have wildly different sets of downstream deps, do you want some blocking all the others then? IMO they should then be partitioned in a way so that you can still move forward on the others.)
BUT, you are 100% right to point to what i think is the proper solution, and that is to treat the DLQ with some respect, not a bit bucket where things get dumped because the wind isn't blowing in the right direction.
(The queue probably isnt down if you've just pulled a message off it)
No need to look down on PG because it makes it more approachable and is more longer a specialized skill.
Learned something new today. I knew what FOR UPDATE did, but somehow I've never RTFM'd hard enough to know about the SKIP LOCKED directive. Thats pretty cool.
Something I've still not mastered is how to prevent lock escalation into table-locks, which could torpedo all of this.
The idea is that if your DLQ has consistently high volume, there is something wrong with your upstream data, or data handling logic, not the architecture.
Furthermore, why have two indexes with the same leading field (status)?
https://web.archive.org/web/20240309030618/https://www.2ndqu...
corresponding HN discussion thread from 2016 https://news.ycombinator.com/item?id=14676859
[†] it seems that all the old 2ndquadrant.com blog post links have been broken after their acquisition by enterprisedb
PostgreSQL FOR UPDATE SKIP LOCKED: The One-Liner Job Queue https://www.dbpro.app/blog/postgresql-skip-locked
It covers the race condition, the atomic claim behaviour, worker crashes, and how priorities and retries are usually layered on top. Very much the same approach described in the old 2ndQuadrant post, but with a modern end-to-end example.
DuckDB is on our radar. In practice each database still needs some engine-specific work to feel good, so a fully generic plugin system is harder than it sounds. We are thinking about how to do this in a scalable way.
I have simple flow: tasks are order of thousands an hour. I just use postgresql. High visibility, easy requeue, durable store. With appropriate index, it’s perfectly fine. LLM will write skip locked code right first time. Easy local dev. I always reach for Postgres for event bus in low volume system.
Using Postgres too long is probably less harmful than adding unnecessary complexity too early
Both are pretty bad.
Having SEPARATE DLQ and Event/Message broker systems is not (IMO) valid - because a new point of failure is being introduced into the architecture.
Many developers overcomplicate systems. In the pursuit of 100% uptime, if you're not extremely careful, you removed more 9s with complexity than you added with redundancy. And although hyperscalers pride themselves on their uptime (Amazon even achieved three nines last year!) in reality most customers of most businesses are fine if your system is down for ten minutes a month. It's not ideal and you should probably fix that, but it's not catastrophic either.
The centralization pushes make a situation where if I have a task to do that needs three tools to accomplish, and one of them goes down, they’re all down. So all I can do is go for coffee or an early lunch because I can’t sub in another task into this time slot. They’re all blocked by The System being down, instead of a system being down.
If CI is borked I can work on docs and catch up on emails. If the network is down or NAS is down and everything is on that NAS, then things are dire.
Only if DC gets nuked.
Many developers overcomplicate systems and throw a database at the problem.
Challenge: Design a fault tolerant event-driven architecture. Only rule, you aren’t allowed to use a database. At all. This is actually an interview question for a top employer. Answer this right and you get a salary that will change your life.
DC!=Washington, DC
It’s not usable for high scale processing but most applications just need a simple queue with low depth and low complexity. If you’re already managing PSQL and don’t want to add more management to your stack (and managed services aren’t an option), this pattern works just fine. Go back 10-15yrs and it was more common, especially in Ruby shops, as teams willing to adopt Kafka/Cassandra/etc were more rare.
How would you do this instead, and why?
Postgres has a query interface, replication, backup and many other great utilities. And it’s well supported, so it will work for low-demand applications.
Regardless, you’re using the wrong data structure with the wrong performance profile, and at the margins you will spend a lot more money and time than necessary running it . And service will suffer.
I work on apps that use such a PG based queue system and it provides indispensable features for us we couldn't achieve easily/cleanly with a normal queue system such as being able to dynamically adjust the priority/order of tasks being processed and easily query/report on the content of the queue. We have many other interesting features built into it that are more specific to our needs as well that I'm more hesitant to describe in detail here.