Dbos: Durable Workflow Orchestration with Go and PostgreSQL
(github.com)52 points by Bogdanp 4 days ago
52 points by Bogdanp 4 days ago
Yes, in any durability framework there's still the possibility that a process crashes mid-step, in which case you have no choice but to restart the step.
Where DBOS really shines (vs. Temporal and other workflow systems) is a radically simpler operational model--it's just a library you can install in your app instead of a big heavyweight cluster you have to rearchitect your app to work with. This blog post goes into more detail: https://www.dbos.dev/blog/durable-execution-coding-compariso...
> Yes, in any durability framework there's still the possibility that a process crashes mid-step, in which case you have no choice but to restart the step.
Golem [1] is an interesting counterexample to this. They run your code in a WASM runtime and essentially checkpoint execution state at every interaction with the outside world.
But it seems they are having trouble selling into the workflow orchestration market. Perhaps due to the preconception above? Or are there other drawbacks with this model that I’m not aware of?
1. https://www.golem.cloud/post/durable-execution-is-not-just-f...
I think one potential concern with "checkpoint execution state at every interaction with the outside world" is the size of the checkpoints. Allowing users to control the granularity by explicitly specifying the scope of each step seems like a more flexible model. For example, you can group multiple external interactions into a single step and only checkpoint the final result, avoiding the overhead of saving intermediate data. If you want finer granularity, you can instead declare each external interaction as its own step.
Plus, if the crash happens in the outside world (where you have no control), then checkpointing at finer granularity won't help.
Oh I see. Seems Nextflow is a strong contender in the serverless orchestrator market (serverless sounds better than embedded).
From what I can tell though, NF just runs a single workflow at a time, no queue or database. It relies on filesystem caching for "durability". That's changing recently with some optional add-ons.
> Exactly-Once Event Processing
This sounds...impossible? If you have some step in your workflow, either you 1) record it as completed when you start, but then you can crash halfway through and when you restore the workflow it now isn't processed 2) record it as completed after you're done, but then you can crash in-between completing and recording and when you restore you run the step twice.
#2 sounds like the obvious right thing to do, and what I assume is happening, but is not exactly once and you'd need to still be careful that all of your steps are idempotent.
The specific claim is that workflows are started exactly-once in response to an event. This is possible because starting a workflow is a database transaction, so we can guarantee that exactly one workflow is started per (for example) Kafka message.
For step processing, what you say is true--steps are restarted if they crash mid-execution, so they should be idempotent.
"Exactly-Once Event Processing" is the headline claim - I actually missed the workflow starting bit. So what happens if the workflow fails? Does it get restarted (and so we have twice-started) or does the entire workflow just fail ? Which is probably better described as "at-most once event processing"
I think a clearer way to think about this is "at least once" message delivery plus idempotent workflow execution is effectively exactly-once event processing.
The DBOS workflow execution itself is idempotent (assume each step is idempotent). When DBOS starts a workflow, the "start" (workflow inputs) is durably logged first. If the app crashes, on restart, DBOS reloads from Postgres and resumes from the last completed step. Steps are checkpointed so they don't re-run once recorded.
We may be a small startup, but we're growing fast with no shortage of production users who love our tech: https://www.dbos.dev/customer-stories
Not really.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... - Not even 3 full pages worth over the past 5 years, though the first page is entirely from this year. It's maybe 2-3 a month on average this year, and a lot are dupes.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... - Nim, for comparison, which doesn't really make a dent in the programming world but shows up a lot. The first 15 pages covers the same time period.
Thanks for posting! I am one of the author, happy to answer any question!
How does DBOS scale in a cluster? with Temporal or Dapr Workflows, applications register running their supported workflows types or activities and the workflow orchestration framework balances work across applications. How does this work in the library approach?
Also, how is DBOS handling workflow versioning?
Looking forward for your Java implementation. Thanks
Good questions!
DBOS naturally scales to distributed environments, with many processes/servers per application and many applications running together. The key idea is to use the database concurrency control to coordinate multiple processes. [1]
When a DBOS workflow starts, it’s tagged with the version of the application process that launched it. This way, you can safely change workflow code without breaking existing ones. They'll continue running on the older version. As a result, rolling updates become easy and safe. [2]
[1] https://docs.dbos.dev/architecture#using-dbos-in-a-distribut...
[2] https://docs.dbos.dev/architecture#application-and-workflow-...
There's a clear text password in one of your GitHub Action workflows: https://github.com/dbos-inc/dbos-transact-golang/blob/main/....
That password is only used by the GHA to start a local Postgres Docker container (https://github.com/dbos-inc/dbos-transact-golang/blob/main/c...), which is not accessible from outside.
Great project! Love the library+db approach. Some questions:
1. How much work is it to add bindings for new languages? 2. I know you provide conductor as a service. What are my options for workflow recovery if I don't have outbound network access? 3. Considering this came out of https://dbos-project.github.io/, do you guys have plans beyond durable workflows?
1. We also have support for Python and TypeScript with Java coming soon: https://github.com/dbos-inc
2. There are built-in APIs for managing workflow recovery, documented here: https://docs.dbos.dev/production/self-hosting/workflow-recov...
3. We'll see! :)
Elixir? Or does Oban hew close enough, that it’s not worth it?
Yeah, queue priority is natively supported: https://docs.dbos.dev/golang/tutorials/queue-tutorial#priori...
The durability guarantees are similar--each workflow step is checkpointed, so if a workflow fails, it can recover from the last completed step.
The big difference, like that blog post (https://www.dbos.dev/blog/durable-execution-coding-compariso...) describes, is the operational model. DBOS is a library you can install into your app, whereas Temporal et al. require you to rearchitect your app to run on their workers and external orchestrator.
There are DBOS libraries in multiple languages--Python, TS, and Go so far with Java coming soon: https://github.com/dbos-inc
No Rust yet, but we'll see!
Sounds exactly like how Temporal markets itself. I find that the burden of creating idempotent sub-steps in the workflow falls on the developer, regardless of checkpoints and state management at the workflow level.