jiggunjer 3 hours ago

Sounds exactly like how Temporal markets itself. I find that the burden of creating idempotent sub-steps in the workflow falls on the developer, regardless of checkpoints and state management at the workflow level.

  • KraftyOne 3 hours ago

    Yes, in any durability framework there's still the possibility that a process crashes mid-step, in which case you have no choice but to restart the step.

    Where DBOS really shines (vs. Temporal and other workflow systems) is a radically simpler operational model--it's just a library you can install in your app instead of a big heavyweight cluster you have to rearchitect your app to work with. This blog post goes into more detail: https://www.dbos.dev/blog/durable-execution-coding-compariso...

    • bjornsing 26 minutes ago

      > Yes, in any durability framework there's still the possibility that a process crashes mid-step, in which case you have no choice but to restart the step.

      Golem [1] is an interesting counterexample to this. They run your code in a WASM runtime and essentially checkpoint execution state at every interaction with the outside world.

      But it seems they are having trouble selling into the workflow orchestration market. Perhaps due to the preconception above? Or are there other drawbacks with this model that I’m not aware of?

      1. https://www.golem.cloud/post/durable-execution-is-not-just-f...

      • qianli_cs 12 minutes ago

        I think one potential concern with "checkpoint execution state at every interaction with the outside world" is the size of the checkpoints. Allowing users to control the granularity by explicitly specifying the scope of each step seems like a more flexible model. For example, you can group multiple external interactions into a single step and only checkpoint the final result, avoiding the overhead of saving intermediate data. If you want finer granularity, you can instead declare each external interaction as its own step.

        Plus, if the crash happens in the outside world (where you have no control), then checkpointing at finer granularity won't help.

    • jiggunjer 2 hours ago

      Oh I see. Seems Nextflow is a strong contender in the serverless orchestrator market (serverless sounds better than embedded).

      From what I can tell though, NF just runs a single workflow at a time, no queue or database. It relies on filesystem caching for "durability". That's changing recently with some optional add-ons.

chc4 3 hours ago

> Exactly-Once Event Processing

This sounds...impossible? If you have some step in your workflow, either you 1) record it as completed when you start, but then you can crash halfway through and when you restore the workflow it now isn't processed 2) record it as completed after you're done, but then you can crash in-between completing and recording and when you restore you run the step twice.

#2 sounds like the obvious right thing to do, and what I assume is happening, but is not exactly once and you'd need to still be careful that all of your steps are idempotent.

  • KraftyOne 3 hours ago

    The specific claim is that workflows are started exactly-once in response to an event. This is possible because starting a workflow is a database transaction, so we can guarantee that exactly one workflow is started per (for example) Kafka message.

    For step processing, what you say is true--steps are restarted if they crash mid-execution, so they should be idempotent.

    • reillyse an hour ago

      "Exactly-Once Event Processing" is the headline claim - I actually missed the workflow starting bit. So what happens if the workflow fails? Does it get restarted (and so we have twice-started) or does the entire workflow just fail ? Which is probably better described as "at-most once event processing"

      • bjornsing 23 minutes ago

        "Exactly-Once Event Processing" is possible if (all!) the processing results go into a transactional database along with the stream position marker in a single transaction. That’s probably the mechanism they are relying on.

      • qianli_cs an hour ago

        I think a clearer way to think about this is "at least once" message delivery plus idempotent workflow execution is effectively exactly-once event processing.

        The DBOS workflow execution itself is idempotent (assume each step is idempotent). When DBOS starts a workflow, the "start" (workflow inputs) is durably logged first. If the app crashes, on restart, DBOS reloads from Postgres and resumes from the last completed step. Steps are checkpointed so they don't re-run once recorded.

odie5533 3 hours ago

For a project with minimal users, we get a lot of DBOS posts.

hmaxdml 4 hours ago

Thanks for posting! I am one of the author, happy to answer any question!