Comment by mike_hearn
Comment by mike_hearn 2 days ago
tl;dr same reason other services go offline at night: concurrency is hard and many computations aren't thread safe, so need to run serially against stable snapshots of the data. If you don't have a database that can provide that efficiently you have no choice but to stop the flow of inbound transactions entirely.
Sounds like Dafydd did the right thing in pushing them to deliver some value now and not try to rebuild everything right away. A common mistake I've seen some people make is assuming that overnight batch jobs that have to shut down the service are some side effect of using mainframes, and any new system that uses newer tech won't have that problem.
In reality getting rid of those kinds of batch jobs is often a hard engineering project that requires a redesign of the algorithms or changes to business processes. A classic example is in banking where the ordering of these jobs can change real world outcomes (e.g. are interest payments made first and then cheques processed, or vice-versa?).
In other cases it's often easier for users to understand a system that shuts down overnight. If the rule is "things submitted by 9pm will be processed by the next day" then it's easy to explain. If the rule is "you can submit at any time and it might be processed by the next day", depending on whether or not it happens to intersect the snapshot taken at the start of that particular batch job, then that can be more frustrating than helpful.
Sometimes the jobs are batch just because of mainframe limitations and not for any other reason, those can be made incremental more easily if you can get off the mainframe platform to begin with. But that requires rewriting huge amounts of code, hence the popularity of emulators and code transpilers.
Getting rid of batch jobs shouldn't be a goal; batch processing is generally more efficient as things get amortized, caches get better hit ratios, etc.
What software engineers should understand is there's no reason a batch can't take 3 ms to process and run every 20 ms. "Batch" and "real-time" aren't antonyms. In a language/framework with promises and thread-safe queues it's easy to turn a real time API into a batch one, possibly giving an order of magnitude increase in throughput.