Comment by mike_hearn

Comment by mike_hearn 2 days ago

15 replies

tl;dr same reason other services go offline at night: concurrency is hard and many computations aren't thread safe, so need to run serially against stable snapshots of the data. If you don't have a database that can provide that efficiently you have no choice but to stop the flow of inbound transactions entirely.

Sounds like Dafydd did the right thing in pushing them to deliver some value now and not try to rebuild everything right away. A common mistake I've seen some people make is assuming that overnight batch jobs that have to shut down the service are some side effect of using mainframes, and any new system that uses newer tech won't have that problem.

In reality getting rid of those kinds of batch jobs is often a hard engineering project that requires a redesign of the algorithms or changes to business processes. A classic example is in banking where the ordering of these jobs can change real world outcomes (e.g. are interest payments made first and then cheques processed, or vice-versa?).

In other cases it's often easier for users to understand a system that shuts down overnight. If the rule is "things submitted by 9pm will be processed by the next day" then it's easy to explain. If the rule is "you can submit at any time and it might be processed by the next day", depending on whether or not it happens to intersect the snapshot taken at the start of that particular batch job, then that can be more frustrating than helpful.

Sometimes the jobs are batch just because of mainframe limitations and not for any other reason, those can be made incremental more easily if you can get off the mainframe platform to begin with. But that requires rewriting huge amounts of code, hence the popularity of emulators and code transpilers.

ndriscoll 2 days ago

Getting rid of batch jobs shouldn't be a goal; batch processing is generally more efficient as things get amortized, caches get better hit ratios, etc.

What software engineers should understand is there's no reason a batch can't take 3 ms to process and run every 20 ms. "Batch" and "real-time" aren't antonyms. In a language/framework with promises and thread-safe queues it's easy to turn a real time API into a batch one, possibly giving an order of magnitude increase in throughput.

  • mike_hearn 2 days ago

    Batch size is usually fixed by the business problem in these scenarios, I doubt you can process them in 3msec if the job requires reading in every driving license in the country and doing some work on them for instance.

    • ndriscoll 2 days ago

      This particular thing might be difficult to change because it's 50 year old COBOL or whatever, but my point was more that I've encountered pushes from architects to "eliminate batches" and it makes no sense. It just means that now I have to re-batch things in my code. The correct way to think about it is that you want smaller, more frequent batches.

      Do they really need to do work on all records every night? Probably not. Most people aren't changing their license or vehicle info most days. So the problem is that somewhere they're (conceptually) doing a table scan instead of using an index. That might still be hard to fix, but at least identify the correct problem. Otherwise as you say moving to different tech won't fix it.

abigail95 2 days ago

Do you know why the downtime window hasn't been decreasing over time as it gets deployed onto faster hardware over the years?

Nobody would care or notice if this thing had 99.5% availability and went read only for a few minutes per day.

  • roryirvine 20 hours ago

    Most likely because it's not just a single batch job, but a whole series which have been scheduled based on a rough estimate of how long the jobs around them will take.

    For example, imagine it's 1997 and you're creating a job which produces a summary report based on the number of total number of cars registered, grouped by manufacturer and model.

    Licensed car dealers can submit updates to the list of available models by uploading an EDIFACT file using FTP or AS1. Those uploads are processed nightly by a job which runs at 0247. You check the logs for the past year, and find that this usually takes less than 5 minutes to run, but has on two occasions taken closer to 20 minutes.

    Since you want to have the updated list of models available before you run your summary job, you therefore schedule it to run at 0312 - leaving a gap of 25 minutes just in case. You document your reasoning as a comment in the production control file used to schedule this sequence of jobs.

    Ten years later, and manufacturers can now upload using SFTP or AS2, and you start thinking about ditching EDIFACT altogether and providing a SOAP interface instead. In another ten years you switch off the FTP facility, but still accept EDIFACT uploads via AS2 as a courtesy to the one dealership that still does that.

    Another eight years have passed. The job which ingests the updated model data is now a no-op and reliably runs in less than a millisecond every night. But your summary report is still scheduled for 0312.

    And there might well be tens of thousands of jobs, each with hundreds of dependencies. Altering that schedule is going to be a major piece of work in itself.

  • pjc50 2 days ago

    Maybe it isn't running on faster hardware? These systems are often horrifyingly outdated.

    • pwg a day ago

      Or maybe it is running on faster hardware, but the UK budget office decided not to pay IBM's fees required to make use of the extra speed, so it has been "throttled" to run at the same speed that it ran on the old hardware.

  • kalleboo a day ago

    Why would they spend the money to deploy it on faster hardware when the new cloud-based system rewrite is just around the corner? It's just 3 months way, this time, for sure...

  • mike_hearn 2 days ago

    It doesn't get deployed onto faster hardware. Mainframes haven't really got faster.

    • ndriscoll 2 days ago

      Mainframes have absolutely gotten faster. They're basically small supercomputers.

    • throw16180339 2 days ago

      You're mistaken about this. IBM's z-series had 5GHz CPUs well over a decade ago and they haven't gotten any slower.

    • abigail95 2 days ago

      It must be. Maintaining the original hardware would be more expensive that upgrading to compatible but faster systems.

      • mike_hearn 2 days ago

        What compatible systems? Mainframes are maintained in more or less their original state by teams from IBM. They are designed to be single machines that scale vertically and never shut down, every component can be hot-swapped including CPUs but IBM charge a lot for CPU capacity if I recall correctly. Given that nighttime doesn't get shorter, the DVLA probably don't see much reason to pay a lot more for a slightly smaller window.

        And mainframes from the 80s are slow. It sounds like they're running on the original.

        • ndriscoll 2 days ago

          Newer mainframes are still faster than older mainframes, and can have hundreds of cores and 10s of TB of RAM. A big part of IBM's draw is that they make modern systems that will continue to run your software forever with no modifications. I had an older guy there tell me a story about them changing a default in some ISPF panel, and customers complained enough that they had to change it back. Their storage systems have a virtualization layer for old programs that send commands to move the heads of a drive that hasn't been manufactured for 55 years or whatever and translate that to use storage backed by a modern RAID with normal disks. The engineers in the mainframe groups know who their customer base is and what they want.

          It's unlikely that they're literally using 40 year old hardware since the replacement parts for that would be a nightmare to find and almost certainly more expensive than a compatible new machine.

mschuster91 a day ago

> In reality getting rid of those kinds of batch jobs is often a hard engineering project that requires a redesign of the algorithms or changes to business processes.

That right here is the meat of so many issues in large IT projects: large corporations or government are very, very skeptical about changing their "established processes", usually due to "we have always done it this way". And no matter how often you try to explain them "do it a tiny bit differently to get the same end result but MUCH more efficiently" you'll always run into walls.