Comment by pankalog

Comment by pankalog 6 hours ago

32 replies

I recently worked at a big home lighting company, working on the OS of the router device that communicates with the light bulbs themselves and the internet/user.

Our OTAU architecture uses A/B system updates [1]. Core idea is that both the kernel and the rootfs (read-only) partitions had 2 different bootslots in storage, and the OTAU would only write to the bootslot that is unused. Hence, if something went wrong, the system would automatically fallback to the previous version by just switching the bootslot used. Over the numerous years that that architecture was used, I couldn't find a single post-mortem that resulted in devices being bricked. Something to note is that the rootfs partition was overlaid with a writable partition for persisting state data etc.

Now that was a $two-figure USD device, not a $5/6-figure USD electric SUV. Is this a cost-cutting measure? At those price levels, doubling your NAND size is not even half of a percent of the total cost of the vehicle.

Unless there was a serious issue that the used bootslot corrupted the unused bootslot, then I don't see how this could have happened.

It's saddening that car manufacturers are so unserious about the code they're deploying.

[1] https://source.android.com/docs/core/ota/ab

AlotOfReading 6 hours ago

I've worked in both IoT lighting and automotive, so I'm comfortable comparing the two. This also isn't offered as a defense.

The big auto OEMs are just as sensitive to absolute BOM cost optimization, regardless of the percentage increases. I don't think this was a bootslot issue though, regardless of the word "bricked". Even as backwards and ill-advised as auto software can be, generally accepted practice is that updates are impossible while the vehicle is in motion. This is usually enforced by systems shared across multiple OEMs through the tier system.

The situation sounds more like a disastrously buggy new firmware.

I wouldn't put either past stellantis though. The auto industry already scrapes the bottom of the proverbial barrel sometimes, and stellantis isn't exactly known for their top of market compensation.

potatolicious 5 hours ago

This is generally how other devices work as well - for example all Android devices and Android-derivatives (which many of these cars are!) use a similar A/B partition to prevent bricking.

It definitely reduces the risk of updates, but it absolutely doesn't eliminate it.

The A/B updater itself is a surface area - especially if the logic is complex and there are other child devices that are updated at the same time (likely for cars). In that case you're not just coordinating between two independent partitions, you're coordinating between 2 * N partitions, half of which have dependencies on each other.

Also, the key bit of the mechanism is that upon successful boot the new partition is flagged as "good", which causes flags to be set to assert that the update was successful and the backup partition is no longer needed. That logic can (and does) fail - if your failure point occurs after this checkpoint you're hosed still because you're past the point of no return.

Making that worse is that in most systems you want the "it's all good" checkpoint to occur early - you don't want to, for example, wait multiple minutes for all user services to come up. But that also means that if a critical failure happens in said services, you're past the checkpoint.

palmotea 6 hours ago

> Now that was a $two-figure USD device, not a $5/6-figure USD electric SUV. Is this a cost-cutting measure? At those price levels, doubling your NAND size is not even half of a percent of the total cost of the vehicle.

Could just be a competence and priorities problem. If it's cost cutting, it feels way more likely that some PM cut some story from a sprint to hit a deadline (and objections were either not raised or ignored), than they did some engineering analysis and explicitly decided to save $3 per vehicle by cutting the NAND size.

Edit: Actually, I don't think that technique would have helped, the problem wasn't a botched update, but a seriously buggy one. From the OP:

> The buggy update doesn't appear to brick the car immediately. Instead, the failure appears to occur while driving—a far more serious problem.

  • general1465 6 hours ago

    > Edit: Actually, I don't think that technique would have helped, the problem wasn't a botched update, but a seriously buggy one. From the OP:

    That and combined with general refusal of new automotive bootloaders to downgrade. You can go only up in versioning. So even that you could have working version on second partition, it will never get loaded because it has lower version than currently one you are running.

shadowpho 6 hours ago

Two points to add:

1) Total cost of the vehicle does not matter. What does matter is the operating margin. Half a percent of the total cost of the vehicle will move them from 2% margin to 1.5% margin. (Ford has operating margin of 2% as an example)

In other words an increase in 0.5% cost of total vehicle will reduce their profits by 25%.

That’s a huge number now! Note also that car manufacturers are in a bad spot because their volumes are fairly low (smartphone = 1M/yr, car = 40k/yr) and have harsher requirements for chips, driving the cost way up.

2)AB updates are great, but they can still fail or get soft locked. Especially important around code when you configure the slot to be good and when bad.

  • maxerickson 20 minutes ago

    You are conflating gross and operating margin.

    It's also more dynamic than your presentation. They have a little bit of pricing power, so a small increase doesn't all come out of the margin.

avidiax 6 hours ago

I have heard anecdotally that auto manufacturers are sensitive to a price change less than $5/vehicle. This is better than some industries that are sensitive to $1.

What could easily have happened is that the negotiators didn't include A/B updates in their spec, or they only specced A/B updates at 1GB OTA size.

They do their usual hammering on price, and the head unit or ECU manufacturer gave them some savings by cutting storage space to the bone.

Maybe it was still enough for A/B updates, until the usual software bloat took the updates past the critical limit.

They could still do a safe update by doing an A/B/A update (where B is a shrunken, update-only OS), but that requires development time, and the engineers should already be working on the next vehicle.

  • thunfischbrot 6 hours ago

    Worked for them. Corporations with many brands in their portfolio might discuss for weeks over price differences of components of 0.20 Euro. That‘s twenty Euro cents difference for e.g. a USB connector. If you expect that a vehicle platform sells in the 10s of millions over its lifetime, you‘re talking real money very quick!

    • joezydeco 5 hours ago

      However, the price of recalls and warranty rework is never computed into that number.

      • dylan604 5 hours ago

        yet another example of the flawed logic where "we don't have time/money to do it right now, yet we always seem to find the time/money to redo it later after the shit hits the fan"

jcalvinowens 5 hours ago

> the system would automatically fallback to the previous version by just switching the bootslot used.

That's the hard part though.

It's shockingly common in my experience to have an A/B boot setup, but no actual logic in the userspace application to switch back to the other partition if something goes wrong. It's just a defense against somebody pulling the plug during the OTA, it doesn't protect against software bugs at all.

apex_sloth 6 hours ago

We used to do that with device that where in difficult to reach places with harsh uptime requirement! Think industrial routers and protocol converters. I think it pays for itself very quickly. Sending someone for such a device can get expensive.

CoastalCoder 6 hours ago

That's a good point.

I'm curious if failing to do that opens Jeep up to legitimate lawsuits.

jacquesm 6 hours ago

Well, on the positive side, at least they were stationary unlike these vehicles. Don't get me started on botched OTA updates, there are so many ways companies get those wrong it's not even funny.

kijin 6 hours ago

I once managed to brick a PC motherboard that advertised "dual BIOS". It didn't fallback to the previous version after a botched BIOS update.

It's totally possible that the update corrupted the other bootslot as well. If those blocks aren't off-limits to the updater program, it's just an off-by-one error waiting to happen. Slot 0 or slot 1?

Another possibility is that the updated version booted up just enough not to trigger the automatic fallback, and then got stuck in a loop.

ThatMedicIsASpy 6 hours ago

I've had a bunch of updates break some stuff but since moving to Fedora Atomics/ublue I've never had a system I could not get back into.

stefan_ 6 hours ago

Nothing was bricked at all. Thats just how clickbait journalists describe things that stop working in some way after an update nowadays.

(Most computers in a car don't need duplicate partitioning because they can be bootstrapped from a central computer)

  • stevenhubertron 6 hours ago

    I’m sorry, but you’re incorrect the vehicle completely shutting down while driving and not working again until you put it into park and then it’s shutting down five minutes later is effectively bricked and extremely dangerous. Myself and my family almost died just trying to get home from dinner. It was a complete loss of propulsion and power steering.

    • recursive 6 hours ago

      There are many things that are dangerous that aren't "brick"-ings. If it can be later restored to function, then it is not bricked.

      • stevenhubertron 4 hours ago

        being unable to drive my vehicle due to a software update is bricking. It's also a pun, us Jeep owners call our Jeep's flying bricks.

      • sekh60 5 hours ago

        Thank you. I really hate how watered down the term "bricked" has become.

        • dylan604 5 hours ago

          I prefer the term borked in these situations

    • mannykannot 4 hours ago

      Then it would better be described as a life-threatening event rather than a bricking - especially as, in the hierarchy of concerns, the former is more serious than the latter.

    • stefan_ 4 hours ago

      And then it was fixed with another OTA, so it was not bricked. Why bring up this pedantic point you may ask? Because the grandparent raises a scenario that doesn't apply here. A/B updates or not were not at all the issue here.

  • upboundspiral 6 hours ago

    I for one am always grateful when things are engineered thoughtfully and with redundancy as it is symbolic of respect for the people who are your customers. Especially in something as important as a car, "can be bootstrapped from a central computer" - when? how easily? how reliably? - is not good enough because things will inevitably go wrong for some portion of the user base.

  • zoeysmithe 4 hours ago

    Brick is now slang for a lot of fail conditions that aren't classically 'bricked.' This has become really common I've noticed. Honestly, this ship has sailed and isn't even worth fighting anymore. Its like Xerox asking people to stop calling copies Xeroxes.

    We just never bothered to develop a new term. Maybe 'soft-bricked?' 'Semi-bricked?' I would like journalists at least to start using more accurate terms, but 'bricked' I imagine gets a lot more engagement and ad impressions, so here we are.

monero-xmr 6 hours ago

All those words you are saying, it's quite possible the sub-contractor to the sub-contractor to the sub-contractor in a foreign low-cost country that actually did the work has absolutely no idea what any of that means, and they are doing the bare minimum to deliver

  • zoeysmithe 4 hours ago

    Why wouldnt a foreigner know what this means? This seems very xenophobic. And if US/Euro management is hiring these groups and not giving them requirements for redundancy then guess who is at fault? Not the contractor.