Tell HN: Azure outage

881 points by tartieret 5 days ago

Azure is down for us, we can't even access the azure portal. Are other experiencing this? Our services are located in Canada/Central and US-East 2

https://downdetector.ca/status/windows-azure/

https://azure.status.microsoft/en-gb/status

croemer 5 days ago

Preliminary post incident review: https://azure.status.microsoft/en-gb/status/history/

Timeline

15:45 UTC on 29 October 2025 – Customer impact began.

16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.

16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD.

16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.

16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.

17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.

17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.

17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration.

18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally.

18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally.

23:15 UTC on 29 October 2025 - PowerApps mitigation of dependency, and customers confirm mitigation.

00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers.

Reply View 56 replies

xnorswap 5 days ago

33 minutes from impact to status page for a complete outage is a joke.

Reply View | 42 replies
- neya 5 days ago
  
  In Microsoft's defense, Azure has always been a complete joke. It's extremely developer unfriendly, buggy and overpriced.
  
  Reply View | 25 replies
  
  michaelt 5 days ago
  
  If you call that defending microsoft, I'd hate to see what attacking them looks like :)
  
  Reply View | 7 replies
  
  sfn42 4 days ago
  
  I've only used Azure, to me it seems fine ish. Some things are rather overcomplicated and it's far from perfect but I assumed the other providers were similarly complicated and imperfect.
  Can't say I've experienced many bugs in there either. It definitely is overpriced but I assume they all are?
  
  Reply View | 6 replies
  
  sofixa 4 days ago
  
  > In Microsoft's defense, Azure has always been a complete joke. It's extremely developer unfriendly, buggy and overpriced.
  Don't forget extremely insecure. There is a quarterly critical cross-tenant CVE with trivial exploitation for them, and it has been like that for years.
  
  Reply View | 1 reply
  
  hinkley 4 days ago
  
  Given how much time I spent on my first real multi-tenant project, dealing with the consequences of architecture decisions meant to prevent these sorts of issues, I can see clearly the temptation to avoid dealing with them.
  But what we do when things are easy is not who we are. That's a fiction. It's how we show up when we are in the shit that matters. It's discipline that tells you to voluntarily go into all of the multi-tenant mitigations instead of waiting for your boss to notice and move the goalposts you should have moved on your own.
  
  Reply View | 0 replies
  
  madjam002 4 days ago
  
  My favourite was the Azure CTO complaining that Git was unintuitive, clunky and difficult to use
  
  Reply View | 6 replies
  
  rk06 3 days ago
  
  Hmm, isn't that the same argument we use in defense of windows and ms teams?
  
  Reply View | 0 replies
- campbel 4 days ago
  
  As a technologist, you should always avoid MS. Even if they have a best-in-class solution for some domain, they will use that to leverage you into their absolute worst-in-class ecosystem.
  
  Reply View | 1 reply
  
  hinkley 4 days ago
  
  I see Amazon using a subset of the same sorts of obfuscations that Microsoft was infamous for. They just chopped off the crusts so it's less obvious that it's the same shit sandwich.
  
  Reply View | 0 replies
- imglorp 4 days ago
  
  That's about how long it took to bubble up three levels of management and then go past the PR and legal teams for approvals.
  
  Reply View | 0 replies
- infaloda 5 days ago
  
  More importantly `15:45 UTC on 29 October 2025 – Customer impact began.
  16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered. ` A 19-minute delay in alert is a joke.
  
  Reply View | 5 replies
  
  hinkley 4 days ago
  
  10 minutes to alert, to avoid flapping false positives. 10 minute response window for first responders. Or, 5 minute window before failing over to backup alerts, and 4 minutes to wake up, have coffee, and open the appropriate windows.
  
  Reply View | 2 replies
  
  Xss3 4 days ago
  
  That does not say it took 19 minutes for alerts to appear. Following could mean any amount of time.
  
  Reply View | 1 reply
  
  hinkley 4 days ago
  
  It's 19 minutes until active engagement by staff. And planned rolling restarts can trigger alerts if you don't set thresholds of time instead of just thresholds of count.
  It would be nice though if alert systems made it easy to wire up CD to turn down sensitivity during observed actions. Sort of like how the immune system turns down a bit while you're eating.
  
  Reply View | 0 replies
- thayne 4 days ago
  
  Unfortunately,that is also typical. I've seen it take longer than that for AWS to update their status page.
  The reason is probably because changes to the status page require executive approval, because false positives could lead to bad publicity, and potentially having to reimburse customers for failing to meet SLAs.
  
  Reply View | 1 reply
  
  ape4 4 days ago
  
  Perhaps they could set the time to when it really started after executive approval.
  
  Reply View | 0 replies
- sbergot 5 days ago
  
  and for a while the status was "there might be issues on azure portal".
  
  Reply View | 1 reply
  
  ambentzen 4 days ago
  
  There might have been, but they didn't know because they couldn't access it. Could have been something totally unrelated.
  
  Reply View | 0 replies
- schainks 4 days ago
  
  AWS either is “on it” or you they will say something somewhere between 60-90 minutes after impact.
  We should be lucky MSFT is so consistent!
  Hug ops to the Azure team, since management is shredding up talent over there.
  
  Reply View | 0 replies
- HeavyStorm 4 days ago
  
  I've been on bridges where people _forgot_ to send comms for dozens of minutes. Too many inexperienced people around these days.
  
  Reply View | 0 replies
- skeezyjefferson 4 days ago
  
  [flagged]
  
  Reply View | 0 replies
onionisafruit 5 days ago

At 16:04 “Investigation commenced”. Then at 16:15 “We began the investigation”. Which is it?

Reply View | 7 replies
- ssss11 5 days ago
  
  Quick coffee run before we get stuck in mate
  
  Reply View | 3 replies
  
  ozim 5 days ago
  
  Load some carbs with chocolate chip cookies as well, that’s what I would do.
  You don’t want to debug stuff with low sugar.
  
  Reply View | 1 reply
  
  normie3000 4 days ago
  
  One crash after another
  
  Reply View | 0 replies
  
  red-iron-pine 4 days ago
  
  burn a smoko and take a leak
  
  Reply View | 0 replies
- taco_emoji 4 days ago
  
  16:04 Started running around screaming 16:15 Sat down & looked at logs
  
  Reply View | 0 replies
- not_a_bot_4sho 5 days ago
  
  I read it as the second investigation being specific to AFD. The first more general.
  
  Reply View | 1 reply
  
  onionisafruit 4 days ago
  
  I think you’re right. I missed that subtlety on first reading.
  
  Reply View | 0 replies
neop1x 3 days ago

>> We began the investigation and started to examine configuration changes within AFD.
Troubleshooting has completed
Troubleshooting was unable to automatically fix all of the issues found. You can find more details below.
>> We initiated the deployment of our ‘last known good’ configuration.
System Restore can help fix problems that might be making your computer run slowly or stop responding.
System Restore does not affect any of your documents, pictures, or other personal data. Recently installed programs and drivers might be uninstalled.
Confirm your restore point
Your computer will be restored to the state it was in before the event in the Description field below.

Reply View | 0 replies
oofbey 4 days ago

“Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations.”
Very circular way of saying “the validator didn’t do its job”. This is AFAICT a pretty fundamental root cause of the issue.
It’s never good enough to have a validator check the content and hope that finds all the issues. Validators are great and can speed a lot of things up. But because they are independent code paths they will always miss something. For critical services you have to assume the validator will be wrong, and be prepared to contain the damage WHEN it is wrong.

Reply View | 0 replies
notorandit 4 days ago

What puzzles me too is the time it took to recognize an outage.
Looks like there was no monitoring and no alerts.
Which is kinda weird.

Reply View | 2 replies
- hinkley 4 days ago
  
  I've seen sensitivity get tuned down to avoid false positives during deployments or rolling restarts for host updates. And to a lesser extent for autoscaling noise. It can be hard to get right.
  I think it's perhaps a gap in the tools. We apply the same alert criteria at 2 am that we do while someone is actively running deployment or admin tasks and there's a subset that should stay the same, like request failure rate, and others that should be tuned down, like overall error rate and median response times.
  And it means one thing if the failure rate for one machine is 90% and something else if the cluster failure rate is 5%, but if you've only got 18 boxes it's hard to discern the difference. And which is the higher priority error may change from one project to another.
  
  Reply View | 0 replies
- deadbolt 4 days ago
  
  Just what you want in a cloud provider, right?
  
  Reply View | 0 replies

mystcb 5 days ago

Update 16:57 UTC:

Azure Portal Access Issues

Starting at approximately 16:00 UTC, we began experiencing Azure Front Door issues resulting in a loss of availability of some services. In addition. customers may experience issues accessing the Azure Portal. Customers can attempt to use programmatic methods (PowerShell, CLI, etc.) to access/utilize resources if they are unable to access the portal directly. We have failed the portal away from Azure Front Door (AFD) to attempt to mitigate the portal access issues and are continuing to assess the situation.

We are actively assessing failover options of internal services from our AFD infrastructure. Our investigation into the contributing factors and additional recovery workstreams continues. More information will be provided within 60 minutes or sooner.

This message was last updated at 16:57 UTC on 29 October 2025

---

Update: 16:35 UTC:

Azure Portal Access Issues

Starting at approximately 16:00 UTC, we began experiencing DNS issues resulting in availability degradation of some services. Customers may experience issues accessing the Azure Portal. We have taken action that is expected to address the portal access issues here shortly. We are actively investigating the underlying issue and additional mitigation actions. More information will be provided within 60 minutes or sooner.

This message was last updated at 16:35 UTC on 29 October 2025

---

Azure Portal Access Issues

We are investigating an issue with the Azure Portal where customers may be experiencing issues accessing the portal. More information will be provided shortly.

This message was last updated at 16:18 UTC on 29 October 2025

---

Message from the Azure Status Page: https://azure.status.microsoft/en-gb/status

Reply View 82 replies

planewave 5 days ago

Azure Network Availability Issues
Starting at approximately 16:00 UTC, we began experiencing Azure Front Door issues resulting in a loss of availability of some services. We suspect that an inadvertent configuration change as the trigger event for this issue. We are taking two concurrent actions where we are blocking all changes to the AFD services and at the same time rolling back to our last known good state.
We have failed the portal away from Azure Front Door (AFD) to mitigate the portal access issues. Customers should be able to access the Azure management portal directly.
We do not have an ETA for when the rollback will be completed, but we will update this communication within 30 minutes or when we have an update.
This message was last updated at 17:17 UTC on 29 October 2025

Reply View | 2 replies
- croemer 5 days ago
  
  "We have initiated the deployment of our 'last known good' configuration. This is expected to be fully deployed in about 30 minutes from which point customers will start to see initial signs of recovery. Once this is completed, the next stage is to start to recover nodes while we route traffic through these healthy nodes."
  "This message was last updated at 18:11 UTC on 29 October 2025"
  
  Reply View | 1 reply
  
  croemer 5 days ago
  
  At this stage, we anticipate full mitigation within the next four hours as we continue to recover nodes. This means we expect recovery to happen by 23:20 UTC on 29 October 2025. We will provide another update on our progress within two hours, or sooner if warranted.
  This message was last updated at 19:57 UTC on 29 October 2025
  
  Reply View | 0 replies
cyptus 5 days ago

AFD is down quite often regionally in Europe for our services. In 50%+ the cases they just don‘t report it anywhere, even if its for 2h+.

Reply View | 36 replies
- RajT88 5 days ago
  
  Spam those Azure tickets. If you have a CSAM, build them a nice powerpoint telling the story of all your AFD issues (that's what they are there for).
  > In 50%+ the cases they just don‘t report it anywhere, even if its for 2h+.
  I assume you mean publicly. Are you getting the service health alerts?
  
  Reply View | 32 replies
  
  tomashubelbauer 5 days ago
  
  CSAM apparently also means Customer Success Account Manager for those who might have gotten startled by this message like me.
  
  Reply View | 5 replies
  
  psunavy03 5 days ago
  
  Some really unfortunate acronyms flying around the Microsoft ecosystem . . .
  
  Reply View | 4 replies
  
  nijave 5 days ago
  
  Back when we used Azure the only outcome was them trying to upsell us on Premium Support
  
  Reply View | 1 reply
  
  RajT88 4 days ago
  
  Do you recall the kind of premium support? Azure Rapid Response?
  
  Reply View | 0 replies
  
  cyptus 5 days ago
  
  in many cases: no service health alerts, no status page updates and no confirmations from the support team in tickets. still we can confirm these issues from different customers accross europe. Mostly the issues are regional dependent.
  
  Reply View | 0 replies
  
  cyberax 5 days ago
  
  > CSAM
  Child Sex-Abuse Material?!? Well, a nice case of acronym collision.
  
  Reply View | 14 replies
  
  alias_neo 4 days ago
  
  Where do these alerts supposedly come from? I started having issues around 4PM (GMT), couldn't access portal, and couldn't make AKV requests from the CLI, and initially asked our Ops guys but with no info and a vague "There may be issues with Portal" on their status page, that was me done for the day.
  
  Reply View | 0 replies
  
  llama052 5 days ago
  
  I got a service health alert an hour after it started, saying the portal was having issues. Pretty useless and misleading.
  
  Reply View | 1 reply
  
  RajT88 5 days ago
  
  That should go into the presentation you provide your CSAM with as well.
  Storytelling is how issues get addressed. Help the CSAM tell the story to the higher ups.
  
  Reply View | 0 replies
- nevf1 5 days ago
  
  This is the single most frustrating thing about these incidents. As you're harmstrung on what you can do or how you can react until Microsoft officially acknowledges a problem. Took nearly 90mins both today and when it happened on 9th October.
  
  Reply View | 1 reply
  
  cyptus 5 days ago
  
  so true. instead of getting a fast feedback we are wasting time searching for our own issues first.
  
  Reply View | 0 replies
- hallh 5 days ago
  
  Same experience. We've recently migrated fully away from AFD due to how unreliable it is.
  
  Reply View | 0 replies
jjp 5 days ago

Whilst the status message acknowledge's the issue with Front Door (AFD), it seems as though the rest of the actions are about how to get Portal/internal services working without relying on AFD. For those of us using Front Door does that mean we're in for a long haul?

Reply View | 4 replies
- llama052 5 days ago
  
  Please migrate off of front door. It's been a failure mode since it came out historically. Anything else is better at this point
  
  Reply View | 2 replies
  
  everfrustrated 5 days ago
  
  Didn't the underlying vendor they used for Azure Front Door go bankrupt? It's probably on life support.
  
  Reply View | 1 reply
  
  guptadagger 5 days ago
  
  i understood that to be a different third party that provided a CDN and was different than afd. https://learn.microsoft.com/en-us/azure/frontdoor/migrate-cd...
  
  Reply View | 0 replies
- progmetaldev 5 days ago
  
  Currently even the Front Door landing page is only partially loading.
  https://azure.microsoft.com/en-us/products/frontdoor
  
  Reply View | 0 replies
8cvor6j844qw_d6 5 days ago

I'll be interested in the incident writeup since DNS is mentioned. It will be interesting in a way if it is similar to what happened at AWS.

Reply View | 11 replies
- Insanity 5 days ago
  
  It's pretty unlikely. AWS published a public 'RCA' https://aws.amazon.com/message/101925/. A race condition in a DNS 'record allocator' causing all DNS records for DDB to be wiped out.
  I'm simplifying a bit, but I don't think it's likely that Azure has a similar race condition wiping out DNS records on _one_ system than then propagates to all others. The similarity might just end at "it was DNS".
  
  Reply View | 8 replies
  
  parliament32 5 days ago
  
  That RCA was fun. A distributed system with members that don't know about each other, don't bother with leader elections, and basically all stomp all over each other updating the records. It "worked fine" until one of the members had slightly increased latency and everything cascade-failed down from there. I'm sure there was missing (internal) context but it did not sound like a well-architected system at all.
  
  Reply View | 2 replies
  
  kyrra 5 days ago
  
  https://isitdns.com/
  
  Reply View | 0 replies
  
  cdr420 5 days ago
  
  It's always DNS
  
  Reply View | 3 replies
- layer8 5 days ago
  
  DNS has both naming and cache invalidation, so no surprise it’s among the hardest things to get right. ;)
  
  Reply View | 1 reply
  
  dotancohen 5 days ago
  
  That's three of the hardest problems in CS ))
  
  Reply View | 0 replies
NDizzle 5 days ago

They briefly had a statement about using Traffic Manager to work with your AFD to work around this issue, with a link to learn.microsoft.com/...traffic-manager, and the link didn't work. Due to the same issue affecting everyone right now.
They quickly updated the message to REMOVE the link. Comical at this point.

Reply View | 2 replies
- Aperocky 5 days ago
  
  The statement is still there though on the status page though
  
  Reply View | 1 reply
  
  NDizzle 5 days ago
  
  They re-added it once the site was accessible.
  
  Reply View | 0 replies
jdc0589 5 days ago

yea its not just the portal. microsoft.com is down too

Reply View | 11 replies
- mystcb 5 days ago
  
  Yeah, I am guessing it's just a placeholder till they get more info. I thought I saw somewhere that internally within Microsoft it's seen as a "Sev 1" with "all hands on deck" - Annoyingly I can't remember where I saw it, so if someone spots it before I do, please credit that person :D
  Edit: Typo!
  
  Reply View | 2 replies
  
  verst 5 days ago
  
  It's a Sev 0 actually (as one would expect - this isn't a big secret). I was on the engineering bridge call earlier for a bit. The Azure service I work on was minimally impacted (our customer facing dashboard could not load, but APIs and data layer were not impacted) but we found a workaround.
  
  Reply View | 0 replies
  
  chad_c 5 days ago
  
  It was here https://news.ycombinator.com/item?id=45749054 but that comment has been deleted.
  
  Reply View | 0 replies
- PeterCorless 5 days ago
  
  Seems all Microsoft-related domains are impacted in some way.
  • https://www.xbox.com/en-US also doesn't fully paint. Header comes up, but not the rest of the page.
  • https://www.minecraft.net/en-us is extremely slow, but eventually came up.
  
  Reply View | 0 replies
- bossyTeacher 5 days ago
  
  It sure must be embarrassing for the website of the second richest company in the world to be down.
  
  Reply View | 0 replies
- daxfohl 5 days ago
  
  Downdetector says aws and gcp are down too. Might be in for a fun day.
  
  Reply View | 4 replies
  
  rozenmd 5 days ago
  
  From what I can tell, Downdetector just tracks traffic to their pages without actually checking if the site is down.
  The other day during the AWS outage they "reported" OVH down too.
  
  Reply View | 0 replies
  
  jdc0589 5 days ago
  
  yea I saw that, but im not sure on how accurate that is. a few large apps/companies I know to be 100% on AWS in us-east-1 are cranking along just fine.
  
  Reply View | 0 replies
  
  linhns 5 days ago
  
  Not sure if this is true. I just login to the console with no glitch.
  
  Reply View | 0 replies
  
  NetMageSCW 5 days ago
  
  AWS was performance issues and I believe is resolved.
  
  Reply View | 0 replies
- planewave 5 days ago
  
  yes, and it seems that at least for some login.microsoftonline.com is down too, which is part of the Entra login / SSO flow.
  
  Reply View | 0 replies
jonathanlydall 5 days ago

Yet another reason to move away from Front Door.
We already had to do it for large files served from Blob Storage since they would cap out at 2MB/s when not in cache of the nearest PoP. If you’ve ever experienced slow Windows Store or Xbox downloads it’s probably the same problem.
I had a support ticket open for months about this and in the end the agent said “this is to be expected and we don’t plan on doing anything about it”.
We’ve moved to Cloudflare and not only is the performance great, but it costs less.
Only thing I need to move off Front Door is a static website for our docs served from Blob Storage, this incident will make us do it sooner rather than later.

Reply View | 6 replies
- out_sider 5 days ago
  
  we are considering the same but because our website uses APEX domain we would need to move all DNS resolver to cloudfront right ? Does it have as a nice "rule set builder" as azure ?
  
  Reply View | 4 replies
  
  jonathanlydall 5 days ago
  
  Unless you pay for CloudFlare’s Enterpise plan, you’re required to have them host your DNS zone, you can use a different registrar as long as you just point your NS records to Cloudflare.
  Be aware that if you’re using Azure as your registrar, it’s (probably still) impossible to change your NS records to point to CloudFlare’s DNS server, at least it was for me about 6 months ago.
  This also makes it impossible to transfer your domain to them either, as CloudFlare’s domain transfer flow requires you set your NS records to point to them before their interface shows a transfer option.
  In our case we had to transfer to a different registrar, we used Namecheap.
  However, transferring a domain from Azure was also a nightmare. Their UI doesn’t have any kind of transfer option, I eventually found an obscure document (not on their Learn website) which had an az command which would let you get a transfer code which I could give to Namecheap.
  Then I had to wait over a week for the transfer timeout to occur because there is no way on Azure side that I could find to accept the transfer immediately.
  I found CloudFlare’s way of building rules quite easy to use, different from Front Door but I’m not doing anything more complex than some redirects and reverse proxying.
  I will say that Cloudflare’s UI is super fast, with Front Door I always found it painfully slow when trying to do any kind of configuration.
  Cloudflare also doesn’t have the problem that Front Door has where it requires a manual process every 6 months or so to renew the APEX certificate.
  
  Reply View | 3 replies
- nosefrog 5 days ago
  
  Front Door is not good.
  
  Reply View | 0 replies
eddie_catflap 5 days ago

We saw issues before 16:00 UTC - approx 15:38

Reply View | 0 replies
ThatManulTheCat 5 days ago

DNS. Ofc.

Reply View | 0 replies
rconti 5 days ago

Sounds like they need to move their portal to a region with more capacity for the desired instance type. /s

Reply View | 0 replies

Uehreka 5 days ago

I noticed that Starbucks mobile ordering was down and thought “welp, I guess I’ll order a bagel and coffee on Grubhub”, then GrubHub was down. My next stop was HN to find the common denominator, and y’all did not disappoint.

Reply View 34 replies

pants2 5 days ago

Good thing HN is hosted on a couple servers in a basement. Much more reliable than cloud, it seems!

Reply View | 10 replies
- dang 5 days ago
  
  Just don't use genetically identical hardware:
  https://news.ycombinator.com/item?id=32031639
  https://news.ycombinator.com/item?id=32032235
  Edit: wow, I can't believe we hadn't put https://news.ycombinator.com/item?id=32031243 in https://news.ycombinator.com/highlights. Fixed now.
  
  Reply View | 6 replies
  
  hinkley 5 days ago
  
  I’ve seen this up close twice and I’m surprised it’s only twice. Between March and September one year, 6 people on one team had to get new hard drives in their thinkpads and rebuild their systems. All from the same PO but doled out over the course of a project rampup. That was the first project where the onboarding docs were really really good, since we got a lot of practice in a short period of time.
  Long before that, the first raid array anyone set up for my (teams’) usage, arrived from Sun with 2 dead drives out of 10. They RMA’d us 2 more drives and one of those was also DOA. That was a couple years after Sun stopped burning in hardware for cost savings, which maybe wasn’t that much of a savings all things considered.
  
  Reply View | 0 replies
  
  gogusrl 5 days ago
  
  I got burnt by this bug on freakin' Christmas Eve 2020 ( https://forum.hddguru.com/viewtopic.php?f=10&t=40766 ). There was some data loss and a lot of lessons learned.
  
  Reply View | 0 replies
  
  praccu 5 days ago
  
  Many years ago (13?), I was around when Amazon moved SABLE from RAM to SSDs. A whole rack came from a single batch, and something like 128 disks went out at once.
  I was an intern but everyone seemed very stressed.
  
  Reply View | 0 replies
  
  airstrike 5 days ago
  
  I love that "Ask HN: What'd you do while HN was down?" was a thing
  
  Reply View | 1 reply
  
  Cthulhu_ 4 days ago
  
  My plan B was going to the Stack Exchange homepage for some interesting threads but it got repetitive.
  
  Reply View | 0 replies
  
  Cthulhu_ 4 days ago
  
  Man I hit something like that once, a SSD had a firmware bug where it would stop working at an exact number of hours.
  
  Reply View | 0 replies
- lysace 5 days ago
  
  It was on AWS at least (for a while) in 2022.
  https://news.ycombinator.com/item?id=32030400
  
  Reply View | 1 reply
  
  jjice 5 days ago
  
  Yeah looks like they're back on M5.
  dang saying it's temporary: https://news.ycombinator.com/item?id=32031136
  $ dig news.ycombinator.com ; <<>> DiG 9.10.6 <<>> news.ycombinator.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54819 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;news.ycombinator.com. IN A ;; ANSWER SECTION: news.ycombinator.com. 1 IN A 209.216.230.207 ;; Query time: 79 msec ;; SERVER: 100.100.100.100#53(100.100.100.100) ;; WHEN: Wed Oct 29 13:59:29 EDT 2025 ;; MSG SIZE rcvd: 65
  And that IP says it's with M5 again.
  
  Reply View | 0 replies
- parliament32 5 days ago
  
  Always has been.
  
  Reply View | 0 replies
Havoc 5 days ago

The sysadmin subreddit tends to beat hn on outage reports by an hour+ in my experience.
Bunch of on-call peeps over there that definitely know the instant something major goes down

Reply View | 0 replies
sergiotapia 5 days ago

Wow I just left a Starbucks drivethru line because it was just not moving. I guess it was because of this.

Reply View | 7 replies
- iso1631 5 days ago
  
  You'd think that Starbucks execs would be held accountable for the fragile system they have put in place.
  But they won't be.
  
  Reply View | 6 replies
  
  peanut-walrus 5 days ago
  
  Why? Starbucks is not providing a critical service. Spending less money and resources and just accepting the risk that occasionally you won't be able to sell coffee for a few hours is a completely valid decision from both management and engineering pov.
  
  Reply View | 5 replies
hypeatei 5 days ago

Starbucks mobile was down during the AWS outage too...

Reply View | 4 replies
- SoftTalker 5 days ago
  
  They are multi-cloud --- vulnerable to all outages!
  
  Reply View | 1 reply
  
  mring33621 5 days ago
  
  you wouldn't believe some of the crap enterprise bigco mgmt put in place for disaster recovery.
  they think that they are 'eliminating a single point of failure', but in reality, they end up adding multiple, complicated points of mostly failure.
  
  Reply View | 0 replies
- Hamuko 5 days ago
  
  Gonna build my application to be multicloud so that it requires multiple cloud platforms to be online at the same time. The RAID 0 of cloud computing.
  
  Reply View | 0 replies
- andoma 5 days ago
  
  Go multi-cloud they said...
  
  Reply View | 0 replies
Theodores 5 days ago

My inner Nelson-from-the-Simpsons wishes I was on your team today, able to flaunt my flask of tea and homemade packed sandwiches. I would tease you by saying 'ha ha!' as your efforts to order coffee with IP packets failed.
I always go everywhere adequately prepared for beverages and food. Thanks to your comment, I have a new reason to do so. Take out coffees are actually far from guaranteed. Payment systems could go down, my bank account could be hacked or maybe the coffee shop could be randomly closed. Heck, I might even have an accident crossing the road. Anything could happen. Hence, my humble flask might not have the top beverage in it but at least it works.
We all design systems with redundancy, backups and whatnot, but few of us apply this thinking to our food and drink. Maybe get a kettle for the office and a backup kettle, in case the first one fails?

Reply View | 0 replies
01284a7e 5 days ago

Ha, maybe rethink the I AM NOTHING BUT A HUGE CLOUD CONSUMER thing on some fundamental levels? Like food?

Reply View | 0 replies
port11 5 days ago

I noticed it when my Netatmo rigamajig stopped notifying me of bad indoor air quality. Lovely. Why does it need to go through the cloud if the data is right there in the home network…

Reply View | 2 replies
- pasc1878 4 days ago
  
  Same here for netatmo - ironically I replied to an incident report with netatmo saying all was OK when the whole system was falling over.
  However netatmo does need to have a server to store data as you need to consolidate acreoss devices plus you can query gfor a year's data and that won't and can't be held locally.
  
  Reply View | 1 reply
  
  port11 3 days ago
  
  It could be local-first. I don't mind the cross-device sync being done centrally, of course, but the app specifically asks for access to Home and Local Network. I wonder if Home Assistant could deal with blackouts…
  
  Reply View | 0 replies
[removed] 5 days ago

[deleted]

Reply View | 0 replies
garbagewoman 5 days ago

Service culture is so hollow

Reply View | 0 replies
jeffrallen 5 days ago

You know you can talk to your barista and ask for a bagel, right? If you're lucky they still take cash... if you still _have_ cash. :)

Reply View | 1 reply
- 0_____0 5 days ago
  
  I was at a McDs a couple months back and I'm pretty sure you had to use the kiosk to order. Some places are deprecating the cashier entirely.
  
  Reply View | 0 replies