MangoCoffee 5 days ago

The Internet is supposed to be decentralized. The big three seem to have all the power now (Amazon, Microsoft, and Google) plus Cloudflare/Oracle.

How did we get here? Is it because of scale? Going to market in minutes by using someone else's computers instead of building out your own, like co-location or dedicated servers, like back in the day.

  • kube-system 5 days ago

    It still is very decentralized. We are discussing this via the internet right now.

    • kbelder 5 days ago

      I need to drop AWS and start passing data through encrypted HN posts.

    • chasd00 5 days ago

      When AWS was down we were talking about it here, now Azure is down and we're still talking about it here. Where does HN actually live?

    • [removed] 5 days ago
      [deleted]
  • mrinterweb 5 days ago

    A lot of money and years of marketing the cloud as the responsible business decision led us here. Now that the cloud providers have vendor lock-in, few will leave, and customers will continue to wildly overpay for cloud services.

    • gwbas1c 5 days ago

      Ahh, but you forget what it used to be like. Sites used to go down all the time.

      Now, they go down a lot less frequently, but when they do, it's more widespread.

      • bossyTeacher 5 days ago

        Not sure how the current situation is better. Being stranded with no way whatsoever to access most/all of your services sounds way more terrifying than regular issues limited to a couple of services at a time

      • JoBrad 5 days ago

        It’s the Heisenberg cloud principal.

  • deaux 5 days ago

    From today [0].

    > Big Tech lobbying is riding the EU’s deregulation wave by spending more, hiring more, and pushing more, according to a new report by NGO’s Corporate Europe Observatory and LobbyControl on Wednesday (29 October).

    > Based on data from the EU’s transparency register, the NGOs found that tech companies spend the most on lobbying of any sector, spending €151m a year on lobbying — a 33 percent increase from €113m in 2023.

    Gee whizz, I really do wonder how they end up having all the power!

    [0] https://news.ycombinator.com/item?id=45744973

  • alt227 5 days ago

    Thats the whole point, big players like AWS and MS can go down, but here we are still talking on the internet.

    Decentralisation is winning it seems.

    • jslaby 5 days ago

      Not everyone has moved over, but I'm sure there have been thoughts or plans to.

  • nzach 5 days ago

    > How did we get here?

    I think the response lies in the surrounding ecosystem.

    If you have a company it's easier to scale your team if you use AWS (or any other established ecosystem). It's way easier to hire 10 engineers that are competent with AWS tools than it is to hire 10 engineers that are competent with the IBM tools.

    And from the individuals perspective it also make sense to bet on larger platforms. If you want to increase your odds of getting a new job, learning the AWS tools gives you a better ROI than learning the IBM tools.

  • AndrewKemendo 5 days ago

    A natural monopoly is a monopoly in an industry in which high infrastructure costs and other barriers to entry relative to the size of the market give the largest supplier in an industry, often the first supplier in a market, an overwhelming advantage over potential competitors. Specifically, an industry is a natural monopoly if a single firm can supply the entire market at a lower long-run average cost than if multiple firms were to operate within it. In that case, it is very probable that a company (monopoly) or a minimal number of companies (oligopoly) will form, providing all or most of the relevant products and/or services.

    https://en.wikipedia.org/wiki/Natural_monopoly

  • pphysch 5 days ago

    Consolidation is the inevitable outcome of free unregulated markets.

    In our highly interconnected world, decentralization paradoxically requires a central authority to enforce decentralization by restricting M&A, cartels, etc.

    • SoKamil 5 days ago

      Is there a theorem that models this behavior? Capital feels like a mass that attracts more mass the larger it becomes, like gravity.

  • anonymars 5 days ago

    Efficiency (aka cost) <---> Resiliency/redundancy

    Pick your point on the scale

    • deathanatos 5 days ago

      Maybe in a perfect world, or in a free market.

      But the cloud compute market is basically centralized into 2.5 companies at this point. The point of paying companies like Azure here is that they've in theory centralized the knowledge and know-how of running multiple, distributed datacenters, so as to be resilient.

      But that we keep seeing outages encompassing more than a failure domain, then it should be fair game for engineers / customers to ask "what am I paying for, again?"

      Moreover, this seems to be a classic case of large barriers to entry (the huge capital costs associated with building out a datacenter) barring new entrants into the market, coupled with "nobody ever got fired for buying IBM" level thinking. Are outages like these truly factored into the napkin math that says externalizing this is worth it?

Aldipower 4 days ago

Hetzner, Netcup, OVH, BunnyCDN, ClouDNS, Postmark

You name them. Other good providers you have experience with?

There is no reason for an expensive cloud. Never has been, but decision makers tried to keep their pants dry.

ApolloFortyNine 5 days ago

They admit in their update blurb azure front door is having issues but still report azure front door as having no issues on their status page.

And it's very clear from these updates that they're more focused on the portal than the product, their updates haven't even mentioned fixing it yet, just moving off of it, as if it's some third party service that's down.

  • consp 5 days ago

    > as having no issues on their status page

    Unsubstantiated idea: So the support contract likely says there is a window between each reporting step and the status page is the last one and the one in the legal documents giving them several more hours before the clauses trigger.

sedatk 5 days ago

The paradox of cloud provider crashes is that if the provider goes down and takes the whole world with it, it's actually good advertisement. Because, that means so many things rely on it, it's critically important, and has so many big customers. That might be why Amazon stock went up after AWS crash.

If Azure goes down and nobody feels it, does Azure really matter?

  • thewebguyd 5 days ago

    People feel it, but usually not general consumers like they do when AWS goes down.

    If Azure goes down, it's mostly affecting internal stuff at big old enterprises. Jane in accounting might notice, but the customers don't. Contrast with AWS which runs most of the world's SaaS products.

    People not being able to do their jobs internally for a day tends not to make headlines like "100 popular internet services down for everyone" does.

empath75 5 days ago

Friend of mine at MSFT says it's a Sev-0 outage and they can't even get to the ticket tracking system.

kure256 4 days ago

We’ve been experimenting with multi-cluster failover for Kubernetes workloads, and one open-source project that actually works really well is k8gb .

It acts as a GSLB controller inside Kubernetes — doing DNS-level health checks, region awareness, and automatic failover between clusters when one goes down.

It integrates with ExternalDNS and supports multiple DNS providers (Infoblox, Route53, Azure DNS, NS1, etc.), so it can handle failover across both on-prem and cloud clusters.

It’s not a silver bullet for every architecture, but it’s one of the few OSS projects that make multi-region failover actually manageable in practice.

kierenj 5 days ago

Sorry - my bad. I literally just connected an old XP VM to the internet to activate it.

blenderob 5 days ago

https://azure.status.microsoft/en-us/status says everything's fine! Any place I can read more about this outage?

  • reid 5 days ago

    You're looking at it. I couldn't find any discussion elsewhere yet...

  • sbergot 5 days ago

    official status pages are useless most of the time.

    • jeffrallen 5 days ago

      I work for a cloud provider which is serious about transparency. Our customers know they are going to get the straight story from our status page.

      When you find an honest vendor, cherish them. They are rare, and they work hard to earn and keep your confidence.

  • sbergot 5 days ago

    now there is an information about "Azure Portal Access Issues". No word about front door being down.

amaccuish 5 days ago

Seeing users having issues with the "Modern Outlook", specifically empty accounts. Switching back to the "Legacy Outlook" which functions largely without the help of the cloud fixes the issue. How ironic.

tyfon 5 days ago

Seems to be down in Norway.

Even the national digital id service is down.

  • hexbin010 5 days ago

    > Even the national digital id service is down.

    Can't help but smirk as my country is ramming through "Digital ID" right now

    • bombcar 4 days ago

      Someone somewhere thought that "national digital ID service" should absolutely rely on a cloud provider in and from another country.

      What a time to be alive.

FrostKiwi a day ago

Surprised to see the situation getting worse, what the hell.

Had some Frontdoor operations timing out, but now I'm straight up denied with "Message: All Changes to Azure Frondoor Configuration are blocked currently."

What a mess.

Steven_Vellon 5 days ago

For us, it looks like most services are still working (eastus and eastus2). Our AKS cluster is still running and taking requests. Failures seem limited to management portal.

jmspring 5 days ago

The outage was really weird. For me, parts of the portal worked, other parts didn't. I had access to a couple of resource groups, but no resources visible in those groups. Azure Devops Pipelines that needed do download from packages.microsoft.com didn't work.

The Microsoft status page mostly referenced the portal outage, but it was more than that.

  • bombcar 4 days ago

    I hate these failures because you end up with things that keep working fine because the login credentials are cached, etc; but if you restart or otherwise refresh, you're doomed.

mythz 5 days ago

High availability is touted as a reason for their high prices, but I swear I read about major cloud outages far more than I experience any outages at Hetzner.

  • prmoustache 5 days ago

    I think the biggest features of the big cloud vendors is that when they are down, not only you but your customers and your competitors usually have issues at the same time so everybody just shrug and have a lazy/off day at the same time. Even on call teams reall just have to wait and stay on standby because there is very little they can do. Doing a failover can be slower than waiting for the recovery, not help at all if outage is spanned accross several region, or bring aditional risks.

    And more importantly nobody lose any reputation except AWS/Azure/Google.

    • zavec 5 days ago

      It's like back in school when there was a snow day!

  • graemep 5 days ago

    Ostensible reason.

    The real reason is that outages are not your fault. Its the new version of "nobody ever got fired for buying IBM" - later it became MS, and now its any big cloud provider.

    • [removed] 5 days ago
      [deleted]
  • jmaker 5 days ago

    For one it’s statistics - Hetzner simply runs far fewer major services than hyperscalers. And the services they run are also more affluent, with larger customer bases, so downtimes are systemically critical. Therefore it’s louder.

    On the merits though, I agree, haven’t had any serious issues with Hetzner.

  • bad_haircut72 5 days ago

    Same with DigitalOcean. I run one box and it hasnt gone down for like 2 years

    • yabones 5 days ago

      DO has been shockingly reliable for me. I shut down a neglected box almost 900 days uptime the other day. In that time AWS has randomly dropped many of my boxes with no warning requiring a manual stop/start action to recover them... But everybody keeps telling me that DO isn't "as reliable" as the big three are.

    • ipdashc 5 days ago

      To be fair, in the AWS/Azure outages, I don't think any individual (already created) boxes went down, either. In AWS' case you couldn't start up new EC2 instances, and presumably same for Azure (unless you bypass the management portal, I guess). And obviously services like DynamoDB and Front Door, respectively, went down. Hetzner/DO don't offer those, right? Or at least they're not very popular.

    • robotnikman 5 days ago

      Same here, I run a few droplets for personal projects and never had any issues with then.

  • bongodongobob 5 days ago

    It's just the admin portal.

reid 5 days ago

This is impacting the Azure CDN at azureedge.net. DNS A records for azureedge.net tenants are taking 2-6 seconds and often return nothing.

AdmiralAsshat 5 days ago

Some exec at Microsoft told the Azure guys to ape everything Amazon does and they took it literally.

  • Telemakhos 5 days ago

    Or, the NSA needed to upgrade their access at both.

    • embedding-shape 5 days ago

      Do Microsoft still say "If the government has a broader voluntary national security program to gather customer data, we don't participate in it" today (which PRISM proved very false), or are they at least acknowledging they're participating in whatever NSA has deployed today?

      • terminalshort 5 days ago

        PRISM wasn't voluntary. Also there are 3 levels here:

        1. Mandatory

        2. "Voluntary"

        3. Voluntary

        And I suspect that very little of what the NSA does falls into category 3. As Sen Chuck Schumer put it "you take on the intelligence community, they have six ways from Sunday at getting back at you"

  • jrochkind1 5 days ago

    I was gonna say that obv AWS hacked em to even things up.

  • dboreham 5 days ago

    This is funny but also possibly true because: business/MBA types see these outages as a way to prove how critical some services are, leading to investors deciding to load up on the vendor's stock.

    • alt227 5 days ago

      I may or may not have been known to temporarily take a database down in the past to make a point to management about how unreliable some old software is.

jammo 4 days ago

We all need to move away from these big cloud providers. Two medium size smaller providers is enough.

-Cloudflare for R2 (object storage) and CDN (Fastly+backblaze also available). -Two VPS/Server providers with a decent reputation and mid-size (using a comparison site like https://serversearcher.com or look directly into people like Hetzner or latitude) -PlanetScale or Neon for database if you don't co-locate it, though better to use someone like digital ocean, vultr or latitude who offer databases too)

  • dspillett 4 days ago

    > We all need to move away from these big cloud providers.

    But then who do we blame when things are down? If we manage our own infrastructure we have to stay late to fix it when it breaks instead of saying “sorry, Microsoft, nothing we can do” and magically our clients accepting that…

  • blcknight 4 days ago

    Ah yes, let's put our multibillion dollar ecommerce site on... checks notes Hetzner.

    Lol

mystcb 5 days ago

Updated 16:35 UTC

Azure Portal Access Issues

Starting at approximately 16:00 UTC, we began experiencing DNS issues resulting in availability degradation of some services. Customers may experience issues accessing the Azure Portal. We have taken action that is expected to address the portal access issues here shortly. We are actively investigating the underlying issue and additional mitigation actions. More information will be provided within 60 minutes or sooner.

This message was last updated at 16:35 UTC on 29 October 2025

----

Azure Portal Access Issues

We are investigating an issue with the Azure Portal where customers may be experiencing issues accessing the portal. More information will be provided shortly.

This message was last updated at 16:18 UTC on 29 October 2025

-- From the Azure status page

hedayet 5 days ago

The sad thing is - $MSFT isn't even down by 1%. And IIRC, $AMZN actually went up during their previous outage.

So if we look at these companies' bottom lines, all those big wigs are actually doing something right. Sales and lobbying capacity is way more effective than reliability or good engineering (at least in the short term).

  • locusofself 5 days ago

    AMZN went up almost 4 percent between the day of the outage and the day after. Crazy market.

    • jlarocco 5 days ago

      Because it shows how much lock-in they have.

      You know nobody is migrating off of AWS or Azure because of these.

  • navane 5 days ago

    Look how important we are, is what these failures show

    • marcosdumay 5 days ago

      What do you mean? That IT isn't important for Microsoft and Amazon?

      That's certainly not the right conclusion.

      • alt227 5 days ago

        I think he was implying that those companies think they are so important that it doesnt matter they are down, they wont loose any customers over it because they are too big and important.

    • cyberax 5 days ago

      So we can look forward to "accidental" cloud outages just to show their importance?

      I guess the GCP is next.

    • Arrath 5 days ago

      "They'll learn their lesson and be rock solid after this! I better invest now!"

  • bigstrat2003 5 days ago

    That's a good thing. Stock prices shouldn't go down because of rare incidents which don't accurately represent how successful a company is likely to be in the future.

  • AtNightWeCode 5 days ago

    I looked into this before and the stocks of these large corps simply does not move when outages happens. Maybe intra-day, I don't have that data, but in general no effect.

  • iamtheworstdev 5 days ago

    well, at this point, 90% of the market cap of FAANGS plus Microsoft is... OMG AI LLM hype

vincebowdren 5 days ago

UK, and other regions too; our APAC installation in Australia is affected.

[removed] 5 days ago
[deleted]
progmetaldev 5 days ago

I was having issues a few hours ago. I'm now able to access the portal, although I get lots of errors in the browser console, and things are loading slowly. I have services in the US-East region.

I have been having issues with GitHub and the winget tool for updates throughout the day as well. I imagine things are pulling from the same locations on Azure for some of the software I needed to update (NPM dependencies, and some .NET tooling).

tartieret 5 days ago

Microsoft posted an update on X: https://x.com/AzureSupport/status/1983569891379835372?ref_sr...

"We’re investigating an issue impacting Azure Front Door services. Customers may experience intermittent request failures or latency. Updates will be provided shortly."

buttscicles 4 days ago

Interesting that everybody knows when AWS goes down but Azure needs a "Tell HN" :)

Best of luck to the teams responding to this incident.

  • tartieret 4 days ago

    I was a little puzzled as we got notified our apps were down, and then I tried to login in the Azure portal with no success. But the Azure status page reported no incident, so I posted here and quickly confirmed that others were impacted! They did a pretty bad job with their status page as the front door service was shown green all along

borg16 5 days ago

i guess folks in azure wanted to show some solidarity with aws brethren

(couldn't resist adding it. i acknowledge this comment adds no value to the discussion)

  • aurumque 5 days ago

    Azure goes down all the time. On Friday we had an entire regional service down all day. Two weeks ago same thing different region. You only hear about it when it's something everyone uses like the portal, because in general nobody uses Azure unless they're held hostage.

    • Mr_Bees69 5 days ago

      Yeah, im regretting my decision to buy an xbox now. Every once in a while, everything goes down.

glzone1 5 days ago

Wasn't the saying "It's always DNS" floating around somewhere?

Be interesting to understand cause here. Pretty big impact on services we use

  • mikestew 5 days ago

    Could be DNS, I'm seeing SERVFAIL trying to resolve what look to be MS servers when I'm hitting (just one example) mygoodtogo.com (trying to pay a road toll bill, and failing).

bragma 5 days ago

They suggest to use Traffic Manager to router around failing FrontDoor CDN, but DNS is failing too, making the suggestion another failure.

  • asciii 5 days ago

    Yeah they're suggesting to use CLI but then my Frontdoor deployment failed. Welp.

reid 5 days ago

Portal and Azure CDN are down here in the SF Bay Area. Tenant azureedge.net DNS A queries are taking 2-6 seconds and most often return nothing. I got a couple successful A response in the last 10 minutes.

Edit: As of 9:19 AM Pacific time, I'm now getting successful A responses but they can take several seconds. The web server at that address is not responding.

eeasss 5 days ago

Deglobalization in geopolitics should be followed by deglobalization in cloud providers as well. Viva la local vendors.

chrisgeleven 5 days ago

"Front Door" has to be the worst product name for a CDN I've ever heard of. I used to work for a CDN too.

  • unethical_ban 5 days ago

    I wonder if many Germans are eager to sign up for AFD.

    But seriously I thought it would be the console, not a CDN.

    • jeffrallen 5 days ago

      Front Door (tm), with Back Door access for the FBI included free with your subscription! ;)

  • oliyoung 5 days ago

    We should've never let marketing in the door honestly, all of the product names for the big three are awful.

    Microsoft CDN

    There, that's it. You're selling it to (hopefully) technical people

[removed] 5 days ago
[deleted]
jacquesm 5 days ago

It is much more than azure. One of my kids needs a key for their laptop and can't reach that either. Great excuse though, 'Azure ate my homework'. What a ridiculous world we are building. Fuck MS and their account requirements for windows.

elFarto 5 days ago

We saw all incoming traffic to our app drop to zero at about 15:45. I wonder how long this one will take to fix.

NDizzle 5 days ago

My best guess at the moment is something global like the CDN is having problems affecting things everywhere. I'm able to use a legacy application we have that goes directly to resources in uswest3, but I'm not able to use our more modern application which uses APIM/CDN networks at all.

vs4vijay 5 days ago
  • ipsum2 5 days ago

    Status page (first link) is down for me. Second one works

    • charv 5 days ago

      oh the irony, the status link being down too

      • karateka01 5 days ago

        status page being affected by the same issue is so lame

andhuman 5 days ago

I bet it’s DNS.

  • andhuman 5 days ago

    “ Starting at approximately 16:00 UTC, we began experiencing DNS issues resulting in availability degradation of some services. Customers may experience issues accessing the Azure Portal. We have taken action that is expected to address the portal access issues here shortly. We are actively investigating the underlying issue and additional mitigation actions. More information will be provided within 60 minutes or sooner.

    This message was last updated at 16:35 UTC on 29 October 2025”

  • pbhjpbhj 4 days ago

    That was my bet too, then I looked at ISC and noticed there were PoCs released for critical BIND9 vulns yesterday ... might be related?

  • [removed] 5 days ago
    [deleted]
vinyl7 5 days ago

Vibe coded internet keeps getting better

  • avgDev 5 days ago

    Quick find someone who can actually read documentation and code!

  • the_af 5 days ago

    You just paste the outage error codes back to the LLM and pray it's still working and can fix whatever went wrong!

    • m_fayer 5 days ago

      When all the people forget to code for themselves, every LLM will code itself out of existence with that one last bug. One, after another.

ApolloFortyNine 5 days ago

Two hours after the initial outage, they have finally updated the Front Door status on their status page.

tonymet 5 days ago

Any healthcare IT admins care to chime in? A predominantly MS industry with critical workloads.

udfalkso 5 days ago

OpenAI Clip python library fails because the model download is a hardcoded azure cdn url :(

SoftTalker 5 days ago

We're on Office 365 and so far it's still responding. At least Outlook and Teams is.

  • jeffdn 5 days ago

    They don't run on Azure!

    • RajT88 5 days ago

      They definitely do run on Azure. Probably not 100%, but at least some footprint of those services do.

    • rcarmo 5 days ago

      Are you absolutely sure?

      • jansper39 5 days ago

        They don't, however authentication for those services relies on Entra ID which seems to be affected.

        • rcarmo 5 days ago

          I'd say DNS/Front Door (or some carrier interconnect) is the thing affected, since I can auth just fine in a few places. (I'm at MS, but not looped into anything operational these days, so I'm checking my personal subscription).

rvz 5 days ago

Looking forward to the post mortem.

  • internet_points 4 days ago

    > What went wrong and why?

    > An inadvertent tenant configuration change within Azure Front Door (AFD) triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery. The change introduced an invalid or inconsistent configuration state that caused a significant number of AFD nodes to fail to load properly, leading to increased latencies, timeouts, and connection errors for downstream services.

    > As unhealthy nodes dropped out of the global pool, traffic distribution across healthy nodes became imbalanced, amplifying the impact and causing intermittent availability even for regions that were partially healthy. We immediately blocked all further configuration changes to prevent additional propagation of the faulty state and began deploying a ‘last known good’ configuration across the global fleet. Recovery required reloading configurations across a large number of nodes and rebalancing traffic gradually to avoid overload conditions as nodes returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

    > The trigger was traced to a faulty tenant configuration deployment process. Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations. Safeguards have since been reviewed and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future.

    So, so far they're saying it's a combination of bad config + their config-validator had a bug. Would love more details.

    • Aldipower 4 days ago

      We have some trouble with the AFD in Germany too.

chemodax 5 days ago

It seems Azure FrontDoor is affected, because our private VM works fine in different regions.

bossyTeacher 5 days ago

I noticed issues on Azure so I went to the status page. It said everything was fine even though the Azure Portal was down. It took more than 10 minutes for that status page to update.

How can one of the richest companies in the world not offer a better service?

  • Ylpertnodi 5 days ago

    >How can one of the richest companies in the world not offer a better service?

    Better service costs money.

Jarwain 5 days ago

On our end, our VMs are still working, so our gitlab instance is still up. Our services using Azure App Services are available through their provided url. However, Front Door is failing to resolve any domains that it was responsible for.

[removed] 5 days ago
[deleted]
irusensei 5 days ago

I was working when I saw the portal page showing only resource groups and lots of items missing. I thought it was a weird browser cache issue.

The actual stuff I was working on (App Insights, Function App) that was still open was operational.

baconbrand 5 days ago

All of our sites went down. This is my company’s busiest time of year. Hooray.

[removed] 5 days ago
[deleted]
_pdp_ 5 days ago

With all the recent outages considered, it is time to move off the cloud.

a_f 5 days ago

Looks like MyGet is impacted too. Seems like they use Azure:

>What is required to be able to use MyGet? ... MyGet runs its operations from the Microsoft Azure in the West Europe region, near Amsterdam, the Netherlands.

hypeatei 5 days ago

All of my employers things are hosted on Azure and running just fine and didn't go down at all. Portal access has been fixed.

Doesn't seem to be too bad of an outage unless you were relying on Azure Front Door.

  • randomsofr 5 days ago

    SSO is down, Azure Portal Down and more, seems like a major outage. Already a lot of services seem to be affected: banks, airlines, consumer apps, etc.

    • hypeatei 5 days ago

      The portal is up for me and their status page confirms they did a failover for it. Definitely not disputing that its reach is wide, but a lot of smaller setups probably aren't using Front Door.

    • rahkiin 5 days ago

      Both work for me in the Netherlands

rcarmo 5 days ago

Not seeing it. I have VMs in US East and Netherlands and they're up.

  • tgv 5 days ago

    I tried to look some things up on their support pages before 1600Z, and it timed-out. The Dutch railways are also affected (they're an MS shop, IIRC).

LaserToy 5 days ago

Azure portal still insists the issue is jsut with Console.

We had to bypass the Frontdoor

8cvor6j844qw_d6 5 days ago

Quite close to the recent AWS outage. Let me take a look if its a major one similar to AWS.

Any guess on what's causing it?

In hindsight, I guess the foresight of some organizations to go multi-cloud was correct after all.

  • jcims 5 days ago

    We're multi-cloud and it really saved a few workloads last week with the AWS issue.

    It's not easy though.

    • sanskarix 5 days ago

      This is the eternal tension for early-stage builders, isn't it? Multi-cloud gives you resilience, but adds so much complexity that it can actually slow down shipping features and iterating.

      I'm curious—at what point did you decide the overhead was worth it? Was it after experiencing an outage, or did you architect for it from day one?

      As someone launching a product soon (more on the builder/product side than infra-engineer), I keep wrestling with this. The pragmatist in me says "start simple, prove the concept, then layer in resilience." But then you see events like this week and think "what if this happens during launch?"

      How did you handle the operational complexity? Did you need dedicated DevOps folks, or are there patterns/tools that made it manageable for a smaller team?

      • jcims 5 days ago

        I don't think I would recommend multi-cloud right out of the gate unless you already have a lot of experience in the space or there is a strong demand from your customers. There's a tremendous amount of overhead with security/compliance, incident management, billing, tooling, entitlements, etc. There are a number of external factors that drove our decision to do it, resiliency is just one of them. But we are a pretty big shop, spending ~$10M/mo on cloud infra and have ~100 people in the platform management department.

        I would recommend focusing on multi-region within a single CSP instead (both for workloads AND your tooling), which covers the vast majority of incidents and lays some of the architectural foundation for multi-cloud down the road. Develop failover plans for each service in your architecture (eg. planned/tested runbooks to migrate to Traffic Manager in the event AFD goes down)

        Also choose your provider wisely. We experience 3-5x the number of service-impacting incidents on Azure that we do on AWS. I'm sure others have different experiences, but I would never personally start a company on Azure. AWS has its own issues, of course, but reliability has not been a major one (relatively speaking) over the past 10 years. Last week's incident with DynamoDB in us-east-1 had zero impact on our AWS workloads in other regions.

  • iAMkenough 5 days ago

    Trusting AI without sufficient review and oversight of changes to production.

    • whynotminot 5 days ago

      Yeah, these things never happened when humans were trusted without sufficient review and oversight of changes to production.

    • shepherdjerred 5 days ago

      Do you have any insight or do you just dislike AI? Incidents like this happened long before AI generated code

      • Capricorn2481 5 days ago

        I don't think it's meant to be serious. It's a comment on Microsoft laying off their staff and stuffing their Azure and Dotnet teams with AI product managers.

avgDev 5 days ago

I am having a bunch of issues. It looks like their sites and azure are both affected.

I also got weird notification in VS2022 that my license key was upgraded to Enterprise, but we did not purchase anything.

  • Mr_Bees69 5 days ago

    Might be a failsafe, if you cant get a license status, and you're aware that MS is down, just default to the highest tier.

CKMo 5 days ago

Reasons to not use hyperscalers, exhibit 654

There's a lot of outages this month!

dlcarrier 5 days ago

Yesterday Amazon, today Microsoft. Are Google's cloud services going down tomorrow?

  • Insanity 5 days ago

    Maybe they are and no one realized yet.. :P

    That said, I don't hear about GCP outages all that often. I do think AWS might be leading in outages, but that's a gut feeling, I didn't look up numbers.

    • xenolithis 5 days ago

      fairly certain they had a significant multi region outage within the past few years. I'll try to find some details to link.

      Few customers....few voices to complain as well.

    • Mr_Bees69 5 days ago

      as a victim of xbox, azure is down 'bout as often as its up

  • briffle 5 days ago

    here's hoping its Oracle's cloud instead....

thimkerbell 5 days ago

Does (should, could) DownDetector also say what customer-facing services are down, when some infrastructure is unworking? Or is that the info that the malefactors are seeking?

tpl 5 days ago

Part of this outage involves outlook hanging and then blaming random addins. Pretty terrible practice by Microsoft to blame random vendors for their own outage.

syntaxing 5 days ago

I absolutely love the utility aspect of LLMs but part of me is curious if moving faster by using AI is going to make these sorts of failure more and more often.

  • monkaiju 5 days ago

    If true then what "utility" is there?

    • 1718627440 5 days ago

      More visibility for the general person to see how brittle software is?

bronco21016 5 days ago

Unable to access the portal and any hit to SSO for other corporate accesses is also broken. Seems like there's something wrong in their Identity services.

user3939382 5 days ago

I know how to fix this but this community is too close minded and argumentative egocentric sensitive pedantic threatened angry etc to bother discussing it

[removed] 5 days ago
[deleted]
perks_12 5 days ago

Thank you. I was wondering what was going on at a company whose web app I need to access. I just checked with BuiltWith and it seems they are on Azure.

senderista 5 days ago

Even if the cloud providers have much better reliability than most on-prem infra, the failure correlation they induce negates much of the benefit.

ThatManulTheCat 5 days ago

Azure portal currently mostly not working (UK)... Downdetector reporting various Microsoft linked services are out (Minecraft, Microsoft 365, Xbox...)

_oleksandr_ 5 days ago

Based on the delay in resolving the issue, it appears MC attempted to rehire some of the DevOps engineers whom AI had previously replaced.

  • jeffrallen 5 days ago

    They probably hired the ones AWS laid off, causing the AWS outage.

    Institutional knowledge matters. Just has to be the right institution is all.

zaoui_amine 4 days ago

Language models aren't perfect; they can still generate similar outputs. Invertibility is a stretch.

[removed] 5 days ago
[deleted]
djeastm 5 days ago

I'm mid-deployment, but thankfully it seems to be running ok so far. Just the portal is not working so my visibility is not good.

bragma 5 days ago

They suggest to use Traffic Manager to route around failing CDNs. But DNS is not working too, making the suggestion another fail.

[removed] 5 days ago
[deleted]
tecleandor 5 days ago

LinkedIn has been acting funny for an hour or so, and some pages in the learn.microsoft.com domain have been failing for me too...

ZeroConcerns 5 days ago

Oh, well, I'm sure Azure will be given the same pass that AWS got here recently when they had their 12-hour outage...

  • taeric 5 days ago

    I didn't realize AWS got a pass?

    • graemep 5 days ago

      Have repeated outages lost them customers? has it lost them any money in any way?

      That is a pass.

      • taeric 5 days ago

        Apologies, but this just reads like a low effort critique of big things.

        To be clear, they should get criticism. They should be held liable for any damage they cause.

        But that they remain the biggest cloud offering out there isn't something you'd expect to change from a few outages that, by most all evidence, potential replacements have, as well? More, a lot of the outages potential replacements have are often more global in nature.

      • philipallstar 5 days ago

        Have people left GitHub due to the multiple post-acquisition outages? That is a pass if you don't judge it the same way.

      • prmoustache 5 days ago

        Well, they have successfully locked their customers captive thanks to huge egress fees.

      • arccy 5 days ago

        customers like us are certainly looking at expanding from just multi region into instead being multi cloud...

everfrustrated 5 days ago

GitHub runners (specifically the "larger" runner types) are all down for us. These are known to be hosted on Azure.

martijnvds 5 days ago

This probably explains why paying for street parking in Cologne by phone/web didn't work (eternal spinner) then

zbowling 5 days ago

Alaska Airlines is redircting folks to their slimmed down international site and you can't check in on mobile.

smithkl42 5 days ago

The iron law of uptime: "The mandatory single point of failure in every possible system is configuration."

ycombinatornews 5 days ago

So that’s why CapitalOne is out today. Even though their (incorrect) status page says all systems operational.

baconbrand 5 days ago

Our Azure DevOps site is still functioning and our Azure hosted databases are accessible. Everything else is cooked.

jimmyl02 5 days ago

pretty interesting how datadog's uptime tracker (https://updog.ai/) says all the sites are fully available.

if that's true then it's a sign that Azure's control / data plane separation is doing it's job! at least for now

  • jonathanlydall 5 days ago

    Our Azure hosted dotnet App Service is working fine, but our docs site served via Front Door went down. Can’t access anything through the Portal.

  • layer8 5 days ago

    Maybe they need a downtime tracker. ;)

tartieret 4 days ago

it took a good half hour after we detected the problem to see a notification on the Azure status page. Thanks to those who responded to my question as it validated the issue was global and we contacted our users t right away

Mr_Bees69 5 days ago

MS website seems to be up but really slow. Think xbox might still be down, Bing works for some reason tho!?

udev4096 5 days ago

Luckily, no one uses azure and it's fully expected from azure to go down all the time! Keep it up!

ksec 5 days ago

>Last week AWS, now this.

This is not the first or second time this happened, multiple Hyperscaler failed one by one.

zaoui_amine 4 days ago

Yeah, Azure is a mess today. Can't do anything without the portal.

twodave 5 days ago

Appears to be an issue in Front Door. Our back end stuff is fine but FD is bouncing everything.

  • NDizzle 5 days ago

    Yeah, I have non prod environments that don't use FD that are functioning. Routing through FD does not work. And a different app, nonprod doesn't use FD (and is working) but loads assets from the CDN (which is not working).

    FD and CDN are global resources and are experiencing issues. Probably some other global resources as well.

    Hate to say it, but DNS is looking like it's still the undisputed champ.

qmr 5 days ago

Always in these large provider outages you see people who have forgotten the old ways.

AtNightWeCode 5 days ago

Earnings report today. A coincidence?

I can at least login to Azure. But several MS sites are down.

vanviegen 5 days ago

Many (all?) LinkedIn profiles are also down for me. Luckily the frontpage still works. ;-)

Go cloud!

DeathArrow 4 days ago

Buy cloud because you're always safe! Until you aren't.

  • mnau 4 days ago

    It doesn't matter whether you actually are safe or not. What matters is that you are in compliance.

major505 4 days ago

Somewhere, an ex microsoft engineer that where layoff during the last week, is saying to himself “thank god, this shit is not my problem anymore”

amluto 5 days ago

vscode.dev appears to be down. I think this will be my excuse to find an alternative -- I never really liked vscode.dev anyway.

(Coder is currently at the top of the experiment list. Any other suggestions?)

redwood 5 days ago

Is it Cosmos DB? If so the symmetry with AWS/Dynamo would be very eerie.

macshome 5 days ago

I just tried to check the Xbox services status page and it never even loaded.

  • chokolad 5 days ago

    Majority of actual Xbox services are working fine, xbox.com itself is busted.

Shuddown 5 days ago

Github Codespaces (for the 5 people that use them) are also still down.

kryogen1c 5 days ago

downdetector reports coincident cloudflare outage. is microsoft using cloudflare for management plane, or is there common infra? data center problem somewhere, maybe fiber backbone? BGP?

kryogen1c 5 days ago

downdetector reports coincident cloudflare outage. is microsoft using cloudflare for management plane, or is there common infra? data center problem somewhere, maybe fiber backbone? BGP?

xer0x 5 days ago

Wow, they are still down 12 hours later. :/

  • croemer 5 days ago

    Not officially - status page says all healthy

llimos 5 days ago

Yep, down from here too (in Israel).

Services too, not just the portal.

acd 5 days ago

Putting all your eggs software in one basket

shivenigma 14 hours ago

what's happening? self hosting advocate groups attacking all cloud to prove their point?

pred8er 5 days ago

on the line with msft, they said 4 hours is what they are thinking. a workaround they are saying is to use traffic manager,

kierenj 5 days ago

microsoft.com is back -

edit: it worked once, then died again. So I guess - some resolvers, or FD servers may be working!

zelias 5 days ago

Anyone have betting odds on when Google will go down next? Are we looking at all 3 providers having outages in the span of 3 weeks?

xuf 5 days ago

Down here too (region West Europe)

rluhar 5 days ago

Looks like AWS is also impacted?

  • zavec 5 days ago

    Yeah the graph for that one looks exactly the same shape. I wonder if they were depending on some azure component somehow, or maybe there were things hosted on both and the azure failure made enough things failover to AWS that AWS couldn't cope? If that was the case I'd expect to see something similar with GCP too though.

    Edit: nope looks like there's actually a spike on GCP as well

    • estel 5 days ago

      It's possibly more likely that people mis-attribute the cause of an outage to the wrong providers when they use downdetector.

      • zavec 5 days ago

        Definitely also a strong possibility. I wish I had paid more attention during the AWS one earlier to see what other things looked like on there at the time.

thewisenerd 5 days ago

they recently had an incident with front door reachability, wonder if it's back.

QNBQ-5W8

pred8er 5 days ago

looks like MS completed a failover and things are be recovering slowly

giantg2 5 days ago

Compare the comments and news coverage on this compared to the AWS outage... pretty telling.

razodactyl 5 days ago

AWS, now Azure - wasn't this a plot point in Terminator where SkyNet was causing computer systems to have issues much before it finally become self-aware?

Funnily enough, AI has been training on its own data as generated by users writing AI conversations back to the internet - there's a feedback loop at play.

dlcarrier 5 days ago

We're quickly learning who's relying on a single cloud provider.

  • Insanity 5 days ago

    Multi cloud is really hard to get right at scale, and honestly not worth the effort for the majority of companies and use-case.

    • MiguelHudnandez 5 days ago

      When you look at the scale of the reports, you find they are much lower than Azure's. seeing a bunch of 24-hour sparkline type graphs next to each other can make it look like they are equally impacted, but AWS has 500 reports and Azure has 20,000. The scale is hidden by the choice of graph.

      In other words, people reporting outages at AWS are probably having trouble with microsoft-run DNS services or caching proxies. It's not that the issues aren't there, it's that the internet is full of intermingled complexity. Just that amount of organic false-positives can make it look like an unrelated major service is impacted.

worik 5 days ago

An important quality of the cloud is that it is always available.

Except that it is not!

Interesting times...

journal 5 days ago

one day these outages will cause a starvation.

tonymet 5 days ago

Hello fellow boomers!

I noticed that winget is also down eg.

  winget upgrade fabric
  Failed in attempting to update the source: winget
  An unexpected error occurred while executing the command:
  InternetOpenUrl() failed.
  0x80072ee7 : unknown error
patching-trowel 5 days ago

As of now Azure Status page still shows no incident. It must be manually updated, someone has to actively decide to acknowledge an issue, and they're just... not. It undermines confidence in that status page.

  • baconbrand 5 days ago

    I have never noticed that page being updated in a timely manner.

  • charles_f 5 days ago

    It shows that some people have issues accessing the portal.

m_a_g 5 days ago

It’s not DNS

There is no way it’s DNS

It was DNS

AtNightWeCode 5 days ago

From Azure status page: "Customers can consider implementing failover strategies with Azure Traffic Manager, to fail over from Azure Front Door to your origins".

What a terrible advise.

rsolva 5 days ago

So that's why all of our municipality's digital services are down ... utter chaos at the political meeting I attended just now.

[removed] 5 days ago
[deleted]
zzake 5 days ago

Portal is now accessible, bypassing FDN

rawgabbit 5 days ago
  • the_af 5 days ago

    I especially like how Nadella speaks of layoffs as some kind of uncontrollable natural disaster, like a hurricane, caused by no-one in particular. A kind of "God works in mysterious ways".

        > “Microsoft is being recognized and rewarded at levels never seen before,” Nadella wrote. “And yet, at the same time, we’ve undergone layoffs. This is the enigma of success in an industry that has no franchise value.”
         
        > Nadella explained the disconnect between thriving financials and layoffs by stating that “progress isn’t linear” and that it is “sometimes dissonant, and always demanding.”
    
    I've read the whole memo and it's actually worse than those excerpts. Nadella doesn't even claim these were low performers:

        > These decisions are among the most difficult we have to make. They affect people we’ve worked alongside, learned from, and shared countless moments with—our colleagues, teammates, and friends.
    
    Ok, so Microsoft is thriving, these were friends and people "we've learned from", but they must go because... uh... "progress isn't linear". Well, thanks Nadella! That explains so much!
  • FeteCommuniste 5 days ago

    > [Satya Nadella] said that the company’s future opportunity was to bring AI to all eight billion people on the planet.

    But what if I don't want AI brought to me?

[removed] 5 days ago
[deleted]
almosthere 5 days ago

Reports of Azure and AWS down on the same day? Infrastructure terrorism?

  • reaperducer 5 days ago

    Reports of Azure and AWS down on the same day? Infrastructure terrorism?

    > We have confirmed that an inadvertent configuration change as the trigger event for this issue.

    Save the speculation for Reddit. HN is better than that.

  • 12_throw_away 5 days ago

    > Infrastructure terrorism?

    Unless that's a euphemism for "vibe coding", no.