Comment by canyon289

Comment by canyon289 15 days ago

106 replies

I work at Google on these systems everyday (caveat this is my own words not my employers)). So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.

However I can share this written by my colleagues! You'll find great explanations about accelerator architectures and the considerations made to make things fast.

https://jax-ml.github.io/scaling-book/

In particular your questions are around inference which is the focus of this chapter https://jax-ml.github.io/scaling-book/inference/

Edit: Another great resource to look at is the unsloth guides. These folks are incredibly good at getting deep into various models and finding optimizations, and they're very good at writing it up. Here's the Gemma 3n guide, and you'll find others as well.

https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...

KaiserPro 15 days ago

Same explanation but with less mysticism:

Inference is (mostly) stateless. So unlike training where you need to have memory coherence over something like 100k machines and somehow avoid the certainty of machine failure, you just need to route mostly small amounts of data to a bunch of big machines.

I don't know what the specs of their inference machines are, but where I worked the machines research used were all 8gpu monsters. so long as your model fitted in (combined) vram, you could job was a goodun.

To scale the secret ingredient was industrial amounts of cash. Sure we had DGXs (fun fact, nvidia sent literal gold plated DGX machines) but they wernt dense, and were very expensive.

Most large companies have robust RPC, and orchestration, which means the hard part isn't routing the message, its making the model fit in the boxes you have. (thats not my area of expertise though)

  • zozbot234 14 days ago

    > Inference is (mostly) stateless. ... you just need to route mostly small amounts of data to a bunch of big machines.

    I think this might just be the key insight. The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!), you "only" pay for the request-specific raw compute and the memory storage+bandwidth for the activations. And the proprietary models are now huge, highly-quantized extreme-MoE models where the former factor (model size) is huge and the latter (request-specific compute) has been correspondingly minimized - and where it hasn't, you're definitely paying "pro" pricing for it. I think this goes a long way towards explaining how inference at scale can work better than locally.

    (There are "tricks" you could do locally to try and compete with this setup, such as storing model parameters on disk and accessing them via mmap, at least when doing token gen on CPU. But of course you're paying for that with increased latency, which you may or may not be okay with in that context.)

    • patrick451 14 days ago

      > The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!)

      Kind of unrelated, but this comment made me wonder when we will start seeing side channel attacks that force queries to leak into each other.

      • jeffrallen 14 days ago

        I asked a colleague about this recently and he explained it away with a wave of the hand saying, "different streams of tokens and their context are on different ranks of the matrices". And I kinda believed him, based on the diagrams I see on Welch Labs YouTube channel.

        On the other hand, I've learned that when I ask questions about security to experts in a field (who are not experts in security) I almost always get convincing hand waves, and they are almost always proven to be completely wrong.

        Sigh.

    • saagarjha 14 days ago

      mmap is not free. It just moves bandwidth around.

      • zozbot234 14 days ago

        Using mmap for model parameters allows you to run vastly larger models for any given amount of system RAM. It's especially worthwhile when you're running MoE models and parameters for unused "experts" can just be evicted from RAM, leaving room for more relevant data. But of course this applies more generally to, e.g. single model layers, etc.

  • abdullin 14 days ago

    > Inference is (mostly) stateless

    Quite the opposite. Context caching requires state (K/V cache) close to the VRAM. Streaming requires state. Constrained decoding (known as Structured Outputs) also requires state.

    • KaiserPro 14 days ago

      > Quite the opposite.

      Unless something has dramatically changed, the model is stateless. The context cache needs to be injected before the new prompt, but for what I understand (and please do correct me if I'm wrong) the the context cache isn't that big, like in the order of a few tens of kilobytes. Plus the cache saves seconds of GPU time, so having an extra 100ms of latency is nothing compare to a cache miss. so a broad cache is much much better than a narrow local cache.

      But! even if its larger, Your bottleneck isn't the network, its waiting on the GPUs to be free[1]. So whilst having the cache really close ie in the same rack, or same machine, will give the best performance, it will limit your scale (because the cache is only effective for a small number of users)

      [1] a 100megs of data shared over the same datacentre network every 2-3 seconds per node isn't that much, especially if you have a partitioned network (ie like AWS where you have a block network and a "network" network)

      • spott 12 days ago

        KV cache for dense models is order 50% of parameters. For sparse moe models it can be significantly smaller I believe, but I don’t think it is measured in kb.

blibble 15 days ago

> So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.

"we do 1970s mainframe style timesharing"

there, that was easy

  • kstrauser 15 days ago

    For real. Say it takes 1 machine 5 seconds to reply, and that a machine can only possibly form 1 reply at a time (which I doubt, but for argument).

    If the requests were regularly spaced, and they certainly won’t be, but for the sake of argument, then 1 machine could serve 17,000 requests per day, or 120,000 per week. At that rate, you’d need about 5,600 machines to serve 700M requests. That’s a lot to me, but not to someone who owns a data center.

    Yes, those 700M users will issue more than 1 query per week and they won’t be evenly spaced. However, I’d bet most of those queries will take well under 1 second to answer, and I’d also bet each machine can handle more than one at a time.

    It’s a large problem, to be sure, but that seems tractable.

  • brookst 14 days ago

    But that’s not accurate. There are all sorts of tricks around KV cache where different users will have the same first X bytes because they share system prompts, caching entire inputs / outputs when the context and user data is identical, and more.

    Not sure if you were just joking or really believe that, but for other peoples’ sake, it’s wildly wrong.

    • kossTKR 13 days ago

      Really? So the system recognises someone asked the same question and serves the same answer? And who on earth shares the exact same context?

      I mean i get the idea but sounds so incredibly rare it would mean absolutely nothing optimisation wise.

      • brookst 12 days ago

        Yes. It is not incredibly rare, it's incredibly common. A huge percentage of queries to retail LLMs are things like "hello" and "what can you do", with static system prompts that make the total context identical.

        It's worth maybe a 3% reduction in GPU usage. So call it a half billion dollars a year or so, for a medium to large service.

        • throwaway2037 9 days ago

              > It's worth maybe a 3% reduction in GPU usage. So call it a half billion dollars a year or so, for a medium to large service.
          
          So if 3% is 500M, then annual spend is ~16.6B. That is medium sized these days?
      • fc417fc802 13 days ago

        Even if that were the case you wouldn't be wrong. Adding caching and deduplication (and clever routing and sharding, and ...) on top of timesharing doesn't somehow make it not timesharing anymore. The core observation about the raw numbers still applies.

  • claytongulick 14 days ago

    I'm pretty sure that's not right.

    They're definitely running cluster knoppix.

    :-)

  • rootsudo 15 days ago

    Makes perfect sense, completely understand now!

benreesman 15 days ago

I don't think it's either useful or particularly accurate to characterize modern disagg racks of inference gear, well-understood RDMA and other low-overhead networking techniques, aggressive MLA and related cache optimizations that are in the literature, and all the other stuff that goes into a system like this as being some kind of mystical thing attended to by a priesthood of people from a different tier of hacker.

This stuff is well understood in public, and where a big name has something highly custom going on? Often as not it's a liability around attachment to some legacy thing. You run this stuff at scale by having the correct institutions and processes in place that it takes to run any big non-trivial system: that's everything from procurement and SRE training to the RTL on the new TPU, and all of the stuff is interesting, but if anyone was 10x out in front of everyone else? You'd be able to tell.

Signed, Someone Who Also Did Megascale Inference for a TOP-5 For a Decade.

tough 15 days ago

Doesn't google have TPU's that makes inference of their own models much more profitable than say having to rent out NVDIA cards?

Doesn't OpenAI depend mostly on its relationship/partnership with Microsoft to get GPUs to inference on?

Thanks for the links, interesting book!

  • ActorNightly 15 days ago

    Yes. Google is probably gonna win the LLM game tbh. They had a massive head start with TPUs which are very energy efficient compared to Nvidia Cards.

    • baxtr 15 days ago

      The only one who can stop Google is Google.

      They’ll definitely have the best model, but there is a chance they will f*up the product / integration into their products.

      • scarface_74 15 days ago

        It would take talent for them to mess up hosting businesses who want to use their TPUs on GCP.

        But then again even there, their reputation for abandoning products, lack of customer service, condescension when it came to large enterprises’ “legacy tech” lets Microsoft who is king of hand holding big enterprise and even AWS run rough shod over them.

        When I was at AWS ProServe, we didn’t even bother coming up with talking points when competing with GCP except to point out how they abandon services. Was it partially FUD? Probably. But it worked.

      • adastra22 15 days ago

        There is plenty of time left to fumble the ball.

    • benreesman 14 days ago

      Google will win the LLM game if the LLM game is about compute, which is the common wisdom and maybe true, but not foreordained by God. There's an argument that if compute was the dominant term that Google would never have been anything but leading by a lot.

      Personally right now I see one clear leader and one group going 0-99 like a five sigma cosmic ray: Anthropic and the PRC. But this is because I believe/know that all the benchmarks are gamed as hell, its like asking if a movie star had cosmetic surgery. On quality, Opus 4 is 15x the cost and sold out / backordered. Qwen 3 is arguably in next place.

      In both of those cases, extreme quality expert labeling at scale (assisted by the tool) seems to be the secret sauce.

      Which is how it would play out if history is any guide: when compute as a scaling lever starts to flatten, you expert label like its 1987 and claim its compute and algorithms until the government wises up and stops treating your success persobally as a national security priority. It's the easiest trillion Xi Xianping ever made: pretending to think LLMs are AGI too, fast following for pennies on the dollar, and propping up a stock market bubble to go with the fentanyl crisis? 9-D chess. It's what I would do about AI if I were China.

      Time will tell.

      • 0_____0 14 days ago

        I believe Google might win the LLM game simply because they have the infrastructure to make it profitable - via ads.

        All the LLM vendors are going to have to cope with the fact that they're lighting money on fire, and Google have the paying customers (advertisers) and with the user-specific context they get from their LLM products, one of the juciest and most targetable ad audiences of all time.

      • ActorNightly 14 days ago

        Everyone seems to forget about Mu Zero which was arguably more important than transformer architecture.

    • fakedang 15 days ago

      Yeah honestly. They could just try selling solutions and SLAs combining their TPU hardware with on-prem SOTA models and practically dominate enterprise. From what I understand, that's GCP's gameplay too for most regulated enterprise clients.

      • ActorNightly 15 days ago

        Googles bread and butter is advertising, so they have a huge interest in keeping things in house. Data is more valuable to them than money from hardware sales.

        Even then, I think that their primary use case is going to be consumer grade good AI on phones. I dunno why Gemma QAT model fly so low on the radar, but you can basically get full scale Llamma 3 like performance from a single 3090 now, at home.

      • Ericson2314 15 days ago

        Relenting hardware like that would be such a cleansing old-school revenue stream for Google... just imagine...

    • stogot 15 days ago

      Hasn’t the Inferentia chip been around long enough to make the same argument? AWS and Google probably have the same order of magnitude of their own custom chips

    • davedx 15 days ago

      But they’re ASICs so any big architecture changes will be painful for them right?

      • llm_nerd 15 days ago

        TPUs are accelerators that accelerate the common operations found in neural nets. A big part is simply a massive number of matrix FMA units to process enormous matrix operations, which comprises the bulk of doing a forward pass through a model. Caching enhancements and massively growing memory was necessary to facilitate transformers, but on the hardware side not a huge amount has changed and the fundamentals from years ago still powers the latest models. The hardware is just getting faster and with more memory and more parallel processing units. And later getting more data types to enable hardware-enabled quantization.

        So it isn't like Google designed a TPU for a specific model or architecture. They're pretty general purpose in a narrow field (oxymoron, but you get the point).

        The set of operations Google designed into a TPU is very similar to what nvidia did, and it's about as broadly capable. But Google owns the IP and doesn't pay the premium and gets to design for their own specific needs.

        • saagarjha 14 days ago

          There are plenty of matrix multiplies in the backward pass too. Obviously this is less useful when serving but it's useful for training.

      • edoceo 15 days ago

        I'd think no. They have the hardware and software experience, likely have next and next-next plans in place already. The big hurdle is money, which G has a bunch of.

      • [removed] 15 days ago
        [deleted]
  • canyon289 15 days ago

    Im a research person building models so I can't answer your questions well (save for one part)

    That is, as a research person using our GPUs and TPUs I see first hand how choices from the high level python level, through Jax, down to the TPU architecture all work together to make training and inference efficient. You can see a bit of that in the gif on the front page of the book. https://jax-ml.github.io/scaling-book/

    I also see how sometimes bad choices by me can make things inefficient. Luckily for me if my code/models are running slow I can ping colleagues who are able to debug at both a depth and speed that is quite incredible.

    And because were on HN I want to preemptively call out my positive bias for Google! It's a privilege to be able to see all this technology first hand, work with great people, and do my best to ship this at scale across the globe.

ignoramous 15 days ago

> Another great resource to look at is the unsloth guides.

And folks at LMSys: https://lmsys.org/blog/

  Large Model Systems (LMSYS Corp.) is a 501(c)(3) non-profit focused on incubating open-source projects and research. Our mission is to make large AI models accessible to everyone by co-developing open models, datasets, systems, and evaluation tools. We conduct cutting-edge machine learning research, develop open-source software, train large language models for broad accessibility, and build distributed systems to optimize their training and inference.
hnpolicestate 14 days ago

This caught my attention "But today even “small” models run so close to hardware limits".

Sounds analogous to the 60's and 70's i.e "even small programs run so close to hardware limits". If optimization and efficiency is dead in software engineering, it's certainly alive and well in LLM development.

jackhalford 15 days ago

Why does the unsloth guide for gemma 3n say:

> llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!

That makes the want to try exactly that? Weird

nwhnwh 14 days ago

Nothing smart about making something that is not useful for humans.

LAC-Tech 15 days ago

If people at google are so smart why can't google.com get a 100% lighthouse score?

  • jeltz 14 days ago

    I have met a lot of people at Google, they have some really good engineers and mediocre ones. But mostl importantly they are just normal engineers dealing normal office politics.

    I don't like how the grand parent mystifies this. This problem is just normal engineering. Any good engineer could learn how to do it.

  • usr1106 14 days ago

    Because most smart people are not generalists. My first boss was really smart and managed to found a university institute in computer science. The 3 other professors he hired were, ahem, strange choices. We 28 year old assistents could only shake our heads. After fighting a couple of years with his own hires the founder left in frustration to found another institution.

    One of my colleagues was only 25, really smart in his field and became a professor less than 10 years later. But he was incredibly naive in everyday chores. Buying groceries or filing taxes resulted in major screw-ups regularly

    • jeltz 14 days ago

      I have met those supersmart specialists but in my experience there are also a lot of smart people who are more generalists.

      The real answer is likely internal company politics and priorities. Google certainly has people with the technical skills to solve it but do they care and if they care can they allocate those skilled people to the task?

      • gregorygoc 14 days ago

        My observation is that in general smart generalists are smarter than smart specialists. I work at Google, and it’s just that these generalists folks are extremely fast learners. They can cover breadth and depth of an arbitrary topic in a matter of 15 minutes, just enough to solve a problem at hand.

        It’s quite intimidating how fast they can break down difficult concepts into first principles. I’ve witnessed this first hand and it’s beyond intimidating. Makes you wondering what you’re doing at this company… That being said, the caliber of folks I’m talking about is quite rare, like top 10% of top 1% teams at Google.

        • jeltz 14 days ago

          That is my experience too. It sometimes seem the supersmart generalists are people whose strongest skill is learning.

  • ranger_danger 14 days ago

    Pro-tip they're just not. A lot of tech nerds really like to think they're a genius with all the answers ("why don't they just do XX"), but some eventually learn that the world is not so black and white.

    The Dunning-Kruger effect also applies to smart people. You don't stop when you are estimating your ability correctly. As you learn more, you gain more awareness of your ignorance and continue being conservative with your self estimates.

catigula 15 days ago

A lot of really smart people working on problems that don't even really need to be solved is an interesting aspect of market allocation.

  • YossarianFrPrez 15 days ago

    Can you explain what you mean about 'not needing to be solved'? There are versions of that kind of critique that would seem, at least on the surface, to better apply to finance or flash trading.

    I ask because scaling an system that a substantially chunk of the population finds incredibly useful, including for the more efficient production of public goods (scientific research, for example) does seem like a problem that a) needs to be solved from a business point of view, and b) should be solved from a civic-minded point of view.

    • windexh8er 15 days ago

      I think the problem I see with this type of response is that it doesn't take into context the waste of resources involved. If the 700M users per week is legitimate then my question to you is: how many of those invocations are worth the cost of resources that are spent, in the name of things that are truly productive?

      And if AI was truly the holy grail that it's being sold as then there wouldn't be 700M users per week wasting all of these resources as heavily as we are because generative AI would have already solved for something better. It really does seem like these platforms are, and won't be, anywhere as useful as they're continuously claimed to be.

      Just like Tesla FSD, we keep hearing about a "breakaway" model and the broken record of AGI. Instead of getting anything exceptionally better we seem to be getting models tuned for benchmarks and only marginal improvements.

      I really try to limit what I'm using an LLM for these days. And not simply because of the resource pigs they are, but because it's also often a time sink. I spent an hour today testing out GPT-5 and asking it about a specific problem I was solving for using only 2 well documented technologies. After that hour it had hallucinated about a half dozen assumptions that were completely incorrect. One so obvious that I couldn't understand how it had gotten it so wrong. This particular technology, by default, consumes raw SSE. But GPT-5, even after telling it that it was wrong, continued to give me examples that were in a lot of ways worse and kept resorting to telling me to validate my server responses were JSON formatted in a particularly odd way.

      Instead of continuing to waste my time correcting the model I just went back to reading the docs and GitHub issues to figure out the problem I was solving for. And that led me down a dark chain of thought: so what happens when the "teaching" mode rethinks history, or math fundamentals?

      I'm sure a lot of people think ChatGPT is incredibly useful. And a lot of people are bought into not wanting to miss the boat, especially those who don't have any clue to how it works and what it takes to execute any given prompt. I actually think LLMs have a trajectory that will be similar to social media. The curve is different and I, hopefully, don't think we've seen the most useful aspects of it come to fruition as of yet. But I do think that if OpenAI is serving 700M users per week then, once again, we are the product. Because if AI could actually displace workers en masse today you wouldn't have access to it for $20/month. And they wouldn't offer it to you at 50% off for the next 3 months when you go to hit the cancel button. In fact, if it could do most of the things executives are claiming then you wouldn't have access to it at all. But, again, the users are the product - in very much the same way social media played into.

      Finally, I'd surmise that of those 700M weekly users less than 10% of those sessions are being used for anything productive that you've mentioned and I'd place a high wager that the 10% is wildly conservative. I could be wrong, but again - we'd know about that if it were the actual truth.

      • mlyle 15 days ago

        > If the 700M users per week is legitimate then my question to you is: how many of those invocations are worth the cost of resources that are spent, in the name of things that are truly productive?

        Is everything you spend resources on truly productive?

        Who determines whether something is worth it? Is price/willingness of both parties to transact not an important factor?

        I don't think ChatGPT can do most things I do. But it does eliminate drudgery.

      • hirvi74 15 days ago

        > so what happens when the "teaching" mode rethinks history, or math fundamentals?

        The person attempting to learn either (hopefully) figures out the AI model was wrong, or sadly learns the wrong material. The level of impact is probably quite relative to how useful the knowledge is one's life.

        The good or bad news, depending on how you look at it, is that humans are already great at rewriting history and believing wrong facts, so I am not entirely sure an LLM can do that much worse.

        Maybe ChatGPT might just kill of the ignorant like it already has? GPT already told a user to combine bleach and vinegar, which produces chlorine gas. [1]

        [1] https://futurism.com/chatgpt-bleach-vinegar

    • catigula 15 days ago

      [flagged]

      • hattmall 15 days ago

        The only solution to those people starving to death is to kill the people that benefit from them starving to death. It's a solved problem, the solution isn't palatable. No one is starving to death because of a lack of engineering prowess.

      • AdieuToLogic 15 days ago

        > People are starving to death and the world's brightest engineers are ...

        This is a political will, empathy, and leadership problem. Not an engineering problem.

      • seneca 15 days ago

        Famine in the modern world is almost entirely caused by dysfunctional governments and/or armed conflicts. Engineers have basically nothing to do with either of those.

        This sort of "there are bad things in the world, therefore focusing on anything else is bad" thinking is generally misguided.

      • trhway 15 days ago

        the existence of poor hungry people feeds the fear of becoming poor and hungry which drives those brightest engineers. I.e. the things work as intended, unfortunately.

    • abletonlive 15 days ago

      They won’t be honest and explain it to you but I will. Takes like the one you’re responding to are from loathsome pessimistic anti-llm people that are so far detached from reality they can just confidently assert things that have no bearing on truth or evidence. It’s a coping mechanism and it’s basically a prolific mental illness at this point

      • ezst 14 days ago

        And what does that make you? A "loathsome clueless pro-llm zealot detached from reality"? LLMs are essentially next word predictors marketed as oracles. And people use them as that. And that's killing them. Because LLMs don't actually "know", they don't "know that they don't know", and won't tell you they are inadequate when they are. And that's a problem left completely unsolved. At the core of very legitimate concerns about the proliferation of LLMs. If someone here sounds irrational and "coping", it very much appears to be you.

      • jon-wood 14 days ago

        > so far detached from reality they can just confidently assert things that have no bearing on truth or evidence

        So not unlike an LLM then?

  • virgil_disgr4ce 15 days ago

    > working on problems that don't even really need to be solved

    Very, very few problems _need_ to be solved. Feeding yourself is a problem that needs to be solved in order for you to continue living. People solve problems for different reasons. If you don't think LLMs are valuable, you can just say that.

    • crawfordcomeaux 15 days ago

      The few problems humanity has that need to be solved:

      1. How to identify humanity's needs on all levels, including cosmic ones...(we're in the Space Age so we need to prepare ourselves for meeting beings from other places)

      2. How to meet all of humanity's needs

      Pointing this out regularly is probably necessary because the issue isn't why people are choosing what they're doing...it's that our systems actively disincentivize collectibely addressing these two problems in a way that doesn't sacrifice people's wellbeing/lives... and most people don't even think about it like this.

    • catigula 15 days ago

      The notion that simply pretending to not understand that I was making a value judgment about worth is an argument is tiring.

  • vermilingua 15 days ago

    Well, we all thought advertising was the worst thing to come out of the tech industry, someone had to prove us wrong!