Comment by originalvichy
Comment by originalvichy 6 hours ago
Where in the world are you getting the numbers for how much video streaming uses energy? I am quite sure that just as with LLMs, most of the energy goes into the initial encoding of the video, and nowadays any rational service encodes videos to several bitrates to avoid JIT transcoding.
Networking can’t take that much energy, unless perhaps we are talking about purely wireless networking with cell towers?
LLM Inference is still quite power-hungry, Video decoding with hardware acceleration is much more efficient.
But we can do some estimates, heck, we can even ask GPT for some numbers.
Say you want to do 30 minutes of video (h265) or 30 minutes of LLM inferencing on a generic consumer device, ignoring the source of the model or source of encoded video, you get about 4x difference:
This is optimised already, so a working hardware H.265 decoder is assumed, and for inferencing, something on the level of an RTX 3050, but can also be a TPU or NE.While not the most scientific comparison, it's perhaps good to know that video decoding is practically always local, and for streaming services it will use whatever is available and might even switch codecs (i.e. AV1, H.265, H.264 depending on what is available, and what licenses are used). And if you have older hardware, some codecs won't even exist in hardware, to the point where you start doing software decoding (very inefficient).
AI inferencing is mostly remote (at least the heavy loads) in a datacenter because local availability of hardware is pretty hit and miss, models are pretty big and spinning one up every time you just wanted to ask something is not very user friendly. Because in a datacenter you tend to pay for amperage per rack, you spec your AI inferencing hardware to eat that power since you're not saving any money or hardware life when you don't use it. That means that efficiency is important (more use out of a rack) but scaling/idling isn't really that big of a deal (but it has slowly dawned on people that burning power 'because you can' is not really a great model). That AI inferencing in a datacenter is more power-hungry as a result, because they can, because it is faster, and that's what attracts users.
I would estimate that the local llama3 inferencing uses less power than when done in a datacenter, because there simply is less power available locally (try finding an end-user device that is used mass-market with enough power available, you won't; only small markets like gaming PCs and workstations will do).