ZML - High performance AI inference stack

ismailmaj 2 days ago

What would be the benefit of using ZML instead of relying on StableHLO/PJRT? Because the cost of porting models is for sure high.

Reply View 1 reply

gwenzek a day ago

ZML the zig library is mostly a wrapper of StableHLO/pjrt. But it's a high quality wrapper, and the tagged tensor syntax is really helpful to write complex ops like dot, or gather.
And ZML the framework also resolve issues with the complex dependency graph of stablehlo/pjrt.

Reply View | 0 replies

hsjdhdvsk 3 days ago

Hi ya! Want to say this looks awesome :) really interested in the sharded inference demo!!! You said it was experimental, is it in the examples folder at all?? (On phone atm, so apologies for not investigating further)

Reply View 0 replies

onurcel 3 days ago

First of all, great job! I think the inference will become more and more important.

That being said, I have a question regarding the ease of use. How difficult it is for someone with python/c++ background to get used to zig and (re)write a model to use with zml?

Reply View 2 replies

gwenzek 3 days ago

Hi co-author here. Zig is way simpler than C++. Simple like in an afternoon I was able to onboard in the language and rewrote the core meat of a C++ algorithm and see speed gains (fastBPE for reference).
Coming from Python, the hardest part is learning memory management. What helps with ZML is that the model code is mostly meta programming, so we can be a bit flexible there.
We have a high level API, that should feel familiar to Pytorch user (as myself), but improves in a few ways

Reply View | 0 replies
steeve 3 days ago

pretty easy, usually the hardest part is figuring out what the python code is doing

Reply View | 0 replies

Palmik 3 days ago

Given that the focus is performance, do you have any benchmarks to compare against the likes of TensoRT-LLM.

Reply View 2 replies

gwenzek 2 days ago

It' s a bit early to compare directly to TensorRT because we don't have a full-blown equivalent.
Note that our focus is being platform agnostic, easy to deploy/integrate, good performance all-around, and ease of tweaking. We are using the same compiler than Jax, so our performances are on par. But generally we believe we can gain on overall "tok/s/$" by having shorter startup time, choosing the most efficient hardware available, and easily implementing new tricks like multi-token prediction.

Reply View | 0 replies
koe123 2 days ago

I second this, it would help to justify the time investment into a framework if its clear how it stacks up!

Reply View | 0 replies

montyanderson 3 days ago

my dreams have come true. hardware-agnostic ml primitives in a typed, compiled language.

my only question is: is zig stable enough to base such a project on?

Reply View 2 replies

gwenzek 2 days ago

Zig has been relatively stable for the past few years for the main Zig code. What has changed the most is the `build.zig` build system (which we aren't using).
We are also looking ahead at Zig roadmap, and trying to anticipate upcoming breaking changes, and isolate our users from that.

Reply View | 0 replies
dartos 3 days ago

Stable as in unchanging, no.
Stable as in reliable enough, I’d say so.

Reply View | 0 replies