Benjamin is an Elixir and Functional Programming enthusiast, and the maintainer of Erlang’s Apache Arrow implementation. Though not a core contributor, he also frequently contributes to the Nx and Livebook projects with the occasional contributions to Hexpm and Hex. Being a 16 year old high schooler, he does not have any professional programming experience but he tries to contribute to Opensource whenever time permits.
In his free time he likes to read Non Fiction and Fantasy, listen to 70s and 80s music (and additionally grumble about how nobody at school shares his taste in music) and crack terrible jokes. Beware. He’s bit of a Linux and Emacs zealot.
The Nx project brings Elixir to the domain of AI & ML. It makes a very large promise: to leverage the BEAM in ML workflows, enabling us to run ML models concurrently, distributed over multiple nodes, as well as partitioned across several GPUs.
Closer inspection reveals that Nx is walking a tightrope: its computations are powered by NIFs, whose scheduling is a tricky business, and GPUs, which support a very small number of parallel processes. This raises the question: how is Nx providing the high throughput that it boasts of?
Nx answers this question with “Servings” and “Batches”, ways to batch requests together to reduce and distribute the calls to the NIF and GPU. But how exactly does that work? What trade-offs are we making by batching requests?
In this talk we will examine Nx’s Servings, how it addresses the problem of scheduling, the trade-offs and the performance characteristics of its approach, as well as what are the other constraints when running an ML service.