SambaNova Cloud serves Llama 3.1 405B with 100+ tokens/s

SambaNova Cloud serves Llama 3.1 405B with 100+ tokens/s

HomeNews, Other ContentSambaNova Cloud serves Llama 3.1 405B with 100+ tokens/s

Not to be outdone by rival AI system upstarts, SambaNova has launched its own inference cloud that it says is ready to serve Meta's biggest models faster than the rest.

Zuck's new Llama is a beast

The cloud offering is one of several that have emerged during the AI boom, offering API access to popular open weight models. Most of these are GPU-based, but for the more boutique vendors dealing in specialized hardware, such as Cerebras, Groq and now SambaNova, it seems whoever can get the biggest model to spit out tokens the fastest has a leg up .

If you're not familiar, tokens here refer to how large language models encode words, word fragments, punctuation, and figures. So the faster your infrastructure can generate tokens, the less time you have to wait for a response.

According to CEO Rodrigo Liang, SambaNova has managed to get Meta's 405 billion parameter Llama 3.1 model (more than twice the size of OpenAI's GPT-3.5 model) to extract tokens at a rate of 132 per second and with full 16-bit precision it was trained on no less.

Tagged:
SambaNova Cloud serves Llama 3.1 405B with 100+ tokens/s.
Want to go more in-depth? Ask a question to learn more about the event.