Honey, I shrunk the LLM! A Beginner's Guide to Quantization

Honey, I shrunk the LLM! A Beginner's Guide to Quantization

HomeNews, Other ContentHoney, I shrunk the LLM! A Beginner's Guide to Quantization

Hands on If you hop on Hugging Face and start scrolling through large language models, you'll quickly notice a trend: Most have been trained on 16-bit floating-point Brain-float precision.

LLM's Quantization Crash Course for Beginners

FP16 and BF16 have become quite popular for machine learning – not only because they provide a good balance between accuracy, throughput and model size – but the data types are widely supported across the vast majority of hardware, be it CPUs, GPUs or dedicated AI accelerators.

The problem comes when you try to run models, especially larger ones, with 16-bit tensors on a single chip. With two bytes per parameter, a model like the Llama-3-70B requires at least 140 GB of very fast memory, and that doesn't include other overhead, such as key-value cache.

To get around this, you can either split the model across multiple chips – or even servers – or you can compress the model weights to a lower precision in a process called quantization.

Tagged:
Honey, I shrunk the LLM! A Beginner's Guide to Quantization.
Want to go more in-depth? Ask a question to learn more about the event.