Hey folks,

Last week we covered Temperature & Sampling and why your AI gives different answers every time. This week: model quantization, the trick that lets you run a 70-billion parameter model on a laptop.

You've probably seen those Reddit posts or Hacker News threads where someone runs Llama 70B on a MacBook with 64GB of RAM. And you think: wait, shouldn't that model be hundreds of gigabytes? How does it even fit? The answer is almost always quantization.

The Storage Problem

Every parameter in a neural network is a number. By default, most models store each parameter as a 16-bit floating-point number (FP16). That means each parameter takes 2 bytes of memory.

Let's do some quick math for popular models:

Model Parameters Memory at FP16
Llama 3 8B 8 billion ~16 GB
Llama 3 70B 70 billion ~140 GB
Llama 3.1 405B 405 billion ~810 GB

That 405B model needs over 800 GB just to load the weights. Not to run inference. Just to sit in memory. Most servers don't have that kind of GPU memory, let alone your laptop.

What Is Quantization?

Quantization reduces the precision of those numbers. Instead of storing each parameter as a 16-bit float, you store it as an 8-bit integer, a 4-bit integer, or sometimes even lower.

Think of it like rounding. If a weight is 0.23481947, you don't actually need all those decimal places to get a good prediction. You can round it to something coarser and the model still works. Surprisingly well, in fact.

The common quantization levels:

  • FP16 (16-bit float): the baseline. Full precision for most training.
  • INT8 (8-bit integer): half the memory of FP16. Minimal quality loss for most models.
  • INT4 (4-bit integer): quarter the memory. Noticeable but often acceptable quality loss.
  • 2-bit and lower: experimental. Quality degrades fast, but people keep pushing the boundary.

Here's what that does to our models:

Model FP16 INT8 INT4
Llama 3 8B ~16 GB ~8 GB ~4 GB
Llama 3 70B ~140 GB ~70 GB ~35 GB
Llama 3.1 405B ~810 GB ~405 GB ~203 GB

That 70B model at 4-bit fits in 35 GB. Now your MacBook with 64 GB of unified memory can handle it. That's the magic.

How Quantization Actually Works

There are a few different approaches, and the differences matter.

Post-Training Quantization (PTQ) is the simplest. You take a fully trained FP16 model and convert the weights to lower precision after the fact. No retraining needed. You just map the floating-point values to a smaller set of integers using a scale factor.

Say a layer has weights ranging from -1.0 to 1.0. With INT8, you have 256 possible values. So you divide that range into 256 bins and snap each weight to the nearest bin. The model loses some nuance, but the overall behavior is preserved.

Quantization-Aware Training (QAT) is more involved. You simulate the lower precision during training itself, so the model learns to be robust to rounding errors. QAT generally produces better results than PTQ at the same bit width, but it requires retraining. For very large models, that retraining cost can be significant.

GPTQ, AWQ, and GGUF are formats you'll see in practice. GPTQ and AWQ are quantization methods designed specifically for large language models. GGUF is a file format (used by llama.cpp) that supports various quantization levels and runs on CPUs. When you download a model file labeled something like Q4_K_M, that's a specific GGUF quantization scheme at roughly 4-bit precision.

What You Lose

Quantization isn't free. You're throwing away information, and that has consequences.

At INT8, the loss is genuinely hard to measure on most benchmarks. I've seen comparisons where INT8 quantized models score within 1% of the FP16 original on standard tasks. For most applications, this is a no-brainer.

At INT4, things get more interesting. The model still works well for general conversation and common tasks. But you start to see degradation on:

  • Complex reasoning chains
  • Precise numerical calculations (models were already bad at this, quantization makes it worse)
  • Tasks requiring recall of less common knowledge
  • Following very specific formatting instructions

At 2-bit and below, quality drops noticeably. Responses become more generic, the model hallucinates more, and it struggles with nuance. Usable for some applications, but you feel the difference.

The practical rule I've seen hold up: go INT8 if you can afford the memory, INT4 if you can't. Below 4-bit, test carefully for your specific use case.

Why This Matters for You

If you're using commercial APIs like GPT-4 or Claude, quantization is happening behind the scenes. Providers almost certainly serve quantized versions of their models in production to reduce costs and latency. You don't control it, but now you know why the same model might feel slightly different across providers or over time.

If you're running open-source models locally or self-hosting, quantization is one of the first decisions you'll make. The tools have gotten remarkably good:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=quantization_config,
    device_map="auto"
)

That's it. Five lines and you're running a 4-bit quantized Llama model. The bitsandbytes library handles the conversion on the fly.

If you're a product manager evaluating self-hosted vs. API options, quantization changes the cost equation dramatically. A model that needs four A100 GPUs at full precision might need one at INT4. That's a 4x reduction in GPU costs.

Quantization vs. Distillation

Last week we covered distillation, and it's worth noting how these two approaches compare. They're both about making models smaller and faster, but they work differently.

Distillation trains a smaller model from scratch to mimic a larger one. You end up with a completely different model with fewer parameters. The architecture changes.

Quantization keeps the same model and same architecture, just reduces the precision of the numbers. The parameter count stays the same, but each parameter takes less memory.

You can also combine them. Distill a 70B model down to 8B, then quantize that 8B model to INT4. Now you've gone from 140 GB to roughly 4 GB. That's the kind of compression that puts capable models on a phone.

Key Takeaway

Quantization reduces the numerical precision of model weights to shrink memory usage and speed up inference, with surprisingly small accuracy trade-offs.

  • FP16 is the standard training precision: 2 bytes per parameter
  • INT8 cuts memory in half with near-zero quality loss for most tasks
  • INT4 cuts memory to a quarter but starts showing degradation on complex reasoning
  • Post-training quantization is fast and easy; quantization-aware training gives better quality but costs more
  • GGUF, GPTQ, and AWQ are the formats you'll actually encounter in practice
  • Quantization is why people can run 70B models on consumer hardware
  • Commercial API providers likely quantize their models in production too

What's Next

Next week: Transfer Learning (Standing on giants' shoulders).

Read the full AI Learning series -> Learn AI

Thanks for reading! Got questions, feedback, or want to chat about AI? Hit reply – I read and respond to every message. And if you found this valuable, feel free to forward it to a friend who'd benefit!

Pranay
Pranay
Infolia.ai

💬 Join the Discussion