Small Language Models vs Large Language Models
How to choose between a small on-device model and a large general model for cost, latency, privacy, and narrow fine-tuned tasks.
“Small language model” has no fixed parameter cutoff. The leading survey defines small language models (SLMs) by capability, not size, with literature definitions ranging from under 1 billion parameters up to roughly 10 billion [10]. The practical question is not which model is biggest. The practical question is which model fits your task, your budget, and your hardware.

A decision flow
Most builders pick a model by reflex and reach for the largest one available. Reverse that habit. Start from the shape of the task, then size the model to fit.
NVIDIA proposes exactly this split for agentic systems. Use SLMs for the specialized calls an agent repeats, and reach for an LLM only when broad ability is required [8].
SLMs vs LLMs at a glance
| Small language models | Large language models | |
|---|---|---|
| Typical size | Under 1B to about 10B [10] | Tens to hundreds of B |
| Cost | Low: runs on cheap hardware | High: GPU clusters or API fees |
| Latency | Fast, local inference | Slower, network round trips |
| Privacy | On-device, data stays put | Usually sent to a provider |
| Customizability | Easy to fine-tune | Costly to fine-tune |
| Best for | Narrow repeated tasks | Broad open-ended reasoning |
The rest of this article fills in the numbers behind each row.
What counts as small
No single threshold separates small from large. The comprehensive survey of SLMs defines them by what they can do on constrained hardware, not by a parameter count [10]. In practice, you will see three loose bands:
- Tiny: 135M to about 1.5B. Runs on a phone or laptop CPU.
- Small: roughly 2B to 9B. Runs on a single consumer GPU.
- Mid: 10B to 27B. The upper edge of “small,” needs a strong GPU.
Treat these as guidance, not law. A 3B model that solves your task beats a 70B model that solves it more slowly and at higher cost.
The main SLM families
The open-model field is crowded. Here are the families worth knowing, with sizes and verified benchmarks.
Phi (Microsoft)
Phi-3-mini has 3.8 billion parameters, trained on 3.3 trillion tokens, and is small enough to run on a phone. It scores 69% on MMLU (a broad knowledge benchmark) and 8.38 on MT-bench, rivaling Mixtral 8x7B and GPT-3.5 [1]. Phi-3-small (7B) and Phi-3-medium (14B) reach 75% and 78% on MMLU [1]. Phi-4 is a 14B model that surpasses its teacher GPT-4 on STEM-focused question answering, going beyond distillation through heavy use of synthetic training data [2].
Gemma (Google)
The original Gemma shipped at 2B and 7B and beat similarly sized open models on 11 of 18 tasks [3]. Gemma 2 spans 2B, 9B, and 27B. Its 2B and 9B variants train with knowledge distillation instead of plain next-token prediction [4]. Gemma 3 spans 1B to 27B, adds vision, and supports at least 128K context. Its 4B instruct variant competes with the previous-generation 27B model [5].
Qwen2.5 (Alibaba)
Qwen2.5 released in seven sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B, with quantized variants for each [6]. That spread lets you pick a size per task without changing model families.
Llama 3.2 lightweight (Meta)
Meta Llama 3.2 lightweight models come in 1B and 3B, support 128K context, and target on-device summarization and rewriting. Meta trained them with pruning and distillation [11]. Quantized Llama 3.2 1B and 3B achieve an average 56% model-size reduction, 41% memory reduction, and a 2-4x speedup versus the BF16 baseline, using QLoRA and SpinQuant [12].
SmolLM (Hugging Face)
SmolLM ships at 135M, 360M, and 1.7B. The 1.7B variant outperforms other sub-2B models, including Phi-1.5 and Qwen2-1.5B [13]. These are the models to reach for when you need something that runs on almost any device.
TinyLlama
TinyLlama is a 1.1B model pretrained on about 3 trillion tokens, built on the Llama 2 architecture [7]. It shows how far a tiny model goes when you train it on a large token budget.
How they get small
Two techniques do most of the work: distillation and quantization.
Knowledge distillation originates with Hinton, Vinyals, and Dean in 2015. A large “teacher” model produces soft probability targets, and a smaller “student” model learns to match them through temperature scaling, compressing the teacher’s knowledge into a deployable model [9]. Gemma 2’s 2B and 9B models train this way, citing that original work [4]. Phi-4 takes the idea further and beats its own teacher on STEM tasks through synthetic data [2].
Quantization shrinks the model after training. It stores weights at lower precision, so a model that used 16 bits per weight might use 4 bits or fewer. Quantized Llama 3.2 models cut size by 56% and memory by 41% while running 2-4x faster [12]. The llama.cpp project enables CPU-only local inference through the GGUF format and supports anything from 1.5-bit to 8-bit quantization on consumer hardware [14].
When SLMs win
Reach for a small model when one of these matters more than raw capability.
- Cost: An SLM runs on a single consumer GPU or a CPU. You avoid per-token API fees and GPU cluster rental.
- Latency: Local inference skips the network round trip. Quantized Llama 3.2 runs 2-4x faster than its full-precision baseline [12].
- Privacy: An on-device model keeps user data on the device. Nothing leaves the machine, which suits regulated or sensitive workloads.
- On-device deployment: Phi-3-mini fits on a phone [1]. SmolLM at 135M fits almost anywhere [13].
- Narrow fine-tuned tasks: A small model fine-tuned on your domain often beats a large general model on that domain, and it is far cheaper to fine-tune.
The agentic argument
Most agents repeat a handful of specialized actions: parse a request, call a tool, format a response. They rarely need open-ended genius on every step.
NVIDIA argues that SLMs are “sufficiently powerful, inherently more suitable, and necessarily more economical” for agentic systems built around a few repeated tasks [8]. The proposal is a heterogeneous architecture: route the repetitive, well-scoped calls to SLMs, and invoke an LLM only when a step genuinely needs broad ability [8]. You pay LLM prices for the few hard calls, not the thousands of routine ones.
When you still need an LLM
Small does not mean better for everything. Keep a large model when:
- The task needs broad world knowledge across many domains at once.
- The reasoning is multi-step, open-ended, or novel with no fixed shape.
- You synthesize across long, varied context where general ability pays off.
- You cannot predict the inputs, so you cannot fine-tune a narrow model.
The honest framing is a portfolio. Use SLMs where they fit and LLMs where they do not, and route between them.
Try a small model locally
You can run a small model on your own machine in a few minutes. With Ollama, pull a quantized Phi-3 and prompt it:
# Pull a quantized small model (under 3 GB) and run it locally
ollama pull phi3:mini
ollama run phi3:mini "Summarize this in one sentence: small models trade breadth for speed and privacy."To call llama.cpp directly with a GGUF file, build the binary and point it at a downloaded model [14]:
# Clone and build llama.cpp for CPU-only inference
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
# Run a 4-bit quantized model on CPU, no GPU required
./build/bin/llama-cli \
-m ./models/phi-3-mini-4k-instruct-q4.gguf \
-p "List three tasks where a small model beats a large one."Both run on a laptop with no API key and no data leaving the machine.
Further reading
- Phi-3 Technical Report (Microsoft) : the 3.8B model that runs on a phone and rivals far larger models.
- Phi-4 Technical Report (Microsoft) : a 14B model that surpasses its GPT-4 teacher on STEM through synthetic data.
- Gemma 3 Technical Report (Google DeepMind) : 1B to 27B with vision and 128K context, where the 4B rivals the prior 27B.
- Small Language Models are the Future of Agentic AI (NVIDIA) : the case for routing repetitive agent calls to SLMs.
- A Comprehensive Survey of Small Language Models (Wang et al.) : the capability-based definition this article uses.
- Distilling the Knowledge in a Neural Network (Hinton et al.) : the original distillation paper behind modern small models.
- llama.cpp : CPU-only local inference with GGUF and 1.5-bit to 8-bit quantization.
- Llama 3.2 quantized models (Meta AI) : 56% smaller, 41% less memory, 2-4x faster.
- LLM landscape 2026 : how the large models compare today.
- Multi-model routing : build the SLM-first, LLM-fallback architecture in practice.
- Ollama : the easiest way to pull and run small models locally.
- Fine-tuning : how to specialize a small model on your domain.
- AI subscription pricing 2026 : what large-model API access costs.