Fine-Tuning LLMs - A Practical Guide

When and how to fine-tune large language models, covering data preparation, training approaches (full fine-tuning, LoRA, QLoRA), evaluation, and cost considerations.

Added 28 Mar 2026 10 min read Updated 30 May 2026

#fine-tuning #LLM #training #machine-learning #models

Learn this your way

Read Guided course

Hands weaving glowing red threads beside a lone figure conducting a luminous tower, a metaphor for fine-tuning a pre-trained model by threading in new data and steering its behaviour. — Fine-tuning is threading new data through a model you did not train from scratch: adapt the weights, then steer the behaviour.

Fine-tuning adapts a pre-trained language model to a specific task or domain by training it on additional data. It is one of the most misunderstood techniques in applied AI. Teams often fine-tune when prompting would suffice, or skip fine-tuning when it would provide significant improvements. This guide covers when fine-tuning is appropriate, how to do it effectively, and how to avoid common pitfalls.

Step 1 Prepare data Collect, review, and split high-quality examples. Quality beats volume.

→

Step 2 Choose method Full fine-tuning, LoRA, QLoRA, or DoRA, based on budget and GPU memory.

→

Step 3 Train Run the training loop, watching validation loss for overfitting.

→

Step 4 Evaluate Compare against the prompted base model on held-out test data.

→

Step 5 Deploy and serve Serve the adapter or merged model, then plan periodic retraining.

When to Fine-Tune (and When Not To)

Fine-Tune When

The task requires a specific output format that prompting cannot reliably produce. If you need the model to consistently output a particular JSON schema, classification label, or structured format, fine-tuning encodes this behavior more reliably than prompting.

Domain-specific language is critical. Legal, medical, financial, or technical domains have specialized terminology and reasoning patterns. Fine-tuning on domain data improves the model’s fluency and accuracy in these contexts.

You need consistent behavior at scale. A fine-tuned model produces more consistent outputs than a prompted model for the same task. When running thousands of predictions, this consistency matters.

Cost optimization. A fine-tuned smaller model can match the performance of a larger prompted model for specific tasks, at significantly lower inference cost.

Do Not Fine-Tune When

Prompting works well enough. If prompt engineering with few-shot examples achieves your quality targets, fine-tuning adds unnecessary complexity and cost.

You lack sufficient data. Fine-tuning requires hundreds to thousands of high-quality examples. If you have fewer than 100 examples, prompt engineering with few-shot examples is more practical.

The task changes frequently. Fine-tuning creates a fixed model. If requirements change weekly, you will need to retrain constantly. Prompting is more adaptable.

You need the model’s full general capability. Fine-tuning narrows the model’s capabilities to the fine-tuning task. A fine-tuned customer service model may lose capability on unrelated tasks.

Data Preparation

Data quality is the single most important factor in fine-tuning success.

Dataset Requirements

Volume. Minimum 100 examples for simple tasks, 500-1000 for complex tasks, 2000+ for best results. More data generally helps, but quality matters more than quantity.

Quality. Every example should be a perfect example of the desired behavior. If you would not accept the output from a human, do not include it in training data. One bad example can teach the model bad habits.

Diversity. Cover the full range of inputs and outputs the model will encounter. Include easy cases, hard cases, edge cases, and examples of every category or output type.

Format. Most fine-tuning approaches use conversation format: system prompt, user message, assistant response. Match the format to your inference use case.

Dataset Creation Process

Collect seed examples. Gather real examples from your application, or have domain experts create them.
Quality review. Have multiple reviewers verify each example. Remove or correct any that are ambiguous, incorrect, or inconsistent.
Augment if needed. Use an LLM to generate additional examples, but always have humans verify the generated examples. Never fine-tune on unreviewed synthetic data.
Split into train/validation/test. Use 80/10/10 or 90/5/5 splits. Never evaluate on training data.

Fine-Tuning Approaches

At a glance, the four main approaches trade quality against memory and cost:

	Full fine-tuning	LoRA	QLoRA	DoRA
Updates	All weights	Small adapters	Adapters on 4-bit base	Direction of decomposed weights
GPU memory	Highest	Low	Lowest	Low
Quality	Best	Near-full	Slightly below LoRA	Closest to full
Inference latency	None added	None when merged	None when merged	None when merged
Best for	Large data, deep domain shift	Most practical cases	Limited GPU memory	Quality-sensitive adapters

Full Fine-Tuning

Update all model parameters on your dataset. Produces the best results but requires significant compute (GPU hours) and stores a full copy of the model weights.

When to use: Large datasets (10K+ examples), significant domain adaptation needed, budget for compute.

LoRA (Low-Rank Adaptation)

Train small adapter matrices that modify the model’s behavior without changing the original weights. Much cheaper than full fine-tuning, with results that are often comparable.

The key insight (Hu et al., 2022): the weight updates ΔW needed to adapt a model to a new task are intrinsically low-rank. Rather than updating the full weight matrix W ∈ ℝ^(d×k), LoRA decomposes the update as ΔW = BA where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k), with rank r ≪ min(d, k). A 4096×4096 weight matrix has 16.8M parameters; at rank 8, the LoRA adapter has only 65K, a 256× reduction in trainable parameters.

When to use: Most fine-tuning use cases. LoRA has become the default approach for practical fine-tuning. Hugging Face’s peft library provides a standard implementation.

Key parameters: Rank (r) controls adapter capacity. Start with r=8 or r=16. Higher rank captures more complex adaptations but costs more and risks overfitting. The alpha scaling parameter is typically set to r or 2r.

QLoRA

Combines LoRA with model quantization (Dettmers et al., 2023). The base model is loaded in 4-bit NormalFloat (NF4) precision using bitsandbytes, reducing memory requirements dramatically. A 65B parameter model that requires ~130GB in full precision fits in ~48GB with QLoRA. Enables fine-tuning large models on consumer GPUs.

When to use: When GPU memory is limited. Quality is slightly lower than full LoRA but the cost reduction is substantial. The NF4 data type is specifically designed to minimize quantization error for normally-distributed weights.

Beyond QLoRA: the 2024-2026 PEFT frontier

LoRA started a research line, and several successors now close the small remaining gap to full fine-tuning. You do not need all of them, but it helps to know the landscape.

DoRA (Weight-Decomposed Low-Rank Adaptation). Splits each weight into a magnitude and a direction, then applies LoRA to the direction only. It tracks full fine-tuning quality more closely than plain LoRA and keeps the merge-back property, so it adds no inference latency. A common 2026 starting point is DoRA at r=16 targeting all linear layers.
PiSSA. Initialises the adapters from the principal singular vectors of the original weights rather than at random, which speeds and stabilises convergence.
VeRA. Freezes a shared pair of random matrices and trains only tiny scaling vectors, cutting trainable parameters even further than LoRA for large multi-adapter fleets.
GaLore. Projects the gradients (not the weights) into a low-rank subspace, which allows full-parameter training in far less memory. It is a different trade-off from LoRA: full weights update, but the optimiser footprint shrinks.

For the mechanism behind all of these, see the dedicated glossary entry on LoRA and QLoRA .

Supervised fine-tuning vs preference tuning

Everything above is supervised fine-tuning (SFT): you show the model input-output pairs. A second stage, preference tuning, teaches the model which of two responses is better. Direct Preference Optimization (DPO) has largely displaced full RLHF pipelines for this, because it optimises directly on preference pairs without training a separate reward model. A typical recipe is SFT first to fix behaviour and format, then DPO to align tone and preferences. Preference tuning needs preference data (pairs labelled better or worse), which is a different and often harder dataset to build than SFT examples.

The tooling you will actually use

You rarely implement any of this by hand. The Hugging Face peft library provides the adapters, transformers and trl provide the training and preference-tuning loops, and higher-level wrappers such as Axolotl, Unsloth, and torchtune package the whole flow with sensible defaults. All of them run on PyTorch . For serving, frameworks like vLLM can hot-swap many LoRA adapters against one base model, so a single deployment serves many fine-tuned variants economically.

Training Process

Hyperparameters

Learning rate. Start with 1e-5 to 2e-5 for full fine-tuning, 1e-4 to 3e-4 for LoRA. Too high causes catastrophic forgetting; too low produces minimal adaptation.

Epochs. 2-5 epochs for most datasets. Monitor validation loss to detect overfitting. Stop when validation loss starts increasing.

Batch size. Larger is generally better for training stability. Use the largest batch size that fits in GPU memory, with gradient accumulation if needed.

Monitoring Training

Track during training:

Training loss (should decrease steadily)
Validation loss (should decrease, then plateau; increasing indicates overfitting)
Learning rate schedule (warmup then decay is standard)

Common Training Problems

Catastrophic forgetting. The model loses general capabilities while learning the fine-tuning task. Reduce learning rate, reduce epochs, or mix general-purpose data into the fine-tuning dataset.

Overfitting. The model memorizes training examples but does not generalize. Reduce epochs, add more diverse training data, or increase regularization (dropout, weight decay).

Mode collapse. The model produces the same output regardless of input. Usually caused by insufficient data diversity or too many epochs. Add more diverse examples.

Evaluation

Evaluate fine-tuned models rigorously:

Compare to baseline. Always compare against the base model with good prompting. If fine-tuning does not significantly improve over prompting, it is not worth the ongoing maintenance cost.

Use held-out test data. Evaluate on examples the model never saw during training. Training set performance is meaningless.

Evaluate multiple dimensions. Check not just task accuracy but also output quality, format compliance, and edge case handling.

Check for regressions. Test capabilities that the model should retain from pre-training. Fine-tuning should not break general-purpose capabilities unless that is intentional.

Cost Considerations

Training cost. Fine-tuning via API (OpenAI, Bedrock) costs approximately $8–$25 per million training tokens as of early 2026, verify current pricing at provider documentation before budgeting, as these figures change. Self-hosted fine-tuning costs GPU hours ($1–$5/hour for single-GPU on cloud providers, more for multi-GPU).

Inference cost. Fine-tuned models are often the same cost to run as base models. The savings come from using a smaller fine-tuned model instead of a larger prompted model.

Maintenance cost. Fine-tuned models need periodic retraining as data and requirements change. Budget for quarterly or monthly retraining cycles.

Fine-tuning is a powerful technique when applied to the right problems. The decision to fine-tune should be driven by data: you have enough quality examples, prompting is demonstrably insufficient, and the improvement justifies the ongoing maintenance cost.

Sources

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR (2022). https://arxiv.org/abs/2106.09685 , The original LoRA paper. Demonstrates that intrinsic rank of weight updates is low, enabling efficient fine-tuning with adapter matrices.
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS (2023). https://arxiv.org/abs/2305.14314 , Introduces NF4 quantization and the double quantization technique that enables fine-tuning 65B models on a single 48GB GPU.
He, J. et al. “Towards a Unified View of Parameter-Efficient Transfer Learning.” ICLR (2022). https://arxiv.org/abs/2110.04366 , Unified framework comparing LoRA, adapters, prefix tuning, and prompt tuning. Useful for understanding when each PEFT method is appropriate.
Kirkpatrick, J. et al. “Overcoming Catastrophic Forgetting in Neural Networks.” PNAS 114, no. 13 (2017): 3521–3526. https://arxiv.org/abs/1612.00796 , Elastic Weight Consolidation (EWC), a regularization approach to catastrophic forgetting mentioned in the training problems section.
Hugging Face. “PEFT: State-of-the-Art Parameter-Efficient Fine-Tuning.” https://github.com/huggingface/peft , The standard Python library for LoRA, QLoRA, prompt tuning, and other PEFT methods referenced in this guide.
Liu, S.-Y. et al. “DoRA: Weight-Decomposed Low-Rank Adaptation.” ICML (2024). https://arxiv.org/abs/2402.09353 , Decomposes weights into magnitude and direction, closing the quality gap to full fine-tuning with no added inference cost.
Zhao, J. et al. “GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.” ICML (2024). https://arxiv.org/abs/2403.03507 , Projects gradients into a low-rank subspace to enable full-parameter training in reduced memory.
Rafailov, R. et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS (2023). https://arxiv.org/abs/2305.18290 , The preference-tuning method that replaces reward-model RLHF for many alignment recipes.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session