Quick Answer
Fine-tuning is the process of taking a pre-trained AI model and continuing to train it on a smaller, task-specific dataset so it performs better on your particular use case. Instead of training a model from scratch (which costs millions of euros), you take an existing model that already understands language or images, then teach it the specific style, terminology, or decision patterns you need. Fine-tuning changes the model permanently; prompt engineering does not.
Dark industrial loom with a single red thread being guided through complex machinery: fine-tuning threads specialised knowledge through the structure of a pre-trained model.
Fine-tuning threads a specific dataset through a pre-existing model structure: the loom (the model) remains the same; the thread (your data) defines the final pattern.

The problem fine-tuning solves

A pre-trained model like GPT-4o or Llama 3 is trained on general internet data. It can write in many styles, discuss many topics, and answer many types of questions. But it does not know:

  • Your company’s internal terminology
  • The exact output format your system expects
  • The tone of voice your brand uses
  • The regulatory constraints specific to your industry
  • Domain-specific concepts that are underrepresented in public training data

You can address some of these with prompt engineering: add instructions to every prompt. But for complex domains, long instruction lists inflate costs and still produce inconsistent results.

Fine-tuning bakes the knowledge or behaviour into the model itself, so you do not need to explain it every time.

Prompt engineering vs fine-tuning

Prompt engineeringFine-tuning
Setup timeMinutes to hoursDays to weeks
Setup cost€0€2 to €10,000+
Inference costHigher (long prompts cost more)Lower (shorter prompts needed)
ConsistencyVariableHigh
Data requiredNone50 to 100,000+ examples
Model ownershipNoYes (if self-hosted)
Best forExploratory, general tasksHigh-volume, specialised tasks

Rule of thumb: spend one week on prompt engineering first. If quality is still insufficient after exhausting prompt techniques, evaluate fine-tuning.

Types of fine-tuning

Instruction fine-tuning
Format: prompt + ideal response pairs Goal: teach the model a specific output style or behaviour Example: 500 examples of customer emails paired with ideal support responses in your company's tone
Domain adaptation
Format: large corpus of domain text (continued pre-training) Goal: add specialised vocabulary and concepts Example: training a model on 50,000 medical case notes so it understands clinical language
LoRA / parameter-efficient fine-tuning
Trains small adapter, not full model 10-100x cheaper than full fine-tuning Used for: style transfer, task-specific adapters, image generation with custom subject (DreamBooth)
RLHF (reinforcement learning from human feedback)
Human raters rank responses Reward model trained on rankings Base model trained to maximise reward Used by OpenAI, Anthropic, and Google to align base models to human preferences

The fine-tuning workflow

Step 1 Prepare training data Curate high-quality examples: input/output pairs or domain documents. More time spent here directly improves model quality.
Step 2 Format the data Convert to the required format. For instruction fine-tuning, this is typically JSONL with system/user/assistant messages.
Step 3 Run the training job Upload data to the API or run on rented GPU infrastructure. Training takes minutes (API, small dataset) to days (large open-weight model).
Step 4 Evaluate and deploy Test on a held-out set of examples. Compare accuracy, format consistency, and cost against the baseline prompt-engineered approach. Deploy if quality meets the bar.

Fine-tuning with the OpenAI API

bash
pip install openai
python
import json
from openai import OpenAI

client = OpenAI(api_key="YOUR_API_KEY")

# Training data format: JSONL with messages arrays
training_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for an Austrian fintech company. Reply in formal German."},
            {"role": "user", "content": "Ich habe eine Frage zu meiner Rechnung."},
            {"role": "assistant", "content": "Guten Tag! Ich helfe Ihnen gerne bei Ihrer Frage zur Rechnung. Bitte teilen Sie mir Ihre Kundennummer mit, damit ich Ihr Konto prüfen kann."}
        ]
    },
    # ... add 99+ more examples
]

# Save as JSONL
with open("training.jsonl", "w") as f:
    for ex in training_examples:
        f.write(json.dumps(ex) + "\n")

# Upload training file
with open("training.jsonl", "rb") as f:
    response = client.files.create(file=f, purpose="fine-tune")
    file_id = response.id

# Start fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file_id,
    model="gpt-4o-mini",
)
print(f"Fine-tuning job created: {job.id}")

LoRA fine-tuning with Hugging Face

For open-weight models (Llama 3, Mistral), LoRA is the standard approach:

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
import torch

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

# Configure LoRA: only train a small adapter
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # rank: higher = more capacity, more memory
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.1,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 8,034,476,032 || 0.05%
# Only 0.05% of parameters are being trained

The LoRA adapter saves as a small file. Apply it to the base model at inference time.

When not to fine-tune

The prompt engineering ceiling is not actually reached: Most companies that think they need fine-tuning have not exhausted prompt engineering. Few-shot examples, structured output formats, and chain-of-thought prompting often get 80% of the quality with 0% of the fine-tuning overhead.

Your task changes frequently: Fine-tuned models encode the task into weights. Changing the task requires retraining. For rapidly evolving use cases, prompt engineering stays flexible.

Your dataset has quality problems: Fine-tuning amplifies patterns in training data. A model trained on inconsistent or low-quality examples learns to be inconsistent and low-quality.

You are under EU AI Act obligations: If your use case falls under the EU AI Act’s high-risk categories, a fine-tuned model may trigger documentation, evaluation, and compliance obligations that a prompted model using a compliant third-party API does not.

What’s next

Further reading