Large language models have fundamentally changed what developers can build. But out-of-the-box, a general-purpose model like Llama 3 or Mistral doesn't know your company's internal documentation, your customer support tone, or your domain-specific terminology. That's where fine-tuning comes in.
This guide walks you through the complete fine-tuning pipeline — from dataset preparation to model deployment — using open-source tools that are production-ready in 2026. Whether you're building a coding assistant, a legal document analyzer, or a customer support bot, these techniques apply directly.
What Is Fine-Tuning and Why Does It Matter?
Fine-tuning is the process of continuing the training of a pre-trained model on a smaller, task-specific dataset. Instead of training a model from scratch (which costs millions of dollars and months of compute), you leverage the vast general knowledge already baked into models like Llama 3.1, Mistral 7B, or Mixtral 8x7B, and then specialize it for your use case.
When Should You Fine-Tune vs. Prompt Engineer?
| Approach | Best For | Limitations |
|---|---|---|
| Prompt Engineering | Quick iterations, general tasks | No persistent knowledge, token limits |
| RAG (Retrieval-Augmented Generation) | Dynamic, large knowledge bases | Retrieval latency, hallucination risk |
| Fine-Tuning | Domain tone, format consistency, specialized tasks | Requires curated data, compute cost |
Fine-tuning wins when you need the model to internalize behavior — not just access data. For example, if you want every output to follow a strict JSON schema, or always respond in a clinical medical tone, fine-tuning gives you that consistency without injecting pages of instructions into every prompt.
Understanding the Fine-Tuning Landscape in 2026
Before writing a single line of code, understand the three major approaches:
1. Full Fine-Tuning
Updates all model weights. Requires enormous GPU memory (e.g., 80GB+ for a 7B parameter model in fp32). Rarely practical unless you have a dedicated GPU cluster.
2. LoRA (Low-Rank Adaptation)
Injects small trainable rank-decomposition matrices into attention layers. The base model weights are frozen — only the LoRA adapters are trained. This reduces trainable parameters by over 99%, making fine-tuning feasible on a single consumer GPU.
3. QLoRA (Quantized LoRA)
Combines 4-bit quantization of the base model with LoRA adapter training. A 7B model that normally requires ~28GB of VRAM can be fine-tuned on a single 16GB GPU (or even a T4 on Google Colab). This is the go-to method for most developers in 2026.
Setting Up Your Environment
Let's start by setting up a reproducible Python environment.
# Create and activate a virtual environment
python -m venv llm-finetune
source llm-finetune/bin/activate # On Windows: llm-finetune\Scripts\activate
# Install core dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.48.0
pip install datasets==3.2.0
pip install peft==0.14.0
pip install trl==0.13.0
pip install bitsandbytes==0.45.0
pip install accelerate==1.3.0
pip install wandb # for experiment tracking
Verify your CUDA setup:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
Step 1 — Curating and Preparing Your Dataset
This is the most important and often most underestimated step. Garbage in, garbage out is especially true for LLM fine-tuning.
Choosing the Right Format
Most fine-tuning pipelines use one of these dataset formats:
Instruction Format (for instruction-following models):
{
"instruction": "Summarize the following customer complaint in one sentence.",
"input": "I ordered a laptop three weeks ago and it still hasn't arrived. The tracking says it's been in the warehouse for 10 days...",
"output": "Customer reports a delayed laptop order that has been stationary in a warehouse for 10 days."
}
Chat Format (for conversational fine-tuning):
{
"messages": [
{"role": "system", "content": "You are a helpful legal assistant specializing in contract law."},
{"role": "user", "content": "What is a force majeure clause?"},
{"role": "assistant", "content": "A force majeure clause is a contractual provision that excuses one or both parties from performance obligations when an extraordinary event beyond their control..."}
]
}
Loading and Preprocessing with Hugging Face Datasets
from datasets import load_dataset, Dataset
import json
# Load from a JSONL file
dataset = load_dataset("json", data_files={"train": "train.jsonl", "test": "test.jsonl"})
# Or create from a Python list
data = [
{"instruction": "...", "input": "...", "output": "..."},
# ... more examples
]
dataset = Dataset.from_list(data)
# Apply a formatting function
def format_instruction(example):
"""Format into a single training string using Alpaca-style prompt."""
if example.get("input"):
prompt = f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
else:
prompt = f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
return {"text": prompt}
formatted_dataset = dataset.map(format_instruction, remove_columns=dataset["train"].column_names)
print(formatted_dataset["train"][0]["text"])
Data Quality Checklist
Before training, run these sanity checks on your dataset:
- Minimum 500 examples for meaningful fine-tuning; 2,000–10,000 is the sweet spot for most tasks
- Consistent formatting — check for missing fields, inconsistent casing, or truncated outputs
- Deduplication — use MinHash or exact dedup to remove duplicate examples
- Length distribution — plot token length histograms; trim outliers above your model's context window
- Quality over quantity — 500 high-quality, human-reviewed examples outperform 50,000 noisy scraped examples
# Check token length distribution
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
lengths = [len(tokenizer.encode(ex["text"])) for ex in formatted_dataset["train"]]
print(f"Min: {min(lengths)}, Max: {max(lengths)}, Mean: {sum(lengths)/len(lengths):.0f}")
# Filter examples that exceed context window
MAX_LENGTH = 2048
filtered = formatted_dataset.filter(lambda ex: len(tokenizer.encode(ex["text"])) <= MAX_LENGTH)
print(f"Kept {len(filtered['train'])} / {len(formatted_dataset['train'])} examples")
Step 2 — Loading the Base Model with Quantization
With QLoRA, we load the model in 4-bit precision first, then attach trainable LoRA adapters.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct" # or "mistralai/Mistral-7B-Instruct-v0.3"
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # Nested quantization for memory efficiency
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for normally distributed weights
bnb_4bit_compute_dtype=torch.bfloat16 # bfloat16 for stable training
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Critical for causal LM training
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto", # Automatically distributes across available GPUs
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
print(f"Model loaded. Parameters: {model.num_parameters():,}")
Step 3 — Configuring LoRA Adapters with PEFT
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
# Prepare model for k-bit training (freezes base weights, enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank — higher = more capacity, more memory. Start with 16.
lora_alpha=32, # Scaling factor. Typically 2x rank.
target_modules=[ # Which attention layers to target
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_dropout=0.05, # Dropout for regularization
bias="none", # Don't train bias terms
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 8,072,884,224 || trainable%: 0.5196
The key insight: you're only training ~0.5% of the model's parameters while getting meaningful specialization. This is the magic of LoRA.
Step 4 — Training with the TRL SFTTrainer
The trl library's SFTTrainer handles the heavy lifting of supervised fine-tuning.
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
training_args = SFTConfig(
output_dir="./llama-3-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
gradient_checkpointing=True, # Trade compute for memory
optim="paged_adamw_32bit", # Memory-efficient optimizer for QLoRA
save_steps=50,
logging_steps=10,
learning_rate=2e-4,
weight_decay=0.001,
fp16=False,
bf16=True, # Use bfloat16 on Ampere+ GPUs
max_grad_norm=0.3, # Gradient clipping
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True, # Group similar-length sequences for efficiency
lr_scheduler_type="cosine",
report_to="wandb", # Log to Weights & Biases
dataset_text_field="text",
max_seq_length=2048,
packing=False, # Set True to pack multiple short examples
eval_strategy="steps",
eval_steps=50,
load_best_model_at_end=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=formatted_dataset["train"],
eval_dataset=formatted_dataset["test"],
args=training_args,
processing_class=tokenizer,
)
# Start training
trainer.train()
# Save the fine-tuned adapters
trainer.save_model("./llama-3-finetuned-final")
tokenizer.save_pretrained("./llama-3-finetuned-final")
Understanding the Hyperparameters
learning_rate=2e-4: Higher than typical full fine-tuning (1e-5). LoRA adapters need a higher LR since they start from zero.gradient_accumulation_steps=4: Simulates a larger batch size without the memory cost. Keep effective batch size between 8 and 32.warmup_ratio=0.03: Prevents early training instability by linearly ramping up the LR.lr_scheduler_type="cosine": Gradually decays LR — works better than linear for LoRA.
Step 5 — Inference with Your Fine-Tuned Model
Once training completes, load and test your adapter:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
# Load base model (no quantization needed for inference if you have enough VRAM)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# Load fine-tuned adapter
model = PeftModel.from_pretrained(base_model, "./llama-3-finetuned-final")
model = model.merge_and_unload() # Merge adapters into base weights for faster inference
# Create pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
)
# Generate output
prompt = """### Instruction:
Classify the following customer review sentiment.
### Input:
The product arrived damaged and customer service was unhelpful.
### Response:"""
result = pipe(prompt)[0]["generated_text"]
print(result)
Step 6 — Evaluating Your Fine-Tuned Model
Never ship a fine-tuned model without evaluation. Use both automated metrics and human review.
Automated Evaluation
from datasets import load_metric
import numpy as np
# For classification tasks — use accuracy
# For generation tasks — use ROUGE or BERTScore
# ROUGE for summarization
rouge = load_metric("rouge")
predictions = ["The product was damaged and support was unhelpful."]
references = ["Customer received a damaged product and received poor support."]
results = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-L: {results['rougeL'].mid.fmeasure:.3f}")
# Perplexity — lower is better (measures how "surprised" the model is by the test set)
import math
def compute_perplexity(model, tokenizer, texts, device="cuda"):
model.eval()
total_loss = 0
total_tokens = 0
with torch.no_grad():
for text in texts:
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs, labels=inputs["input_ids"])
total_loss += outputs.loss.item() * inputs["input_ids"].shape[1]
total_tokens += inputs["input_ids"].shape[1]
return math.exp(total_loss / total_tokens)
perplexity = compute_perplexity(model, tokenizer, test_texts)
print(f"Test Perplexity: {perplexity:.2f}")
Evaluation Best Practices
- Hold-out test set: Never evaluate on training data. Reserve at least 10–20% of your dataset.
- Baseline comparison: Compare against the base model (without fine-tuning) and a prompt-engineered version.
- Human eval: For qualitative tasks (tone, style, helpfulness), have domain experts rate outputs on a 1–5 scale.
- Regression testing: Make sure fine-tuning didn't degrade general capabilities. Run your fine-tuned model on MMLU or HellaSwag benchmarks.
Step 7 — Saving and Deploying
Option A: Merge and Export to GGUF (for llama.cpp / Ollama)
# Convert merged model to GGUF for efficient CPU/GPU inference
pip install llama-cpp-python
python convert_hf_to_gguf.py ./llama-3-finetuned-merged \
--outtype q4_k_m \
--outfile llama3-custom-q4km.gguf
Option B: Deploy with vLLM
# Install vLLM
# pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(
model="./llama-3-finetuned-merged",
dtype="bfloat16",
tensor_parallel_size=1, # Number of GPUs
max_model_len=4096,
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Summarize this complaint: ..."], sampling_params)
print(outputs[0].outputs[0].text)
Option C: Push to Hugging Face Hub
from huggingface_hub import HfApi
api = HfApi()
api.create_repo("your-username/llama3-custom-finetuned", private=True)
model.push_to_hub("your-username/llama3-custom-finetuned")
tokenizer.push_to_hub("your-username/llama3-custom-finetuned")
🚀 Pro Tips
1. Use Flash Attention 2 for Faster Training
model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2", # 2–4x faster on A100/H100
torch_dtype=torch.bfloat16,
)
2. Monitor Training Loss Curves
A healthy fine-tuning run shows training loss steadily decreasing over 1–3 epochs. If validation loss starts rising while training loss keeps falling, you're overfitting — reduce epochs or add more data.
3. Experiment with Rank Efficiently
Don't start with r=64. Use r=8 first, evaluate, then scale up. The gains from higher rank diminish quickly and the memory cost is linear.
4. Use packing=True for Short Instruction Datasets
If most of your training examples are under 512 tokens, packing concatenates multiple examples into one sequence, dramatically improving GPU utilization.
5. Save Adapters, Not Full Models
LoRA adapters are only a few MB (e.g., 84MB for a 7B model with r=16). Store them separately from the base model for easy version management and rollbacks.
6. Synthetic Data Augmentation
If you have fewer than 1,000 examples, use a capable model (e.g., GPT-4o or Claude 3.7) to generate synthetic training examples. Clearly label them and mix 20–30% synthetic with real examples for best results.
7. Checkpoint Every 50–100 Steps
Fine-tuning can become unstable (loss spike, NaN gradients). Always set save_steps=50 so you can roll back to a stable checkpoint.
Common Mistakes to Avoid
❌ Forgetting to Set pad_token
# Always do this before training:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id
❌ Training on the Prompt (Input Masking)
For instruction fine-tuning, you should only compute the loss on the response tokens, not the instruction. SFTTrainer with the DataCollatorForCompletionOnlyLM handles this:
from trl import DataCollatorForCompletionOnlyLM
response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
trainer = SFTTrainer(
...
data_collator=collator,
)
❌ Using Too High a Learning Rate
Fine-tuning is sensitive to LR. Values above 5e-4 often cause catastrophic forgetting — the model "forgets" its pre-trained knowledge. Stick to 1e-4 to 3e-4 for LoRA.
❌ Ignoring Catastrophic Forgetting
Fine-tuning on a narrow domain can degrade performance on general tasks. Mitigate this by:
- Mixing a small percentage (~5%) of general instruction data into your training set
- Using a lower learning rate
- Training for fewer epochs
❌ Not Validating Your Dataset Format
Always inspect a few examples after formatting to confirm the prompt template is correct:
print(formatted_dataset["train"][0]["text"])
# Manually verify the output looks right before starting a multi-hour training run
❌ Running Full Evaluation Only at the End
Use eval_strategy="steps" and eval_steps=50 to catch problems early. It's painful to discover overfitting after 8 hours of training.
Real-World Example: Fine-Tuning Mistral for Legal Document Extraction
Here's a condensed, real-world scenario: fine-tuning Mistral 7B to extract structured information from legal contracts.
Dataset format:
{
"instruction": "Extract the key dates and parties from this contract excerpt.",
"input": "This Agreement is entered into as of January 15, 2026, between Acme Corp, a Delaware corporation ('Client'), and TechServ LLC, a California limited liability company ('Vendor')...",
"output": "{\"effective_date\": \"2026-01-15\", \"parties\": [{\"name\": \"Acme Corp\", \"role\": \"Client\", \"jurisdiction\": \"Delaware\"}, {\"name\": \"TechServ LLC\", \"role\": \"Vendor\", \"jurisdiction\": \"California\"}]}"
}
Results after fine-tuning on 2,000 contract examples:
- Extraction accuracy: 91.3% (vs. 67.8% with GPT-4o + prompt engineering)
- Latency: 340ms average (running on a single A10G GPU via vLLM)
- Cost: ~$0.0003 per document (vs. ~$0.04 with GPT-4o API)
The fine-tuned model also learned to always return valid JSON — something prompt engineering alone struggled to guarantee consistently.
📌 Key Takeaways
-
QLoRA is your default starting point — it makes fine-tuning accessible on a single consumer or cloud GPU. Full fine-tuning is rarely necessary.
-
Dataset quality trumps quantity — 1,000 clean, well-formatted examples will outperform 10,000 noisy ones every time.
-
Start with small experiments — test your pipeline on 100 examples first, then scale up. Catch formatting bugs before committing to an expensive training run.
-
Rank (
r) is a lever, not a fixed setting — start atr=8orr=16and only increase if evaluation shows the model lacks capacity. -
Always evaluate against a baseline — fine-tuning can regress general capability. Measure both task-specific and general performance.
-
LoRA adapters are lightweight assets — store them separately, version them like code, and share only the adapter weights (not the full model) for easy collaboration.
-
Inference efficiency matters — after fine-tuning, use
merge_and_unload()for production, and serve with vLLM or llama.cpp for high-throughput, low-latency deployments. -
Monitor training in real time — integrate Weights & Biases from day one. Blind training runs lead to wasted GPU hours.
Conclusion
Fine-tuning LLMs is no longer the exclusive domain of research labs. With QLoRA, Hugging Face PEFT, and the trl library, a developer with a single GPU can build a specialized model that outperforms much larger general-purpose models on a specific task — often at a fraction of the API cost.
The pipeline is straightforward: curate a high-quality dataset, load a quantized base model, attach LoRA adapters, train for a few epochs, evaluate rigorously, and deploy with an efficient serving framework. Each step has established best practices, and the open-source ecosystem has matured to the point where most sharp edges have been smoothed out.
The most important insight is this: the bottleneck is no longer compute — it's data. Invest your time in collecting and cleaning domain-specific examples, and the fine-tuning itself almost takes care of itself.
Start small, iterate quickly, measure everything, and ship.
References
- Hugging Face PEFT Documentation
- TRL — Transformer Reinforcement Learning
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
- Llama 3 Model Card — Meta AI
- Mistral 7B Technical Report
- vLLM: Easy, Fast, and Cheap LLM Serving
- BitsAndBytes Library
- Weights & Biases — Experiment Tracking
- Flash Attention 2 (Dao et al., 2023)