Skip to main content
Back to Blog
LLMFine-TuningHugging FacePyTorchLlamaMistralLoRAQLoRAMachine LearningNLPTransformersAI

Fine‑Tuning LLMs: Step‑by‑Step Guide for Developers

A hands-on, developer-focused guide to fine-tuning open-source large language models like Llama and Mistral on domain-specific data using PyTorch and Hugging Face Transformers — covering dataset prep, LoRA, QLoRA, training loops, evaluation, and deployment best practices.

May 6, 202614 min readNiraj Kumar

Large language models have fundamentally changed what developers can build. But out-of-the-box, a general-purpose model like Llama 3 or Mistral doesn't know your company's internal documentation, your customer support tone, or your domain-specific terminology. That's where fine-tuning comes in.

This guide walks you through the complete fine-tuning pipeline — from dataset preparation to model deployment — using open-source tools that are production-ready in 2026. Whether you're building a coding assistant, a legal document analyzer, or a customer support bot, these techniques apply directly.


What Is Fine-Tuning and Why Does It Matter?

Fine-tuning is the process of continuing the training of a pre-trained model on a smaller, task-specific dataset. Instead of training a model from scratch (which costs millions of dollars and months of compute), you leverage the vast general knowledge already baked into models like Llama 3.1, Mistral 7B, or Mixtral 8x7B, and then specialize it for your use case.

When Should You Fine-Tune vs. Prompt Engineer?

ApproachBest ForLimitations
Prompt EngineeringQuick iterations, general tasksNo persistent knowledge, token limits
RAG (Retrieval-Augmented Generation)Dynamic, large knowledge basesRetrieval latency, hallucination risk
Fine-TuningDomain tone, format consistency, specialized tasksRequires curated data, compute cost

Fine-tuning wins when you need the model to internalize behavior — not just access data. For example, if you want every output to follow a strict JSON schema, or always respond in a clinical medical tone, fine-tuning gives you that consistency without injecting pages of instructions into every prompt.


Understanding the Fine-Tuning Landscape in 2026

Before writing a single line of code, understand the three major approaches:

1. Full Fine-Tuning

Updates all model weights. Requires enormous GPU memory (e.g., 80GB+ for a 7B parameter model in fp32). Rarely practical unless you have a dedicated GPU cluster.

2. LoRA (Low-Rank Adaptation)

Injects small trainable rank-decomposition matrices into attention layers. The base model weights are frozen — only the LoRA adapters are trained. This reduces trainable parameters by over 99%, making fine-tuning feasible on a single consumer GPU.

3. QLoRA (Quantized LoRA)

Combines 4-bit quantization of the base model with LoRA adapter training. A 7B model that normally requires ~28GB of VRAM can be fine-tuned on a single 16GB GPU (or even a T4 on Google Colab). This is the go-to method for most developers in 2026.


Setting Up Your Environment

Let's start by setting up a reproducible Python environment.

# Create and activate a virtual environment
python -m venv llm-finetune
source llm-finetune/bin/activate  # On Windows: llm-finetune\Scripts\activate

# Install core dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.48.0
pip install datasets==3.2.0
pip install peft==0.14.0
pip install trl==0.13.0
pip install bitsandbytes==0.45.0
pip install accelerate==1.3.0
pip install wandb  # for experiment tracking

Verify your CUDA setup:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Step 1 — Curating and Preparing Your Dataset

This is the most important and often most underestimated step. Garbage in, garbage out is especially true for LLM fine-tuning.

Choosing the Right Format

Most fine-tuning pipelines use one of these dataset formats:

Instruction Format (for instruction-following models):

{
  "instruction": "Summarize the following customer complaint in one sentence.",
  "input": "I ordered a laptop three weeks ago and it still hasn't arrived. The tracking says it's been in the warehouse for 10 days...",
  "output": "Customer reports a delayed laptop order that has been stationary in a warehouse for 10 days."
}

Chat Format (for conversational fine-tuning):

{
  "messages": [
    {"role": "system", "content": "You are a helpful legal assistant specializing in contract law."},
    {"role": "user", "content": "What is a force majeure clause?"},
    {"role": "assistant", "content": "A force majeure clause is a contractual provision that excuses one or both parties from performance obligations when an extraordinary event beyond their control..."}
  ]
}

Loading and Preprocessing with Hugging Face Datasets

from datasets import load_dataset, Dataset
import json

# Load from a JSONL file
dataset = load_dataset("json", data_files={"train": "train.jsonl", "test": "test.jsonl"})

# Or create from a Python list
data = [
    {"instruction": "...", "input": "...", "output": "..."},
    # ... more examples
]
dataset = Dataset.from_list(data)

# Apply a formatting function
def format_instruction(example):
    """Format into a single training string using Alpaca-style prompt."""
    if example.get("input"):
        prompt = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    else:
        prompt = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": prompt}

formatted_dataset = dataset.map(format_instruction, remove_columns=dataset["train"].column_names)
print(formatted_dataset["train"][0]["text"])

Data Quality Checklist

Before training, run these sanity checks on your dataset:

  • Minimum 500 examples for meaningful fine-tuning; 2,000–10,000 is the sweet spot for most tasks
  • Consistent formatting — check for missing fields, inconsistent casing, or truncated outputs
  • Deduplication — use MinHash or exact dedup to remove duplicate examples
  • Length distribution — plot token length histograms; trim outliers above your model's context window
  • Quality over quantity — 500 high-quality, human-reviewed examples outperform 50,000 noisy scraped examples
# Check token length distribution
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

lengths = [len(tokenizer.encode(ex["text"])) for ex in formatted_dataset["train"]]
print(f"Min: {min(lengths)}, Max: {max(lengths)}, Mean: {sum(lengths)/len(lengths):.0f}")

# Filter examples that exceed context window
MAX_LENGTH = 2048
filtered = formatted_dataset.filter(lambda ex: len(tokenizer.encode(ex["text"])) <= MAX_LENGTH)
print(f"Kept {len(filtered['train'])} / {len(formatted_dataset['train'])} examples")

Step 2 — Loading the Base Model with Quantization

With QLoRA, we load the model in 4-bit precision first, then attach trainable LoRA adapters.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"  # or "mistralai/Mistral-7B-Instruct-v0.3"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,       # Nested quantization for memory efficiency
    bnb_4bit_quant_type="nf4",            # NormalFloat4 — best for normally distributed weights
    bnb_4bit_compute_dtype=torch.bfloat16  # bfloat16 for stable training
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Critical for causal LM training

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",           # Automatically distributes across available GPUs
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

print(f"Model loaded. Parameters: {model.num_parameters():,}")

Step 3 — Configuring LoRA Adapters with PEFT

from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training

# Prepare model for k-bit training (freezes base weights, enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                      # Rank — higher = more capacity, more memory. Start with 16.
    lora_alpha=32,             # Scaling factor. Typically 2x rank.
    target_modules=[           # Which attention layers to target
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,         # Dropout for regularization
    bias="none",               # Don't train bias terms
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 8,072,884,224 || trainable%: 0.5196

The key insight: you're only training ~0.5% of the model's parameters while getting meaningful specialization. This is the magic of LoRA.


Step 4 — Training with the TRL SFTTrainer

The trl library's SFTTrainer handles the heavy lifting of supervised fine-tuning.

from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

training_args = SFTConfig(
    output_dir="./llama-3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,     # Effective batch size = 4 * 4 = 16
    gradient_checkpointing=True,       # Trade compute for memory
    optim="paged_adamw_32bit",         # Memory-efficient optimizer for QLoRA
    save_steps=50,
    logging_steps=10,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=True,                         # Use bfloat16 on Ampere+ GPUs
    max_grad_norm=0.3,                 # Gradient clipping
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,              # Group similar-length sequences for efficiency
    lr_scheduler_type="cosine",
    report_to="wandb",                 # Log to Weights & Biases
    dataset_text_field="text",
    max_seq_length=2048,
    packing=False,                     # Set True to pack multiple short examples
    eval_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
    args=training_args,
    processing_class=tokenizer,
)

# Start training
trainer.train()

# Save the fine-tuned adapters
trainer.save_model("./llama-3-finetuned-final")
tokenizer.save_pretrained("./llama-3-finetuned-final")

Understanding the Hyperparameters

  • learning_rate=2e-4: Higher than typical full fine-tuning (1e-5). LoRA adapters need a higher LR since they start from zero.
  • gradient_accumulation_steps=4: Simulates a larger batch size without the memory cost. Keep effective batch size between 8 and 32.
  • warmup_ratio=0.03: Prevents early training instability by linearly ramping up the LR.
  • lr_scheduler_type="cosine": Gradually decays LR — works better than linear for LoRA.

Step 5 — Inference with Your Fine-Tuned Model

Once training completes, load and test your adapter:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Load base model (no quantization needed for inference if you have enough VRAM)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Load fine-tuned adapter
model = PeftModel.from_pretrained(base_model, "./llama-3-finetuned-final")
model = model.merge_and_unload()  # Merge adapters into base weights for faster inference

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
)

# Generate output
prompt = """### Instruction:
Classify the following customer review sentiment.

### Input:
The product arrived damaged and customer service was unhelpful.

### Response:"""

result = pipe(prompt)[0]["generated_text"]
print(result)

Step 6 — Evaluating Your Fine-Tuned Model

Never ship a fine-tuned model without evaluation. Use both automated metrics and human review.

Automated Evaluation

from datasets import load_metric
import numpy as np

# For classification tasks — use accuracy
# For generation tasks — use ROUGE or BERTScore

# ROUGE for summarization
rouge = load_metric("rouge")

predictions = ["The product was damaged and support was unhelpful."]
references = ["Customer received a damaged product and received poor support."]

results = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-L: {results['rougeL'].mid.fmeasure:.3f}")

# Perplexity — lower is better (measures how "surprised" the model is by the test set)
import math

def compute_perplexity(model, tokenizer, texts, device="cuda"):
    model.eval()
    total_loss = 0
    total_tokens = 0
    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt").to(device)
            outputs = model(**inputs, labels=inputs["input_ids"])
            total_loss += outputs.loss.item() * inputs["input_ids"].shape[1]
            total_tokens += inputs["input_ids"].shape[1]
    return math.exp(total_loss / total_tokens)

perplexity = compute_perplexity(model, tokenizer, test_texts)
print(f"Test Perplexity: {perplexity:.2f}")

Evaluation Best Practices

  • Hold-out test set: Never evaluate on training data. Reserve at least 10–20% of your dataset.
  • Baseline comparison: Compare against the base model (without fine-tuning) and a prompt-engineered version.
  • Human eval: For qualitative tasks (tone, style, helpfulness), have domain experts rate outputs on a 1–5 scale.
  • Regression testing: Make sure fine-tuning didn't degrade general capabilities. Run your fine-tuned model on MMLU or HellaSwag benchmarks.

Step 7 — Saving and Deploying

Option A: Merge and Export to GGUF (for llama.cpp / Ollama)

# Convert merged model to GGUF for efficient CPU/GPU inference
pip install llama-cpp-python

python convert_hf_to_gguf.py ./llama-3-finetuned-merged \
    --outtype q4_k_m \
    --outfile llama3-custom-q4km.gguf

Option B: Deploy with vLLM

# Install vLLM
# pip install vllm

from vllm import LLM, SamplingParams

llm = LLM(
    model="./llama-3-finetuned-merged",
    dtype="bfloat16",
    tensor_parallel_size=1,  # Number of GPUs
    max_model_len=4096,
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Summarize this complaint: ..."], sampling_params)
print(outputs[0].outputs[0].text)

Option C: Push to Hugging Face Hub

from huggingface_hub import HfApi

api = HfApi()
api.create_repo("your-username/llama3-custom-finetuned", private=True)

model.push_to_hub("your-username/llama3-custom-finetuned")
tokenizer.push_to_hub("your-username/llama3-custom-finetuned")

🚀 Pro Tips

1. Use Flash Attention 2 for Faster Training

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",  # 2–4x faster on A100/H100
    torch_dtype=torch.bfloat16,
)

2. Monitor Training Loss Curves

A healthy fine-tuning run shows training loss steadily decreasing over 1–3 epochs. If validation loss starts rising while training loss keeps falling, you're overfitting — reduce epochs or add more data.

3. Experiment with Rank Efficiently

Don't start with r=64. Use r=8 first, evaluate, then scale up. The gains from higher rank diminish quickly and the memory cost is linear.

4. Use packing=True for Short Instruction Datasets

If most of your training examples are under 512 tokens, packing concatenates multiple examples into one sequence, dramatically improving GPU utilization.

5. Save Adapters, Not Full Models

LoRA adapters are only a few MB (e.g., 84MB for a 7B model with r=16). Store them separately from the base model for easy version management and rollbacks.

6. Synthetic Data Augmentation

If you have fewer than 1,000 examples, use a capable model (e.g., GPT-4o or Claude 3.7) to generate synthetic training examples. Clearly label them and mix 20–30% synthetic with real examples for best results.

7. Checkpoint Every 50–100 Steps

Fine-tuning can become unstable (loss spike, NaN gradients). Always set save_steps=50 so you can roll back to a stable checkpoint.


Common Mistakes to Avoid

❌ Forgetting to Set pad_token

# Always do this before training:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

❌ Training on the Prompt (Input Masking)

For instruction fine-tuning, you should only compute the loss on the response tokens, not the instruction. SFTTrainer with the DataCollatorForCompletionOnlyLM handles this:

from trl import DataCollatorForCompletionOnlyLM

response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    ...
    data_collator=collator,
)

❌ Using Too High a Learning Rate

Fine-tuning is sensitive to LR. Values above 5e-4 often cause catastrophic forgetting — the model "forgets" its pre-trained knowledge. Stick to 1e-4 to 3e-4 for LoRA.

❌ Ignoring Catastrophic Forgetting

Fine-tuning on a narrow domain can degrade performance on general tasks. Mitigate this by:

  • Mixing a small percentage (~5%) of general instruction data into your training set
  • Using a lower learning rate
  • Training for fewer epochs

❌ Not Validating Your Dataset Format

Always inspect a few examples after formatting to confirm the prompt template is correct:

print(formatted_dataset["train"][0]["text"])
# Manually verify the output looks right before starting a multi-hour training run

❌ Running Full Evaluation Only at the End

Use eval_strategy="steps" and eval_steps=50 to catch problems early. It's painful to discover overfitting after 8 hours of training.


Here's a condensed, real-world scenario: fine-tuning Mistral 7B to extract structured information from legal contracts.

Dataset format:

{
  "instruction": "Extract the key dates and parties from this contract excerpt.",
  "input": "This Agreement is entered into as of January 15, 2026, between Acme Corp, a Delaware corporation ('Client'), and TechServ LLC, a California limited liability company ('Vendor')...",
  "output": "{\"effective_date\": \"2026-01-15\", \"parties\": [{\"name\": \"Acme Corp\", \"role\": \"Client\", \"jurisdiction\": \"Delaware\"}, {\"name\": \"TechServ LLC\", \"role\": \"Vendor\", \"jurisdiction\": \"California\"}]}"
}

Results after fine-tuning on 2,000 contract examples:

  • Extraction accuracy: 91.3% (vs. 67.8% with GPT-4o + prompt engineering)
  • Latency: 340ms average (running on a single A10G GPU via vLLM)
  • Cost: ~$0.0003 per document (vs. ~$0.04 with GPT-4o API)

The fine-tuned model also learned to always return valid JSON — something prompt engineering alone struggled to guarantee consistently.


📌 Key Takeaways

  • QLoRA is your default starting point — it makes fine-tuning accessible on a single consumer or cloud GPU. Full fine-tuning is rarely necessary.

  • Dataset quality trumps quantity — 1,000 clean, well-formatted examples will outperform 10,000 noisy ones every time.

  • Start with small experiments — test your pipeline on 100 examples first, then scale up. Catch formatting bugs before committing to an expensive training run.

  • Rank (r) is a lever, not a fixed setting — start at r=8 or r=16 and only increase if evaluation shows the model lacks capacity.

  • Always evaluate against a baseline — fine-tuning can regress general capability. Measure both task-specific and general performance.

  • LoRA adapters are lightweight assets — store them separately, version them like code, and share only the adapter weights (not the full model) for easy collaboration.

  • Inference efficiency matters — after fine-tuning, use merge_and_unload() for production, and serve with vLLM or llama.cpp for high-throughput, low-latency deployments.

  • Monitor training in real time — integrate Weights & Biases from day one. Blind training runs lead to wasted GPU hours.


Conclusion

Fine-tuning LLMs is no longer the exclusive domain of research labs. With QLoRA, Hugging Face PEFT, and the trl library, a developer with a single GPU can build a specialized model that outperforms much larger general-purpose models on a specific task — often at a fraction of the API cost.

The pipeline is straightforward: curate a high-quality dataset, load a quantized base model, attach LoRA adapters, train for a few epochs, evaluate rigorously, and deploy with an efficient serving framework. Each step has established best practices, and the open-source ecosystem has matured to the point where most sharp edges have been smoothed out.

The most important insight is this: the bottleneck is no longer compute — it's data. Invest your time in collecting and cleaning domain-specific examples, and the fine-tuning itself almost takes care of itself.

Start small, iterate quickly, measure everything, and ship.


References

All Articles
LLMFine-TuningHugging FacePyTorchLlamaMistralLoRAQLoRAMachine LearningNLPTransformersAI

Written by

Niraj Kumar

Software Developer — building scalable systems for businesses.