Fine-Tuning LLMs - A Practical Guide
Step-by-step guide to fine-tuning GPT, Llama, and Mistral models for your domain-specific use cases.
By Dr. Emily Watson on
Large language models are powerful out-of-the-box, but domain-specific fine-tuning can dramatically improve performance for specialized applications. This guide shows you how to fine-tune models effectively.
When to Fine-Tune vs. Use RAG
Use RAG When:
- Facts change frequently (news, documentation)
- Need source attribution
- Have limited training data
- Want to avoid model retraining
Fine-Tune When:
- Need specific writing style or tone
- Domain has unique patterns (medical, legal, code)
- Want to reduce model size/cost
- Have 500+ high-quality examples
Fine-Tuning Methods
1. Full Fine-Tuning
Updates all model parameters. Most effective but most expensive.
Pros: Best performance, complete adaptation Cons: Expensive, requires significant GPU memory, risk of overfitting
2. LoRA (Low-Rank Adaptation)
Recommended for most use cases
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=8, # Rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4M (0.1% of total)
Pros: Only 0.1% of parameters trainable, faster, cheaper Cons: Slightly lower performance than full fine-tuning
3. QLoRA (Quantized LoRA)
For consumer hardware (RTX 3090, M1/M2 Mac)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
Data Preparation
Dataset Format
For Instruction Tuning:
{
"instruction": "Explain this code snippet",
"input": "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
"output": "This function calculates the nth Fibonacci number using recursion..."
}
For Completion Tasks:
{
"prompt": "SELECT * FROM users WHERE",
"completion": " status='active' AND last_login > '2024-01-01'"
}
Data Quality Checklist
✅ Minimum 500 examples for simple tasks ✅ 10K+ examples for complex reasoning ✅ Remove duplicates and near-duplicates ✅ Balance classes/categories ✅ Use domain-specific terminology ✅ Include edge cases ✅ Validate ground truth labels
Fine-Tuning Process
Step 1: Choose Base Model
Open Models (free to fine-tune):
- Llama 2 7B/13B/70B: Best open-source performance
- Mistral 7B: Excellent for code generation
- Falcon 7B/40B: Great for instruction following
Commercial Models (API fine-tuning):
- GPT-3.5: Easiest, via OpenAI API
- Claude: Via Anthropic API (coming soon)
Step 2: Training Configuration
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./llama-2-7b-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-4,
warmup_steps=100,
logging_steps=10,
save_steps=100,
evaluation_strategy="steps",
eval_steps=100,
fp16=True, # Use mixed precision
optim="paged_adamw_32bit"
)
Step 3: Training Loop
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=train_data,
eval_dataset=eval_data,
dataset_text_field="text",
max_seq_length=512,
tokenizer=tokenizer,
args=training_args,
packing=True # Pack multiple examples per sequence
)
trainer.train()
Evaluation
Automated Metrics
from evaluate import load
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(model_id="gpt2", data=test_data)
# BLEU for translation
bleu = load("bleu")
bleu_score = bleu.compute(predictions=preds, references=targets)
Human Evaluation
Test Set Examples:
- Factuality: Does it hallucinate?
- Relevance: Does it answer the question?
- Style: Is the tone appropriate?
- Safety: Does it produce harmful content?
Comparison Testing
| Model | Accuracy | Latency | Cost/1K tokens |
|---|---|---|---|
| GPT-5 Base | 82% | 2.3s | $0.03 |
| GPT-5 Fine-tuned | 94% | 2.3s | $0.03 |
| Llama 2 7B Base | 71% | 0.4s | $0.0001 |
| Llama 2 7B Fine-tuned | 89% | 0.4s | $0.0001 |
Deployment
Option 1: Hugging Face Inference
from huggingface_hub import HfApi
api = HfApi()
model_url = api.upload_folder(
folder_path="./llama-2-7b-finetuned",
repo_id="your-org/llama-2-finetuned",
repo_type="model"
)
Option 2: Self-Hosted with vLLM
# 10x faster inference
vllm serve ./llama-2-7b-finetuned \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9
Option 3: Serverless
import modal
stub = modal.Stub("llama-finetuned")
image = modal.Image.debian_slim().pip_install("vllm")
@stub.function(
image=image,
gpu="A10G",
timeout=300
)
def generate(prompt: str):
from vllm import LLM, SamplingParams
llm = LLM(model="llama-2-7b-finetuned")
outputs = llm.generate([prompt], SamplingParams(temperature=0.7))
return outputs[0].outputs[0].text
Cost Optimization
Training Costs (Llama 2 7B, 10K examples):
| Method | GPU | Time | Cost |
|---|---|---|---|
| Full fine-tuning | A100 80GB | 4 hours | $40 |
| LoRA | A100 40GB | 2 hours | $20 |
| QLoRA (4-bit) | RTX 3090 | 6 hours | Free (if you have GPU) |
Inference Costs (1M tokens/month):
| Model | Hosting | Monthly Cost |
|---|---|---|
| GPT-5 | OpenAI API | $30,000 |
| Llama 2 7B (hosted) | AWS p3.2xlarge | $3,000 |
| Llama 2 7B (serverless) | Modal/Lambda | $1,500 |
Common Issues & Solutions
Issue: Overfitting
Symptoms: Perfect training performance, poor test performance Solutions:
- Reduce epochs
- Increase dropout
- Add more training data
- Use early stopping
Issue: Catastrophic Forgetting
Symptoms: Model loses general capabilities Solutions:
- Mix general and domain-specific data
- Use lower learning rate
- Regular evaluation on general benchmarks
Issue: Training Instability
Symptoms: Loss spikes or NaN values Solutions:
- Reduce learning rate
- Use gradient clipping
- Increase batch size
- Check data quality
Best Practices
- Start Small: Fine-tune 7B model before 70B
- Monitor Loss: Track training and validation loss
- Save Checkpoints: Keep best model, not just last
- Test Early: Evaluate after each epoch
- Version Control: Track data, hyperparameters, and code
- Document: Record what works and what doesn’t
Fine-Tuning as a Service
If fine-tuning sounds complex, consider managed services:
- OpenAI Fine-Tuning API: Easiest, GPT-3.5 only
- MosaicML: Training infrastructure + MPT models
- Anyscale: Ray-based distributed training
- Custom ML teams: For specialized requirements
Conclusion
Fine-tuning dramatically improves LLM performance for domain-specific tasks. Start with LoRA for cost-effective training, use quality datasets, and evaluate rigorously. The investment in fine-tuning pays off in better accuracy, lower latency, and reduced costs compared to larger general models.
We have a newsletter
Subscribe and get the latest news and updates about AI & Backend Development on your inbox every week. No spam, no hassle.