Fine-Tuning LLMs - A Practical Guide

Step-by-step guide to fine-tuning GPT, Llama, and Mistral models for your domain-specific use cases.

ML training visualization

By Dr. Emily Watson on

Large language models are powerful out-of-the-box, but domain-specific fine-tuning can dramatically improve performance for specialized applications. This guide shows you how to fine-tune models effectively.

When to Fine-Tune vs. Use RAG

Use RAG When:

  • Facts change frequently (news, documentation)
  • Need source attribution
  • Have limited training data
  • Want to avoid model retraining

Fine-Tune When:

  • Need specific writing style or tone
  • Domain has unique patterns (medical, legal, code)
  • Want to reduce model size/cost
  • Have 500+ high-quality examples

Fine-Tuning Methods

1. Full Fine-Tuning

Updates all model parameters. Most effective but most expensive.

Pros: Best performance, complete adaptation Cons: Expensive, requires significant GPU memory, risk of overfitting

2. LoRA (Low-Rank Adaptation)

Recommended for most use cases

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4M (0.1% of total)

Pros: Only 0.1% of parameters trainable, faster, cheaper Cons: Slightly lower performance than full fine-tuning

3. QLoRA (Quantized LoRA)

For consumer hardware (RTX 3090, M1/M2 Mac)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Data Preparation

Dataset Format

For Instruction Tuning:

{
  "instruction": "Explain this code snippet",
  "input": "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
  "output": "This function calculates the nth Fibonacci number using recursion..."
}

For Completion Tasks:

{
  "prompt": "SELECT * FROM users WHERE",
  "completion": " status='active' AND last_login > '2024-01-01'"
}

Data Quality Checklist

✅ Minimum 500 examples for simple tasks ✅ 10K+ examples for complex reasoning ✅ Remove duplicates and near-duplicates ✅ Balance classes/categories ✅ Use domain-specific terminology ✅ Include edge cases ✅ Validate ground truth labels

Fine-Tuning Process

Step 1: Choose Base Model

Open Models (free to fine-tune):

  • Llama 2 7B/13B/70B: Best open-source performance
  • Mistral 7B: Excellent for code generation
  • Falcon 7B/40B: Great for instruction following

Commercial Models (API fine-tuning):

  • GPT-3.5: Easiest, via OpenAI API
  • Claude: Via Anthropic API (coming soon)

Step 2: Training Configuration

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama-2-7b-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 16
    learning_rate=2e-4,
    warmup_steps=100,
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    fp16=True,  # Use mixed precision
    optim="paged_adamw_32bit"
)

Step 3: Training Loop

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
    packing=True  # Pack multiple examples per sequence
)

trainer.train()

Evaluation

Automated Metrics

from evaluate import load

perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(model_id="gpt2", data=test_data)

# BLEU for translation
bleu = load("bleu")
bleu_score = bleu.compute(predictions=preds, references=targets)

Human Evaluation

Test Set Examples:

  1. Factuality: Does it hallucinate?
  2. Relevance: Does it answer the question?
  3. Style: Is the tone appropriate?
  4. Safety: Does it produce harmful content?

Comparison Testing

ModelAccuracyLatencyCost/1K tokens
GPT-5 Base82%2.3s$0.03
GPT-5 Fine-tuned94%2.3s$0.03
Llama 2 7B Base71%0.4s$0.0001
Llama 2 7B Fine-tuned89%0.4s$0.0001

Deployment

Option 1: Hugging Face Inference

from huggingface_hub import HfApi

api = HfApi()
model_url = api.upload_folder(
    folder_path="./llama-2-7b-finetuned",
    repo_id="your-org/llama-2-finetuned",
    repo_type="model"
)

Option 2: Self-Hosted with vLLM

# 10x faster inference
vllm serve ./llama-2-7b-finetuned \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9

Option 3: Serverless

import modal

stub = modal.Stub("llama-finetuned")
image = modal.Image.debian_slim().pip_install("vllm")

@stub.function(
    image=image,
    gpu="A10G",
    timeout=300
)
def generate(prompt: str):
    from vllm import LLM, SamplingParams
    llm = LLM(model="llama-2-7b-finetuned")
    outputs = llm.generate([prompt], SamplingParams(temperature=0.7))
    return outputs[0].outputs[0].text

Cost Optimization

Training Costs (Llama 2 7B, 10K examples):

MethodGPUTimeCost
Full fine-tuningA100 80GB4 hours$40
LoRAA100 40GB2 hours$20
QLoRA (4-bit)RTX 30906 hoursFree (if you have GPU)

Inference Costs (1M tokens/month):

ModelHostingMonthly Cost
GPT-5OpenAI API$30,000
Llama 2 7B (hosted)AWS p3.2xlarge$3,000
Llama 2 7B (serverless)Modal/Lambda$1,500

Common Issues & Solutions

Issue: Overfitting

Symptoms: Perfect training performance, poor test performance Solutions:

  • Reduce epochs
  • Increase dropout
  • Add more training data
  • Use early stopping

Issue: Catastrophic Forgetting

Symptoms: Model loses general capabilities Solutions:

  • Mix general and domain-specific data
  • Use lower learning rate
  • Regular evaluation on general benchmarks

Issue: Training Instability

Symptoms: Loss spikes or NaN values Solutions:

  • Reduce learning rate
  • Use gradient clipping
  • Increase batch size
  • Check data quality

Best Practices

  1. Start Small: Fine-tune 7B model before 70B
  2. Monitor Loss: Track training and validation loss
  3. Save Checkpoints: Keep best model, not just last
  4. Test Early: Evaluate after each epoch
  5. Version Control: Track data, hyperparameters, and code
  6. Document: Record what works and what doesn’t

Fine-Tuning as a Service

If fine-tuning sounds complex, consider managed services:

  • OpenAI Fine-Tuning API: Easiest, GPT-3.5 only
  • MosaicML: Training infrastructure + MPT models
  • Anyscale: Ray-based distributed training
  • Custom ML teams: For specialized requirements

Conclusion

Fine-tuning dramatically improves LLM performance for domain-specific tasks. Start with LoRA for cost-effective training, use quality datasets, and evaluate rigorously. The investment in fine-tuning pays off in better accuracy, lower latency, and reduced costs compared to larger general models.

We have a newsletter

Subscribe and get the latest news and updates about AI & Backend Development on your inbox every week. No spam, no hassle.