Kurai - AI & Backend Development Agency

Large language models are powerful out-of-the-box, but domain-specific fine-tuning can dramatically improve performance for specialized applications. This guide shows you how to fine-tune models effectively.

When to Fine-Tune vs. Use RAG

Use RAG When:

Facts change frequently (news, documentation)
Need source attribution
Have limited training data
Want to avoid model retraining

Fine-Tune When:

Need specific writing style or tone
Domain has unique patterns (medical, legal, code)
Want to reduce model size/cost
Have 500+ high-quality examples

Fine-Tuning Methods

1. Full Fine-Tuning

Updates all model parameters. Most effective but most expensive.

Pros: Best performance, complete adaptation Cons: Expensive, requires significant GPU memory, risk of overfitting

2. LoRA (Low-Rank Adaptation)

Recommended for most use cases

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4M (0.1% of total)

Pros: Only 0.1% of parameters trainable, faster, cheaper Cons: Slightly lower performance than full fine-tuning

3. QLoRA (Quantized LoRA)

For consumer hardware (RTX 3090, M1/M2 Mac)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Data Preparation

Dataset Format

For Instruction Tuning:

{
  "instruction": "Explain this code snippet",
  "input": "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
  "output": "This function calculates the nth Fibonacci number using recursion..."
}

For Completion Tasks:

{
  "prompt": "SELECT * FROM users WHERE",
  "completion": " status='active' AND last_login > '2025-01-01'"
}

Data Quality Checklist

✅ Minimum 500 examples for simple tasks ✅ 10K+ examples for complex reasoning ✅ Remove duplicates and near-duplicates ✅ Balance classes/categories ✅ Use domain-specific terminology ✅ Include edge cases ✅ Validate ground truth labels

Fine-Tuning Process

Step 1: Choose Base Model

Open Models (free to fine-tune):

Llama 2 7B/13B/70B: Best open-source performance
Mistral 7B: Excellent for code generation
Falcon 7B/40B: Great for instruction following

Commercial Models (API fine-tuning):

GPT-3.5: Easiest, via OpenAI API
Claude: Via Anthropic API (coming soon)

Step 2: Training Configuration

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama-2-7b-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 16
    learning_rate=2e-4,
    warmup_steps=100,
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    fp16=True,  # Use mixed precision
    optim="paged_adamw_32bit"
)

Step 3: Training Loop

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
    packing=True  # Pack multiple examples per sequence
)

trainer.train()

Evaluation

Automated Metrics

from evaluate import load

perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(model_id="gpt2", data=test_data)

# BLEU for translation
bleu = load("bleu")
bleu_score = bleu.compute(predictions=preds, references=targets)

Human Evaluation

Test Set Examples:

Factuality: Does it hallucinate?
Relevance: Does it answer the question?
Style: Is the tone appropriate?
Safety: Does it produce harmful content?

Comparison Testing

Model	Accuracy	Latency	Cost/1K tokens
GPT-5 Base	82%	2.3s	$0.03
GPT-5 Fine-tuned	94%	2.3s	$0.03
Llama 2 7B Base	71%	0.4s	$0.0001
Llama 2 7B Fine-tuned	89%	0.4s	$0.0001

Deployment

Option 1: Hugging Face Inference

from huggingface_hub import HfApi

api = HfApi()
model_url = api.upload_folder(
    folder_path="./llama-2-7b-finetuned",
    repo_id="your-org/llama-2-finetuned",
    repo_type="model"
)

Option 2: Self-Hosted with vLLM

# 10x faster inference
vllm serve ./llama-2-7b-finetuned \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9

Option 3: Serverless

import modal

stub = modal.Stub("llama-finetuned")
image = modal.Image.debian_slim().pip_install("vllm")

@stub.function(
    image=image,
    gpu="A10G",
    timeout=300
)
def generate(prompt: str):
    from vllm import LLM, SamplingParams
    llm = LLM(model="llama-2-7b-finetuned")
    outputs = llm.generate([prompt], SamplingParams(temperature=0.7))
    return outputs[0].outputs[0].text

Cost Optimization

Training Costs (Llama 2 7B, 10K examples):

Method	GPU	Time	Cost
Full fine-tuning	A100 80GB	4 hours	$40
LoRA	A100 40GB	2 hours	$20
QLoRA (4-bit)	RTX 3090	6 hours	Free (if you have GPU)

Inference Costs (1M tokens/month):

Model	Hosting	Monthly Cost
GPT-5	OpenAI API	$30,000
Llama 2 7B (hosted)	AWS p3.2xlarge	$3,000
Llama 2 7B (serverless)	Modal/Lambda	$1,500

Common Issues & Solutions

Issue: Overfitting

Symptoms: Perfect training performance, poor test performance Solutions:

Reduce epochs
Increase dropout
Add more training data
Use early stopping

Issue: Catastrophic Forgetting

Symptoms: Model loses general capabilities Solutions:

Mix general and domain-specific data
Use lower learning rate
Regular evaluation on general benchmarks

Issue: Training Instability

Symptoms: Loss spikes or NaN values Solutions:

Reduce learning rate
Use gradient clipping
Increase batch size
Check data quality

Best Practices

Start Small: Fine-tune 7B model before 70B
Monitor Loss: Track training and validation loss
Save Checkpoints: Keep best model, not just last
Test Early: Evaluate after each epoch
Version Control: Track data, hyperparameters, and code
Document: Record what works and what doesn’t

Fine-Tuning as a Service

If fine-tuning sounds complex, consider managed services:

OpenAI Fine-Tuning API: Easiest, GPT-3.5 only
MosaicML: Training infrastructure + MPT models
Anyscale: Ray-based distributed training
Custom ML teams: For specialized requirements

Conclusion

Fine-tuning dramatically improves LLM performance for domain-specific tasks. Start with LoRA for cost-effective training, use quality datasets, and evaluate rigorously. The investment in fine-tuning pays off in better accuracy, lower latency, and reduced costs compared to larger general models.

Fine-Tuning LLMs - A Practical Guide