Fine-Tune AI Models on Your Own Hardware: The LoRA Guide for SMEs
Fine-Tune AI Models on Your Own Hardware: The LoRA Guide for SMEs
“Fine-tuning” sounds like something that requires a GPU cluster and a machine learning team. In 2026, it requires a Mac Mini M4 and 90 minutes. The techniques that made this possible — LoRA and QLoRA — compress the training process so drastically that a model trained on your company’s specific data runs on the same hardware you’d use for inference.
This guide shows you exactly how it works, what it costs, and when it makes sense for your business.

What Fine-Tuning Actually Does
A pre-trained model like Qwen 2.5 7B or Gemma 3 4B knows a lot about everything. Fine-tuning teaches it to be exceptional at your specific task.
flowchart LR
BASE["Base Model<br/>(General Knowledge)"] --> LORA["LoRA Training<br/>(Your Data, 90 min)"]
LORA --> CUSTOM["Custom Model<br/>(Your Domain Expert)"]
DATA["Your Training Data<br/>(500-5,000 examples)"] --> LORA
style BASE fill:#1E293B,color:#FAFAFA
style LORA fill:#F5A623,color:#0B1628
style CUSTOM fill:#059669,color:#FAFAFA
Before fine-tuning: “Summarize this contract” → generic legal summary After fine-tuning on your firm’s contracts: “Summarize this contract” → summary in your firm’s format, highlighting the clauses your lawyers care about, using your terminology
LoRA vs QLoRA: The Techniques That Changed Everything
Traditional fine-tuning updates every parameter in the model — for a 7B model, that’s 7 billion numbers, requiring 28GB+ of memory just for the training process. Impractical on consumer hardware.
LoRA (Low-Rank Adaptation) freezes the original model and trains only small “adapter” matrices — typically 0.1-1% of the total parameters. A 7B model’s LoRA adapter is ~10-50MB instead of 14GB.
QLoRA goes further by quantizing the base model to 4-bit precision during training, cutting memory requirements by another 50%.
| Method | Quality vs Full | Memory Savings | Training Time | Best For |
|---|---|---|---|---|
| Full fine-tune | 100% | 0% | Hours-days | Research only |
| LoRA | 90-95% | ~70% | 60-90 min | Best quality on consumer HW |
| QLoRA | 80-90% | ~85% | 30-60 min | Production sweet spot |
| Prefix tuning | 70-80% | ~90% | 15-30 min | Very resource constrained |
For most business use cases, QLoRA at 80-90% quality is indistinguishable from full fine-tuning. The 85% memory savings mean you can train on hardware you already own.
What You Need: Hardware Requirements
| Your Hardware | Max Model Size | Training Time (5K examples) | Best Tool |
|---|---|---|---|
| Mac Mini M4 16GB | 7B (QLoRA) | ~90 min | MLX |
| Mac M3 Pro 32GB | 7B (LoRA) or 14B (QLoRA) | ~60-90 min | MLX |
| RTX 3080 10GB | 7B (QLoRA) | ~45 min | Unsloth |
| RTX 3090 24GB | 13B (QLoRA) | ~60 min | Unsloth |
| RTX 4090 24GB | 13B (LoRA) | ~30 min | Unsloth |
At VORLUX AI, we use our Mac M3 Pro (32GB) for client model customization and an RTX 3080 for GPU-accelerated training. Total hardware cost: what we already own.
Step-by-Step: Fine-Tune on Mac with MLX
Apple’s MLX framework makes fine-tuning native on Apple Silicon:
# Install MLX-LM
pip install mlx-lm
# Prepare your training data (JSONL format)
cat > train.jsonl << 'EOF'
{"prompt": "Summarize this contract clause:", "completion": "This clause establishes..."}
{"prompt": "Extract the payment terms:", "completion": "Payment is due within..."}
EOF
# Fine-tune with LoRA
python -m mlx_lm.lora \
--model mlx-community/Qwen2.5-7B-Instruct-4bit \
--data ./train.jsonl \
--batch-size 2 \
--num-iters 500 \
--output ./my-custom-adapter
# Test your fine-tuned model
python -m mlx_lm.generate \
--model mlx-community/Qwen2.5-7B-Instruct-4bit \
--adapter-path ./my-custom-adapter \
--prompt "Summarize this contract clause: ..."
The adapter file is ~20-50MB. The base model stays unchanged. You can swap adapters for different tasks without downloading new models.
Step-by-Step: Fine-Tune on NVIDIA GPU with Unsloth
For GPU-accelerated training on Windows or Linux hardware:
# Install Unsloth (fastest QLoRA library)
pip install unsloth
# Python training script
python << 'EOF'
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=16)
# Train on your data
from trl import SFTTrainer
trainer = SFTTrainer(model=model, tokenizer=tokenizer, dataset=your_dataset)
trainer.train()
# Export to GGUF for Ollama
model.save_pretrained_gguf("./output", tokenizer, quantization_method="q4_k_m")
EOF
The final step — exporting to GGUF — means your fine-tuned model runs directly in Ollama. Same deployment, same infrastructure, just better at your specific task.
When Fine-Tuning Makes Sense (And When It Doesn’t)
| Scenario | Fine-Tune? | Why |
|---|---|---|
| ”Answer questions about our product catalog” | No — use RAG | RAG retrieves current data; fine-tuning bakes in stale data |
| ”Write emails in our brand voice” | Yes | Style and tone are learned through examples |
| ”Classify support tickets into 12 categories” | Yes | Domain-specific classification improves dramatically |
| ”Extract structured data from our invoice format” | Yes | Consistent extraction patterns are trainable |
| ”Summarize contracts in our template format” | Yes | Output format is a fine-tuning strength |
| ”Answer general questions” | No | Base models already handle this well |
Rule of thumb: Fine-tune when the format or style of the output matters. Use RAG when the data needs to be current.
The Economics
| Cost Item | Fine-Tuning | Cloud API Training |
|---|---|---|
| Hardware | EUR 0 (use existing) | N/A |
| Training compute | EUR 0.50 (electricity) | EUR 50-500 per run |
| Training time | 30-90 minutes | 1-4 hours |
| Per-inference cost | EUR 0 | EUR 0.01-0.10 per query |
| Data privacy | 100% local | Data sent to provider |
| Iterations | Unlimited, free | Each run costs money |
The ability to iterate freely is the hidden advantage. With cloud training, every experiment costs money. With local training, you can run 50 experiments in a day at zero marginal cost, finding the optimal dataset and parameters for your use case.
What We Offer
At VORLUX AI, fine-tuning is available as an add-on to our Edge AI deployment:
- Data preparation: We help structure your training examples (typically 500-5,000 samples)
- Model selection: Choose the right base model for your task and hardware
- Training: LoRA/QLoRA fine-tuning on our hardware or yours
- Evaluation: Test the fine-tuned model against your quality criteria
- Deployment: Export to Ollama and integrate with your existing workflows
The fine-tuned model runs on the same Mac Mini as your base deployment. No additional hardware needed.
Want a model that speaks your business language? Schedule a free 15-minute assessment to discuss whether fine-tuning makes sense for your use case.
Related: Quantization Guide | Best Local LLMs | Hardware Guide | n8n RAG Pipeline
Sources: LoRA on Apple Silicon (Towards Data Science) | LoRA & QLoRA 2026 Guide | MLX Apple Silicon Guide | MLX-LM Fine-Tuning
Related reading
- Edge AI Hardware Guide 2026: Jetson vs Mac Mini vs NUC — Real Specs, Real Costs
- Quantization Explained: How to Run 70B AI Models on a €700 Mac Mini
- NPU vs GPU: Why Neural Processing Units Are the Future of Edge AI
Ready to Get Started?
VORLUX AI helps Spanish and European businesses deploy AI solutions that stay on your hardware, under your control. Whether you need edge AI deployment, LMS integration, or EU AI Act compliance consulting — we can help.
Book a free discovery call to discuss your AI strategy, or explore our services to see how we work.