DeepSeek V3: A 671B Open-Weight Giant That Beats GPT-4o on Code and Math
When DeepSeek V3 appeared, it changed the conversation about what open-weight models can achieve. Built by a Chinese AI lab, trained on 14.8 trillion tokens using 2.788 million H800 GPU hours, this model matches or beats GPT-4o on multiple hard benchmarks — particularly coding and mathematics. The catch? At 671 billion parameters, it is far too large for any local workstation. Let us be upfront about what this model delivers, what it costs to run, and what European businesses should consider before adopting it.

flowchart LR
INPUT["Input Token"] --> ROUTER["Router Network"]
ROUTER --> E1["Expert 1"]
ROUTER --> E2["Expert 2"]
ROUTER -.->|inactive| E3["Expert 3"]
ROUTER -.->|inactive| E4["..."]
ROUTER -.->|inactive| E128["Expert 128"]
E1 --> COMBINE["Combine Outputs"]
E2 --> COMBINE
subgraph TOTAL ["671B Total Parameters"]
ROUTER
E1
E2
E3
E4
E128
end
subgraph ACTIVE ["37B Active per Token"]
E1
E2
end
COMBINE --> MTP["Multi-Token\nPrediction"]
MTP --> OUTPUT["Output Tokens"]
style INPUT fill:#DBEAFE,stroke:#2563EB,color:#000
style ROUTER fill:#FEF3C7,stroke:#F5A623,color:#000
style E1 fill:#D1FAE5,stroke:#059669,color:#000
style E2 fill:#D1FAE5,stroke:#059669,color:#000
style E3 fill:#FECACA,stroke:#B91C1C,color:#000
style E4 fill:#FECACA,stroke:#B91C1C,color:#000
style E128 fill:#FECACA,stroke:#B91C1C,color:#000
style COMBINE fill:#FEF3C7,stroke:#F5A623,color:#000
style MTP fill:#DBEAFE,stroke:#2563EB,color:#000
style OUTPUT fill:#D1FAE5,stroke:#059669,color:#000
How Mixture-of-Experts works (in plain terms)
Most language models are “dense” — every parameter participates in every computation. DeepSeek V3 uses a different approach called Mixture-of-Experts (MoE). Think of it like a company with 671 billion employees, but for any given task, only 37 billion of them actually work on it. A routing network decides which “expert” sub-networks handle each token.
The result: you get the knowledge depth of a 671B model with the inference speed closer to a 37B dense model. DeepSeek also introduced Multi-Token Prediction (MTP) for faster generation and FP8 mixed-precision training to keep costs down. The model supports a 128K context window, which is generous for long-document tasks.
Real benchmarks — honest numbers
These numbers come directly from the official HuggingFace model card and the DeepSeek technical report.
| Benchmark | DeepSeek V3 | GPT-4o | Claude 3.5 Sonnet | Llama 3.1 405B |
|---|---|---|---|---|
| MMLU (Chat) | 88.5% | ~88% | ~89% | ~85% |
| MMLU (Base, 5-shot) | 87.1% | — | — | ~84% |
| MMLU-Pro | 75.9% | ~73% | ~75% | ~68% |
| GPQA Diamond | 59.1% | ~53% | ~60% | ~50% |
| MATH-500 | 90.2% | ~76% | ~78% | ~73% |
| AIME 2024 | 39.2% | ~30% | ~35% | ~25% |
| HumanEval-Mul (code) | 82.6% | ~80% | ~88% | ~77% |
| LiveCodeBench | 40.5% | 33.4% | — | — |
| Codeforces (percentile) | 51.6 | 23.6 | — | — |
| Arena Hard | 85.5 | ~82% | ~85% | — |
Sources: DeepSeek V3 on HuggingFace, DeepSeek technical report (arXiv).
The standout results: DeepSeek V3 scores 90.2% on MATH-500 (well ahead of GPT-4o’s ~76%), and 51.6 percentile on Codeforces versus GPT-4o’s 23.6. On LiveCodeBench, it hits 40.5% compared to GPT-4o’s 33.4%. For coding and math-heavy workloads, this model genuinely outperforms the most widely used commercial models.
On general knowledge (MMLU) and science reasoning (GPQA Diamond), it is competitive but not dramatically ahead — Claude 3.5 Sonnet still edges it on GPQA, and MMLU scores are within noise across all frontier models.
Hardware reality — this is NOT a local model
Let us be direct. DeepSeek V3 has 671 billion parameters. Even with MoE reducing active parameters to 37B per token, the full model weights must still be loaded into memory.
| Setup | VRAM Required | Feasibility |
|---|---|---|
| Full FP16 | ~1.3 TB | Server cluster only (16x A100 80GB minimum) |
| Q4 quantized | ~350 GB | High-end multi-GPU server |
| Q2/Q3 aggressive quant | ~200 GB | Possible but significant quality loss |
| Cloud API | N/A | Most practical option for almost everyone |
If you have heard of running models locally via Ollama, DeepSeek V3 does have community-contributed quantizations, but you would need hardware that most businesses simply do not have:
# Technically available in Ollama, but requires 200GB+ VRAM
ollama pull deepseek-v3
# For practical local coding work, use the smaller DeepSeek Coder V2 instead
ollama pull deepseek-coder-v2:16b
For most teams, the realistic path is the DeepSeek API, which uses an OpenAI-compatible format and costs significantly less than GPT-4o per token.
Geopolitical considerations for European businesses
DeepSeek is a Chinese AI company. For European businesses operating under GDPR, this raises legitimate questions:
- Data sovereignty: Queries sent to DeepSeek’s API travel to Chinese infrastructure. If you process personal data or confidential business information, this may conflict with your GDPR compliance posture.
- Regulatory uncertainty: The EU-China data transfer landscape is less settled than EU-US arrangements. There is no adequacy decision for China under GDPR.
- Model weights are MIT-licensed: The code itself is freely available under MIT license, and the model weights are under DeepSeek’s Model Agreement. If you self-host on European infrastructure, you control where data goes — but self-hosting requires the massive hardware described above.
The pragmatic approach for European SMEs: use DeepSeek V3 via API for non-sensitive tasks (public data analysis, coding assistance, research synthesis), and keep sensitive workloads on local models or European-hosted alternatives. For a deeper look at local deployment economics, see our cloud vs local cost analysis.
When to use DeepSeek V3 (and when not to)
Strong use cases:
- Complex coding and competitive programming tasks
- Advanced mathematics, data science, and quantitative analysis
- Research synthesis and scientific reasoning
- Batch processing of non-sensitive analytical tasks via API
Think twice when:
- Processing personal data subject to GDPR
- You need guaranteed uptime independent of Chinese infrastructure
- Local deployment is a hard requirement (the model is simply too large)
- You need the absolute best code generation (Claude 3.5 Sonnet still leads on HumanEval)
Related reading
- DeepSeek R1: The Best Open-Source Reasoning Model You Can Run Locally
- AESIA: What Every Spanish Business Deploying AI Must Know in 2026
- AI Evaluations: How to Test Your RAG Pipeline Before Going Live
Conclusion
DeepSeek V3 is a genuinely impressive achievement. It proves that open-weight models from outside the US-UK axis can compete at the frontier, and its MoE architecture is a masterclass in efficiency. The math and coding benchmarks speak for themselves — 90.2% on MATH-500 and 51.6 percentile on Codeforces are numbers that no other open model touches.
But for European SMEs, it is not a drop-in replacement for local AI. The 671B parameter count puts it firmly in cloud-or-cluster territory, and the Chinese origin adds a data governance layer that businesses must evaluate honestly. The smartest approach is hybrid: use DeepSeek V3 via API where it excels and data sensitivity allows, while keeping more practical models like Qwen 2.5 72B or Llama 3.3 70B for local deployment.
If you need help designing that hybrid architecture — balancing performance, privacy, and cost — talk to our team. That is exactly what we do.
Sources: DeepSeek V3 HuggingFace model card, DeepSeek technical report, DeepSeek official site.
Ready to Get Started?
VORLUX AI helps Spanish and European businesses deploy AI solutions that stay on your hardware, under your control. Whether you need edge AI deployment, LMS integration, or EU AI Act compliance consulting — we can help.
Book a free discovery call to discuss your AI strategy, or explore our services to see how we work.