DeepSeek V3: A 671B Open-Weight Giant That Beats GPT-4o on Code and Math

When DeepSeek V3 appeared, it changed the conversation about what open-weight models can achieve. Built by a Chinese AI lab, trained on 14.8 trillion tokens using 2.788 million H800 GPU hours, this model matches or beats GPT-4o on multiple hard benchmarks — particularly coding and mathematics. The catch? At 671 billion parameters, it is far too large for any local workstation. Let us be upfront about what this model delivers, what it costs to run, and what European businesses should consider before adopting it.

Open source AI model comparison

flowchart LR
    INPUT["Input Token"] --> ROUTER["Router Network"]
    ROUTER --> E1["Expert 1"]
    ROUTER --> E2["Expert 2"]
    ROUTER -.->|inactive| E3["Expert 3"]
    ROUTER -.->|inactive| E4["..."]
    ROUTER -.->|inactive| E128["Expert 128"]

    E1 --> COMBINE["Combine Outputs"]
    E2 --> COMBINE

    subgraph TOTAL ["671B Total Parameters"]
        ROUTER
        E1
        E2
        E3
        E4
        E128
    end

    subgraph ACTIVE ["37B Active per Token"]
        E1
        E2
    end

    COMBINE --> MTP["Multi-Token\nPrediction"]
    MTP --> OUTPUT["Output Tokens"]

    style INPUT fill:#DBEAFE,stroke:#2563EB,color:#000
    style ROUTER fill:#FEF3C7,stroke:#F5A623,color:#000
    style E1 fill:#D1FAE5,stroke:#059669,color:#000
    style E2 fill:#D1FAE5,stroke:#059669,color:#000
    style E3 fill:#FECACA,stroke:#B91C1C,color:#000
    style E4 fill:#FECACA,stroke:#B91C1C,color:#000
    style E128 fill:#FECACA,stroke:#B91C1C,color:#000
    style COMBINE fill:#FEF3C7,stroke:#F5A623,color:#000
    style MTP fill:#DBEAFE,stroke:#2563EB,color:#000
    style OUTPUT fill:#D1FAE5,stroke:#059669,color:#000

How Mixture-of-Experts works (in plain terms)

Most language models are “dense” — every parameter participates in every computation. DeepSeek V3 uses a different approach called Mixture-of-Experts (MoE). Think of it like a company with 671 billion employees, but for any given task, only 37 billion of them actually work on it. A routing network decides which “expert” sub-networks handle each token.

The result: you get the knowledge depth of a 671B model with the inference speed closer to a 37B dense model. DeepSeek also introduced Multi-Token Prediction (MTP) for faster generation and FP8 mixed-precision training to keep costs down. The model supports a 128K context window, which is generous for long-document tasks.

Real benchmarks — honest numbers

These numbers come directly from the official HuggingFace model card and the DeepSeek technical report.

Benchmark	DeepSeek V3	GPT-4o	Claude 3.5 Sonnet	Llama 3.1 405B
MMLU (Chat)	88.5%	~88%	~89%	~85%
MMLU (Base, 5-shot)	87.1%	—	—	~84%
MMLU-Pro	75.9%	~73%	~75%	~68%
GPQA Diamond	59.1%	~53%	~60%	~50%
MATH-500	90.2%	~76%	~78%	~73%
AIME 2024	39.2%	~30%	~35%	~25%
HumanEval-Mul (code)	82.6%	~80%	~88%	~77%
LiveCodeBench	40.5%	33.4%	—	—
Codeforces (percentile)	51.6	23.6	—	—
Arena Hard	85.5	~82%	~85%	—

Sources: DeepSeek V3 on HuggingFace, DeepSeek technical report (arXiv).

The standout results: DeepSeek V3 scores 90.2% on MATH-500 (well ahead of GPT-4o’s ~76%), and 51.6 percentile on Codeforces versus GPT-4o’s 23.6. On LiveCodeBench, it hits 40.5% compared to GPT-4o’s 33.4%. For coding and math-heavy workloads, this model genuinely outperforms the most widely used commercial models.

On general knowledge (MMLU) and science reasoning (GPQA Diamond), it is competitive but not dramatically ahead — Claude 3.5 Sonnet still edges it on GPQA, and MMLU scores are within noise across all frontier models.

Hardware reality — this is NOT a local model

Let us be direct. DeepSeek V3 has 671 billion parameters. Even with MoE reducing active parameters to 37B per token, the full model weights must still be loaded into memory.

Setup	VRAM Required	Feasibility
Full FP16	~1.3 TB	Server cluster only (16x A100 80GB minimum)
Q4 quantized	~350 GB	High-end multi-GPU server
Q2/Q3 aggressive quant	~200 GB	Possible but significant quality loss
Cloud API	N/A	Most practical option for almost everyone

If you have heard of running models locally via Ollama, DeepSeek V3 does have community-contributed quantizations, but you would need hardware that most businesses simply do not have:

# Technically available in Ollama, but requires 200GB+ VRAM
ollama pull deepseek-v3

# For practical local coding work, use the smaller DeepSeek Coder V2 instead
ollama pull deepseek-coder-v2:16b

For most teams, the realistic path is the DeepSeek API, which uses an OpenAI-compatible format and costs significantly less than GPT-4o per token.

Geopolitical considerations for European businesses

DeepSeek is a Chinese AI company. For European businesses operating under GDPR, this raises legitimate questions:

Data sovereignty: Queries sent to DeepSeek’s API travel to Chinese infrastructure. If you process personal data or confidential business information, this may conflict with your GDPR compliance posture.
Regulatory uncertainty: The EU-China data transfer landscape is less settled than EU-US arrangements. There is no adequacy decision for China under GDPR.
Model weights are MIT-licensed: The code itself is freely available under MIT license, and the model weights are under DeepSeek’s Model Agreement. If you self-host on European infrastructure, you control where data goes — but self-hosting requires the massive hardware described above.

The pragmatic approach for European SMEs: use DeepSeek V3 via API for non-sensitive tasks (public data analysis, coding assistance, research synthesis), and keep sensitive workloads on local models or European-hosted alternatives. For a deeper look at local deployment economics, see our cloud vs local cost analysis.

When to use DeepSeek V3 (and when not to)

Strong use cases:

Complex coding and competitive programming tasks
Advanced mathematics, data science, and quantitative analysis
Research synthesis and scientific reasoning
Batch processing of non-sensitive analytical tasks via API

Think twice when:

Processing personal data subject to GDPR
You need guaranteed uptime independent of Chinese infrastructure
Local deployment is a hard requirement (the model is simply too large)
You need the absolute best code generation (Claude 3.5 Sonnet still leads on HumanEval)

Conclusion

DeepSeek V3 is a genuinely impressive achievement. It proves that open-weight models from outside the US-UK axis can compete at the frontier, and its MoE architecture is a masterclass in efficiency. The math and coding benchmarks speak for themselves — 90.2% on MATH-500 and 51.6 percentile on Codeforces are numbers that no other open model touches.

But for European SMEs, it is not a drop-in replacement for local AI. The 671B parameter count puts it firmly in cloud-or-cluster territory, and the Chinese origin adds a data governance layer that businesses must evaluate honestly. The smartest approach is hybrid: use DeepSeek V3 via API where it excels and data sensitivity allows, while keeping more practical models like Qwen 2.5 72B or Llama 3.3 70B for local deployment.

If you need help designing that hybrid architecture — balancing performance, privacy, and cost — talk to our team. That is exactly what we do.

Sources: DeepSeek V3 HuggingFace model card, DeepSeek technical report, DeepSeek official site.

Ready to Get Started?

VORLUX AI helps Spanish and European businesses deploy AI solutions that stay on your hardware, under your control. Whether you need edge AI deployment, LMS integration, or EU AI Act compliance consulting — we can help.

Book a free discovery call to discuss your AI strategy, or explore our services to see how we work.

DeepSeek V3: A 671B Open-Weight Giant That Beats GPT-4o on Code and Math

How Mixture-of-Experts works (in plain terms)

Real benchmarks — honest numbers

Hardware reality — this is NOT a local model

Geopolitical considerations for European businesses

When to use DeepSeek V3 (and when not to)

Conclusion

Ready to Get Started?

Blog

VORLUX AI Launch Day: We're Open for Business

The VORLUX AI Stack: Every Tool We Use, Nothing Hidden

Access exclusive resources

15 minutes to evaluate your case

VORLUX AI

How Mixture-of-Experts works (in plain terms)

Real benchmarks — honest numbers

Hardware reality — this is NOT a local model

Geopolitical considerations for European businesses

When to use DeepSeek V3 (and when not to)

Related reading

Conclusion

Ready to Get Started?

Blog

VORLUX AI Launch Day: We're Open for Business

The VORLUX AI Stack: Every Tool We Use, Nothing Hidden

Access exclusive resources

15 minutes to evaluate your case

VORLUX AI