Translation Accuracy Leaderboard by Language Pair
Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.
Translation Accuracy Leaderboard by Language Pair
Which translation AI is most accurate for your language pair? Our leaderboard ranks Google Translate, DeepL, GPT-4, Claude, and NLLB-200 across 50+ language pairs using multiple quality metrics.
Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.
How We Score
Each system is evaluated using three metrics. Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained
- BLEU Score: Automated n-gram overlap with reference translations (SacreBLEU implementation)
- COMET Score: Neural-based quality estimation correlating with human judgment
- Editorial Rating: Human evaluation by native speakers on a 1-10 scale
Scores are updated quarterly using standardized test sets.
Overall Rankings (Averaged Across All Tested Pairs)
| Rank | System | Avg BLEU | Avg COMET | Avg Editorial | Best For |
|---|---|---|---|---|---|
| 1 | GPT-4 | 37.2 | 0.861 | 8.2 | Asian languages, nuanced content |
| 2 | DeepL | 38.1 | 0.865 | 8.4 | European languages (limited set) |
| 3 | Google Translate | 36.5 | 0.853 | 7.9 | Broad coverage, speed |
| 4 | Claude | 36.1 | 0.856 | 8.0 | Long-form, consistency |
| 5 | NLLB-200 | 33.4 | 0.836 | 7.3 | Low-resource languages |
Note: DeepL’s average is inflated by its focus on high-performing European pairs. GPT-4 leads when measured across all language pairs including Asian languages. Google Translate vs DeepL vs AI Models: Which Is Most Accurate?
Rankings by Language Pair
Tier 1: European High-Resource
| Language Pair | #1 | #2 | #3 | #4 | #5 |
|---|---|---|---|---|---|
| EN → ES | DeepL (8.7) | GPT-4 (8.5) | Claude (8.4) | Google (8.2) | NLLB (7.6) |
| EN → FR | DeepL (8.9) | GPT-4 (8.6) | Claude (8.5) | Google (8.3) | NLLB (7.7) |
| EN → DE | DeepL (8.8) | GPT-4 (8.3) | Claude (8.1) | Google (7.9) | NLLB (7.2) |
| EN → PT | DeepL (8.6) | GPT-4 (8.4) | Claude (8.3) | Google (8.1) | NLLB (7.5) |
| EN → IT | DeepL (8.7) | GPT-4 (8.3) | Claude (8.2) | Google (8.0) | NLLB (7.4) |
English to Spanish: AI Translation Comparison English to French: AI Translation Comparison English to German: AI Translation Comparison English to Portuguese: AI Translation Comparison
Tier 2: Asian High-Resource
| Language Pair | #1 | #2 | #3 | #4 | #5 |
|---|---|---|---|---|---|
| EN → ZH | GPT-4 (8.1) | Claude (7.9) | Google (7.8) | DeepL (7.5) | NLLB (7.0) |
| EN → JA | GPT-4 (8.2) | Claude (7.9) | DeepL (7.8) | Google (7.5) | NLLB (6.9) |
| EN → KO | GPT-4 (8.0) | Claude (7.8) | DeepL (7.6) | Google (7.4) | NLLB (6.8) |
English to Chinese (Simplified): AI Translation Comparison English to Japanese: AI Translation Comparison English to Korean: AI Translation Comparison
Tier 3: Other Major Languages
| Language Pair | #1 | #2 | #3 | #4 | #5 |
|---|---|---|---|---|---|
| EN → AR | GPT-4 (7.5) | Claude (7.3) | Google (7.2) | DeepL (6.8) | NLLB (6.7) |
| EN → HI | GPT-4 (7.7) | Claude (7.4) | Google (7.3) | NLLB (6.8) | DeepL (6.9) |
| EN → RU | DeepL (8.1) | GPT-4 (8.0) | Claude (7.8) | Google (7.7) | NLLB (7.2) |
English to Arabic: AI Translation Comparison English to Hindi: AI Translation Comparison English to Russian: AI Translation Comparison
Reverse Pairs (X → EN)
| Language Pair | #1 | #2 | #3 | #4 | #5 |
|---|---|---|---|---|---|
| ES → EN | DeepL (8.9) | GPT-4 (8.8) | Claude (8.6) | Google (8.5) | NLLB (7.9) |
| FR → EN | DeepL (9.0) | GPT-4 (8.8) | Claude (8.7) | Google (8.5) | NLLB (7.8) |
| ZH → EN | GPT-4 (8.4) | Claude (8.1) | Google (8.0) | DeepL (7.7) | NLLB (7.2) |
| JA → EN | GPT-4 (8.5) | Claude (8.2) | DeepL (8.1) | Google (7.8) | NLLB (7.0) |
| DE → EN | DeepL (9.0) | GPT-4 (8.7) | Claude (8.5) | Google (8.3) | NLLB (7.6) |
Spanish to English: AI Translation Comparison French to English: AI Translation Comparison Chinese to English: AI Translation Comparison Japanese to English: AI Translation Comparison German to English: AI Translation Comparison
Low-Resource Languages
| Language Pair | #1 | #2 | #3 |
|---|---|---|---|
| EN → Yoruba | NLLB (6.5) | Google (5.8) | GPT-4 (5.5) |
| EN → Igbo | NLLB (6.2) | Google (5.5) | GPT-4 (5.2) |
| EN → Swahili | Google (7.0) | NLLB (6.8) | GPT-4 (6.5) |
Best Translation AI for Rare/Low-Resource Languages Low-Resource Languages: How NLLB and Aya Are Closing the Gap
Methodology
- Test sets: 1,000 sentences per language pair from diverse domains (news, conversation, technical, literary)
- Reference translations: Professional human translations
- Update frequency: Quarterly
- Systems tested: Latest publicly available versions
- BLEU: SacreBLEU with default tokenization
- COMET: Latest COMET-22 model
- Editorial: 3 native-speaker evaluators per language, scores averaged
Key Takeaways
- DeepL leads for European languages. GPT-4 leads for Asian languages and when averaged across all pairs.
- Translation into English is consistently higher quality than translation from English, across all systems.
- NLLB-200 leads for low-resource languages where other systems have weak or no coverage.
- The quality gap between the top systems is smaller than most people expect — usually 0.5-1.5 points on our 10-point scale.
Next Steps
- Test on your own text: Use the Translation AI Playground: Compare Models Side-by-Side.
- Read detailed comparisons: See specific language pair pages for in-depth analysis.
- Full model comparison: Read Best Translation AI in 2026: Complete Model Comparison.
- Understand our metrics: See Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained.