Fertility Leaderboard
Tokens per word for 16 languages on identical FLORES-200 sentences — apples-to-apples. Higher = more expensive, fewer in-context examples, faster context-budget burn.
On Llama-3.1 / SEA-LION v3, the heaviest language is Malayalam at 14.61 tok/word — 11.5× English. The same text on GPT-4o is just 3.29 tok/word — a 4.4× saving just from the tokenizer.
Fertility (tok / word)
| Language | Fert.▼ | ×Eng | c/tok |
|---|---|---|---|
| Malayalam | 14.61 | 11.5× | 0.62 |
| Kannada | 13.25 | 10.5× | 0.59 |
| Telugu | 12.49 | 9.9× | 0.60 |
| Tamil | 11.25 | 8.9× | 0.76 |
| Sinhala | 10.67 | 8.4× | 0.57 |
| Burmese | 8.43 | 6.7× | 0.52 |
| Lao | 7.93 | 6.3× | 0.60 |
| Bengali | 7.81 | 6.2× | 0.83 |
| Khmer | 7.42 | 5.9× | 0.70 |
| Hindi | 2.65 | 2.1× | 1.89 |
| Malay | 2.07 | 1.6× | 3.33 |
| Indonesian | 2.04 | 1.6× | 3.33 |
| Filipino | 1.96 | 1.6× | 3.04 |
| Thai | 1.90 | 1.5× | 2.10 |
| English | 1.27 | 1.0× | 4.56 |
| Vietnamese | 1.26 | 1.0× | 3.67 |
Corpus: FLORES-200 dev split, 50 parallel sentences. CC-BY-SA 4.0 (FLORES-200 / NLLB, Meta AI). Words counted via Intl.Segmenter so spaceless scripts (Thai, Khmer, Burmese, Lao) are handled correctly. c/tok = characters per token (lower = more fragmented).