FFertiScope

Fertility Leaderboard

Tokens per word for 16 languages on identical FLORES-200 sentences — apples-to-apples. Higher = more expensive, fewer in-context examples, faster context-budget burn.

On Llama-3.1 / SEA-LION v3, the heaviest language is Malayalam at 14.61 tok/word 11.5× English. The same text on GPT-4o is just 3.29 tok/word — a 4.4× saving just from the tokenizer.

Fertility (tok / word)
LanguageFert.×Engc/tok
Malayalam14.6111.5×0.62
Kannada13.2510.5×0.59
Telugu12.499.9×0.60
Tamil11.258.9×0.76
Sinhala10.678.4×0.57
Burmese8.436.7×0.52
Lao7.936.3×0.60
Bengali7.816.2×0.83
Khmer7.425.9×0.70
Hindi2.652.1×1.89
Malay2.071.6×3.33
Indonesian2.041.6×3.33
Filipino1.961.6×3.04
Thai1.901.5×2.10
English1.271.0×4.56
Vietnamese1.261.0×3.67

Corpus: FLORES-200 dev split, 50 parallel sentences. CC-BY-SA 4.0 (FLORES-200 / NLLB, Meta AI). Words counted via Intl.Segmenter so spaceless scripts (Thai, Khmer, Burmese, Lao) are handled correctly. c/tok = characters per token (lower = more fragmented).