Fertility Leaderboard

Tokens per word for 16 languages on identical FLORES-200 sentences — apples-to-apples. Higher = more expensive, fewer in-context examples, faster context-budget burn.

On Llama-3.1 / SEA-LION v3, the heaviest language is Malayalam at 14.61 tok/word — 11.5× English. The same text on GPT-4o is just 3.29 tok/word — a 4.4× saving just from the tokenizer.

Fertility (tok / word)

Language	Fert.▼	×Eng	c/tok
Malayalam	14.61	11.5×	0.62
Kannada	13.25	10.5×	0.59
Telugu	12.49	9.9×	0.60
Tamil	11.25	8.9×	0.76
Sinhala	10.67	8.4×	0.57
Burmese	8.43	6.7×	0.52
Lao	7.93	6.3×	0.60
Bengali	7.81	6.2×	0.83
Khmer	7.42	5.9×	0.70
Hindi	2.65	2.1×	1.89
Malay	2.07	1.6×	3.33
Indonesian	2.04	1.6×	3.33
Filipino	1.96	1.6×	3.04
Thai	1.90	1.5×	2.10
English	1.27	1.0×	4.56
Vietnamese	1.26	1.0×	3.67

Corpus: FLORES-200 dev split, 50 parallel sentences. CC-BY-SA 4.0 (FLORES-200 / NLLB, Meta AI). Words counted via Intl.Segmenter so spaceless scripts (Thai, Khmer, Burmese, Lao) are handled correctly. c/tok = characters per token (lower = more fragmented).