Methodology & Honesty

FertiScope draws a hard line between numbers that are exact (deterministic token math) and the one that is an estimate (multi-turn accuracy risk). The estimate is labelled everywhere it appears. This split is the whole point — a tool that confidently predicted accuracy loss would be selling a claim the underlying research could not support.

Exact ● exact

Fertility = tokens ÷ words, from real tokenizers.
Cost multiplier = fertility ÷ English fertility (same tokenizer). Within one API, $/token is constant, so the token ratio is the cost ratio for equivalent content.
In-context capacity = ⌊(window − reply reserve) ÷ tokens-per-example⌋.
Context budget = cumulative tokens across turns vs. the window.

Estimate ◐ estimate

Multi-turn degradation risk is a transparent 0–4 score. Budget pressure — how fast cumulative tokens cross 75% of the window — scores 0–2 and is the primary driver. Fertility scores 0–2 but only counts as an amplifier once the budget is under pressure, so a roomy window stays Low regardless of fertility. Totals map 0–1 → Low, 2 → Medium, ≥3 → High. It is not a measured accuracy drop.

Honest caveat: The deep-research report found the accuracy-loss magnitude is unestablished and regime-dependent — it mainly bites as you approach the context window. On a 128K window a short 10-turn chat barely dents the budget, so the risk stays Low. That is why FertiScope shows risk per-window, not a single scary number.

How words are counted

Word counts use the browser/Node Intl.Segmenter with granularity:"word", counting only word-like segments. This handles spaceless scripts — Thai, Khmer, Burmese, Lao — correctly, where naïve whitespace splitting would count one giant “word” and badly distort fertility.

Tokenizers & data

Three real tokenizer families run in-app: GPT-4o (o200k_base) and GPT-4 / GPT-3.5 (cl100k_base) via js-tiktoken, and Llama-3.1 / SEA-LION v3 via llama3-tokenizer-js. Llama-3.1 and the continue-trained SEA-LION v3 share one tokenizer, so a single encoder covers both — and reproduces the research’s reference numbers (Tamil ≈ 11–12 tokens/word).

The leaderboard is computed on 50 parallel sentences from FLORES-200 (NLLB, Meta AI — CC-BY-SA 4.0). Because every language expresses the same meaning, fertility ratios are genuinely apples-to-apples.

Known limitations

Fertility varies by domain; the leaderboard reflects FLORES news/wiki text, not your exact corpus — use the Analyzer on your own text.
The cost multiplier assumes word count is comparable across languages for equivalent content (true for FLORES, approximate for free text).
SEA-LION / SeaLLMs are approximated by the Llama-3.1 tokenizer they are built on; a custom-extended vocab could differ slightly.
Multi-turn risk is heuristic by design — see above.

Grounded in the deep-research report “Tokenizer Fertility and Multi-Turn Degradation” (2026), which confirmed the fertility tax is real and structural while flagging the multi-turn accuracy magnitude as an open question.