Methodology & Honesty
FertiScope draws a hard line between numbers that are exact (deterministic token math) and the one that is an estimate (multi-turn accuracy risk). The estimate is labelled everywhere it appears. This split is the whole point — a tool that confidently predicted accuracy loss would be selling a claim the underlying research could not support.
- Fertility = tokens ÷ words, from real tokenizers.
- Cost multiplier = fertility ÷ English fertility (same tokenizer). Within one API, $/token is constant, so the token ratio is the cost ratio for equivalent content.
- In-context capacity = ⌊(window − reply reserve) ÷ tokens-per-example⌋.
- Context budget = cumulative tokens across turns vs. the window.
Multi-turn degradation risk is a transparent 0–4 score. Budget pressure — how fast cumulative tokens cross 75% of the window — scores 0–2 and is the primary driver. Fertility scores 0–2 but only counts as an amplifier once the budget is under pressure, so a roomy window stays Low regardless of fertility. Totals map 0–1 → Low, 2 → Medium, ≥3 → High. It is not a measured accuracy drop.
Intl.Segmenter with granularity:"word", counting only word-like segments. This handles spaceless scripts — Thai, Khmer, Burmese, Lao — correctly, where naïve whitespace splitting would count one giant “word” and badly distort fertility.Three real tokenizer families run in-app: GPT-4o (o200k_base) and GPT-4 / GPT-3.5 (cl100k_base) via js-tiktoken, and Llama-3.1 / SEA-LION v3 via llama3-tokenizer-js. Llama-3.1 and the continue-trained SEA-LION v3 share one tokenizer, so a single encoder covers both — and reproduces the research’s reference numbers (Tamil ≈ 11–12 tokens/word).
The leaderboard is computed on 50 parallel sentences from FLORES-200 (NLLB, Meta AI — CC-BY-SA 4.0). Because every language expresses the same meaning, fertility ratios are genuinely apples-to-apples.
- Fertility varies by domain; the leaderboard reflects FLORES news/wiki text, not your exact corpus — use the Analyzer on your own text.
- The cost multiplier assumes word count is comparable across languages for equivalent content (true for FLORES, approximate for free text).
- SEA-LION / SeaLLMs are approximated by the Llama-3.1 tokenizer they are built on; a custom-extended vocab could differ slightly.
- Multi-turn risk is heuristic by design — see above.
Grounded in the deep-research report “Tokenizer Fertility and Multi-Turn Degradation” (2026), which confirmed the fertility tax is real and structural while flagging the multi-turn accuracy magnitude as an open question.