Moving from "Translated AI" to "Native AI"
Open Source
Linguistic Integrity
If Gemini 3.0 / GPT-5.2 is already good at Tamil, why do we need to build our own?
Cost & Efficiency Disparity
Impact: Inference costs 3x more. Generation is 3x slower.
Standard Tokenizers on English Text
Tokens
Standard Tokenizers on Tamil Text
Fragmentation
Notice the fragmentation. Tamil words are chopped into meaningless bytes, destroying semantic context.
Even the best models struggle
Tokens
Observation: A short paragraph of ~25 words consumes 72 tokens (Ratio: ~2.9 tokens/word).
English Model
Reads 50 pages
Tamil (Standard Tokenizer)
Reads only 15 pages
Problem: LLM memory is fixed (e.g., 4096 tokens).
Result: Because Tamil is "bloated" with tokens, the model's effective memory shrinks. It "forgets" the start of long documents.
| Word | உறவுகளும் | உறவுகளையும் | உறவுகளை |
|---|---|---|---|
| Tamil Llama tiny | _உறவ-களும்- | _உறவ-களையும்- | _உறவ-களை |
| Tamil Llama | _உ-ற-வு-களும்- | _உ-ற-வு-களையும்- | _உ-ற-வு-களை |
| Word | கணிதமானது | எண்கணக்கியலில் |
|---|---|---|
| Tamil Llama tiny | _கணித-மானது- | _எண-க்-கண-க்கிய-லில்- |
| Tamil Llama | _கணித-மானது- | _எண்-கண-க்கிய-லில்- |
| Word | இடவெளிகளும் | முக்கோணம் |
|---|---|---|
| Tamil Llama tiny | _இட-வெ-ள-ிகளும்- | _முக்கோணம்- |
| Tamil Llama | _இட-வெளி-களும்- | _மு-க்கோ-ணம்- |
| Word | முத்திரட்சி | நால் திரட்சி | பல்திரட்சி |
|---|---|---|---|
| Preferred* | முத்-திரட்சி | நால்-திரட்சி | பல்-திரட்சி |
| Tamil Llama tiny | முத்திர-ட- ்சி | _ந- ால்- _திர- ட- ்சி | பல- ்த- ிர-ட- ்சி |
| Tamil Llama | மு-த்திர-ட்சி | _ந- ால்- _தி-ர-ட்சி | பல்-திர-ட்சி |
Why "Search" algorithms fail Tamil
| Feature | English | Tamil |
|---|---|---|
| Structure | Distinct Words | Agglutinative (Root + Suffixes) |
| Example | "Interdisciplinary" | "முத்திரட்சி" (Muthiratchi) |
| Standard Split | Inter + disciplinary (Logical) |
மு + த்திர + ட + சி (Nonsense Noise) |
| Consequence | Learns Meaning | Learns Character Patterns only |
Solution: A custom tokenizer respects Tamil grammar, keeping roots and suffixes intact.
Open Source Morphological Tokenizer - IN PROGRESS
Vanangamudi (Selvakumar Murugan)
Global LLM Training Data (e.g., Llama 3 / GPT-5.2)
Result: Models treat Tamil as a translation task, not a native thinking task.
Solving Combinatorial Explosion
Why? Wikipedia is too static. To learn Tamil grammar, the model must see roots combined with every possible suffix.
"The Knowledge"
Raw text from books, web, and archives.
Goal: Learn Grammar & Facts.
"The Behavior"
Native instructions (not translated).
Ex: "Write a petition to the VAO."
"The Reasoning"
Using Gemini to generate logic puzzles in Tamil.
Goal: Advanced Reasoning.
We are not just building a model.
We are preserving Digital Sovereignty for the Tamil language.
Thank You