Center for Tamil Natural Language Processing Research

Building a Transformer-Based Tamil Relation Extraction Pipeline for Knowledge Graph Construction

shaseevan@gmail.com — Tue, 23 Jun 2026 07:02:27 +0000

Introduction

Knowledge Graph construction from unstructured text remains one of the fundamental challenges in Natural Language Processing (NLP). The process requires identifying entities, understanding semantic relationships between them, and converting textual information into structured triples that can be stored, queried, and reasoned over.

For low-resource languages such as Sri Lankan Tamil, this challenge is even more significant due to the limited availability of annotated datasets, domain-specific resources, and production-ready information extraction systems.

This work presents the design and implementation of a Transformer-based Tamil Relation Extraction pipeline that converts raw Tamil documents into structured knowledge triples.

The primary objective is to establish a modular, extensible architecture capable of supporting future components such as Coreference Resolution, Entity Linking, Ontology Validation, and Knowledge Graph Construction.

Problem Statement

Consider the following Tamil sentence:

சாம்.ஏ.சபாபதி யாழ்ப்பாணத்தில் வாழ்ந்தார்.

Humans can easily infer the relationship:

(சாம்.ஏ.சபாபதி, LIVED_IN, யாழ்ப்பாணம்)

However, machines must solve several sub-problems before reaching this representation:

Detect sentence boundaries.
Identify named entities.
Determine entity types.
Generate valid entity pairs.
Understand the semantic relationship.
Convert the result into a structured representation.

This sequence of tasks forms the basis of Information Extraction systems and Knowledge Graph construction pipelines.

System Architecture

The proposed architecture follows a modular pipeline design:

Unlike end-to-end joint extraction architectures, a pipeline-based design was selected because of its modularity, interpretability, ease of debugging, and suitability for low-resource environments.

Each stage can be independently evaluated, improved, or replaced without affecting the remaining components.

Phase 1: Sentence Splitting

The first stage converts raw documents into individual Tamil sentences.

Input:

சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.
அவர் ஆசிரியராக பணியாற்றினார்.

Output:

[
    "சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.",
    "அவர் ஆசிரியராக பணியாற்றினார்."
]

Sentence segmentation reduces computational complexity and allows downstream models to process smaller semantic units.

Phase 2: Named Entity Recognition

The Named Entity Recognition stage identifies entities and assigns semantic labels.

Entity Categories:

PERSON
LOCATION
ORGANIZATION

Example:

Input:
டாக்டர் கந்தசாமி யாழ்ப்பாண போதனா வைத்தியசாலையில் பணியாற்றினார்.

Output:

PERSON
    கந்தசாமி

ORGANIZATION
    யாழ்ப்பாண போதனா வைத்தியசாலை

A fine-tuned Sri Lankan Tamil NER model is currently used for entity identification.

During experimentation, several tokenizer-related boundary issues were observed, particularly involving Tamil Unicode combining characters. To mitigate these issues, an entity post-processing layer was introduced.

Phase 3: Pair Generation

Named entities alone do not represent knowledge.

Relations exist between pairs of entities.

For example:

PERSON
    சாம்பசிவம்

LOCATION
    யாழ்ப்பாணம்

becomes:

(சாம்பசிவம், யாழ்ப்பாணம்)

The Pair Generator creates candidate entity pairs based on predefined semantic constraints.

Allowed combinations include:

{
    ("PERSON", "LOCATION"),
    ("PERSON", "ORGANIZATION"),
    ("ORGANIZATION", "LOCATION")
}

This significantly reduces the search space for relation classification.

Phase 4: Entity Marking

Before relation classification, entity mentions are explicitly marked.

Example:

Original:

சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.

Transformed into:

[E1] சாம்பசிவம் [/E1]
[E2] யாழ்ப்பாணத்தில் [/E2]
வாழ்ந்தார்.

Entity marking allows transformer-based models to focus on the target entities when predicting relationships.

Phase 5: Transformer-Based Relation Extraction

Relation Extraction is formulated as a classification problem.

Given:

[E1] சாம்பசிவம் [/E1]
[E2] யாழ்ப்பாணத்தில் [/E2]
வாழ்ந்தார்.

the model predicts:

LIVED_IN

Current relation labels include:

LIVED_IN
WORKED_AT
CONTRIBUTED_TO
PART_OF
BORN_IN

The current implementation uses a multilingual Natural Language Inference (NLI) transformer model in a zero-shot setting.

This approach enables rapid prototyping without requiring a dedicated Tamil relation extraction dataset.

Phase 6: Triple Construction

After relation classification, extracted information is converted into a structured knowledge representation.

Input:

Entity 1:
சாம்பசிவம்

Relation:
LIVED_IN

Entity 2:
யாழ்ப்பாணம்

Output:

(
    சாம்பசிவம்,
    LIVED_IN,
    யாழ்ப்பாணம்
)

These triples form the foundational building blocks of a Knowledge Graph.

Phase 7: End-to-End Pipeline Integration

The final stage orchestrates all components into a unified extraction workflow.

Pseudo-flow:

document
    → sentence splitter
    → NER
    → pair generator
    → entity marker
    → relation extraction
    → triple builder

This design enables document-level processing while preserving modularity between components.

Experimental Results

The pipeline was evaluated on multiple Tamil sentence structures including:

Personal biographies
Academic affiliations
Institutional relationships
Historical narratives
Sri Lankan Tamil named entities

Example:

Input:

சாம்.ஏ.சபாபதி யாழ்ப்பாணத்தில் வாழ்ந்தார்.

Output:

(
    சாம்.ஏ.சபாபதி,
    LIVED_IN,
    யாழ்ப்பாணத்தில்
)

Example:

Input:

டாக்டர் கந்தசாமி யாழ்ப்பாண போதனா வைத்தியசாலையில் பணியாற்றினார்.

Output:

(
    கந்தசாமி,
    WORKED_AT,
    யாழ்ப்பாண போதனா வைத்தியசாலை
)

The experiments demonstrate that the overall pipeline architecture successfully extracts meaningful knowledge triples from Tamil text.

Future Work

The current system serves as the foundation for a larger Tamil Knowledge Graph ecosystem.

Planned extensions include:

Coreference Resolution
    ↓
Entity Canonicalization
    ↓
Ontology Validation
    ↓
Relation Filtering
    ↓
Knowledge Graph Construction

Additional research directions include:

MuRIL-based NER models
XLM-R based NER models
Fine-tuned Tamil Relation Extraction models
Document-level Relation Extraction
Ontology-aware triple validation
Knowledge Graph population from Noolaham archives

Conclusion

This work demonstrates the successful implementation of a complete Transformer-based Tamil Relation Extraction pipeline capable of converting unstructured Tamil documents into structured knowledge triples.

The system currently supports:

✓ Sentence Segmentation

✓ Named Entity Recognition

✓ Entity Pair Generation

✓ Transformer-Based Relation Extraction

✓ Triple Construction

✓ End-to-End Knowledge Extraction

While several challenges remain in entity boundary detection and relation classification, the architecture establishes a scalable foundation for future Tamil Knowledge Graph construction and low-resource language information extraction research.

The next stage of development will focus on evaluation, coreference resolution, ontology integration, and large-scale knowledge graph generation from Sri Lankan Tamil textual resources.

Building a Tamil Coreference Resolution System for Information Extraction

shaseevan@gmail.com — Mon, 15 Jun 2026 12:19:39 +0000

Modern Information Extraction systems rely on more than Named Entity Recognition (NER). While NER can identify entities such as people, locations, and organizations, it does not explain how references to those entities evolve throughout a document.

This is where Coreference Resolution (CR) becomes essential.

Coreference Resolution is the task of determining when multiple mentions within a document refer to the same real-world entity. It serves as a critical bridge between entity recognition and structured knowledge extraction.

For applications such as:

Knowledge Graph Construction
Relation Extraction
Semantic Search
Document Intelligence
Retrieval-Augmented Generation (RAG)
Conversational AI

accurate coreference resolution is often the difference between fragmented information and coherent knowledge.

The Information Extraction Problem

During our work at CTNLPR, we observed a common issue in Tamil document processing.

Documents rarely repeat the full entity name in every sentence. Instead, they rely on:

Pronouns
Possessive references
Descriptive noun phrases
Location references

Humans resolve these references naturally using context. Machines do not.

Consider a simple information extraction pipeline without coreference resolution.

சாம்.ஏ.சபாபதி 1898 ஆம் ஆண்டு செப்டெம்பர் 8 ஆம் திகதி பிறந்தார்.
இவர் யாழ்ப்பாண மாவட்டத்தின் முன்னாள் மேயர் ஆவார்.
இவர் யாழ்ப்பாண நூலகத்தின் உருவாக்கத்தில் மிக முக்கிய பங்காற்றினார்.

A relation extraction system may produce:

(இவர், ஆவார், முன்னாள் மேயர்)

(இவர், பங்காற்றினார், யாழ்ப்பாண நூலகம்)

Although the relations are syntactically correct, the subject entity is missing.

After coreference resolution:

(சாம்.ஏ.சபாபதி, ஆவார், முன்னாள் மேயர்)

(சாம்.ஏ.சபாபதி, பங்காற்றினார், யாழ்ப்பாண நூலகம்)

The extracted knowledge now becomes meaningful and directly usable within downstream systems.

This observation motivated the development of a dedicated Tamil coreference layer.

Why We Did Not Start with Neural Coreference Models

Most modern coreference systems are built using:

Transformer-based architectures
Mention-ranking models
End-to-end neural coreference systems
Span-ranking approaches

While these approaches achieve strong performance for English and other high-resource languages, they require:

Large annotated datasets
Extensive model training
Significant computational resources
Language-specific supervision

Tamil currently lacks large-scale publicly available coreference corpora.

For a low-resource language, reproducing state-of-the-art neural coreference systems becomes difficult without substantial annotation effort.

Instead of waiting for large benchmark datasets, we explored a different direction:

Build a deterministic, explainable, and task-oriented coreference framework optimized for Information Extraction.

The objective was not to compete with neural coreference benchmarks.

The objective was to improve relation extraction quality in real-world Tamil document processing.

System Architecture

Each layer contributes a specific capability toward discourse-level entity understanding.

Mention Detection

Before references can be resolved, the system must identify candidate mentions.

The mention detection layer combines several complementary strategies.

Named Entity Recognition

A Tamil NER model identifies:

PERSON
LOCATION
ORGANIZATION

These entities become candidates for memory updates and future reference resolution.

Pronoun Detection

The system maintains Tamil pronoun lexicons and detects references such as:

இவர்
அவர்
அவருக்கு
அவருடைய
இவர்கள்

These mentions are forwarded to the resolution layer.

Location References

Tamil documents frequently refer to locations indirectly using expressions such as:

அங்கு
இங்கு
அங்கே
இங்கே

These references must be linked back to previously observed location entities.

Noun Phrase References

Many documents use descriptive phrases instead of explicit names.

Examples include:

இந்த தலைவர்
இந்த மருத்துவர்
இந்த சமூகசேவகர்

These expressions often function as entity references and therefore require resolution.

Entity Normalization

Tamil’s rich morphology introduces another challenge.

The same entity can appear in multiple grammatical forms:

யாழ்ப்பாணம்
யாழ்ப்பாணத்தில்
யாழ்ப்பாணத்தின்
யாழ்ப்பாணத்திற்கு

From an information extraction perspective, these should all map to a single canonical entity.

To reduce fragmentation, the system applies lemmatization-based normalization using Stanza.

யாழ்ப்பாணத்தில்
      ↓
யாழ்ப்பாணம்

This normalization step significantly improves memory consistency and entity linking.

The Entity Memory Layer

One of the most important design decisions was introducing a lightweight discourse memory.

Instead of neural antecedent scoring, the system explicitly tracks recently observed entities.

Current memory structure:

last_person
last_location
last_org

Whenever a new entity is detected, the memory is updated accordingly.

This memory serves as the contextual state of the document and provides the foundation for reference resolution.

Coreference Resolution

Once memory has been populated, reference resolution becomes straightforward.

When the resolver encounters:

இவர்

it retrieves the current value of:

memory.last_person

and resolves the reference to the corresponding entity.

The same mechanism applies to:

Possessive references
Location references
Descriptive noun phrases

This approach provides a simple yet highly interpretable resolution strategy.

Coreference Chain Construction

Rather than merely replacing mentions, the system groups related references into entity clusters.

Example:

சாம்.ஏ.சபாபதி ஒரு பிரபல சமூக சேவகர். அவர் யாழ்ப்பாணத்தில் பல கல்வித் திட்டங்களை முன்னெடுத்தார். இந்த சமூகசேவகர் பல விருதுகளை பெற்றுள்ளார். அவர் யாழ்ப்பாணத்தில் பிறந்தார். அங்கு அவருக்கு பெரும் மதிப்பு இருந்தது. இவர் யாழ்ப்பாண நூலகத்தின் உருவாக்கத்தில் முக்கிய பங்காற்றினார்.

Entity: சாம்.ஏ.சபாபதி

├── சாம்.ஏ.சபாபதி
├── அவர் 
├── இந்த சமூகசேவகர்
├── அவர் 
└── இவர் 

Entity: யாழ்ப்பாணம்

├── யாழ்ப்பாணம்
└── அங்கு 

Entity: யாழ்ப்பாண நூலகம்

These chains provide a document-level view of entity references and become valuable for:

Debugging
Evaluation
Knowledge Graph Construction
Relation Extraction

Visualization and Analysis

To support experimentation and validation, we developed a Streamlit-based visualization layer.

Users can submit Tamil documents and inspect generated coreference chains in real time.

This interface provides transparency into resolution decisions and helps identify weaknesses in rule design.

Key Insight

One of the most important lessons from this work is: Coreference Resolution is not merely a pronoun-resolution task. It is an entity consistency layer that connects NER, relation extraction, and knowledge graph construction.

Without coreference:

(இவர், பங்காற்றினார், யாழ்ப்பாண நூலகம்)

With coreference:

(சாம்.ஏ.சபாபதி, பங்காற்றினார், யாழ்ப்பாண நூலகம்)

The second representation is immediately usable within structured knowledge systems.

System Implementation at CTNLPR

At CTNLPR, this architecture is being developed as part of a broader Tamil Information Extraction ecosystem.

Current capabilities include:

Named Entity Recognition
Entity Normalization
Pronoun Resolution
Possessive Resolution
Location Resolution
Rule-Based Noun Phrase Resolution
Coreference Chain Construction
Streamlit-Based Visualization

The system acts as a foundational layer between entity extraction and knowledge graph generation.

Future Work

The next phase of development will focus on:

Multi-entity discourse memory
Entity salience tracking
Advanced noun phrase resolution
Relation-aware coreference resolution
Knowledge graph integration
Hybrid neural-rule architectures

These improvements will move the system toward more sophisticated document-level semantic understanding.

Conclusion

Building effective Tamil Information Extraction systems requires more than Named Entity Recognition.

By introducing a dedicated coreference resolution layer, we can maintain entity consistency across documents, improve relation extraction quality, and generate more reliable structured knowledge.

For low-resource languages such as Tamil, carefully designed rule-based systems remain a practical and effective pathway toward document-level semantic understanding while larger neural approaches continue to mature.

Fine-Tuning IndicNER for Sri Lankan Tamil Named Entity Recognition

shaseevan@gmail.com — Tue, 26 May 2026 07:55:59 +0000

The rapid evolution of transformer-based multilingual NLP systems has significantly improved Named Entity Recognition (NER) performance across many high-resource languages. However, low-resource language variants such as Sri Lankan Tamil still face substantial challenges due to limited domain-specific datasets and linguistic underrepresentation in existing multilingual training corpora.

At CTNLPR (Center for Tamil Natural Language Processing Research), we fine-tuned the IndicNER model specifically for Sri Lankan Tamil using a custom annotated NER corpus.

The primary objective of this work was to improve entity recognition performance for Sri Lankan Tamil linguistic patterns, local entities, and morphology-aware contextual variations.

Why Sri Lankan Tamil NER is Challenging

Most multilingual NER systems are trained primarily on:

general web corpora
Indian Tamil datasets
multilingual benchmark datasets
formal textual sources

When applied to Sri Lankan Tamil, these systems often struggle due to regional linguistic differences and contextual entity variations.

Some of the key challenges include:

Sri Lankan regional names
local organization terminology
morphological suffix complexity
OCR-induced token inconsistencies
subword tokenization fragmentation
contextual ambiguity in entity boundaries

These limitations significantly affect downstream NLP applications such as:

semantic search
document intelligence
knowledge graph generation
Tamil chatbots
Retrieval-Augmented Generation (RAG)
government document processing

Model Fine-Tuning Overview

Base Model

The fine-tuning process was built on top of

ai4bharat/IndicNER

IndicNER is a multilingual transformer-based Named Entity Recognition model developed for Indic languages and serves as a strong baseline for Tamil NER tasks.

Training Dataset

The model was fine-tuned using approximately:

Sri Lankan Tamil NER Samples

The dataset was manually curated and annotated under the CTNLPR Tamil NLP research pipeline.

The corpus included annotations for:

Entity Type	Description
PERSON	Human names
LOCATION	Geographic entities
ORGANIZATION	Institutional entities

The dataset preparation pipeline incorporated:

OCR-aware preprocessing
Unicode normalization
Tamil-safe tokenization
BIO tagging
subword label alignment
manual annotation validation

Technical Challenges During Fine-Tuning

Fine-tuning multilingual transformer architectures for Tamil requires addressing several language-specific tokenization and alignment problems.

1. Tamil Tokenization Complexity

Tamil is morphologically rich, meaning that suffixes and grammatical structures often alter token boundaries.

Improper tokenization can lead to:

broken entity spans
incorrect BIO labels
fragmented entity predictions

To address this, the pipeline implemented:

Tamil-safe tokenization
script-preserving preprocessing
Unicode normalization

2. Subword Label Alignment

Transformer tokenizers frequently split Tamil words into multiple subword tokens.

Example: யாழ்ப்பாணப் பல்கலைக்கழகம்

may become multiple subword fragments internally. Without correct label alignment:

entity spans become corrupted
BIO labels mismatch
training instability increases

The training pipeline therefore included proper subword label propagation strategies to ensure accurate entity learning.

3. OCR Noise Handling

Tamil OCR systems still generate:

grapheme inconsistencies
merged tokens
invalid Unicode combinations
punctuation corruption

OCR-aware normalization and cleaning stages were integrated before fine-tuning to reduce noise propagation into the NER model.

Fine-Tuning Configuration

Training Setup

Component	Value
Base Model	IndicNER
Language	Sri Lankan Tamil
Task	Token Classification
Labels	PERSON · LOCATION · ORGANIZATION
Architecture	Transformer-based NER
Annotation Format	BIO Tagging

Training Improvements Implemented

Several improvements were introduced during training to better adapt the model for Sri Lankan Tamil.

Key Optimizations

Tamil-safe Tokenization

Preserved Tamil script integrity during preprocessing and tokenization.

Proper Subword Label Alignment

Ensured complete entity learning across fragmented transformer tokens.

Morphology-aware Training

Focused on handling Tamil suffix patterns and contextual entity boundaries.

OCR-aware Preprocessing

Reduced noisy OCR artifacts before training.

Model Evaluation Results

The fine-tuned model achieved the following performance metrics.

Overall Performance

Metric	Score
F1 Score	0.650
Precision	0.602
Recall	0.707
Accuracy	96.04%

Entity-wise Performance

Entity Type	F1 Score
PERSON	0.721
LOCATION	0.698
ORGANIZATION	0.484

The PERSON and LOCATION categories achieved relatively strong performance, while ORGANIZATION entities remained significantly more challenging.

Inference Examples (Model Predictions)

Example 1

? Input Sentence

“பாரதிதாசன் எழுதிய நூலை பாரதி பதிப்பகம் வெளியிட்டது.”

Model Output: PERSON → பாரதிதாசன் , ORGANIZATION → பாரதி பதிப்பகம்

Example 2

? Input Sentence

“வடமராட்சி தொழில்நுட்ப நிறுவனம் மாணவர்களை சேர்த்தது.”

Model Output: ORGANIZATION → வடமராட்சி தொழில்நுட்ப நிறுவனம்

Example 3

? Input Sentence

“நவமணி கிராமம் வெள்ளத்தால் பாதிக்கப்பட்டது.”

Model Output: LOCATION → நவமணி

Example 4

? Input Sentence

“ஜே.ஏ.எஸ்.பீ. ஜயசிங்க நவமணி கிராமத்திற்கு சென்றார்.”

Model Output: PERSON → ஜே.ஏ.எஸ்.பீ. ஜயசிங்க ,LOCATION → நவமணி

Why Organization Entities Are Difficult

Organization entities in Sri Lankan Tamil often exhibit:

inconsistent naming patterns
long contextual spans
abbreviation variations
mixed-language terminology
domain-specific structures

Examples include:

universities
ministries
NGOs
educational institutions
government departments

This creates substantial variability that affects generalization.

Future improvements will therefore focus heavily on:

organization-specific corpus expansion
domain-balanced sampling
contextual augmentation
larger annotation coverage

Key Observations

One of the most important findings from this work is that:

Better preprocessing and domain-specific data can be as important as model architecture.

For low-resource languages like Sri Lankan Tamil:

high-quality annotations matter
OCR normalization matters
tokenizer alignment matters
linguistic preprocessing matters

Large transformer architectures alone are not sufficient without carefully prepared language-specific datasets.

Applications of the Fine-Tuned Model

The fine-tuned Sri Lankan Tamil IndicNER model can support:

Application	Use Case
Tamil NER	Entity extraction
Semantic Search	Context-aware retrieval
RAG Systems	Entity-aware retrieval pipelines
OCR Post-processing	Structured extraction
Knowledge Graphs	Entity linking
Tamil Chatbots	Context understanding

Future Work

The next phase of research under CTNLPR includes:

expanding organization entity annotations
improving contextual entity modeling
OCR-noisy benchmark evaluations
multilingual Tamil-Sinhala entity alignment
relation extraction
low-resource transformer optimization
Tamil document intelligence systems

Conclusion

The fine-tuning of IndicNER for Sri Lankan Tamil represents an important step toward building stronger NLP infrastructure for low-resource Tamil language technologies.

By combining:

domain-specific Tamil datasets
OCR-aware preprocessing
morphology-sensitive tokenization
transformer fine-tuning

CTNLPR aims to improve the quality and inclusivity of multilingual AI systems for Sri Lankan Tamil.

As AI systems increasingly become language-dependent, foundational NLP research for low-resource languages remains critical for ensuring broader linguistic representation in future AI technologies.

Building a Sri Lankan Tamil Named Entity Recognition Dataset for Low-Resource NLP

shaseevan@gmail.com — Mon, 25 May 2026 15:25:21 +0000

The growth of Large Language Models (LLMs) and multilingual NLP systems has significantly improved language technologies across major global languages. However, low-resource languages such as Sri Lankan Tamil still face a severe lack of high-quality annotated datasets—especially for foundational tasks like Named Entity Recognition (NER).

To address this gap, we developed the Srilankan-Tamil-NER Dataset, a Tamil NER dataset designed specifically for Sri Lankan Tamil linguistic and contextual usage.

This dataset is intended to support:

Why Sri Lankan Tamil NER Matters

Named Entity Recognition (NER) is a core NLP task that identifies and classifies entities such as:

Person names
Locations
Organizations
Dates
Miscellaneous entities

NER acts as a foundational layer for many downstream NLP systems including:

Question answering
Search systems
Chatbots
Document intelligence
Machine translation
Knowledge graph generation

For Tamil — particularly Sri Lankan Tamil — publicly available annotated corpora remain extremely limited. Existing multilingual datasets often underrepresent regional linguistic variations, local named entities, and culturally contextual terminology.

Most existing NER systems for Tamil are trained on datasets originating from Indian Tamil corpora, leaving significant gaps in handling:

Sri Lankan Tamil vocabulary
Local organization names
Sri Lankan place names
Government and institutional terminology

Our dataset aims to bridge this gap.

About the Dataset

Srilankan-Tamil-NER Dataset

The primary goal of this dataset is to create a high-quality manually curated Named Entity Recognition corpus for Sri Lankan Tamil.

The dataset is structured to support fine-tuning transformer-based multilingual models such as:

IndicNER
mBERT
XLM-RoBERTa
MuRIL
IndicBERT

IndicNER itself was trained for multiple Indian languages including Tamil and has become a strong baseline for multilingual NER systems.

Dataset Preparation Pipeline

Creating a Tamil NER dataset involves significantly more than simple annotation.

The preparation workflow included multiple stages:

1. Data Collection

The raw Tamil text corpus was collected from the Noolaham corpus and other relevant publicly available Sri Lankan Tamil textual sources. Special attention was given to:

local linguistic relevance
entity diversity
sentence quality
contextual richness

The objective was to capture realistic Sri Lankan Tamil usage patterns rather than synthetic or translated text.

2. OCR and Text Normalization

Tamil NLP pipelines often begin with scanned or image-based documents.

As part of our broader Tamil document intelligence workflow, OCR-extracted Tamil text underwent:

Unicode normalization
punctuation cleaning
whitespace normalization
invalid character filtering
OCR noise reduction

OCR-related preprocessing becomes extremely important because Tamil script errors can propagate heavily into token classification systems.

3. Named Entity Annotation

The dataset was manually annotated using BIO tagging format:

Tag Type	Description
B-PER	Beginning of person entity
I-PER	Inside person entity
B-LOC	Beginning of location entity
I-LOC	Inside location entity
B-ORG	Beginning of organization entity
I-ORG	Inside organization entity
O	Non-entity token

BIO tagging remains one of the most widely adopted standards for NER annotation pipelines.

Example:

Token	Label
இராமநாதன்	B-PER
யாழ்ப்பாணம்	B-LOC
பல்கலைக்கழகம்	B-ORG

Challenges in Sri Lankan Tamil NER

Building a Tamil NER dataset introduced several language-specific challenges.

Morphological Complexity

Tamil is morphologically rich, where suffixes and grammatical inflections can alter token boundaries significantly.

This creates difficulties for:

tokenizer alignment
entity span detection
subword classification

OCR Noise

Tamil OCR systems still produce:

broken grapheme clusters
merged tokens
Unicode inconsistencies
punctuation corruption

NER systems trained on clean datasets often fail under noisy OCR conditions.

Limited Existing Corpora

Compared to English, Tamil lacks:

large benchmark corpora
standardized annotation frameworks
domain-diverse datasets
region-specific entity resources

Several studies have highlighted the low-resource limitations of Tamil and Sinhala NER ecosystems.

Dataset Design Considerations

During preparation, the following design principles were prioritized.

Entity Diversity

The corpus includes diverse entity categories relevant to Sri Lankan contexts.

Examples include:

government institutions
educational organizations
local geographic locations
personal names
cultural entities

Fine-Tuning Use Cases

This dataset can be used to fine-tune transformer architectures for:

Use Case	Description
Tamil NER	Entity extraction
OCR Post-processing	Structured information extraction
Search Systems	Semantic indexing
RAG Pipelines	Entity-aware retrieval
Tamil Chatbots	Context understanding
Government Document AI	Information extraction
Knowledge Graphs	Entity linking

Importance for Low-Resource NLP

The broader importance of this dataset extends beyond NER itself.

Low-resource language ecosystems require foundational datasets before advanced systems such as LLMs, multilingual agents, and reasoning pipelines can perform effectively.

Datasets like this contribute toward:

Sri Lankan Tamil digital preservation
multilingual AI inclusivity
regional NLP research
culturally aware AI systems

Recent multilingual NER research also highlights the importance of publicly available Tamil and Sinhala annotated corpora for improving low-resource NLP systems.

Future Directions

Planned future improvements include:

expanding entity categories
larger corpus coverage
nested entity support
OCR-noisy benchmark subsets
cross-domain annotations
multilingual Tamil-Sinhala-English alignment
relation extraction extensions

Conclusion

The Srilankan-Tamil-NER Dataset, developed under CTNLPR, represents an important step toward strengthening the Sri Lankan Tamil NLP ecosystem through high-quality entity annotation and linguistically relevant corpus preparation.

By focusing on realistic Sri Lankan Tamil usage, OCR-aware preprocessing, and transformer-ready annotation formats, the dataset aims to support future advancements in:

Tamil NLP
multilingual transformers
document intelligence
AI for low-resource languages

As multilingual AI systems continue evolving rapidly, foundational datasets such as this remain critical for improving language inclusivity and advancing Tamil-focused NLP research.

Building a Tamil Coreference Resolution System: Zero-Shot Contextual Span Modeling with MuRIL for Low-Resource NLP

shaseevan@gmail.com — Fri, 15 May 2026 15:36:33 +0000

Coreference Resolution (CR) is one of the most important tasks in Natural Language Processing (NLP). It focuses on identifying whether multiple expressions within a document refer to the same real-world entity.

Resolving these semantic references is critical for document-level understanding and directly impacts downstream NLP systems such as:

Information Extraction
Knowledge Graph Construction
Question Answering
Relation Extraction
Semantic Search
Conversational AI
Retrieval-Augmented Generation (RAG)

While English NLP already has mature coreference systems and libraries, Tamil remains a highly challenging low-resource language for discourse-level semantic modeling.

As part of ongoing Tamil NLP research at CTNLPR, we explored how modern multilingual coreference architectures can be adapted for low-resource Tamil language understanding.

Why Tamil Coreference Resolution Is Challenging

Most existing coreference systems are heavily optimized for English. Architectures such as:

spaCy-based pipelines
NeuralCoref
AllenNLP coreference models
Transformer-based antecedent ranking systems

rely on assumptions and linguistic patterns that do not transfer cleanly to Tamil.

Tamil introduces several structural and linguistic complexities:

Agglutinative morphology
Free word order
Pronoun dropping
Implicit subject references
Rich inflectional structures
Long-distance discourse dependencies
Noun-to-noun semantic references

These characteristics make direct adaptation of English-centric architectures highly unreliable.

Low-Resource Constraints

Another major challenge is the lack of Tamil-specific resources.

Currently:

No dedicated Tamil coreference libraries exist publicly
No production-ready Tamil CR models are available
Large annotated Tamil coreference corpora are unavailable
Most multilingual systems remain heavily English-biased

This makes supervised transformer training approaches difficult to scale for Tamil.

Limitations of Traditional Supervised Approaches

Many multilingual coreference systems rely on supervised transformer architectures combined with linguistic feature engineering.

These systems often incorporate features such as:

Entity type
Word category
Number agreement
Gender agreement
Antecedent indicators

Although effective for high-resource languages, these approaches depend heavily on:

Manually annotated datasets
Task-specific supervision
Expensive training pipelines
Large computational resources

For Tamil, the absence of benchmark-scale annotated corpora becomes a critical bottleneck.

A Different Direction: Zero-Shot Contextual Modeling

After extensive research and evaluation, we identified that zero-shot multilingual contextual modeling provides a more scalable direction for Tamil coreference resolution.

Instead of relying on supervised antecedent ranking architectures, the CTNLPR approach focuses on:

Contextual span modeling
Semantic similarity learning
Unsupervised clustering
Multilingual transformer representations

This reduces dependency on large annotated datasets while still leveraging strong contextual semantic understanding.

System Architecture

The architecture currently being explored at CTNLPR consists of:

MuRIL-based contextual embeddings
Span-based mention detection
Contextual span representations
Cosine similarity-based semantic linking
Agglomerative clustering

MuRIL-Based Contextual Encoding

The system uses MuRIL as the contextual encoder.

MuRIL is particularly suitable because it was pretrained specifically for Indian languages and multilingual semantic understanding.

The encoder generates contextual token embeddings that capture:

Semantic context
Syntactic dependencies
Cross-sentence relationships
Multilingual language patterns

This provides a stronger foundation for Tamil discourse modeling compared to English-centric multilingual models.

Span Candidate Generation

Instead of manually defining antecedents, the system automatically generates candidate spans from Tamil text.

These spans may represent:

Pronouns
Named entities
Noun phrases
Semantic references

This span-based strategy allows the architecture to remain flexible and language-independent.

Contextual Span Representation

Once candidate spans are generated, contextual span embeddings are constructed using token-level representations from MuRIL.

The system applies:

Mean pooling over token embeddings
Span-level semantic encoding
Context-aware representation construction

This enables semantically similar mentions to occupy nearby regions within the embedding space.

Semantic Similarity and Clustering

After span representation construction, the system performs:

Heuristic span pruning
Cosine similarity computation
Similarity-driven clustering

Agglomerative clustering is then used to group semantically related mentions into discourse-level entity chains.

This removes the need for explicit antecedent ranking while still enabling coherent coreference grouping.

Why This Approach Works Well for Tamil

This direction is particularly effective for Tamil because it avoids dependency on large manually annotated datasets while leveraging multilingual contextual knowledge learned during transformer pretraining.

Key advantages include:

Zero-shot learning capability
Scalability for low-resource environments
Reduced annotation dependency
Contextual semantic understanding
Better adaptation to Tamil morphology
Lightweight unsupervised inference

This makes the architecture more practical for real-world Tamil NLP deployment scenarios.

Key Insight

One of the most important findings from this research is:

Coreference resolution for Tamil is not primarily a rule-engineering problem — it is a contextual semantic modeling problem.

Strong discourse understanding requires:

Context-aware embeddings
Span-level semantic reasoning
Cross-sentence entity linking
Language-aware discourse modeling

Without these components, downstream systems struggle to maintain semantic consistency across documents.

System Implementation at CTNLPR

At CTNLPR, we are actively exploring and prototyping this zero-shot multilingual coreference architecture as part of the ongoing Noolaham GPT initiative.

Our work currently includes:

Researching multilingual CR architectures for Tamil
Evaluating MuRIL-based contextual encoding
Designing span-based semantic linking pipelines
Exploring unsupervised clustering strategies
Building scalable low-resource discourse understanding systems

The long-term objective is to develop production-grade Tamil discourse understanding infrastructure for multilingual AI systems.

Conclusion

Building effective Tamil coreference resolution systems requires significantly more than adapting English NLP pipelines.

By combining:

Multilingual contextual transformers
Span-based semantic representations
Similarity-driven clustering
Zero-shot multilingual inference

we are developing a scalable architecture for document-level semantic understanding in Tamil.

This work represents an important step toward building modern, context-aware Tamil NLP systems for real-world low-resource AI applications.

Building a Morphology-Aware Tamil NER Pipeline: From Transformer Extraction to Canonical Entity Resolution

shaseevan@gmail.com — Fri, 15 May 2026 12:34:43 +0000

Named Entity Recognition (NER) is a foundational layer in modern NLP systems. It directly impacts downstream applications such as search, indexing, knowledge graph construction, entity linking, and Retrieval-Augmented Generation (RAG).

However, for Tamil and other morphologically rich languages, entity extraction alone is insufficient. The larger challenge lies in canonicalizing entity variants into stable root forms that can be consistently indexed and linked across documents.

As part of the ongoing Tamil NLP and Noolaham GPT initiatives at CTNLPR, we designed and implemented a morphology-aware Tamil NER pipeline that combines transformer-based extraction with morphological normalization and canonical entity resolution.

Why Tamil NER Is Challenging

Tamil is a highly agglutinative language. A single entity can appear in multiple inflected surface forms depending on grammatical suffixes, case markers, and contextual usage.

For example:

இலங்கையில்
இலங்கையிலும்
இலங்கையிலே

Although all refer to the same canonical entity (“இலங்கை”), transformer-based NER systems typically extract them as separate entities.

This creates several downstream problems:

Entity duplication
Inconsistent indexing
Fragmented knowledge graphs
Reduced retrieval quality in RAG systems

As a result, canonicalization becomes a critical requirement in Tamil NLP pipelines.

System Architecture

At CTNLPR, we designed a Tamil-aware NER architecture consisting of:

IndicNER → transformer-based entity extraction
Custom span merging → IOB tag consolidation
Prefix-based grouping → heuristic variant clustering
Morphological normalization layer → canonical entity resolution

The system extends standard transformer-based NER pipelines with language-aware normalization strategies specifically designed for Tamil morphology.

Transformer-Based Entity Extraction

We used IndicNER as the baseline entity extraction model.

Advantages:

Strong multilingual transformer backbone
High recall across entity categories
Effective extraction on noisy multilingual corpora

However, a major limitation became immediately apparent:

Transformer NER models do not enforce canonical forms.

As a result, multiple inflected variants of the same entity were treated as separate entities.

Method Exploration and Evaluation

We evaluated several normalization approaches to resolve this issue.

Prefix-Based Grouping

A lightweight heuristic approach based on shared token prefixes.

Advantages:

Fast execution
Simple clustering mechanism

Limitations:

Not linguistically grounded
Unstable under complex suffix chains

Although useful as an intermediate clustering layer, it lacked sufficient linguistic robustness.

UoM Thamizhi Morphological Normalizer

We evaluated the morphological normalization approach developed by the University of Moratuwa.

Advantages:

Linguistically motivated
Rule-based morphological normalization

However, several practical limitations emerged:

Reduced robustness on noisy OCR text
Difficulty handling complex suffix combinations
Poor generalization to unseen word forms

While theoretically sound, the system struggled under real-world document conditions.

Tamil Lemmatizer (Final Approach)

The strongest results were achieved using a Tamil lemmatization-based normalization layer.

Advantages:

Consistent root-form extraction
Better handling of inflected variants
Improved robustness across document types
Strong empirical performance under noisy conditions

This became the final canonicalization layer integrated into our production pipeline.

Canonical Entity Resolution

The final architecture performs many-to-one normalization:

Surface Form	Canonical Entity
இலங்கையில்	இலங்கை
இலங்கையிலும்	இலங்கை
இந்தியாவில்	இந்தியா

This canonical mapping layer significantly improved consistency across downstream NLP systems.

Pipeline Design

The final production pipeline operates as follows:

Document ingestion and chunking
Transformer inference using IndicNER
IOB span merging and filtering
Prefix-based variant aggregation
Morphological normalization (lemmatization layer)
Canonical entity re-indexing

This hybrid architecture combines transformer-based recall with morphology-aware normalization.

System-Level Challenges

Several engineering challenges emerged during implementation:

Agglutinative Morphology

Tamil suffix chains can become extremely complex, making direct canonicalization difficult without normalization.

Variant Explosion

A single entity appearing across multiple grammatical forms produced fragmented entity representations across documents.

OCR and Noisy Input

Historical and scanned documents introduced spelling inconsistencies and OCR artifacts, reducing normalization stability.

Cross-Document Consistency

Canonical entity representation had to remain stable across multi-document pipelines to support reliable indexing and retrieval.

Observations and Results

Across experiments, several patterns became clear:

Transformer NER → high recall, weak canonical consistency
Prefix grouping → useful heuristic but linguistically limited
UoM normalizer → theoretically strong but operationally fragile
Lemmatization → strongest practical normalization performance

The final production system combined: IndicNER + Morphological Lemmatization

to achieve robust multilingual entity extraction and canonicalization.

Key Insight

One of the most important findings from this work was:

In Tamil NER, the primary challenge is not entity detection — it is morphological normalization.

NER output should not be treated as the final entity representation.

Canonicalization is essential for:

Indexing
Entity linking
Knowledge graph construction
Retrieval-Augmented Generation (RAG)
Cross-document semantic consistency

Without normalization, downstream systems become fragmented and unreliable.

System Implementation at CTNLPR

At CTNLPR, this morphology-aware NER architecture was integrated into our broader Tamil NLP infrastructure as part of the ongoing Noolaham GPT initiative.

Our work included:

Evaluating transformer-based Tamil NER models
Designing canonical entity resolution pipelines
Benchmarking normalization strategies
Integrating morphology-aware lemmatization layers
Improving cross-document entity consistency
Optimizing downstream indexing and retrieval performance

The final system now serves as a foundational entity extraction layer for multilingual search, knowledge graphs, and RAG systems being developed at CTNLPR.

Outcome

The final production-ready Tamil NER system:

Resolves inflected entity variants
Produces stable canonical entity forms
Improves downstream retrieval quality
Supports scalable multi-document processing
Enables consistent entity-centric indexing and analytics

Most importantly, the system demonstrates that reliable Tamil NER requires both transformer-based extraction and morphology-aware canonicalization.

Conclusion

Building effective NER systems for Tamil requires significantly more than transformer-based entity extraction.

By combining:

Transformer NER models
IOB span consolidation
Morphological normalization
Canonical entity resolution

we developed a scalable morphology-aware Tamil NER architecture optimized for multilingual NLP systems and large-scale document processing.

Building a Tamil Keyword Extraction Pipeline: From Statistical Methods to Embedding-Based Semantic Extraction

shaseevan@gmail.com — Fri, 15 May 2026 11:05:36 +0000

Keyword extraction is a foundational component in modern NLP systems. It directly impacts search quality, indexing efficiency, document summarization, topic discovery, and Retrieval-Augmented Generation (RAG) pipelines.

However, for low-resource languages such as Tamil, keyword extraction introduces several challenges that extend beyond simply applying existing algorithms. In practice, achieving reliable extraction required careful system-level engineering involving preprocessing, tokenization, normalization, and multilingual semantic modeling.

As part of the ongoing Noolaham GPT initiative at CTNLPR, we designed and implemented a Tamil-aware keyword extraction pipeline that combines statistical methods, graph-based ranking, and embedding-based semantic extraction.

Why Keyword Extraction Is Challenging for Tamil

Many existing keyword extraction libraries are primarily designed for English and other high-resource languages. Tamil introduces additional complexities:

Agglutinative morphology
Spelling and orthographic variations
Limited language-specific NLP tooling
Inconsistent tokenization behavior
Lack of native Tamil stopword integration

As a result, directly applying standard extraction libraries often produces noisy or semantically weak keywords.

System Architecture

Our pipeline integrates multiple extraction strategies:

scikit-learn → TF-IDF-based statistical extraction
Gensim → TextRank graph-based ranking
KeyBERT → embedding-based semantic extraction

To support Tamil processing, we integrated Indic NLP preprocessing techniques, including:

Tokenization
Text normalization
Stopword filtering
Diacritic handling

This preprocessing layer was critical for stabilizing downstream extraction quality.

Method Evaluation

We evaluated multiple extraction approaches to understand their strengths and limitations under Tamil NLP conditions.

TF-IDF

TF-IDF provided a strong statistical baseline for identifying globally important terms across corpora.

Advantages:

Lightweight
Fast computation
Effective for corpus-level frequency analysis

Limitations:

No semantic understanding
Frequency-driven ranking only

TextRank

TextRank adapts the PageRank algorithm to textual data by constructing graph relationships between words.

Advantages:

Unsupervised
No training required

Limitations:

Highly sensitive to tokenization quality
Graph instability under noisy Tamil segmentation

YAKE

YAKE performed well as a lightweight per-document keyword extraction baseline.

Advantages:

Fast inference
No external models required
Reliable for smaller documents

However, semantic understanding remained limited.

Embedding-Based Extraction with KeyBERT

The strongest results were achieved using KeyBERT combined with Tamil-capable embedding models.

A critical realization during experimentation was: KeyBERT itself is not language-aware. Its effectiveness depends entirely on the embedding model.

Using default English embeddings produced poor semantic extraction quality for Tamil documents. Replacing them with Tamil and multilingual Indic embedding models dramatically improved performance.

Embedding Models Evaluated

We integrated several multilingual and Tamil-specific embedding models:

l3cube-pune/tamil-sentence-bert-nli
ai4bharat/indic-bert
paraphrase-multilingual-mpnet-base-v2

The extraction pipeline operates as follows:

Document embedding generation
Candidate n-gram extraction
Semantic similarity computation
Relevance-based ranking

This allowed semantically meaningful phrases to be identified even when they were not frequency-dominant.

System-Level Challenges

Several engineering challenges emerged during implementation:

Tamil Stopword Handling

Tamil stopwords are not automatically supported in most libraries. We implemented custom stopword filtering layers to improve extraction precision.

Text Normalization

Tamil text frequently contains:

Orthographic variations
Diacritic inconsistencies
OCR-induced noise

Normalization significantly improved embedding consistency and keyword quality.

Tokenization Stability

Graph-based and embedding-based extraction methods were highly sensitive to tokenization behavior. Consistent segmentation became essential for reliable extraction.

Observations and Results

Across experiments, several patterns became clear:

TF-IDF → effective for corpus-level topic signals
YAKE → strong lightweight baseline
TextRank → unstable under noisy tokenization
KeyBERT + Tamil embeddings → strongest semantic keyword quality

The final embedding-based pipeline consistently produced more contextually meaningful keywords compared to purely statistical methods.

Key Insight

One of the most important findings from this work was:

In Tamil NLP, keyword extraction is not constrained primarily by the extraction algorithm itself.

The dominant factors are:

Embedding quality
Tokenization consistency
Text normalization quality

Without solving these foundational issues, even advanced extraction algorithms produce weak results.

System Implementation at CTNLPR

At CTNLPR, this keyword extraction pipeline was integrated into the broader Noolaham GPT ecosystem as part of our multilingual NLP infrastructure.

Our work included:

Evaluating extraction algorithms for Tamil corpora
Integrating Tamil-aware preprocessing pipelines
Benchmarking embedding models for semantic extraction
Building embedding-based keyword ranking systems
Optimizing extraction quality for downstream RAG and search pipelines

The final system now supports semantically meaningful keyword extraction across multilingual document collections.

Outcome

The final production pipeline:

Produces semantically relevant Tamil keywords
Handles multilingual document collections
Integrates into downstream search and RAG systems
Supports scalable NLP workflows for low-resource languages

Most importantly, the work demonstrates that high-quality keyword extraction for Tamil is achievable when embedding quality and language-specific preprocessing are treated as first-class system components.

Conclusion

Keyword extraction for low-resource languages requires significantly more than applying standard NLP libraries out of the box.

By combining:

Statistical extraction methods
Graph-based ranking
Embedding-driven semantic extraction
Tamil-aware preprocessing pipelines

we developed a scalable keyword extraction system optimized for Tamil NLP applications and multilingual retrieval systems.

Designing an Efficient Reranking Layer: Multilingual Cross-Encoder Optimization for Tamil–English RAG

shaseevan@gmail.com — Fri, 15 May 2026 10:58:54 +0000

Retrieval-Augmented Generation (RAG) systems are fundamentally dependent on retrieval quality. In multilingual environments, dense retrieval can surface relevant passages, but retrieval alone is often insufficient for producing reliable downstream responses.

In practice, not all retrieved chunks are equally relevant. Passing large sets of loosely related candidates directly into the LLM increases token usage, introduces noisy context, and degrades overall response quality.

For Tamil–English multilingual systems, this problem becomes significantly more pronounced due to cross-lingual ranking inconsistencies.

The Ranking Problem in Multilingual RAG

Dense retrievers are generally effective at identifying semantically related documents. However, they frequently struggle with ranking precision, especially for mixed-language queries involving Tamil and English.

This creates several downstream issues:

Relevant documents ranked lower than less relevant passages
Cross-lingual ranking inconsistencies
Reduced answer quality during generation
Increased context noise inside the LLM

As a result, retrieval quality cannot be measured solely by recall. Precise ranking becomes equally critical.

Core Approach

At CTNLPR, we designed and integrated a cross-encoder reranking layer into our Tamil–English RAG pipeline to refine multilingual retrieval results before generation.

Unlike traditional bi-encoder retrieval systems, cross-encoders jointly process the query and candidate document during inference. This allows the reranker to capture fine-grained semantic relationships between the query and retrieved passages.

The reranking layer provides several advantages:

Joint query–document semantic encoding
Improved ranking precision
Better multilingual alignment
More stable cross-lingual relevance scoring

This enables accurate ordering of multilingual retrieval candidates prior to LLM inference.

Model Evaluation

We evaluated multiple multilingual reranking models under CPU-constrained deployment environments:

BGE-v2-m3 → strong ranking quality, higher CPU latency
jina-v3-multi → strong cross-lingual consistency
jina-v2-cpu-opt → best latency–quality balance
gte-multilingual → stable multilingual performance

During evaluation, several limitations became apparent when reranking was absent:

Correct documents retrieved but incorrectly ordered
Ranking instability for mixed-language queries
Lexical fusion methods (e.g., RRF) introducing cross-script noise

These observations highlighted that multilingual retrieval requires ranking-aware optimization, not retrieval alone.

Two-Stage Retrieval Architecture

The final system uses a two-stage retrieval pipeline:

Dense retrieval generates Top-K candidate passages
Cross-encoder reranker evaluates semantic relevance
Candidate passages are rescored and reordered
Top-ranked chunks are passed to the LLM

This architecture improves retrieval precision while controlling downstream context size.

CPU Optimization Strategy

Cross-encoder rerankers are computationally expensive, especially in CPU-only environments. Since our deployment environment prioritizes CPU inference efficiency, optimization became a central engineering challenge.

The primary objective was to maximize ranking quality while minimizing inference latency

Candidate Reduction (Highest Impact Optimization)

The largest performance improvement came from reducing reranker workload.

Instead of reranking large candidate sets:

Top-K was reduced aggressively (e.g., 100 → 20)
Forward-pass count was minimized
Overall latency dropped significantly

This reduced unnecessary reranker computation while preserving ranking quality.

ONNX Runtime + INT8 Quantization

To improve CPU inference efficiency, we optimized reranker deployment using:

PyTorch → ONNX conversion
INT8 dynamic quantization

This provided several benefits:

Faster inference latency
Reduced memory consumption
Better CPU throughput
Minimal degradation in ranking quality

The optimized runtime significantly improved practical deployment feasibility for multilingual reranking.

Token and Runtime Optimization

Additional optimizations focused on attention complexity and runtime efficiency:

Reduced maximum token length (512 → 256)
Optimized CPU threading (OMP / MKL)
Efficient tokenization and batching strategies

Since transformer self-attention scales quadratically with sequence length (O(n²)), reducing token length had a major impact on latency reduction.

Performance Signals

Across internal evaluations, the optimized reranking pipeline achieved:

Latency reduction from seconds → sub-second inference
Strong multilingual ranking quality (MRR / nDCG maintained)
Stable Tamil English ranking consistency
Improved downstream LLM response quality

The final architecture achieved production-ready CPU inference performance while maintaining strong ranking precision.

What Didn’t Work

Several approaches produced unstable results in multilingual environments:

Similarity-threshold filtering → inconsistent across scripts
Reciprocal Rank Fusion (RRF) → introduced lexical noise between Tamil and English

These methods lacked sufficient semantic robustness for multilingual ranking tasks.

Key Insight

One of the most important findings from this work is: Multilingual RAG is not only a retrieval problem — it is also a ranking precision problem.

In practical systems:

Retrieval ensures coverage
Reranking ensures correctness

Without effective reranking, even strong retrievers can produce noisy or poorly ordered context for generation.

System Implementation at CTNLPR

At CTNLPR, this reranking architecture was integrated into our broader Tamil–English multilingual RAG pipeline as part of the ongoing Noolaham GPT initiative.

Our work included:

Evaluating multilingual cross-encoder rerankers
Benchmarking ranking quality using MRR and nDCG
Designing CPU-efficient reranking pipelines
Implementing ONNX-based optimized inference
Reducing latency for real-world multilingual deployments
Improving ranking stability across Tamil, English, and mixed-language queries

The final system now supports efficient, scalable, cross-lingual reranking for large multilingual document collections.

Conclusion

Reliable multilingual RAG systems require more than strong dense retrieval models. They require ranking-aware architectures capable of refining multilingual relevance before generation.

By combining:

Dense retrieval
Cross-encoder reranking
CPU optimization techniques
Quantized inference pipelines

we developed a scalable multilingual reranking system optimized for Tamil–English retrieval environments.

Document AI in Production: Cost, Language, and Infrastructure Constraints

shaseevan@gmail.com — Fri, 15 May 2026 10:45:58 +0000

Document extraction is often straightforward at demo scale. However, under real production conditions—large document volumes, multilingual corpora, and limited hardware resources—the problem becomes significantly more complex.

Modern Document AI systems must optimize not only for accuracy, but also for scalability, latency, infrastructure cost, and multilingual robustness.

Constraint 1: Cost (Operational Expenditure)

Cloud-based OCR and parsing systems such as Google Cloud Vision and LLM-driven document parsers dramatically simplify development and experimentation. However, these systems introduce a linear operational cost model:

Cost increases proportionally with document volume
Every inference depends on external API calls
Pipelines incur continuous recurring expenses

At small scale, these costs may appear manageable. At production scale—especially for large archival digitization projects involving millions of pages—the operational overhead becomes economically unsustainable.

As a result, real-world document processing systems cannot rely exclusively on API-driven architectures.

Constraint 2: Hardware-Aware Inference

When shifting toward open-source OCR and document understanding models, the bottleneck moves from API cost to infrastructure efficiency.

GPU-based deployments provide high throughput, but scaling GPU infrastructure introduces significant capital and operational expense. CPU-based deployments are considerably more affordable, but require extensive optimization in order to remain practical.

This introduces several engineering requirements:

Model optimization techniques (quantization, pruning)
Optimized inference runtimes such as ONNX Runtime and OpenVINO
Efficient pipeline orchestration and batching strategies

Under these constraints, the objective is no longer “maximum accuracy at any cost.” Instead, the challenge becomes:

Optimizing accuracy, latency, and throughput simultaneously under CPU constraints.

Document AI Is a Pipeline Problem

A common misconception is that Document AI can be solved using a single OCR model. In practice, production systems are composed of multiple tightly coupled stages:

Layout detection and reading-order analysis
Region segmentation (text, tables, images)
Table structure reconstruction
Text detection and recognition
Structural parsing and metadata extraction

Overall system performance depends not only on model quality, but also on orchestration efficiency between these stages. In many deployments, the slowest pipeline component determines total throughput.

Language Constraints in Low-Resource Settings

Extending Document AI systems beyond English introduces another layer of complexity, particularly for low-resource languages such as Tamil.

Several limitations become immediately apparent:

Limited OCR support for Tamil scripts
Weak handling of agglutinative morphology and complex ligatures
Lack of CPU-optimized multilingual models
Poor layout and table generalization across Indic documents

As a result, many pipelines that perform well on English documents degrade significantly in multilingual production environments.

Our Focus at CTNLPR (Noolaham Corpus)

At CTNLPR, through the Noolaham corpus initiative, we are building a CPU-first multilingual document intelligence pipeline focused on scalable, production-grade processing.

Our current focus includes:

Tamil + English multilingual support
Layout-aware extraction for text, tables, and images
CPU-optimized inference pipelines
Reduced dependency on external APIs
Scalable processing for large archival collections

The broader objective is not simply accurate OCR, but sustainable and efficient multilingual document understanding at scale.

Research Direction

The central challenge in production-grade Document AI is not accuracy alone.

The real optimization target is: accuracy × throughput × cost

Balancing these three dimensions is essential for building scalable systems for low-resource languages and large archival corpora.

Designing a Bilingual RAG System: Cross-Lingual Dense Retrieval for Tamil–English

shaseevan@gmail.com — Fri, 15 May 2026 10:22:51 +0000

Retrieval-Augmented Generation (RAG) systems are highly dependent on retrieval quality. In multilingual environments, the problem becomes significantly more complex because the retrieval layer must operate across languages while maintaining semantic consistency.

For Tamil–English systems, one of the key challenges is cross-lingual retrieval — enabling a query in Tamil to retrieve semantically relevant Tamil and English passages from a unified index (and vice versa), without relying on translation pipelines or language-specific partitioning.

Cross-Lingual Retrieval in Multilingual RAG

Traditional retrieval systems often assume that the query language and document language are identical. This assumption breaks down in multilingual knowledge systems where relevant information may exist across multiple languages.

To solve this, we rely on multilingual dense embedding models that project Tamil and English text into a shared semantic vector space. Within this space, semantically similar content across languages becomes geometrically aligned, enabling standard vector similarity search to work across languages.

This allows:

Tamil queries → Tamil + English retrieval
English queries → Tamil + English retrieval

without maintaining separate retrieval systems.

Core Architecture

The retrieval architecture is based on three principles:

Unified multilingual embeddings
Single shared vector index
Cross-lingual semantic similarity search

The system pipeline consists of:

Document chunking
Multilingual embedding generation
Unified vector indexing
Approximate nearest neighbor (ANN) retrieval
LLM-based response synthesis

Model Evaluation

We evaluated several multilingual embedding approaches:

Sentence Transformers (SBERT variants)
Indic-specific models (IndicBERT, MuRIL)

During experimentation, several limitations became apparent:

Weak Tamil–English semantic alignment
Inconsistent similarity distributions across scripts
Reduced recall in mixed-language retrieval scenarios

These limitations significantly affected ranking stability and retrieval quality.

Selected Embedding Model

After extensive evaluation, we selected:

→ intfloat/multilingual-e5-large

The model demonstrated significantly stronger cross-lingual retrieval behavior due to several architectural and training characteristics:

Built on XLM-RoBERTa-large
Large-scale multilingual pretraining
Trained with contrastive learning objectives (>1B training pairs)
Fine-tuned on retrieval benchmarks such as MS MARCO, Mr.TyDi, and MIRACL
Instruction-aware embeddings using query: and passage: prefixes

This resulted in improved semantic alignment between Tamil and English representations, especially for low-resource retrieval tasks.

Unified Indexing Strategy

Instead of maintaining separate indices for each language, we implemented a single unified vector index.

Pipeline:

Chunk Tamil and English documents
Generate embeddings using the same multilingual encoder
Store all vectors in a shared ANN index

This design eliminates language-specific partitioning and simplifies retrieval orchestration.

The final architecture supports:

Tamil → English retrieval
English → Tamil retrieval
Mixed-language retrieval

within the same semantic search space.

Retrieval Flow

The retrieval pipeline operates as follows:

Encode query (Tamil or English)
Perform ANN similarity search
Retrieve top-k semantically relevant chunks
Pass retrieved context to the LLM for generation

Similarity search is based on dense vector retrieval using cosine similarity.

Benchmark Signals (MRR / nDCG)

Across multilingual benchmarks and internal evaluations, we observed measurable improvements in retrieval quality:

Higher MRR@10 → improved early precision
Higher nDCG@10 → better ranking quality
Improved Recall@10 → stronger multilingual coverage
More stable similarity distributions across Tamil and English scripts

These improvements were primarily driven by:

Large-scale contrastive training
Retrieval-specific fine-tuning
Better multilingual semantic alignment

Key Insight

A critical realization from this work is that multilingual RAG is not fundamentally a database problem.

It is primarily an embedding alignment problem solved during representation learning.

When both Tamil and English occupy the same semantic vector space:

Cross-lingual retrieval becomes reliable
Translation overhead becomes unnecessary
Retrieval architectures become significantly simpler

Outcome

The final system achieved:

Stronger cross-lingual ranking quality
Improved MRR/nDCG performance
Reduced architectural complexity
Unified multilingual retrieval
Better knowledge coverage across languages

Most importantly, the system demonstrates that multilingual retrieval for low-resource languages can be achieved effectively without maintaining separate retrieval pipelines or translation layers.

System Implementation at CTNLPR

At CTNLPR, as part of the ongoing Noolaham GPT initiative, we designed and implemented a bilingual Tamil–English RAG retrieval architecture focused on low-resource multilingual retrieval.

Our work included:

Evaluating multilingual embedding models for Tamil–English semantic alignment
Benchmarking retrieval quality using MRR, nDCG, and Recall metrics
Designing a unified vector indexing strategy without language partitioning
Implementing ANN-based cross-lingual dense retrieval pipelines
Optimizing retrieval consistency across Tamil, English, and mixed-language queries

The final system enables:

Tamil queries retrieving both Tamil and English knowledge
English queries retrieving Tamil content
Shared semantic retrieval without explicit translation layers

This architecture now serves as a foundational retrieval layer for downstream Tamil NLP systems, including multilingual RAG, semantic search, and knowledge-centric applications being developed at CTNLPR.

Conclusion

Building multilingual RAG systems for Tamil requires more than multilingual datasets — it requires semantically aligned embeddings capable of supporting reliable cross-lingual retrieval.

By combining:

Multilingual dense encoders
Unified vector indexing
Cross-lingual semantic retrieval

we designed a scalable bilingual RAG architecture capable of retrieving Tamil and English knowledge from a shared semantic space.