Fine-Tuning IndicNER for Sri Lankan Tamil Named Entity Recognition

The rapid evolution of transformer-based multilingual NLP systems has significantly improved Named Entity Recognition (NER) performance across many high-resource languages. However, low-resource language variants such as Sri Lankan Tamil still face substantial challenges due to limited domain-specific datasets and linguistic underrepresentation in existing multilingual training corpora.

At CTNLPR (Center for Tamil Natural Language Processing Research), we fine-tuned the IndicNER model specifically for Sri Lankan Tamil using a custom annotated NER corpus.

The primary objective of this work was to improve entity recognition performance for Sri Lankan Tamil linguistic patterns, local entities, and morphology-aware contextual variations.

Why Sri Lankan Tamil NER is Challenging

Most multilingual NER systems are trained primarily on:

  • general web corpora
  • Indian Tamil datasets
  • multilingual benchmark datasets
  • formal textual sources

When applied to Sri Lankan Tamil, these systems often struggle due to regional linguistic differences and contextual entity variations.

Some of the key challenges include:

  • Sri Lankan regional names
  • local organization terminology
  • morphological suffix complexity
  • OCR-induced token inconsistencies
  • subword tokenization fragmentation
  • contextual ambiguity in entity boundaries

These limitations significantly affect downstream NLP applications such as:

  • semantic search
  • document intelligence
  • knowledge graph generation
  • Tamil chatbots
  • Retrieval-Augmented Generation (RAG)
  • government document processing

Model Fine-Tuning Overview

Base Model

The fine-tuning process was built on top of

ai4bharat/IndicNER

IndicNER is a multilingual transformer-based Named Entity Recognition model developed for Indic languages and serves as a strong baseline for Tamil NER tasks.


Training Dataset

The model was fine-tuned using approximately:

Sri Lankan Tamil NER Samples

The dataset was manually curated and annotated under the CTNLPR Tamil NLP research pipeline.

The corpus included annotations for:

Entity TypeDescription
PERSONHuman names
LOCATIONGeographic entities
ORGANIZATIONInstitutional entities

The dataset preparation pipeline incorporated:

  • OCR-aware preprocessing
  • Unicode normalization
  • Tamil-safe tokenization
  • BIO tagging
  • subword label alignment
  • manual annotation validation

Technical Challenges During Fine-Tuning

Fine-tuning multilingual transformer architectures for Tamil requires addressing several language-specific tokenization and alignment problems.

1. Tamil Tokenization Complexity

Tamil is morphologically rich, meaning that suffixes and grammatical structures often alter token boundaries.

Improper tokenization can lead to:

  • broken entity spans
  • incorrect BIO labels
  • fragmented entity predictions

To address this, the pipeline implemented:

  • Tamil-safe tokenization
  • script-preserving preprocessing
  • Unicode normalization

2. Subword Label Alignment

Transformer tokenizers frequently split Tamil words into multiple subword tokens.

Example: யாழ்ப்பாணப் பல்கலைக்கழகம்

may become multiple subword fragments internally. Without correct label alignment:

  • entity spans become corrupted
  • BIO labels mismatch
  • training instability increases

The training pipeline therefore included proper subword label propagation strategies to ensure accurate entity learning.


3. OCR Noise Handling

Tamil OCR systems still generate:

  • grapheme inconsistencies
  • merged tokens
  • invalid Unicode combinations
  • punctuation corruption

OCR-aware normalization and cleaning stages were integrated before fine-tuning to reduce noise propagation into the NER model.


Fine-Tuning Configuration

Training Setup

ComponentValue
Base ModelIndicNER
LanguageSri Lankan Tamil
TaskToken Classification
LabelsPERSON · LOCATION · ORGANIZATION
ArchitectureTransformer-based NER
Annotation FormatBIO Tagging

Training Improvements Implemented

Several improvements were introduced during training to better adapt the model for Sri Lankan Tamil.

Key Optimizations

Tamil-safe Tokenization

Preserved Tamil script integrity during preprocessing and tokenization.

Proper Subword Label Alignment

Ensured complete entity learning across fragmented transformer tokens.

Morphology-aware Training

Focused on handling Tamil suffix patterns and contextual entity boundaries.

OCR-aware Preprocessing

Reduced noisy OCR artifacts before training.


Model Evaluation Results

The fine-tuned model achieved the following performance metrics.

Overall Performance

MetricScore
F1 Score0.650
Precision0.602
Recall0.707
Accuracy96.04%

Entity-wise Performance

Entity TypeF1 Score
PERSON0.721
LOCATION0.698
ORGANIZATION0.484

The PERSON and LOCATION categories achieved relatively strong performance, while ORGANIZATION entities remained significantly more challenging.


Inference Examples (Model Predictions)

Example 1

? Input Sentence

“பாரதிதாசன் எழுதிய நூலை பாரதி பதிப்பகம் வெளியிட்டது.”

Model Output: PERSON → பாரதிதாசன் , ORGANIZATION → பாரதி பதிப்பகம்


Example 2

? Input Sentence

“வடமராட்சி தொழில்நுட்ப நிறுவனம் மாணவர்களை சேர்த்தது.”

Model Output: ORGANIZATION → வடமராட்சி தொழில்நுட்ப நிறுவனம்


Example 3

? Input Sentence

“நவமணி கிராமம் வெள்ளத்தால் பாதிக்கப்பட்டது.”

Model Output: LOCATION → நவமணி


Example 4

? Input Sentence

“ஜே.ஏ.எஸ்.பீ. ஜயசிங்க நவமணி கிராமத்திற்கு சென்றார்.”

Model Output: PERSON → ஜே.ஏ.எஸ்.பீ. ஜயசிங்க ,LOCATION → நவமணி


Why Organization Entities Are Difficult

Organization entities in Sri Lankan Tamil often exhibit:

  • inconsistent naming patterns
  • long contextual spans
  • abbreviation variations
  • mixed-language terminology
  • domain-specific structures

Examples include:

  • universities
  • ministries
  • NGOs
  • educational institutions
  • government departments

This creates substantial variability that affects generalization.

Future improvements will therefore focus heavily on:

  • organization-specific corpus expansion
  • domain-balanced sampling
  • contextual augmentation
  • larger annotation coverage

Key Observations

One of the most important findings from this work is that:

Better preprocessing and domain-specific data can be as important as model architecture.

For low-resource languages like Sri Lankan Tamil:

  • high-quality annotations matter
  • OCR normalization matters
  • tokenizer alignment matters
  • linguistic preprocessing matters

Large transformer architectures alone are not sufficient without carefully prepared language-specific datasets.


Applications of the Fine-Tuned Model

The fine-tuned Sri Lankan Tamil IndicNER model can support:

ApplicationUse Case
Tamil NEREntity extraction
Semantic SearchContext-aware retrieval
RAG SystemsEntity-aware retrieval pipelines
OCR Post-processingStructured extraction
Knowledge GraphsEntity linking
Tamil ChatbotsContext understanding

Future Work

The next phase of research under CTNLPR includes:

  • expanding organization entity annotations
  • improving contextual entity modeling
  • OCR-noisy benchmark evaluations
  • multilingual Tamil-Sinhala entity alignment
  • relation extraction
  • low-resource transformer optimization
  • Tamil document intelligence systems

Conclusion

The fine-tuning of IndicNER for Sri Lankan Tamil represents an important step toward building stronger NLP infrastructure for low-resource Tamil language technologies.

By combining:

  • domain-specific Tamil datasets
  • OCR-aware preprocessing
  • morphology-sensitive tokenization
  • transformer fine-tuning

CTNLPR aims to improve the quality and inclusivity of multilingual AI systems for Sri Lankan Tamil.

As AI systems increasingly become language-dependent, foundational NLP research for low-resource languages remains critical for ensuring broader linguistic representation in future AI technologies.