Blog ← Center for Tamil Natural Language Processing Research

Building a Transformer-Based Tamil Relation Extraction Pipeline for Knowledge Graph Construction

Jun

23

2026

Introduction Knowledge Graph construction from unstructured text remains one of the fundamental challenges in Natural Language Processing (NLP). The process requires identifying entities, understanding semantic relationships between them, and converting textual information into structured triples that can be stored, queried, and reasoned over. For low-resource languages such as Sri Lankan Tamil, this challenge is even ...

Building a Tamil Coreference Resolution System for Information Extraction

Jun

15

2026

Modern Information Extraction systems rely on more than Named Entity Recognition (NER). While NER can identify entities such as people, locations, and organizations, it does not explain how references to those entities evolve throughout a document. This is where Coreference Resolution (CR) becomes essential. Coreference Resolution is the task of determining when multiple mentions within ...

Fine-Tuning IndicNER for Sri Lankan Tamil Named Entity Recognition

May

26

2026

The rapid evolution of transformer-based multilingual NLP systems has significantly improved Named Entity Recognition (NER) performance across many high-resource languages. However, low-resource language variants such as Sri Lankan Tamil still face substantial challenges due to limited domain-specific datasets and linguistic underrepresentation in existing multilingual training corpora. At CTNLPR (Center for Tamil Natural Language Processing Research), ...

Building a Sri Lankan Tamil Named Entity Recognition Dataset for Low-Resource NLP

May

25

2026

The growth of Large Language Models (LLMs) and multilingual NLP systems has significantly improved language technologies across major global languages. However, low-resource languages such as Sri Lankan Tamil still face a severe lack of high-quality annotated datasets—especially for foundational tasks like Named Entity Recognition (NER). To address this gap, we developed the Srilankan-Tamil-NER Dataset, a ...

Building a Tamil Coreference Resolution System: Zero-Shot Contextual Span Modeling with MuRIL for Low-Resource NLP

May

15

2026

Coreference Resolution (CR) is one of the most important tasks in Natural Language Processing (NLP). It focuses on identifying whether multiple expressions within a document refer to the same real-world entity. Resolving these semantic references is critical for document-level understanding and directly impacts downstream NLP systems such as: Information Extraction Knowledge Graph Construction Question Answering ...

Building a Morphology-Aware Tamil NER Pipeline: From Transformer Extraction to Canonical Entity Resolution

May

15

2026

Named Entity Recognition (NER) is a foundational layer in modern NLP systems. It directly impacts downstream applications such as search, indexing, knowledge graph construction, entity linking, and Retrieval-Augmented Generation (RAG). However, for Tamil and other morphologically rich languages, entity extraction alone is insufficient. The larger challenge lies in canonicalizing entity variants into stable root forms ...

Building a Tamil Keyword Extraction Pipeline: From Statistical Methods to Embedding-Based Semantic Extraction

May

15

2026

Keyword extraction is a foundational component in modern NLP systems. It directly impacts search quality, indexing efficiency, document summarization, topic discovery, and Retrieval-Augmented Generation (RAG) pipelines. However, for low-resource languages such as Tamil, keyword extraction introduces several challenges that extend beyond simply applying existing algorithms. In practice, achieving reliable extraction required careful system-level engineering involving ...

Designing an Efficient Reranking Layer: Multilingual Cross-Encoder Optimization for Tamil–English RAG

May

15

2026

Retrieval-Augmented Generation (RAG) systems are fundamentally dependent on retrieval quality. In multilingual environments, dense retrieval can surface relevant passages, but retrieval alone is often insufficient for producing reliable downstream responses. In practice, not all retrieved chunks are equally relevant. Passing large sets of loosely related candidates directly into the LLM increases token usage, introduces noisy ...

Document AI in Production: Cost, Language, and Infrastructure Constraints

May

15

2026

Document extraction is often straightforward at demo scale. However, under real production conditions—large document volumes, multilingual corpora, and limited hardware resources—the problem becomes significantly more complex. Modern Document AI systems must optimize not only for accuracy, but also for scalability, latency, infrastructure cost, and multilingual robustness. Constraint 1: Cost (Operational Expenditure) Cloud-based OCR and parsing ...

Designing a Bilingual RAG System: Cross-Lingual Dense Retrieval for Tamil–English

May

15

2026

Retrieval-Augmented Generation (RAG) systems are highly dependent on retrieval quality. In multilingual environments, the problem becomes significantly more complex because the retrieval layer must operate across languages while maintaining semantic consistency. For Tamil–English systems, one of the key challenges is cross-lingual retrieval — enabling a query in Tamil to retrieve semantically relevant Tamil and English ...