Center for Tamil Natural Language Processing Research https://www.ctnlpr.com Fri, 15 May 2026 15:41:43 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://www.ctnlpr.com/wp-content/uploads/2020/12/cropped-CTNLPR-Logo-final-Fav-32x32.png Center for Tamil Natural Language Processing Research https://www.ctnlpr.com 32 32 Building a Tamil Coreference Resolution System: Zero-Shot Contextual Span Modeling with MuRIL for Low-Resource NLP https://www.ctnlpr.com/2026/05/15/building-a-tamil-coreference-resolution-system-zero-shot-contextual-span-modeling-with-muril-for-low-resource-nlp/ https://www.ctnlpr.com/2026/05/15/building-a-tamil-coreference-resolution-system-zero-shot-contextual-span-modeling-with-muril-for-low-resource-nlp/#respond Fri, 15 May 2026 15:36:33 +0000 https://ctnlpr.codelantic-staging.club/?p=646 read more]]>

Coreference Resolution (CR) is one of the most important tasks in Natural Language Processing (NLP). It focuses on identifying whether multiple expressions within a document refer to the same real-world entity.

Resolving these semantic references is critical for document-level understanding and directly impacts downstream NLP systems such as:

  • Information Extraction
  • Knowledge Graph Construction
  • Question Answering
  • Relation Extraction
  • Semantic Search
  • Conversational AI
  • Retrieval-Augmented Generation (RAG)

While English NLP already has mature coreference systems and libraries, Tamil remains a highly challenging low-resource language for discourse-level semantic modeling.

As part of ongoing Tamil NLP research at CTNLPR, we explored how modern multilingual coreference architectures can be adapted for low-resource Tamil language understanding.

Why Tamil Coreference Resolution Is Challenging

Most existing coreference systems are heavily optimized for English. Architectures such as:

  • spaCy-based pipelines
  • NeuralCoref
  • AllenNLP coreference models
  • Transformer-based antecedent ranking systems

rely on assumptions and linguistic patterns that do not transfer cleanly to Tamil.

Tamil introduces several structural and linguistic complexities:

  • Agglutinative morphology
  • Free word order
  • Pronoun dropping
  • Implicit subject references
  • Rich inflectional structures
  • Long-distance discourse dependencies
  • Noun-to-noun semantic references

These characteristics make direct adaptation of English-centric architectures highly unreliable.


Low-Resource Constraints

Another major challenge is the lack of Tamil-specific resources.

Currently:

  • No dedicated Tamil coreference libraries exist publicly
  • No production-ready Tamil CR models are available
  • Large annotated Tamil coreference corpora are unavailable
  • Most multilingual systems remain heavily English-biased

This makes supervised transformer training approaches difficult to scale for Tamil.


Limitations of Traditional Supervised Approaches

Many multilingual coreference systems rely on supervised transformer architectures combined with linguistic feature engineering.

These systems often incorporate features such as:

  • Entity type
  • Word category
  • Number agreement
  • Gender agreement
  • Antecedent indicators

Although effective for high-resource languages, these approaches depend heavily on:

  • Manually annotated datasets
  • Task-specific supervision
  • Expensive training pipelines
  • Large computational resources

For Tamil, the absence of benchmark-scale annotated corpora becomes a critical bottleneck.


A Different Direction: Zero-Shot Contextual Modeling

After extensive research and evaluation, we identified that zero-shot multilingual contextual modeling provides a more scalable direction for Tamil coreference resolution.

Instead of relying on supervised antecedent ranking architectures, the CTNLPR approach focuses on:

  • Contextual span modeling
  • Semantic similarity learning
  • Unsupervised clustering
  • Multilingual transformer representations

This reduces dependency on large annotated datasets while still leveraging strong contextual semantic understanding.


System Architecture

The architecture currently being explored at CTNLPR consists of:

  • MuRIL-based contextual embeddings
  • Span-based mention detection
  • Contextual span representations
  • Cosine similarity-based semantic linking
  • Agglomerative clustering

MuRIL-Based Contextual Encoding

The system uses MuRIL as the contextual encoder.

MuRIL is particularly suitable because it was pretrained specifically for Indian languages and multilingual semantic understanding.

The encoder generates contextual token embeddings that capture:

  • Semantic context
  • Syntactic dependencies
  • Cross-sentence relationships
  • Multilingual language patterns

This provides a stronger foundation for Tamil discourse modeling compared to English-centric multilingual models.


Span Candidate Generation

Instead of manually defining antecedents, the system automatically generates candidate spans from Tamil text.

These spans may represent:

  • Pronouns
  • Named entities
  • Noun phrases
  • Semantic references

This span-based strategy allows the architecture to remain flexible and language-independent.


Contextual Span Representation

Once candidate spans are generated, contextual span embeddings are constructed using token-level representations from MuRIL.

The system applies:

  • Mean pooling over token embeddings
  • Span-level semantic encoding
  • Context-aware representation construction

This enables semantically similar mentions to occupy nearby regions within the embedding space.


Semantic Similarity and Clustering

After span representation construction, the system performs:

  • Heuristic span pruning
  • Cosine similarity computation
  • Similarity-driven clustering

Agglomerative clustering is then used to group semantically related mentions into discourse-level entity chains.

This removes the need for explicit antecedent ranking while still enabling coherent coreference grouping.


Why This Approach Works Well for Tamil

This direction is particularly effective for Tamil because it avoids dependency on large manually annotated datasets while leveraging multilingual contextual knowledge learned during transformer pretraining.

Key advantages include:

  • Zero-shot learning capability
  • Scalability for low-resource environments
  • Reduced annotation dependency
  • Contextual semantic understanding
  • Better adaptation to Tamil morphology
  • Lightweight unsupervised inference

This makes the architecture more practical for real-world Tamil NLP deployment scenarios.


Key Insight

One of the most important findings from this research is:

Coreference resolution for Tamil is not primarily a rule-engineering problem — it is a contextual semantic modeling problem.

Strong discourse understanding requires:

  • Context-aware embeddings
  • Span-level semantic reasoning
  • Cross-sentence entity linking
  • Language-aware discourse modeling

Without these components, downstream systems struggle to maintain semantic consistency across documents.


System Implementation at CTNLPR

At CTNLPR, we are actively exploring and prototyping this zero-shot multilingual coreference architecture as part of the ongoing Noolaham GPT initiative.

Our work currently includes:

  • Researching multilingual CR architectures for Tamil
  • Evaluating MuRIL-based contextual encoding
  • Designing span-based semantic linking pipelines
  • Exploring unsupervised clustering strategies
  • Building scalable low-resource discourse understanding systems

The long-term objective is to develop production-grade Tamil discourse understanding infrastructure for multilingual AI systems.


Conclusion

Building effective Tamil coreference resolution systems requires significantly more than adapting English NLP pipelines.

By combining:

  • Multilingual contextual transformers
  • Span-based semantic representations
  • Similarity-driven clustering
  • Zero-shot multilingual inference

we are developing a scalable architecture for document-level semantic understanding in Tamil.

This work represents an important step toward building modern, context-aware Tamil NLP systems for real-world low-resource AI applications.

#TamilNLP #CoreferenceResolution #NLP #MuRIL #LowResourceAI #KnowledgeGraphs #SemanticSearch #RAG #AIEngineering #CTNLPR

]]>
https://www.ctnlpr.com/2026/05/15/building-a-tamil-coreference-resolution-system-zero-shot-contextual-span-modeling-with-muril-for-low-resource-nlp/feed/ 0 646
Building a Morphology-Aware Tamil NER Pipeline: From Transformer Extraction to Canonical Entity Resolution https://www.ctnlpr.com/2026/05/15/building-a-morphology-aware-tamil-ner-pipeline-from-transformer-extraction-to-canonical-entity-resolution/ https://www.ctnlpr.com/2026/05/15/building-a-morphology-aware-tamil-ner-pipeline-from-transformer-extraction-to-canonical-entity-resolution/#respond Fri, 15 May 2026 12:34:43 +0000 https://ctnlpr.codelantic-staging.club/?p=640 read more]]>

Named Entity Recognition (NER) is a foundational layer in modern NLP systems. It directly impacts downstream applications such as search, indexing, knowledge graph construction, entity linking, and Retrieval-Augmented Generation (RAG).

However, for Tamil and other morphologically rich languages, entity extraction alone is insufficient. The larger challenge lies in canonicalizing entity variants into stable root forms that can be consistently indexed and linked across documents.

As part of the ongoing Tamil NLP and Noolaham GPT initiatives at CTNLPR, we designed and implemented a morphology-aware Tamil NER pipeline that combines transformer-based extraction with morphological normalization and canonical entity resolution.

Why Tamil NER Is Challenging

Tamil is a highly agglutinative language. A single entity can appear in multiple inflected surface forms depending on grammatical suffixes, case markers, and contextual usage.

For example:

  • இலங்கையில்
  • இலங்கையிலும்
  • இலங்கையிலே

Although all refer to the same canonical entity (“இலங்கை”), transformer-based NER systems typically extract them as separate entities.

This creates several downstream problems:

  • Entity duplication
  • Inconsistent indexing
  • Fragmented knowledge graphs
  • Reduced retrieval quality in RAG systems

As a result, canonicalization becomes a critical requirement in Tamil NLP pipelines.


System Architecture

At CTNLPR, we designed a Tamil-aware NER architecture consisting of:

  • IndicNER → transformer-based entity extraction
  • Custom span merging → IOB tag consolidation
  • Prefix-based grouping → heuristic variant clustering
  • Morphological normalization layer → canonical entity resolution

The system extends standard transformer-based NER pipelines with language-aware normalization strategies specifically designed for Tamil morphology.


Transformer-Based Entity Extraction

We used IndicNER as the baseline entity extraction model.

Advantages:

  • Strong multilingual transformer backbone
  • High recall across entity categories
  • Effective extraction on noisy multilingual corpora

However, a major limitation became immediately apparent:

Transformer NER models do not enforce canonical forms.

As a result, multiple inflected variants of the same entity were treated as separate entities.


Method Exploration and Evaluation

We evaluated several normalization approaches to resolve this issue.

Prefix-Based Grouping

A lightweight heuristic approach based on shared token prefixes.

Advantages:

  • Fast execution
  • Simple clustering mechanism

Limitations:

  • Not linguistically grounded
  • Unstable under complex suffix chains

Although useful as an intermediate clustering layer, it lacked sufficient linguistic robustness.

UoM Thamizhi Morphological Normalizer

We evaluated the morphological normalization approach developed by the University of Moratuwa.

Advantages:

  • Linguistically motivated
  • Rule-based morphological normalization

However, several practical limitations emerged:

  • Reduced robustness on noisy OCR text
  • Difficulty handling complex suffix combinations
  • Poor generalization to unseen word forms

While theoretically sound, the system struggled under real-world document conditions.

Tamil Lemmatizer (Final Approach)

The strongest results were achieved using a Tamil lemmatization-based normalization layer.

Advantages:

  • Consistent root-form extraction
  • Better handling of inflected variants
  • Improved robustness across document types
  • Strong empirical performance under noisy conditions

This became the final canonicalization layer integrated into our production pipeline.

Canonical Entity Resolution

The final architecture performs many-to-one normalization:

Surface FormCanonical Entity
இலங்கையில்இலங்கை
இலங்கையிலும்இலங்கை
இந்தியாவில்இந்தியா

This canonical mapping layer significantly improved consistency across downstream NLP systems.


Pipeline Design

The final production pipeline operates as follows:

  1. Document ingestion and chunking
  2. Transformer inference using IndicNER
  3. IOB span merging and filtering
  4. Prefix-based variant aggregation
  5. Morphological normalization (lemmatization layer)
  6. Canonical entity re-indexing

This hybrid architecture combines transformer-based recall with morphology-aware normalization.


System-Level Challenges

Several engineering challenges emerged during implementation:

Agglutinative Morphology

Tamil suffix chains can become extremely complex, making direct canonicalization difficult without normalization.

Variant Explosion

A single entity appearing across multiple grammatical forms produced fragmented entity representations across documents.

OCR and Noisy Input

Historical and scanned documents introduced spelling inconsistencies and OCR artifacts, reducing normalization stability.

Cross-Document Consistency

Canonical entity representation had to remain stable across multi-document pipelines to support reliable indexing and retrieval.


Observations and Results

Across experiments, several patterns became clear:

  • Transformer NER → high recall, weak canonical consistency
  • Prefix grouping → useful heuristic but linguistically limited
  • UoM normalizer → theoretically strong but operationally fragile
  • Lemmatization → strongest practical normalization performance

The final production system combined: IndicNER + Morphological Lemmatization

to achieve robust multilingual entity extraction and canonicalization.


Key Insight

One of the most important findings from this work was:

In Tamil NER, the primary challenge is not entity detection — it is morphological normalization.

NER output should not be treated as the final entity representation.

Canonicalization is essential for:

  • Indexing
  • Entity linking
  • Knowledge graph construction
  • Retrieval-Augmented Generation (RAG)
  • Cross-document semantic consistency

Without normalization, downstream systems become fragmented and unreliable.


System Implementation at CTNLPR

At CTNLPR, this morphology-aware NER architecture was integrated into our broader Tamil NLP infrastructure as part of the ongoing Noolaham GPT initiative.

Our work included:

  • Evaluating transformer-based Tamil NER models
  • Designing canonical entity resolution pipelines
  • Benchmarking normalization strategies
  • Integrating morphology-aware lemmatization layers
  • Improving cross-document entity consistency
  • Optimizing downstream indexing and retrieval performance

The final system now serves as a foundational entity extraction layer for multilingual search, knowledge graphs, and RAG systems being developed at CTNLPR.


Outcome

The final production-ready Tamil NER system:

  • Resolves inflected entity variants
  • Produces stable canonical entity forms
  • Improves downstream retrieval quality
  • Supports scalable multi-document processing
  • Enables consistent entity-centric indexing and analytics

Most importantly, the system demonstrates that reliable Tamil NER requires both transformer-based extraction and morphology-aware canonicalization.


Conclusion

Building effective NER systems for Tamil requires significantly more than transformer-based entity extraction.

By combining:

  • Transformer NER models
  • IOB span consolidation
  • Morphological normalization
  • Canonical entity resolution

we developed a scalable morphology-aware Tamil NER architecture optimized for multilingual NLP systems and large-scale document processing.

#TamilNLP #NER #Morphology #IndicNLP #Transformers #AIEngineering #LowResourceLanguages #RAG #CTNLPR

]]>
https://www.ctnlpr.com/2026/05/15/building-a-morphology-aware-tamil-ner-pipeline-from-transformer-extraction-to-canonical-entity-resolution/feed/ 0 640
Building a Tamil Keyword Extraction Pipeline: From Statistical Methods to Embedding-Based Semantic Extraction https://www.ctnlpr.com/2026/05/15/building-a-tamil-keyword-extraction-pipeline-from-statistical-methods-to-embedding-based-semantic-extraction/ https://www.ctnlpr.com/2026/05/15/building-a-tamil-keyword-extraction-pipeline-from-statistical-methods-to-embedding-based-semantic-extraction/#respond Fri, 15 May 2026 11:05:36 +0000 https://ctnlpr.codelantic-staging.club/?p=622 read more]]>

Keyword extraction is a foundational component in modern NLP systems. It directly impacts search quality, indexing efficiency, document summarization, topic discovery, and Retrieval-Augmented Generation (RAG) pipelines.

However, for low-resource languages such as Tamil, keyword extraction introduces several challenges that extend beyond simply applying existing algorithms. In practice, achieving reliable extraction required careful system-level engineering involving preprocessing, tokenization, normalization, and multilingual semantic modeling.

As part of the ongoing Noolaham GPT initiative at CTNLPR, we designed and implemented a Tamil-aware keyword extraction pipeline that combines statistical methods, graph-based ranking, and embedding-based semantic extraction.

Why Keyword Extraction Is Challenging for Tamil

Many existing keyword extraction libraries are primarily designed for English and other high-resource languages. Tamil introduces additional complexities:

  • Agglutinative morphology
  • Spelling and orthographic variations
  • Limited language-specific NLP tooling
  • Inconsistent tokenization behavior
  • Lack of native Tamil stopword integration

As a result, directly applying standard extraction libraries often produces noisy or semantically weak keywords.


System Architecture

Our pipeline integrates multiple extraction strategies:

  • scikit-learn → TF-IDF-based statistical extraction
  • Gensim → TextRank graph-based ranking
  • KeyBERT → embedding-based semantic extraction

To support Tamil processing, we integrated Indic NLP preprocessing techniques, including:

  • Tokenization
  • Text normalization
  • Stopword filtering
  • Diacritic handling

This preprocessing layer was critical for stabilizing downstream extraction quality.


Method Evaluation

We evaluated multiple extraction approaches to understand their strengths and limitations under Tamil NLP conditions.

TF-IDF

TF-IDF provided a strong statistical baseline for identifying globally important terms across corpora.

Advantages:

  • Lightweight
  • Fast computation
  • Effective for corpus-level frequency analysis

Limitations:

  • No semantic understanding
  • Frequency-driven ranking only

TextRank

TextRank adapts the PageRank algorithm to textual data by constructing graph relationships between words.

Advantages:

  • Unsupervised
  • No training required

Limitations:

  • Highly sensitive to tokenization quality
  • Graph instability under noisy Tamil segmentation

YAKE

YAKE performed well as a lightweight per-document keyword extraction baseline.

Advantages:

  • Fast inference
  • No external models required
  • Reliable for smaller documents

However, semantic understanding remained limited.


Embedding-Based Extraction with KeyBERT

The strongest results were achieved using KeyBERT combined with Tamil-capable embedding models.

A critical realization during experimentation was: KeyBERT itself is not language-aware. Its effectiveness depends entirely on the embedding model.

Using default English embeddings produced poor semantic extraction quality for Tamil documents. Replacing them with Tamil and multilingual Indic embedding models dramatically improved performance.


Embedding Models Evaluated

We integrated several multilingual and Tamil-specific embedding models:

  • l3cube-pune/tamil-sentence-bert-nli
  • ai4bharat/indic-bert
  • paraphrase-multilingual-mpnet-base-v2

The extraction pipeline operates as follows:

  1. Document embedding generation
  2. Candidate n-gram extraction
  3. Semantic similarity computation
  4. Relevance-based ranking

This allowed semantically meaningful phrases to be identified even when they were not frequency-dominant.


System-Level Challenges

Several engineering challenges emerged during implementation:

Tamil Stopword Handling

Tamil stopwords are not automatically supported in most libraries. We implemented custom stopword filtering layers to improve extraction precision.

Text Normalization

Tamil text frequently contains:

  • Orthographic variations
  • Diacritic inconsistencies
  • OCR-induced noise

Normalization significantly improved embedding consistency and keyword quality.

Tokenization Stability

Graph-based and embedding-based extraction methods were highly sensitive to tokenization behavior. Consistent segmentation became essential for reliable extraction.


Observations and Results

Across experiments, several patterns became clear:

  • TF-IDF → effective for corpus-level topic signals
  • YAKE → strong lightweight baseline
  • TextRank → unstable under noisy tokenization
  • KeyBERT + Tamil embeddings → strongest semantic keyword quality

The final embedding-based pipeline consistently produced more contextually meaningful keywords compared to purely statistical methods.


Key Insight

One of the most important findings from this work was:

In Tamil NLP, keyword extraction is not constrained primarily by the extraction algorithm itself.

The dominant factors are:

  • Embedding quality
  • Tokenization consistency
  • Text normalization quality

Without solving these foundational issues, even advanced extraction algorithms produce weak results.


System Implementation at CTNLPR

At CTNLPR, this keyword extraction pipeline was integrated into the broader Noolaham GPT ecosystem as part of our multilingual NLP infrastructure.

Our work included:

  • Evaluating extraction algorithms for Tamil corpora
  • Integrating Tamil-aware preprocessing pipelines
  • Benchmarking embedding models for semantic extraction
  • Building embedding-based keyword ranking systems
  • Optimizing extraction quality for downstream RAG and search pipelines

The final system now supports semantically meaningful keyword extraction across multilingual document collections.


Outcome

The final production pipeline:

  • Produces semantically relevant Tamil keywords
  • Handles multilingual document collections
  • Integrates into downstream search and RAG systems
  • Supports scalable NLP workflows for low-resource languages

Most importantly, the work demonstrates that high-quality keyword extraction for Tamil is achievable when embedding quality and language-specific preprocessing are treated as first-class system components.


Conclusion

Keyword extraction for low-resource languages requires significantly more than applying standard NLP libraries out of the box.

By combining:

  • Statistical extraction methods
  • Graph-based ranking
  • Embedding-driven semantic extraction
  • Tamil-aware preprocessing pipelines

we developed a scalable keyword extraction system optimized for Tamil NLP applications and multilingual retrieval systems.


#NLP #TamilNLP #KeywordExtraction #MultilingualAI #Embeddings #KeyBERT #TextMining #AI #CTNLPR

]]>
https://www.ctnlpr.com/2026/05/15/building-a-tamil-keyword-extraction-pipeline-from-statistical-methods-to-embedding-based-semantic-extraction/feed/ 0 622
Designing an Efficient Reranking Layer: Multilingual Cross-Encoder Optimization for Tamil–English RAG https://www.ctnlpr.com/2026/05/15/designing-an-efficient-reranking-layer-multilingual-cross-encoder-optimization-for-tamil-english-rag/ https://www.ctnlpr.com/2026/05/15/designing-an-efficient-reranking-layer-multilingual-cross-encoder-optimization-for-tamil-english-rag/#respond Fri, 15 May 2026 10:58:54 +0000 https://ctnlpr.codelantic-staging.club/?p=619 read more]]>

Retrieval-Augmented Generation (RAG) systems are fundamentally dependent on retrieval quality. In multilingual environments, dense retrieval can surface relevant passages, but retrieval alone is often insufficient for producing reliable downstream responses.

In practice, not all retrieved chunks are equally relevant. Passing large sets of loosely related candidates directly into the LLM increases token usage, introduces noisy context, and degrades overall response quality.

For Tamil–English multilingual systems, this problem becomes significantly more pronounced due to cross-lingual ranking inconsistencies.

The Ranking Problem in Multilingual RAG

Dense retrievers are generally effective at identifying semantically related documents. However, they frequently struggle with ranking precision, especially for mixed-language queries involving Tamil and English.

This creates several downstream issues:

  • Relevant documents ranked lower than less relevant passages
  • Cross-lingual ranking inconsistencies
  • Reduced answer quality during generation
  • Increased context noise inside the LLM

As a result, retrieval quality cannot be measured solely by recall. Precise ranking becomes equally critical.


Core Approach

At CTNLPR, we designed and integrated a cross-encoder reranking layer into our Tamil–English RAG pipeline to refine multilingual retrieval results before generation.

Unlike traditional bi-encoder retrieval systems, cross-encoders jointly process the query and candidate document during inference. This allows the reranker to capture fine-grained semantic relationships between the query and retrieved passages.

The reranking layer provides several advantages:

  • Joint query–document semantic encoding
  • Improved ranking precision
  • Better multilingual alignment
  • More stable cross-lingual relevance scoring

This enables accurate ordering of multilingual retrieval candidates prior to LLM inference.


Model Evaluation

We evaluated multiple multilingual reranking models under CPU-constrained deployment environments:

  • BGE-v2-m3 → strong ranking quality, higher CPU latency
  • jina-v3-multi → strong cross-lingual consistency
  • jina-v2-cpu-opt → best latency–quality balance
  • gte-multilingual → stable multilingual performance

During evaluation, several limitations became apparent when reranking was absent:

  • Correct documents retrieved but incorrectly ordered
  • Ranking instability for mixed-language queries
  • Lexical fusion methods (e.g., RRF) introducing cross-script noise

These observations highlighted that multilingual retrieval requires ranking-aware optimization, not retrieval alone.


Two-Stage Retrieval Architecture

The final system uses a two-stage retrieval pipeline:

  1. Dense retrieval generates Top-K candidate passages
  2. Cross-encoder reranker evaluates semantic relevance
  3. Candidate passages are rescored and reordered
  4. Top-ranked chunks are passed to the LLM

This architecture improves retrieval precision while controlling downstream context size.


CPU Optimization Strategy

Cross-encoder rerankers are computationally expensive, especially in CPU-only environments. Since our deployment environment prioritizes CPU inference efficiency, optimization became a central engineering challenge.

The primary objective was to maximize ranking quality while minimizing inference latency


Candidate Reduction (Highest Impact Optimization)

The largest performance improvement came from reducing reranker workload.

Instead of reranking large candidate sets:

  • Top-K was reduced aggressively (e.g., 100 → 20)
  • Forward-pass count was minimized
  • Overall latency dropped significantly

This reduced unnecessary reranker computation while preserving ranking quality.


ONNX Runtime + INT8 Quantization

To improve CPU inference efficiency, we optimized reranker deployment using:

  • PyTorch → ONNX conversion
  • INT8 dynamic quantization

This provided several benefits:

  • Faster inference latency
  • Reduced memory consumption
  • Better CPU throughput
  • Minimal degradation in ranking quality

The optimized runtime significantly improved practical deployment feasibility for multilingual reranking.


Token and Runtime Optimization

Additional optimizations focused on attention complexity and runtime efficiency:

  • Reduced maximum token length (512 → 256)
  • Optimized CPU threading (OMP / MKL)
  • Efficient tokenization and batching strategies

Since transformer self-attention scales quadratically with sequence length (O(n²)), reducing token length had a major impact on latency reduction.


Performance Signals

Across internal evaluations, the optimized reranking pipeline achieved:

  • Latency reduction from seconds → sub-second inference
  • Strong multilingual ranking quality (MRR / nDCG maintained)
  • Stable Tamil ↔ English ranking consistency
  • Improved downstream LLM response quality

The final architecture achieved production-ready CPU inference performance while maintaining strong ranking precision.


What Didn’t Work

Several approaches produced unstable results in multilingual environments:

  • Similarity-threshold filtering → inconsistent across scripts
  • Reciprocal Rank Fusion (RRF) → introduced lexical noise between Tamil and English

These methods lacked sufficient semantic robustness for multilingual ranking tasks.


Key Insight

One of the most important findings from this work is: Multilingual RAG is not only a retrieval problem — it is also a ranking precision problem.

In practical systems:

  • Retrieval ensures coverage
  • Reranking ensures correctness

Without effective reranking, even strong retrievers can produce noisy or poorly ordered context for generation.


System Implementation at CTNLPR

At CTNLPR, this reranking architecture was integrated into our broader Tamil–English multilingual RAG pipeline as part of the ongoing Noolaham GPT initiative.

Our work included:

  • Evaluating multilingual cross-encoder rerankers
  • Benchmarking ranking quality using MRR and nDCG
  • Designing CPU-efficient reranking pipelines
  • Implementing ONNX-based optimized inference
  • Reducing latency for real-world multilingual deployments
  • Improving ranking stability across Tamil, English, and mixed-language queries

The final system now supports efficient, scalable, cross-lingual reranking for large multilingual document collections.


Conclusion

Reliable multilingual RAG systems require more than strong dense retrieval models. They require ranking-aware architectures capable of refining multilingual relevance before generation.

By combining:

  • Dense retrieval
  • Cross-encoder reranking
  • CPU optimization techniques
  • Quantized inference pipelines

we developed a scalable multilingual reranking system optimized for Tamil–English retrieval environments.

#RAG #MultilingualAI #CrossLingualRetrieval #TamilNLP #Reranking #DenseRetrieval #VectorSearch #AIInfrastructure #CTNLPR

]]>
https://www.ctnlpr.com/2026/05/15/designing-an-efficient-reranking-layer-multilingual-cross-encoder-optimization-for-tamil-english-rag/feed/ 0 619
Document AI in Production: Cost, Language, and Infrastructure Constraints https://www.ctnlpr.com/2026/05/15/document-ai-in-production-cost-language-and-infrastructure-constraints/ https://www.ctnlpr.com/2026/05/15/document-ai-in-production-cost-language-and-infrastructure-constraints/#respond Fri, 15 May 2026 10:45:58 +0000 https://ctnlpr.codelantic-staging.club/?p=616 read more]]>

Document extraction is often straightforward at demo scale. However, under real production conditions—large document volumes, multilingual corpora, and limited hardware resources—the problem becomes significantly more complex.

Modern Document AI systems must optimize not only for accuracy, but also for scalability, latency, infrastructure cost, and multilingual robustness.

Constraint 1: Cost (Operational Expenditure)

Cloud-based OCR and parsing systems such as Google Cloud Vision and LLM-driven document parsers dramatically simplify development and experimentation. However, these systems introduce a linear operational cost model:

  • Cost increases proportionally with document volume
  • Every inference depends on external API calls
  • Pipelines incur continuous recurring expenses

At small scale, these costs may appear manageable. At production scale—especially for large archival digitization projects involving millions of pages—the operational overhead becomes economically unsustainable.

As a result, real-world document processing systems cannot rely exclusively on API-driven architectures.


Constraint 2: Hardware-Aware Inference

When shifting toward open-source OCR and document understanding models, the bottleneck moves from API cost to infrastructure efficiency.

GPU-based deployments provide high throughput, but scaling GPU infrastructure introduces significant capital and operational expense. CPU-based deployments are considerably more affordable, but require extensive optimization in order to remain practical.

This introduces several engineering requirements:

  • Model optimization techniques (quantization, pruning)
  • Optimized inference runtimes such as ONNX Runtime and OpenVINO
  • Efficient pipeline orchestration and batching strategies

Under these constraints, the objective is no longer “maximum accuracy at any cost.” Instead, the challenge becomes:

Optimizing accuracy, latency, and throughput simultaneously under CPU constraints.


Document AI Is a Pipeline Problem

A common misconception is that Document AI can be solved using a single OCR model. In practice, production systems are composed of multiple tightly coupled stages:

  • Layout detection and reading-order analysis
  • Region segmentation (text, tables, images)
  • Table structure reconstruction
  • Text detection and recognition
  • Structural parsing and metadata extraction

Overall system performance depends not only on model quality, but also on orchestration efficiency between these stages. In many deployments, the slowest pipeline component determines total throughput.


Language Constraints in Low-Resource Settings

Extending Document AI systems beyond English introduces another layer of complexity, particularly for low-resource languages such as Tamil.

Several limitations become immediately apparent:

  • Limited OCR support for Tamil scripts
  • Weak handling of agglutinative morphology and complex ligatures
  • Lack of CPU-optimized multilingual models
  • Poor layout and table generalization across Indic documents

As a result, many pipelines that perform well on English documents degrade significantly in multilingual production environments.


Our Focus at CTNLPR (Noolaham Corpus)

At CTNLPR, through the Noolaham corpus initiative, we are building a CPU-first multilingual document intelligence pipeline focused on scalable, production-grade processing.

Our current focus includes:

  • Tamil + English multilingual support
  • Layout-aware extraction for text, tables, and images
  • CPU-optimized inference pipelines
  • Reduced dependency on external APIs
  • Scalable processing for large archival collections

The broader objective is not simply accurate OCR, but sustainable and efficient multilingual document understanding at scale.


Research Direction

The central challenge in production-grade Document AI is not accuracy alone.

The real optimization target is: accuracy × throughput × cost

Balancing these three dimensions is essential for building scalable systems for low-resource languages and large archival corpora.

#DocumentAI #OCR #TamilNLP #LowResourceAI #MachineLearning #AIInfrastructure #CTNLPR

]]>
https://www.ctnlpr.com/2026/05/15/document-ai-in-production-cost-language-and-infrastructure-constraints/feed/ 0 616
Designing a Bilingual RAG System: Cross-Lingual Dense Retrieval for Tamil–English https://www.ctnlpr.com/2026/05/15/designing-a-bilingual-rag-system-cross-lingual-dense-retrieval-for-tamil-english/ https://www.ctnlpr.com/2026/05/15/designing-a-bilingual-rag-system-cross-lingual-dense-retrieval-for-tamil-english/#respond Fri, 15 May 2026 10:22:51 +0000 https://ctnlpr.codelantic-staging.club/?p=602 read more]]>

Retrieval-Augmented Generation (RAG) systems are highly dependent on retrieval quality. In multilingual environments, the problem becomes significantly more complex because the retrieval layer must operate across languages while maintaining semantic consistency.

For Tamil–English systems, one of the key challenges is cross-lingual retrieval — enabling a query in Tamil to retrieve semantically relevant Tamil and English passages from a unified index (and vice versa), without relying on translation pipelines or language-specific partitioning.

Cross-Lingual Retrieval in Multilingual RAG

Traditional retrieval systems often assume that the query language and document language are identical. This assumption breaks down in multilingual knowledge systems where relevant information may exist across multiple languages.

To solve this, we rely on multilingual dense embedding models that project Tamil and English text into a shared semantic vector space. Within this space, semantically similar content across languages becomes geometrically aligned, enabling standard vector similarity search to work across languages.

This allows:

  • Tamil queries → Tamil + English retrieval
  • English queries → Tamil + English retrieval

without maintaining separate retrieval systems.


Core Architecture

The retrieval architecture is based on three principles:

  • Unified multilingual embeddings
  • Single shared vector index
  • Cross-lingual semantic similarity search

The system pipeline consists of:

  • Document chunking
  • Multilingual embedding generation
  • Unified vector indexing
  • Approximate nearest neighbor (ANN) retrieval
  • LLM-based response synthesis

Model Evaluation

We evaluated several multilingual embedding approaches:

  • Sentence Transformers (SBERT variants)
  • Indic-specific models (IndicBERT, MuRIL)

During experimentation, several limitations became apparent:

  • Weak Tamil–English semantic alignment
  • Inconsistent similarity distributions across scripts
  • Reduced recall in mixed-language retrieval scenarios

These limitations significantly affected ranking stability and retrieval quality.


Selected Embedding Model

After extensive evaluation, we selected:

→ intfloat/multilingual-e5-large

The model demonstrated significantly stronger cross-lingual retrieval behavior due to several architectural and training characteristics:

  • Built on XLM-RoBERTa-large
  • Large-scale multilingual pretraining
  • Trained with contrastive learning objectives (>1B training pairs)
  • Fine-tuned on retrieval benchmarks such as MS MARCO, Mr.TyDi, and MIRACL
  • Instruction-aware embeddings using query: and passage: prefixes

This resulted in improved semantic alignment between Tamil and English representations, especially for low-resource retrieval tasks.


Unified Indexing Strategy

Instead of maintaining separate indices for each language, we implemented a single unified vector index.

Pipeline:

  • Chunk Tamil and English documents
  • Generate embeddings using the same multilingual encoder
  • Store all vectors in a shared ANN index

This design eliminates language-specific partitioning and simplifies retrieval orchestration.

The final architecture supports:

  • Tamil → English retrieval
  • English → Tamil retrieval
  • Mixed-language retrieval

within the same semantic search space.


Retrieval Flow

The retrieval pipeline operates as follows:

  1. Encode query (Tamil or English)
  2. Perform ANN similarity search
  3. Retrieve top-k semantically relevant chunks
  4. Pass retrieved context to the LLM for generation

Similarity search is based on dense vector retrieval using cosine similarity.


Benchmark Signals (MRR / nDCG)

Across multilingual benchmarks and internal evaluations, we observed measurable improvements in retrieval quality:

  • Higher MRR@10 → improved early precision
  • Higher nDCG@10 → better ranking quality
  • Improved Recall@10 → stronger multilingual coverage
  • More stable similarity distributions across Tamil and English scripts

These improvements were primarily driven by:

  • Large-scale contrastive training
  • Retrieval-specific fine-tuning
  • Better multilingual semantic alignment

Key Insight

A critical realization from this work is that multilingual RAG is not fundamentally a database problem.

It is primarily an embedding alignment problem solved during representation learning.

When both Tamil and English occupy the same semantic vector space:

  • Cross-lingual retrieval becomes reliable
  • Translation overhead becomes unnecessary
  • Retrieval architectures become significantly simpler

Outcome

The final system achieved:

  • Stronger cross-lingual ranking quality
  • Improved MRR/nDCG performance
  • Reduced architectural complexity
  • Unified multilingual retrieval
  • Better knowledge coverage across languages

Most importantly, the system demonstrates that multilingual retrieval for low-resource languages can be achieved effectively without maintaining separate retrieval pipelines or translation layers.


System Implementation at CTNLPR

At CTNLPR, as part of the ongoing Noolaham GPT initiative, we designed and implemented a bilingual Tamil–English RAG retrieval architecture focused on low-resource multilingual retrieval.

Our work included:

  • Evaluating multilingual embedding models for Tamil–English semantic alignment
  • Benchmarking retrieval quality using MRR, nDCG, and Recall metrics
  • Designing a unified vector indexing strategy without language partitioning
  • Implementing ANN-based cross-lingual dense retrieval pipelines
  • Optimizing retrieval consistency across Tamil, English, and mixed-language queries

The final system enables:

  • Tamil queries retrieving both Tamil and English knowledge
  • English queries retrieving Tamil content
  • Shared semantic retrieval without explicit translation layers

This architecture now serves as a foundational retrieval layer for downstream Tamil NLP systems, including multilingual RAG, semantic search, and knowledge-centric applications being developed at CTNLPR.


Conclusion

Building multilingual RAG systems for Tamil requires more than multilingual datasets — it requires semantically aligned embeddings capable of supporting reliable cross-lingual retrieval.

By combining:

  • Multilingual dense encoders
  • Unified vector indexing
  • Cross-lingual semantic retrieval

we designed a scalable bilingual RAG architecture capable of retrieving Tamil and English knowledge from a shared semantic space.

#RAG #MultilingualEmbeddings #CrossLingualRetrieval #TamilNLP #DenseRetrieval #VectorSearch #AIInfrastructure #CTNLPR

]]>
https://www.ctnlpr.com/2026/05/15/designing-a-bilingual-rag-system-cross-lingual-dense-retrieval-for-tamil-english/feed/ 0 602
Building Noolaham Ontology https://www.ctnlpr.com/2026/01/27/building-noolaham-ontology/ Tue, 27 Jan 2026 22:25:55 +0000 https://ctnlpr.codelantic-staging.club/?p=478 read more]]>

Summary: This project aims to create a high-level semantic framework that defines the canonical concepts, classes, properties, and relationships used across the Noolaham ecosystem. It provides a shared conceptual model against which extracted knowledge from the corpus is mapped, ensuring semantic consistency and structural coherence across digitization, NLP enrichment, knowledge graph construction, and application layers. By aligning entities, events, and concepts to a common ontology, the framework enables interoperability between components, supports reasoning and inference, and facilitates integration with external standards and linked data initiatives. This ontology serves as the semantic backbone of the ecosystem, enabling scalable knowledge expansion while preserving conceptual clarity and scholarly rigor.

Partner InstitutionNoolaham Foundation
SupervisorsDr. Saatviga Sudhahar
Researchers
]]>
478
Building Noolaham knowledge graph https://www.ctnlpr.com/2026/01/27/building-noolaham-knowledge-graph/ Tue, 27 Jan 2026 22:24:23 +0000 https://ctnlpr.codelantic-staging.club/?p=475 read more]]>

Summary: This project involves the creation of a structured, ontology-aligned graph that represents entities and their relationships derived from NLP-enriched corpus data. The knowledge graph models people, places, works, events, and concepts, enabling rich semantic linking across diverse materials in the Noolaham collection. By integrating temporal and contextual attributes, the graph supports advanced reasoning, inference, and complex queries that span documents, time periods, and thematic domains. This foundational layer enables interoperability across the Noolaham ecosystem, powering intelligent applications such as semantic search, contextual discovery, analytics, and AI-driven assistants while maintaining traceability to original source materials.

Partner InstitutionNoolaham Foundation
SupervisorsDr. Saatviga Sudhahar
Researchers
]]>
475
Infrastructure, storage and content conversion https://www.ctnlpr.com/2026/01/27/infrastructure-storage-and-content-conversion/ Tue, 27 Jan 2026 22:18:32 +0000 https://ctnlpr.codelantic-staging.club/?p=470 read more]]>

Summary: This project focuses on establishing a scalable and resilient technical foundation to support large-scale digitization and processing of Noolaham’s collections. The project provisions secure compute infrastructure and object storage to manage raw inputs, intermediate artifacts, and processed outputs across the content lifecycle. On top of this infrastructure, a robust digital content conversion pipeline is deployed to systematically process Noolaham’s Tamil and English materials. The pipeline integrates OCR, image enhancement, layout analysis, segmentation, and text normalization workflows tailored to bilingual content, ensuring high-fidelity conversion from scanned sources to machine-readable text. The resulting outputs are versioned, traceable, and aligned with downstream NLP enrichment and knowledge engineering processes, enabling consistent reuse across search, analytics, and AI-driven applications.

Partner InstitutionNoolaham Foundation
SupervisorsDr. Saatviga Sudhahar
ResearchersSathiyalokeswaran Lingeswaran
]]>
470
Build and evaluate Noolaham GPT https://www.ctnlpr.com/2026/01/27/build-and-evaluate-noolaham-gpt/ Tue, 27 Jan 2026 22:14:33 +0000 https://ctnlpr.codelantic-staging.club/?p=466 read more]]>

Summary: This project focuses on the design, implementation, and assessment of an AI-powered conversational assistant that enables interactive question answering, exploration, and knowledge synthesis across the Noolaham ecosystem. Noolaham GPT leverages the authoritative Noolaham corpus, NLP-enriched structures, domain ontologies, and the underlying knowledge graph to deliver accurate, contextual, and explainable responses. The system is designed to support complex information needs, including thematic exploration, cross-document synthesis, and temporally aware reasoning over historical and cultural materials. Responses are grounded in source documents, providing transparent citations and traceability to original texts, thereby ensuring reliability and scholarly trust. The project also includes systematic evaluation of accuracy, relevance, temporal consistency, and user experience to ensure the assistant meets research, educational, and public access requirements.

Partner InstitutionNoolaham Foundation
SupervisorsDr. Saatviga Sudhahar
ResearchersSathiyalokeswaran Lingeswaran

]]>
466