Building a Tamil Keyword Extraction Pipeline: From Statistical Methods to Embedding-Based Semantic Extraction

Keyword extraction is a foundational component in modern NLP systems. It directly impacts search quality, indexing efficiency, document summarization, topic discovery, and Retrieval-Augmented Generation (RAG) pipelines.

However, for low-resource languages such as Tamil, keyword extraction introduces several challenges that extend beyond simply applying existing algorithms. In practice, achieving reliable extraction required careful system-level engineering involving preprocessing, tokenization, normalization, and multilingual semantic modeling.

As part of the ongoing Noolaham GPT initiative at CTNLPR, we designed and implemented a Tamil-aware keyword extraction pipeline that combines statistical methods, graph-based ranking, and embedding-based semantic extraction.

Why Keyword Extraction Is Challenging for Tamil

Many existing keyword extraction libraries are primarily designed for English and other high-resource languages. Tamil introduces additional complexities:

  • Agglutinative morphology
  • Spelling and orthographic variations
  • Limited language-specific NLP tooling
  • Inconsistent tokenization behavior
  • Lack of native Tamil stopword integration

As a result, directly applying standard extraction libraries often produces noisy or semantically weak keywords.


System Architecture

Our pipeline integrates multiple extraction strategies:

  • scikit-learn → TF-IDF-based statistical extraction
  • Gensim → TextRank graph-based ranking
  • KeyBERT → embedding-based semantic extraction

To support Tamil processing, we integrated Indic NLP preprocessing techniques, including:

  • Tokenization
  • Text normalization
  • Stopword filtering
  • Diacritic handling

This preprocessing layer was critical for stabilizing downstream extraction quality.


Method Evaluation

We evaluated multiple extraction approaches to understand their strengths and limitations under Tamil NLP conditions.

TF-IDF

TF-IDF provided a strong statistical baseline for identifying globally important terms across corpora.

Advantages:

  • Lightweight
  • Fast computation
  • Effective for corpus-level frequency analysis

Limitations:

  • No semantic understanding
  • Frequency-driven ranking only

TextRank

TextRank adapts the PageRank algorithm to textual data by constructing graph relationships between words.

Advantages:

  • Unsupervised
  • No training required

Limitations:

  • Highly sensitive to tokenization quality
  • Graph instability under noisy Tamil segmentation

YAKE

YAKE performed well as a lightweight per-document keyword extraction baseline.

Advantages:

  • Fast inference
  • No external models required
  • Reliable for smaller documents

However, semantic understanding remained limited.


Embedding-Based Extraction with KeyBERT

The strongest results were achieved using KeyBERT combined with Tamil-capable embedding models.

A critical realization during experimentation was: KeyBERT itself is not language-aware. Its effectiveness depends entirely on the embedding model.

Using default English embeddings produced poor semantic extraction quality for Tamil documents. Replacing them with Tamil and multilingual Indic embedding models dramatically improved performance.


Embedding Models Evaluated

We integrated several multilingual and Tamil-specific embedding models:

  • l3cube-pune/tamil-sentence-bert-nli
  • ai4bharat/indic-bert
  • paraphrase-multilingual-mpnet-base-v2

The extraction pipeline operates as follows:

  1. Document embedding generation
  2. Candidate n-gram extraction
  3. Semantic similarity computation
  4. Relevance-based ranking

This allowed semantically meaningful phrases to be identified even when they were not frequency-dominant.


System-Level Challenges

Several engineering challenges emerged during implementation:

Tamil Stopword Handling

Tamil stopwords are not automatically supported in most libraries. We implemented custom stopword filtering layers to improve extraction precision.

Text Normalization

Tamil text frequently contains:

  • Orthographic variations
  • Diacritic inconsistencies
  • OCR-induced noise

Normalization significantly improved embedding consistency and keyword quality.

Tokenization Stability

Graph-based and embedding-based extraction methods were highly sensitive to tokenization behavior. Consistent segmentation became essential for reliable extraction.


Observations and Results

Across experiments, several patterns became clear:

  • TF-IDF → effective for corpus-level topic signals
  • YAKE → strong lightweight baseline
  • TextRank → unstable under noisy tokenization
  • KeyBERT + Tamil embeddings → strongest semantic keyword quality

The final embedding-based pipeline consistently produced more contextually meaningful keywords compared to purely statistical methods.


Key Insight

One of the most important findings from this work was:

In Tamil NLP, keyword extraction is not constrained primarily by the extraction algorithm itself.

The dominant factors are:

  • Embedding quality
  • Tokenization consistency
  • Text normalization quality

Without solving these foundational issues, even advanced extraction algorithms produce weak results.


System Implementation at CTNLPR

At CTNLPR, this keyword extraction pipeline was integrated into the broader Noolaham GPT ecosystem as part of our multilingual NLP infrastructure.

Our work included:

  • Evaluating extraction algorithms for Tamil corpora
  • Integrating Tamil-aware preprocessing pipelines
  • Benchmarking embedding models for semantic extraction
  • Building embedding-based keyword ranking systems
  • Optimizing extraction quality for downstream RAG and search pipelines

The final system now supports semantically meaningful keyword extraction across multilingual document collections.


Outcome

The final production pipeline:

  • Produces semantically relevant Tamil keywords
  • Handles multilingual document collections
  • Integrates into downstream search and RAG systems
  • Supports scalable NLP workflows for low-resource languages

Most importantly, the work demonstrates that high-quality keyword extraction for Tamil is achievable when embedding quality and language-specific preprocessing are treated as first-class system components.


Conclusion

Keyword extraction for low-resource languages requires significantly more than applying standard NLP libraries out of the box.

By combining:

  • Statistical extraction methods
  • Graph-based ranking
  • Embedding-driven semantic extraction
  • Tamil-aware preprocessing pipelines

we developed a scalable keyword extraction system optimized for Tamil NLP applications and multilingual retrieval systems.


#NLP #TamilNLP #KeywordExtraction #MultilingualAI #Embeddings #KeyBERT #TextMining #AI #CTNLPR

Leave a Reply