Building a Tamil Coreference Resolution System for Information Extraction

Modern Information Extraction systems rely on more than Named Entity Recognition (NER). While NER can identify entities such as people, locations, and organizations, it does not explain how references to those entities evolve throughout a document.

This is where Coreference Resolution (CR) becomes essential.

Coreference Resolution is the task of determining when multiple mentions within a document refer to the same real-world entity. It serves as a critical bridge between entity recognition and structured knowledge extraction.

For applications such as:

Knowledge Graph Construction
Relation Extraction
Semantic Search
Document Intelligence
Retrieval-Augmented Generation (RAG)
Conversational AI

accurate coreference resolution is often the difference between fragmented information and coherent knowledge.

The Information Extraction Problem

During our work at CTNLPR, we observed a common issue in Tamil document processing.

Documents rarely repeat the full entity name in every sentence. Instead, they rely on:

Pronouns
Possessive references
Descriptive noun phrases
Location references

Humans resolve these references naturally using context. Machines do not.

Consider a simple information extraction pipeline without coreference resolution.

சாம்.ஏ.சபாபதி 1898 ஆம் ஆண்டு செப்டெம்பர் 8 ஆம் திகதி பிறந்தார்.
இவர் யாழ்ப்பாண மாவட்டத்தின் முன்னாள் மேயர் ஆவார்.
இவர் யாழ்ப்பாண நூலகத்தின் உருவாக்கத்தில் மிக முக்கிய பங்காற்றினார்.

A relation extraction system may produce:

(இவர், ஆவார், முன்னாள் மேயர்)

(இவர், பங்காற்றினார், யாழ்ப்பாண நூலகம்)

Although the relations are syntactically correct, the subject entity is missing.

After coreference resolution:

(சாம்.ஏ.சபாபதி, ஆவார், முன்னாள் மேயர்)

(சாம்.ஏ.சபாபதி, பங்காற்றினார், யாழ்ப்பாண நூலகம்)

The extracted knowledge now becomes meaningful and directly usable within downstream systems.

This observation motivated the development of a dedicated Tamil coreference layer.

Why We Did Not Start with Neural Coreference Models

Most modern coreference systems are built using:

Transformer-based architectures
Mention-ranking models
End-to-end neural coreference systems
Span-ranking approaches

While these approaches achieve strong performance for English and other high-resource languages, they require:

Large annotated datasets
Extensive model training
Significant computational resources
Language-specific supervision

Tamil currently lacks large-scale publicly available coreference corpora.

For a low-resource language, reproducing state-of-the-art neural coreference systems becomes difficult without substantial annotation effort.

Instead of waiting for large benchmark datasets, we explored a different direction:

Build a deterministic, explainable, and task-oriented coreference framework optimized for Information Extraction.

The objective was not to compete with neural coreference benchmarks.

The objective was to improve relation extraction quality in real-world Tamil document processing.

System Architecture

Each layer contributes a specific capability toward discourse-level entity understanding.

Mention Detection

Before references can be resolved, the system must identify candidate mentions.

The mention detection layer combines several complementary strategies.

Named Entity Recognition

A Tamil NER model identifies:

PERSON
LOCATION
ORGANIZATION

These entities become candidates for memory updates and future reference resolution.

Pronoun Detection

The system maintains Tamil pronoun lexicons and detects references such as:

இவர்
அவர்
அவருக்கு
அவருடைய
இவர்கள்

These mentions are forwarded to the resolution layer.

Location References

Tamil documents frequently refer to locations indirectly using expressions such as:

அங்கு
இங்கு
அங்கே
இங்கே

These references must be linked back to previously observed location entities.

Noun Phrase References

Many documents use descriptive phrases instead of explicit names.

Examples include:

இந்த தலைவர்
இந்த மருத்துவர்
இந்த சமூகசேவகர்

These expressions often function as entity references and therefore require resolution.

Entity Normalization

Tamil’s rich morphology introduces another challenge.

The same entity can appear in multiple grammatical forms:

யாழ்ப்பாணம்
யாழ்ப்பாணத்தில்
யாழ்ப்பாணத்தின்
யாழ்ப்பாணத்திற்கு

From an information extraction perspective, these should all map to a single canonical entity.

To reduce fragmentation, the system applies lemmatization-based normalization using Stanza.

யாழ்ப்பாணத்தில்
      ↓
யாழ்ப்பாணம்

This normalization step significantly improves memory consistency and entity linking.

The Entity Memory Layer

One of the most important design decisions was introducing a lightweight discourse memory.

Instead of neural antecedent scoring, the system explicitly tracks recently observed entities.

Current memory structure:

last_person
last_location
last_org

Whenever a new entity is detected, the memory is updated accordingly.

This memory serves as the contextual state of the document and provides the foundation for reference resolution.

Coreference Resolution

Once memory has been populated, reference resolution becomes straightforward.

When the resolver encounters:

இவர்

it retrieves the current value of:

memory.last_person

and resolves the reference to the corresponding entity.

The same mechanism applies to:

Possessive references
Location references
Descriptive noun phrases

This approach provides a simple yet highly interpretable resolution strategy.

Coreference Chain Construction

Rather than merely replacing mentions, the system groups related references into entity clusters.

Example:

சாம்.ஏ.சபாபதி ஒரு பிரபல சமூக சேவகர். அவர் யாழ்ப்பாணத்தில் பல கல்வித் திட்டங்களை முன்னெடுத்தார். இந்த சமூகசேவகர் பல விருதுகளை பெற்றுள்ளார். அவர் யாழ்ப்பாணத்தில் பிறந்தார். அங்கு அவருக்கு பெரும் மதிப்பு இருந்தது. இவர் யாழ்ப்பாண நூலகத்தின் உருவாக்கத்தில் முக்கிய பங்காற்றினார்.

Entity: சாம்.ஏ.சபாபதி

├── சாம்.ஏ.சபாபதி
├── அவர் 
├── இந்த சமூகசேவகர்
├── அவர் 
└── இவர் 

Entity: யாழ்ப்பாணம்

├── யாழ்ப்பாணம்
└── அங்கு 

Entity: யாழ்ப்பாண நூலகம்

These chains provide a document-level view of entity references and become valuable for:

Debugging
Evaluation
Knowledge Graph Construction
Relation Extraction

Visualization and Analysis

To support experimentation and validation, we developed a Streamlit-based visualization layer.

Users can submit Tamil documents and inspect generated coreference chains in real time.

This interface provides transparency into resolution decisions and helps identify weaknesses in rule design.

Key Insight

One of the most important lessons from this work is: Coreference Resolution is not merely a pronoun-resolution task. It is an entity consistency layer that connects NER, relation extraction, and knowledge graph construction.

Without coreference:

(இவர், பங்காற்றினார், யாழ்ப்பாண நூலகம்)

With coreference:

(சாம்.ஏ.சபாபதி, பங்காற்றினார், யாழ்ப்பாண நூலகம்)

The second representation is immediately usable within structured knowledge systems.

System Implementation at CTNLPR

At CTNLPR, this architecture is being developed as part of a broader Tamil Information Extraction ecosystem.

Current capabilities include:

Named Entity Recognition
Entity Normalization
Pronoun Resolution
Possessive Resolution
Location Resolution
Rule-Based Noun Phrase Resolution
Coreference Chain Construction
Streamlit-Based Visualization

The system acts as a foundational layer between entity extraction and knowledge graph generation.

Future Work

The next phase of development will focus on:

Multi-entity discourse memory
Entity salience tracking
Advanced noun phrase resolution
Relation-aware coreference resolution
Knowledge graph integration
Hybrid neural-rule architectures

These improvements will move the system toward more sophisticated document-level semantic understanding.

Conclusion

Building effective Tamil Information Extraction systems requires more than Named Entity Recognition.

By introducing a dedicated coreference resolution layer, we can maintain entity consistency across documents, improve relation extraction quality, and generate more reliable structured knowledge.

For low-resource languages such as Tamil, carefully designed rule-based systems remain a practical and effective pathway toward document-level semantic understanding while larger neural approaches continue to mature.