Introduction
Knowledge Graph construction from unstructured text remains one of the fundamental challenges in Natural Language Processing (NLP). The process requires identifying entities, understanding semantic relationships between them, and converting textual information into structured triples that can be stored, queried, and reasoned over.
For low-resource languages such as Sri Lankan Tamil, this challenge is even more significant due to the limited availability of annotated datasets, domain-specific resources, and production-ready information extraction systems.
This work presents the design and implementation of a Transformer-based Tamil Relation Extraction pipeline that converts raw Tamil documents into structured knowledge triples.
The primary objective is to establish a modular, extensible architecture capable of supporting future components such as Coreference Resolution, Entity Linking, Ontology Validation, and Knowledge Graph Construction.
Problem Statement
Consider the following Tamil sentence:
சாம்.ஏ.சபாபதி யாழ்ப்பாணத்தில் வாழ்ந்தார்.
Humans can easily infer the relationship:
(சாம்.ஏ.சபாபதி, LIVED_IN, யாழ்ப்பாணம்)
However, machines must solve several sub-problems before reaching this representation:
- Detect sentence boundaries.
- Identify named entities.
- Determine entity types.
- Generate valid entity pairs.
- Understand the semantic relationship.
- Convert the result into a structured representation.
This sequence of tasks forms the basis of Information Extraction systems and Knowledge Graph construction pipelines.
System Architecture
The proposed architecture follows a modular pipeline design:

Unlike end-to-end joint extraction architectures, a pipeline-based design was selected because of its modularity, interpretability, ease of debugging, and suitability for low-resource environments.
Each stage can be independently evaluated, improved, or replaced without affecting the remaining components.
Phase 1: Sentence Splitting
The first stage converts raw documents into individual Tamil sentences.
Input:
சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.
அவர் ஆசிரியராக பணியாற்றினார்.
Output:
[
"சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.",
"அவர் ஆசிரியராக பணியாற்றினார்."
]
Sentence segmentation reduces computational complexity and allows downstream models to process smaller semantic units.
Phase 2: Named Entity Recognition
The Named Entity Recognition stage identifies entities and assigns semantic labels.
Entity Categories:
PERSON
LOCATION
ORGANIZATION
Example:
Input:
டாக்டர் கந்தசாமி யாழ்ப்பாண போதனா வைத்தியசாலையில் பணியாற்றினார்.
Output:
PERSON
கந்தசாமி
ORGANIZATION
யாழ்ப்பாண போதனா வைத்தியசாலை
A fine-tuned Sri Lankan Tamil NER model is currently used for entity identification.
During experimentation, several tokenizer-related boundary issues were observed, particularly involving Tamil Unicode combining characters. To mitigate these issues, an entity post-processing layer was introduced.
Phase 3: Pair Generation
Named entities alone do not represent knowledge.
Relations exist between pairs of entities.
For example:
PERSON
சாம்பசிவம்
LOCATION
யாழ்ப்பாணம்
becomes:
(சாம்பசிவம், யாழ்ப்பாணம்)
The Pair Generator creates candidate entity pairs based on predefined semantic constraints.
Allowed combinations include:
{
("PERSON", "LOCATION"),
("PERSON", "ORGANIZATION"),
("ORGANIZATION", "LOCATION")
}
This significantly reduces the search space for relation classification.
Phase 4: Entity Marking
Before relation classification, entity mentions are explicitly marked.
Example:
Original:
சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.
Transformed into:
[E1] சாம்பசிவம் [/E1]
[E2] யாழ்ப்பாணத்தில் [/E2]
வாழ்ந்தார்.
Entity marking allows transformer-based models to focus on the target entities when predicting relationships.
Phase 5: Transformer-Based Relation Extraction
Relation Extraction is formulated as a classification problem.
Given:
[E1] சாம்பசிவம் [/E1]
[E2] யாழ்ப்பாணத்தில் [/E2]
வாழ்ந்தார்.
the model predicts:
LIVED_IN
Current relation labels include:
LIVED_IN
WORKED_AT
CONTRIBUTED_TO
PART_OF
BORN_IN
The current implementation uses a multilingual Natural Language Inference (NLI) transformer model in a zero-shot setting.
This approach enables rapid prototyping without requiring a dedicated Tamil relation extraction dataset.
Phase 6: Triple Construction
After relation classification, extracted information is converted into a structured knowledge representation.
Input:
Entity 1:
சாம்பசிவம்
Relation:
LIVED_IN
Entity 2:
யாழ்ப்பாணம்
Output:
(
சாம்பசிவம்,
LIVED_IN,
யாழ்ப்பாணம்
)
These triples form the foundational building blocks of a Knowledge Graph.
Phase 7: End-to-End Pipeline Integration
The final stage orchestrates all components into a unified extraction workflow.
Pseudo-flow:
document
→ sentence splitter
→ NER
→ pair generator
→ entity marker
→ relation extraction
→ triple builder
This design enables document-level processing while preserving modularity between components.
Experimental Results
The pipeline was evaluated on multiple Tamil sentence structures including:
- Personal biographies
- Academic affiliations
- Institutional relationships
- Historical narratives
- Sri Lankan Tamil named entities
Example:
Input:
சாம்.ஏ.சபாபதி யாழ்ப்பாணத்தில் வாழ்ந்தார்.
Output:
(
சாம்.ஏ.சபாபதி,
LIVED_IN,
யாழ்ப்பாணத்தில்
)
Example:
Input:
டாக்டர் கந்தசாமி யாழ்ப்பாண போதனா வைத்தியசாலையில் பணியாற்றினார்.
Output:
(
கந்தசாமி,
WORKED_AT,
யாழ்ப்பாண போதனா வைத்தியசாலை
)
The experiments demonstrate that the overall pipeline architecture successfully extracts meaningful knowledge triples from Tamil text.
Future Work
The current system serves as the foundation for a larger Tamil Knowledge Graph ecosystem.
Planned extensions include:
Coreference Resolution
↓
Entity Canonicalization
↓
Ontology Validation
↓
Relation Filtering
↓
Knowledge Graph Construction
Additional research directions include:
- MuRIL-based NER models
- XLM-R based NER models
- Fine-tuned Tamil Relation Extraction models
- Document-level Relation Extraction
- Ontology-aware triple validation
- Knowledge Graph population from Noolaham archives
Conclusion
This work demonstrates the successful implementation of a complete Transformer-based Tamil Relation Extraction pipeline capable of converting unstructured Tamil documents into structured knowledge triples.
The system currently supports:
✓ Sentence Segmentation
✓ Named Entity Recognition
✓ Entity Pair Generation
✓ Transformer-Based Relation Extraction
✓ Triple Construction
✓ End-to-End Knowledge Extraction
While several challenges remain in entity boundary detection and relation classification, the architecture establishes a scalable foundation for future Tamil Knowledge Graph construction and low-resource language information extraction research.
The next stage of development will focus on evaluation, coreference resolution, ontology integration, and large-scale knowledge graph generation from Sri Lankan Tamil textual resources.