Building a Transformer-Based Tamil Relation Extraction Pipeline for Knowledge Graph Construction

Introduction

Knowledge Graph construction from unstructured text remains one of the fundamental challenges in Natural Language Processing (NLP). The process requires identifying entities, understanding semantic relationships between them, and converting textual information into structured triples that can be stored, queried, and reasoned over.

For low-resource languages such as Sri Lankan Tamil, this challenge is even more significant due to the limited availability of annotated datasets, domain-specific resources, and production-ready information extraction systems.

This work presents the design and implementation of a Transformer-based Tamil Relation Extraction pipeline that converts raw Tamil documents into structured knowledge triples.

The primary objective is to establish a modular, extensible architecture capable of supporting future components such as Coreference Resolution, Entity Linking, Ontology Validation, and Knowledge Graph Construction.

Problem Statement

Consider the following Tamil sentence:

சாம்.ஏ.சபாபதி யாழ்ப்பாணத்தில் வாழ்ந்தார்.

Humans can easily infer the relationship:

(சாம்.ஏ.சபாபதி, LIVED_IN, யாழ்ப்பாணம்)

However, machines must solve several sub-problems before reaching this representation:

Detect sentence boundaries.
Identify named entities.
Determine entity types.
Generate valid entity pairs.
Understand the semantic relationship.
Convert the result into a structured representation.

This sequence of tasks forms the basis of Information Extraction systems and Knowledge Graph construction pipelines.

System Architecture

The proposed architecture follows a modular pipeline design:

Unlike end-to-end joint extraction architectures, a pipeline-based design was selected because of its modularity, interpretability, ease of debugging, and suitability for low-resource environments.

Each stage can be independently evaluated, improved, or replaced without affecting the remaining components.

Phase 1: Sentence Splitting

The first stage converts raw documents into individual Tamil sentences.

Input:

சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.
அவர் ஆசிரியராக பணியாற்றினார்.

Output:

[
    "சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.",
    "அவர் ஆசிரியராக பணியாற்றினார்."
]

Sentence segmentation reduces computational complexity and allows downstream models to process smaller semantic units.

Phase 2: Named Entity Recognition

The Named Entity Recognition stage identifies entities and assigns semantic labels.

Entity Categories:

PERSON
LOCATION
ORGANIZATION

Example:

Input:
டாக்டர் கந்தசாமி யாழ்ப்பாண போதனா வைத்தியசாலையில் பணியாற்றினார்.

Output:

PERSON
    கந்தசாமி

ORGANIZATION
    யாழ்ப்பாண போதனா வைத்தியசாலை

A fine-tuned Sri Lankan Tamil NER model is currently used for entity identification.

During experimentation, several tokenizer-related boundary issues were observed, particularly involving Tamil Unicode combining characters. To mitigate these issues, an entity post-processing layer was introduced.

Phase 3: Pair Generation

Named entities alone do not represent knowledge.

Relations exist between pairs of entities.

For example:

PERSON
    சாம்பசிவம்

LOCATION
    யாழ்ப்பாணம்

becomes:

(சாம்பசிவம், யாழ்ப்பாணம்)

The Pair Generator creates candidate entity pairs based on predefined semantic constraints.

Allowed combinations include:

{
    ("PERSON", "LOCATION"),
    ("PERSON", "ORGANIZATION"),
    ("ORGANIZATION", "LOCATION")
}

This significantly reduces the search space for relation classification.

Phase 4: Entity Marking

Before relation classification, entity mentions are explicitly marked.

Example:

Original:

சாம்பசிவம் யாழ்ப்பாணத்தில் வாழ்ந்தார்.

Transformed into:

[E1] சாம்பசிவம் [/E1]
[E2] யாழ்ப்பாணத்தில் [/E2]
வாழ்ந்தார்.

Entity marking allows transformer-based models to focus on the target entities when predicting relationships.

Phase 5: Transformer-Based Relation Extraction

Relation Extraction is formulated as a classification problem.

Given:

[E1] சாம்பசிவம் [/E1]
[E2] யாழ்ப்பாணத்தில் [/E2]
வாழ்ந்தார்.

the model predicts:

LIVED_IN

Current relation labels include:

LIVED_IN
WORKED_AT
CONTRIBUTED_TO
PART_OF
BORN_IN

The current implementation uses a multilingual Natural Language Inference (NLI) transformer model in a zero-shot setting.

This approach enables rapid prototyping without requiring a dedicated Tamil relation extraction dataset.

Phase 6: Triple Construction

After relation classification, extracted information is converted into a structured knowledge representation.

Input:

Entity 1:
சாம்பசிவம்

Relation:
LIVED_IN

Entity 2:
யாழ்ப்பாணம்

Output:

(
    சாம்பசிவம்,
    LIVED_IN,
    யாழ்ப்பாணம்
)

These triples form the foundational building blocks of a Knowledge Graph.

Phase 7: End-to-End Pipeline Integration

The final stage orchestrates all components into a unified extraction workflow.

Pseudo-flow:

document
    → sentence splitter
    → NER
    → pair generator
    → entity marker
    → relation extraction
    → triple builder

This design enables document-level processing while preserving modularity between components.

Experimental Results

The pipeline was evaluated on multiple Tamil sentence structures including:

Personal biographies
Academic affiliations
Institutional relationships
Historical narratives
Sri Lankan Tamil named entities

Example:

Input:

சாம்.ஏ.சபாபதி யாழ்ப்பாணத்தில் வாழ்ந்தார்.

Output:

(
    சாம்.ஏ.சபாபதி,
    LIVED_IN,
    யாழ்ப்பாணத்தில்
)

Example:

Input:

டாக்டர் கந்தசாமி யாழ்ப்பாண போதனா வைத்தியசாலையில் பணியாற்றினார்.

Output:

(
    கந்தசாமி,
    WORKED_AT,
    யாழ்ப்பாண போதனா வைத்தியசாலை
)

The experiments demonstrate that the overall pipeline architecture successfully extracts meaningful knowledge triples from Tamil text.

Future Work

The current system serves as the foundation for a larger Tamil Knowledge Graph ecosystem.

Planned extensions include:

Coreference Resolution
    ↓
Entity Canonicalization
    ↓
Ontology Validation
    ↓
Relation Filtering
    ↓
Knowledge Graph Construction

Additional research directions include:

MuRIL-based NER models
XLM-R based NER models
Fine-tuned Tamil Relation Extraction models
Document-level Relation Extraction
Ontology-aware triple validation
Knowledge Graph population from Noolaham archives

Conclusion

This work demonstrates the successful implementation of a complete Transformer-based Tamil Relation Extraction pipeline capable of converting unstructured Tamil documents into structured knowledge triples.

The system currently supports:

✓ Sentence Segmentation

✓ Named Entity Recognition

✓ Entity Pair Generation

✓ Transformer-Based Relation Extraction

✓ Triple Construction

✓ End-to-End Knowledge Extraction

While several challenges remain in entity boundary detection and relation classification, the architecture establishes a scalable foundation for future Tamil Knowledge Graph construction and low-resource language information extraction research.

The next stage of development will focus on evaluation, coreference resolution, ontology integration, and large-scale knowledge graph generation from Sri Lankan Tamil textual resources.

Building a Transformer-Based Tamil Relation Extraction Pipeline for Knowledge Graph Construction

Introduction

Problem Statement

System Architecture

Phase 1: Sentence Splitting

Phase 2: Named Entity Recognition

Phase 3: Pair Generation

Phase 4: Entity Marking

Phase 5: Transformer-Based Relation Extraction

Phase 6: Triple Construction

Phase 7: End-to-End Pipeline Integration

Experimental Results

Future Work

Conclusion

Leave a Reply Cancel Reply