Projects ← Center for Tamil Natural Language Processing Research

Taxonomy/Ontology Creation

Jan

27

2026

Summary: This project aims to create a high-level semantic framework that defines the canonical concepts, classes, properties, and relationships used across the Noolaham ecosystem. It provides a shared conceptual model against which extracted knowledge from the corpus is mapped, ensuring semantic consistency and structural coherence across digitization, NLP enrichment, knowledge graph construction, and application layers. ...

Building a Knowledge Graph

Jan

27

2026

Summary: This project involves the creation of a structured, ontology-aligned graph that represents entities and their relationships derived from NLP-enriched corpus data. The knowledge graph models people, places, works, events, and concepts, enabling rich semantic linking across diverse materials in the Noolaham collection. By integrating temporal and contextual attributes, the graph supports advanced reasoning, inference, ...

Infrastructure, storage and content conversion

Jan

27

2026

Summary: This project focuses on establishing a scalable and resilient technical foundation to support large-scale digitization and processing of Noolaham’s collections. The project provisions secure compute infrastructure and object storage to manage raw inputs, intermediate artifacts, and processed outputs across the content lifecycle. On top of this infrastructure, a robust digital content conversion pipeline is ...

Build and evaluate Noolaham GPT

Jan

27

2026

Summary: This project focuses on the design, implementation, and assessment of an AI-powered conversational assistant that enables interactive question answering, exploration, and knowledge synthesis across the Noolaham ecosystem. Noolaham GPT leverages the authoritative Noolaham corpus, NLP-enriched structures, domain ontologies, and the underlying knowledge graph to deliver accurate, contextual, and explainable responses. The system is designed ...

Gap analysis with development of nlp tools and knowledge engineering

Jan

27

2026

Summary: This project conducts a comprehensive gap analysis of existing digitization, natural language processing, and knowledge engineering capabilities for Sri Lankan Tamil, followed by the development of targeted NLP tools to address identified limitations. It delivers an end-to-end AI system that ingests, processes, and enriches digitized texts and documents to produce structured, reliable, and trustworthy ...

Custom GPT tool for question and answering

Jan

26

2026

Summary: The project aims to develop a Retrieval-Augmented Generation (RAG)–based question-answering system using custom, domain-specific English content. The system will retrieve relevant passages from a processed, metadata-enriched text corpus and use them as grounding context for a GPT model to generate accurate and context-aware responses. By combining information retrieval with generative AI, the system is ...

Development of digital content conversion pipeline

Nov

25

2023

Summary: The content conversion pipeline transforms Noolaham’s digitized documents into structured, machine-readable text. It begins with the ingestion of digital documents such as scanned newspapers, books, magazines, and pamphlets. These documents are first preprocessed to improve quality and consistency. The pipeline then performs layout analysis to identify document structure, followed by article segmentation to separate logical ...

Entity Extraction in Tamil Tweets

Feb

22

2017

Summary: Social media text such as Twitter holds information regarding various important aspects. Extraction of such information serves as the basis for the most preliminary task in Natural Language Processing called Entity extraction. Entities are real world elements or objects such as Person names, Organization names, Product names, Location names. Entities are often referred to as Named ...

Paraphrase Identification System for Tamil

Feb

22

2017

Summary: Paraphrase can be defined as “the same meaning of a sentence is expressed in another sentence using different words”. Paraphrases can be identified, generated or extracted. This project focuses on sentence level paraphrase identification for Tamil. Identifying paraphrases in Tamil is a difficult task, because evaluating the semantic similarity of the underlying content and ...

Language Resource Development for Tamil

Jan

13

2017

Summary: The pre-requisites for developing NLP applications in any language are the availability of Lexical Resources, Corpora and Computational Models. The sparseness of these resources for Tamil is one of the major reasons for the slow growth of NLP work in Tamil language. Through this project we aim to reduce the gap by creating required language ...