AI-Powered Ecosystem for Noolaham Foundation

The project aims to do large scale annotation, storage and analysis of Sri Lankan Tamil content. This is related to the field of semantic culturomics in which researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage. It is a form of computational lexicology that studies human behaviour and cultural trends through the quantitative analysis of digitised texts. The underlying data is from Noolaham Foundation, a Digital Archive and a Digital Library undertaking the critical work of documenting, digitally preserving and providing free and open access to knowledge bases and cultural heritage of Sri Lankan Tamil speaking communities. The archive contains digitised text from Sri Lankan newspapers, books, magazines, pamphlets etc from various sources totalling  up to approximately 180,000+ documents. It also includes a web archive and born-digital data in text format which would be included in our pipeline. 


Goals


The goal of the project is to build an ecosystem that analyses large scale Sri Lankan Tamil content using natural language processing methodologies and generative AI. 


Objectives


  • Converting digitized content from Noolaham project to text with meta-data tagging. This includes converting newspapers, books, magazines, pamphlets from pdfs and images to text. We do Layout analysis of newspapers, Optical character recognition, Building document type storage, XML conversion of text.
  • Building language processing resources using natural language processing resources and building Knowledge engineering capabilities. 
  • Building a generative AI based tool to query Noolaham’s content, similar to ChatGPT. 


The Ecosystem



Architecture



Projects


Corpus Creation

The authoritative repository of digitized texts and documents produced through upstream ingestion pipelines (OCR, image processing, segmentation, and normalization). It serves as the single source of truth from which all structured, semantic, and temporal knowledge is derived.

NLP Enrichment

An analytical layer that processes corpus content to extract higher-level knowledge such as entities, topics, events, relationships, and summaries. It transforms raw text into structured representations suitable for ontology mapping and downstream knowledge construction.

Taxonomy/Ontology Creation

A high-level semantic framework that defines canonical concepts, classes, and relationships used across the ecosystem. It provides a shared conceptual model to which extracted knowledge is mapped, ensuring consistency, interoperability, and semantic alignment across all components. This will also cover a taxonomy which includes a structured classification system that organizes things into categories and subcategories.

Knowledge Graph

A structured graph of entities and relationships generated from ontology-aligned enriched data. It enables semantic linking, reasoning, inference, and complex queries across people, places, works, events, and concepts represented in the corpus.

A Sri Lankan Tamil Foundation Model 

This project aims to build a Sri Lankan Tamil foundation model using large-scale Tamil text from archives, literature, newspapers, and digital content. The goal is to create an AI system that understands the linguistic and cultural nuances of Sri Lankan Tamil, enabling applications such as semantic search, knowledge retrieval, translation, digital preservation, and conversational AI while making Tamil knowledge more accessible for future generations.

A Sri Lankan Tamil Text-to-Speech Model

This project focuses on developing a Sri Lankan Tamil text-to-speech model capable of generating natural and regionally accurate Tamil speech. The goal is to support applications such as voice assistants, accessibility tools, educational platforms, audiobook generation, and spoken conversational AI tailored to Sri Lankan Tamil speakers.


Products


Noolaham GPT

An AI-powered assistant that provides interactive question answering, exploration, and synthesis over the Noolaham ecosystem. It draws on corpus content, enriched structures, ontology, and the knowledge graph to deliver contextual answers with citations and temporal awareness.

Noolaham WebArchive 

Noolaham WebArchive enables the ingestion, preservation, and analysis of born-digital Tamil web content such as blogs, online news portals, and community forums, preventing digital loss. Archived data is processed through the same NLP enrichment pipeline as digitized texts, allowing unified indexing, metadata tagging, and semantic linking within the Noolaham Knowledge Graph.

Noolaham Entity Explorer

Noolaham Entity Explorer is an AI-powered knowledge exploration platform designed to identify, organize, and connect entities such as people, places, organizations, events, and concepts from Sri Lankan Tamil archival and digital content. Using ontology-aware retrieval and semantic search, the system enables users to explore relationships across historical documents, literature, newspapers, and cultural records, making Tamil knowledge more searchable, discoverable, and accessible for research, education, and digital preservation.

Noolaham Trends

The system detects shifts in discourse related to political movements, migration, education, social reforms, literature, religion, and identity formation using topic modelling, named entity frequency tracking, sentiment analysis, and event co-occurrence mapping. These insights are presented through interactive dashboards that enable researchers to explore long-term trends in terminology, ideological narratives, cultural practices, and community concerns across decades of archived newspapers, books, and born-digital content.

Noolaham MultiLingual

Noolaham MultiLingual is a cross-lingual access and translation layer that enhances the discoverability of Tamil heritage content for global audiences. Using neural machine translation and multilingual embeddings, it enables bidirectional translation and supports cross-lingual information retrieval while maintaining semantic fidelity.

Noolaham Pedia

A curated knowledge presentation layer that generates human-readable summaries of entities and concepts. It organizes information derived from the knowledge graph and corpus into accessible entries, with transparent references back to original Noolaham sources.

Noolaham Voice AI

Noolaham Voice AI is a multilingual AI-powered voice platform that enables natural Tamil voice interactions, speech-based search, conversational AI assistance, and audio access to digital knowledge, archives, and educational content through advanced speech recognition and text-to-speech technologies.

Noolaham Plagarism

Noolaham Plagiarism is an AI-powered system for detecting textual similarity and verifying document provenance in low-resource Tamil corpora, identifying direct copying, paraphrasing, translation plagiarism, and content reuse using semantic and stylometric analysis.

Noolaham FactCheck 

Noolaham FactCheck is an automated claim verification system that cross-references statements from digitized documents against trusted archival sources within the Noolaham ecosystem. Using named entity recognition, event extraction, and knowledge graph-based reasoning, it evaluates claims related to historical events, public figures, institutions, and socio-political developments with contextual evidence from primary sources.


Conclusion


As part of the language preprocessing layer of the ecosystem we have completed the development of the digital content conversion pipeline. As next steps layout analysis, ocr and xml conversion has to be run on the whole of Noolaham’s content to convert it to raw text. This requires significant resources and budget related to server cost and running time. I will submit a separate proposal specifying those requirements. Separate efforts have to be initiated to build the processing resources layer and the knowledge engineering layer.