Development of digital content conversion pipeline

Summary: The content conversion pipeline transforms Noolaham’s digitized documents into structured, machine-readable text. It begins with the ingestion of digital documents such as scanned newspapers, books, magazines, and pamphlets. These documents are first preprocessed to improve quality and consistency. The pipeline then performs layout analysis to identify document structure, followed by article segmentation to separate logical content units. Optical Character Recognition (OCR) is applied to extract text from images, and the extracted content is converted into standardized text formats. A post-correction stage ensures accuracy and consistency of the extracted text. The final output is clean, structured text data enriched with metadata, ready for storage and downstream processing.

Partner Institution	Noolaham Foundation
Supervisors	Dr. Saatviga Sudhahar
Researchers	Charangan Vasantharajan, Nilakshan Kunananthaseelan