Text/Speech Processing and Analytics Platform

One of the main goals of CTNLPR is to build an opensource Text/Speech Processing and Analytics Platform for Tamil language. Therefore the entire core of the system needs to built via a component model which includes clearly differentiated layers. All system processes are placed into separate components so that all of the data and functions inside each component are semantically related. Fig 1. shows the component model of the platform.

Fig 1: CTNLPR Text/Speech Processing and Analytics Platform Component Model

Our platform comprises of the following layers, Language Resource layer, Processing Resource layer and the Product/Application layer.

Language Resource Layer

Language Resources refer to data-only resources such as corpuses, lexicons, tagsets, models and dictionaries or thesaurus which are required by the processing resources in order to build an application or product related to natural language processing. Corpus resources include, Monolingual corpuses (Tamil, English, Sinhala and Dravidian languages), Parallel corpuses (Tamil-English, Tamil-Sinhala, Tamil-Dravidian languages), Annotated corpuses (Tamil corpuses annotated with POS tags, Chunk tags, Sense tags, Named entity tags etc), Speech corpuses (required for building Tamil text to speech generation systems) and Lexicons.

Tag sets are predefined set of grammatical tags that are used by linguists to tag a corpus. These include, POS tags, Chunk tags etc. Models could include Language models for text and acoustic models for speech. Dictionaries include monolingual or bilingual dictionaries for Tamil, Tamil theasaurus and other types of dictionaries such as pronunciation dictionaries for speech.

Processing Resource Layer

Processing resources refer to resources whose character is principally programmatic or algorithmic, such as tokenisers, chunkers or parsers. For example, a part-of-speech tagger is best characterised by reference to the process it performs on text. Processing resources typically use Language resources, e.g. a tagger often uses a lexicon; a word sense disambiguator uses a dictionary or thesaurus.

Our processing resource layer would include all basic tools required for language processing such as sentence splitter, tokeniser, POS tagger, dependency parser, Morphological analyser/generator, named entity recogniser, coreference resolver, pronominal resolver and word sense disambiguator.

Product/Application Layer

This layer would consist of the Products/Applications that would be built using the language and processing resources in the platform. We aim to build a set of products/applications that would be useful for the community. These include but are not limited to, Machine Translation systems from English to Tamil, Tamil to Sinhala, Tamil to Dravidian languages and vice versa, Machine Transliteration system from English to Tamil, Text to Speech and Speech to Text systems for Tamil, Language detection system, Intelligent input methods for Tamil in devices, Sentiment analysis applications, Question and answering systems, Spell checkers, Tamil Optical Character recognition system and Tamil Wordnet.

Knowledge Engineering

Knowledge engineering would be ultimate point we would reach when all the previous resources are in place. Using the language and processing resources and with the help of the products/applications built for Tamil natural language processing there is the possibility of analysing large amounts of Tamil text data for new insights. This would include large scale media content analysis such as news/social media, historical news, Information retrieval and Pattern recognition.

A typical example architecture pipeline for knowledge engineering tasks is shown in Fig 2. It shows how the language and processing resources in the platform could be used to perform pattern recognition in a given large corpora.

Fig 2: Text Processing Architecture Pipeline