Projects Road Map

We have identified Tamil natural language processing research projects that should be initiated either by building upon previous work, towards creating systems with better performance and evaluation measures or building systems using new machine learning approaches. Meanwhile we also want to do a complete linguistic study on the evolution of the Tamil language. The projects are organised in to six main research themes and covers all areas related to natural language processing for Tamil.

Language Resources

Monolingual/Parallel/Annotated Text/Speech Corpora
The need for corpus for a language is multifarious. Starting from the preparation of a dictionary or lexicon to machine translation, corpus has become an inevitable resource for technological development of languages. Corpus means a body of huge text incorporating various types of textual materials, including newspaper, weeklies, fictions, scientific writings, literary writings, and so on. Corpus must be very huge in size as it is going to be used for many language applications such as preparation of lexicons of different sizes, purposes and types, NLP tools, machine translation programs and so on. We intend to build monolingual corpora for Tamil, Sinhala and Dravidian languages, parallel corpora for Tamil-English, Tamil-Sinhala and Tamil-Dravidian languages, annotated corpora tagged for Part-of-Speech, morphology, lemma, phrases, senses, named entities etc and Speech corpora required for automatic speech recognition applications. While we focus on Tamil as a whole there is a need to differentiate between Sri Lankan and Indian Tamil when building the corpus resources. Building separate lexical resources for Sri Lankan Tamil is an important task since Sri Lankan Tamil differs in the way it is written and spoken. We also intend to build the same lexical resources for Sri Lankan Tamil which would be useful in creating NLP applications involving Sri Lankan Tamil.

Language/Speech Models
Language Models are an important resource for various NLP applications that require generation of text including Machine Translation, Speech Recognition etc. There are different approaches in building a Language Model and the most popular approach is the statistical approach. There is also the Morpheme based approach which could be useful especially in applications such as Speech recognition. We aim to build efficient Language models for Tamil using both the approaches. In a statistical model, typical sequences of words are given high probabilities whereas atypical word sequences are given low probabilities. The quality of the language model is evaluated by measuring the probability of new test word sequences. To capture linguistic information morpheme based language models might be required for highly inflectional languages like Tamil. In that case we might need to decompose words into stems and endings and treat these units/ morphemes separately as if they were independent words. An acoustic model is used in Automatic Speech Recognition to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. Acoustic models for Tamil could be trained using toolkits such as CMUSphinx using spoken Tamil sentences and their corresponding transcripts. We aim to create proper acoustic models for Speech which would be useful for ASR systems.

Dictionaries
For many natural languages, generic dictionaries giving lexical information pertaining to linguistic characteristics of the lexical units such as pronunciation, definition, etymology, grammatical category, etc., are created by linguistic experts manually. Computationally, such dictionaries are of great value, for many natural language processing (NLP) tasks. Dictionaries include monolingual or bilingual dictionaries for Tamil, Tamil thesaurus and other types of dictionaries such as pronunciation dictionaries for speech. We aim to collect such available dictionary resources from institutions and where possible collaborate with an institution to create more sophisticated dictionaries to semantically represent knowledge in the form of concepts and relations. Such attempts have already been initiated for Tamil language.

WordNet for Tamil
WordNet for Tamil is a lexical network tool for Tamil language. In a wordnet, which is a semantic network, the different lexical categories of words (nouns, verbs) are organised into ‘synsets’ (sets of synonyms). Each synset represents a lexical concept and they can be linked by different types of relation (hyperonymy, antonymy, etc.). Tamil WordNet could be used as a basic resource for computational linguistics purposes and language engineering applications such as machine translation, information extraction, word sense disambiguation, knowledge representation, etc. In this project, we aim to capture the relationships for root words in Tamil along with the appropriate number of senses and concepts and build a database for it. This should be accessible through an interface online. We aim to build this product by collaborating with partner institutions who have already worked in creating the Tamil WordNet in the past and build upon it using other resources.

Language Parsing and Resolution

Tokenizer
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tamil uses spaces to mark word boundaries, but a lot of Tamil word forms are agglutinative in nature, meaning they glue together at least two words. Those cases can be identified as determiners+nouns, nouns+postpositions, verbs+particles, nouns+particles and etc. Except the first pattern (determiners+nouns), in all other cases, the second part of the word forms are restricted and can be listed. So it is possible to split certain Tamil agglutinative word forms into separate tokens. This project would aim to build a module that splits Tamil sentences into words carefully studying the different word forms appearing in sentences.

Chunker
Chunking is the task of identifying and segmenting the text into syntactically correlated word groups. It could divide a sentence into its major non-overlapping phrases and attach a label to each chunk. Tamil being an agglutinative language have a complex morphological and syntactical structure. It is a relatively free word order language but in the phrasal and clausal construction it behaves like a fixed word order language. In Tamil language Transformation Based Learning and Machine learning techniques using SVMs have been used in building a Chunker. Chunkers could be based on Noun phrases, Verb Phrases, Adjectival Phrases, Adverbial Phrases, Conjunctions and Complementizers. This project would aim to build a module that performs chunking in the above mentioned categories when given Tamil text.

Sentence Splitter
Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address and not the end of a sentence. also, question marks and exclamation marks may appear in embedded quotations, emoticons and slang. The task of sentence splitting is quite important for natural language processing tasks such as machine translation, speech recognition and language modelling. This project would aim to build an efficient sentence splitter module that could be used to break down Tamil text in to meaningful sentences.

Part-of-Speech Tagger
Part of speech (POS) tagging is the process of labeling a part of speech or other lexical class marker to each and every word in a sentence. Tamil being a Dravidian language has a very rich morphological structure which is agglutinative. Tamil words are made up of lexical roots followed by one or more affixes and therefore tagging a word on Tamil is quite complex. A morphologically rich language such as Tamil has many grammatical categories which leads to ambiguity. Hence lexical ambiguity must be taken into consideration during the development of an efficient POS tagger. In the past Rule based methods, hidden mark models and support vector machines have been used for developing a POS tagger for Tamil. This project will aim to build a Tamil POS tagger module that addresses existing challenges and performs with high accuracy.

Grammatical Parser / Dependency Parser
Dependency grammar is based on the idea that the syntactic structure of a sentence consists of binary asymmetrical relations between the words of the sentence. Dependency parsing uses linguistic information to give relationship between words. In the past only few attempts have been reported regarding dependency parsing for Tamil. They have used rule-based and corpus-based approaches in building a dependency parser. An annotation scheme needs to be designed in order to annotate Tamil data with dependency relations to train a dependency parser and this is a very challenging task. A series of linguistic rules have to be implemented to build dependency structure for Tamil sentences. This project aims to build a dependency parsing module for Tamil which could identify the different grammatical dependecies in Tamil sentences and tag them with the appropriate grammatical structure.

Morphological Analysis
Morphological analysis of a word is the process of segmenting the word into component morphemes and assigning the correct morphosyntactic information to it. For a given word, a morphological analyser (MA) will return its root word and the word class along with return its root word and the word class along with the other grammatical information depending upon its word class. Tamil is a verb-final , relatively free word order language which has a rich inflectional and derivational morphology. Therefore building an efficient morphological analyser is challenging. Finite-state automatas and SVM based approaches have been used in the past to build Tamil morphological analysers. This project would aim to build a Tamil Morphological Analyser module that could efficiently retrieve the root form of a given Tamil word from its inflected form and this should be applicable to any domain.

Morphological Generation
The Morphological Generator takes lemma and a Morpho-lexical description as input and gives a word-form as output. It is a reverse process of Morphological Analyzer. The Morphological Generator takes lemma and a Morpho-lexical description as input and gives a word-form as output. It is a reverse process of Morphological Analyzer. Morphological Generator is an essential tool used in Natural Language Processing (NLP) applications for the generation of final fully-inflected words from the root and a set of morphological properties. A morphological generator plays an indispensable role in target language sentence generation in Machine Translation (MT) systems. It is also used in Information Retrieval (IR) systems in the front-end for query expansion. Finite state based morphological generators have proved to work efficiently for Tamil language in the past. In this project we aim to build a Tamil Morphological generator module and this would be built alongside the Morphological analysis module.

Coreference Resolution
Coreference analysis also known as record linkage or identity uncertainty is a difficult and important problem in natural language processing. In many domains we could find multiple views, descriptions or names for the same underlying object. Correctly resolving these references is a necessary task to further processing and understanding the data. This analysis finds the nouns and phrases that refer to the same entity enabling the extraction of relations among entities as well as more complex propositions. Simple pattern matching heuristics have been used in the process of developing a named entity coreference resolver for other languages. Research in this area is still in early stages for Tamil and therefore in this project we aim to study this problem and build an efficient named entity coreference analyser module.

Anaphora Resolution
Anaphora is a relation of interpretational dependency between an antecedent referring expression and an anaphoric referring expression. In general, one considers that the referent of the latter is determined by knowledge inferred from the former. The Tamil pronominals are ‘avan’, ‘aval’ and ‘atu’ are the third person singular pronouns. ‘atu’ is third person singular neuter pronoun. The relation that is established between a pronoun and its antecedent helps to provide more information that can be extracted. Conditional random fields and Decision tree learning have been used before in building an automatic anaphora resolution system for Tamil. In this project we aim to build an Anaphora resolution module that could resolve the anaphora in Tamil text and output referring expressions.

Word sense disambiguation
Word sense disambiguation is the process of identifying the right sense for a word when a word might have two or more meanings. This is vital for problems such as machine translation, question and answering or information retrieval. The need for disambiguating competing senses of ambiguous words is a crucial issue for all the Natural Language Processing activities including machine translation. Support vector machines are efficient in the classification process of discriminating the contexts there by selecting the correct sense of the target word. Unsupervised approaches based on clustering methods have been used as well. In this project we aim to build a module for Tamil word sense disambiguation, in which given Tamil text the correct sense of words needs to be identified.

Named Entity Recognition
Named Entity Recognition is the process of identifying and recognizing named entities such as person, organization, location, date, time and money in the text documents. Named Entity Recognition is a subtask of Information Extraction and Information Extraction is the process of extracting the useful information from data. Unlike English, there is no concept of capital letters in Tamil and hence no capitalization information is available for named entities in Tamil. All named entities are nouns and hence are Noun Phrases but not all Noun Phrases are Named Entities. Conditional random fields and hidden markove models have been used in the past for named entity recognition but it has been done in a very small cale. In this project we aim to build a Named Entity recogniser module for Tamil which could identify and tag named entities in a given text.

Machine Translation and Transliteration

Translation between Tamil and English
Machine Translation of Natural (Human) Languages has a long tradition, benefiting from decades of manual and semi-automatic analysis by linguists, sociologists, psychologists and computer scientists among others. Much of the work in the literature however, largely report on translation between languages within the European Family of languages. Anyhow recently Statistical Machine Translation approaches have been used in Translation for Tamil. Especially Rule based machine translation and phrase based machine translation approaches have been used in previous work to build a bilingual translation system both for English to Tamil and Tamil to English. Large amount of English-Tamil Parallel corpus needs to be created and used to train such a translation system. This project aims to build a scalable and accurate translation system product for Tamil and English so that it could be used in many applications useful for general public.

Translation between Tamil and Sinhala
The goal of this research is to develop a machine translation system for Sinhala-Tamil language pair to reduce the language barrier between Sinhala and Tamil communities. We are aiming to develop a modern machine translation system for both Sinhala to Tamil and Tamil to Sinhala directions. Since Sinhala and Tamil are morphologically rich languages, it is a very challenging task to translate between each other. This projects needs an efficient Tamil Morphological analyser, n-gram Language models trained from large amount of corpora for Sinhala and Tamil and translation models trained using large amount of Sinhala-Tamil parallel corpora. With the help of statistical machine translation approaches it is possible to build an efficient translation system. This project aims to build a scalable and accurate translation system product for Sinhala and Tamil which could be used in applications that benefit the community.

Translation between Tamil and other Dravidian Languages
Translating between two related languages is a simpler task than translating between totally unrelated languages. This area focuses on languages of the Dravidian family such as Telugu, Malayalam and Kannada. There have been attempts to translate between Tamil and Malayalam (a language that is very linguistically similar to Tamil) and Tamil and Telugu in the past in India. Again constructing a parallel corpora for the training of such tasks would be challenging. In this project we look forward to collaborate with Indian institutions to build translation systems as products from and to Tamil and Dravidian languages. The systems should be scalable and accurate enough to be considered useful in applications helpful for the relevant communities.

Machine Transliteration from English to Tamil
Transliteration is a process that takes a character string in source language as input and generates a character string in the target language as output. The process can be seen conceptually as two levels of decoding: segmentation of the source string into transliteration units and relating the source language transliteration units with units in the target language by resolving different combinations of alignments and unit mappings. Machine transliteration has gained prime importance as a supporting tool for Machine translation and cross language information retrieval especially when proper names and technical terms are involved. The performance of a machine translation and cross-language information retrieval depends extremely on accurate transliteration of named entities. Hence the transliteration model must aim to preserve the phonetic structure of words as closely as possible. There are quite a lot of challenges in developing an English to Tamil transliteration system since vowels in English, may correspond to long vowels or short vowels in Tamil and some consonants like ‘r’, ‘l’, and ‘n’, has multiple transliterations in Tamil. This project aims to build an efficient English to Tamil transliteration system as a product.

Human-Machine Interface

Tamil Text to Speech Generation
An unrestricted text-to-speech system in Tamil is expected to produce a speech signal, corresponding to the given text in a language that is highly intelligible to a human listener. The intelligibility and comprehensibility of synthesized speech should be relatively good in the naturalistic environments. Furthermore, listeners should be able to clearly perceive the message with little attention, and act on synthesized speech of a command correctly and without perceptible delay in noisy environments. Presently, unit selection-based synthesis (USS) and statistical parametric synthesis techniques are the state-of-art techniques for this task. Recently it has been proved that Hidden Markov Model (HMM) based Speech synthesis is more successful when it comes to Tamil speech synthesis. This project aims to build a Tamil Text to Speech Generation system as a product that efficiently generates Tamil speech given Tami text.

Tamil Speech to Text Generation
Speech Recognition enables us to convert the speech to text, which enhances the user interface to a broader one. A Speech Recognition system includes two major components; Feature Extractor (FE) and the Recogniser. The FE block generates a sequence of feature vectors that represent the input speech signal. It is based on a priori knowledge that is always true and it does not change with time. The Recognizer performs the trajectory recognition and generates the correct output word. Artificial neural networks have been used in Tamil Speech Recognition in classifying the feature vectors of the input signal. This project aims to build a Tamil Speech to Text Generation system as a product that efficiently generates text given Tamil text.

Speaker Recognition Systems to identify Tamil Speakers
As both physical and data security become of greater concern, new methods of uniquely identifying people are emerging. These technologies rely largely upon the physical differences that make us individuals. Because every human being has a unique voice, voice can be used as a form of biometric user verification to physically secure an area, limit access to personnel files or verify a claimant’s identity over the telephone. Each speaker recognition system has two phases: Enrollment and verification. During enrollment, the speaker’s voice is recorded and typically a number of features are extracted to form a voice print, template, or model. In the verification phase, a speech sample or “utterance” is compared against a previously created voice print. This project would focus on creating a Speaker recognition system as a product to identify Tamil Speakers.

Prepare Speaker Recognition database for Tamil Speakers
The availability of good speech databases is crucial for development and assessment in all branches of speech research. This database would be an useful resource to aid in the assessment of speaker recognition systems in general, and in comparing systems, in particular. The intra-speaker and inter-speaker variability are important parameters for a speech database and speech databases are most commonly classified into single-session and multi-session databases. The aim of this project would be to create a new reliable Speaker Recognition database for Tamil speakers.

Deep Learning Approaches to improve accuracy of Speech Recognition Systems
Automatic speech recognition, translation of spoken words into text, is still a challenging task due to the high viability in speech signals. Deep learning, sometimes referred as representation learning or unsupervised feature learning, is a new area of machine learning. Deep learning is becoming a mainstream technology for speech recognition and has successfully replaced Gaussian mixtures for speech recognition and feature coding at an increasingly larger scale. The main target of this project is to applying typical deep learning algorithms, including deep neural networks (DNN) and deep belief networks (DBN), for automatic continuous speech recognition to improve and build upon the existing Speech Recognition systems in Tamil.

Language Identification System for Tamil
In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods. One of the great bottlenecks of language identification systems is to distinguish between closely related languages. Similar languages present significant lexical and structural overlap, making it challenging for systems to discriminate between them. This project would aim at building a language identification system for Tamil.

Tamil Optical Character Recognition System
The process of character recognition involves extraction of defined characteristics called features to classify an unknown character into one of the known classes. Therefore, OCR involves two processes: Feature extraction and Classification. The process of character recognition becomes very tough in Tamil since many inter-class dependencies exist in Tamil. In Tamil language, many letters look alike. So classification becomes a big challenge. Rule-based classifiers and SVMs have been used for classification purposes recently in Tamil OCR. The large collection of online textbooks and other relevant information has already been captured in digital format could be of great use when there is a Tamil OCR system that reads image formats, so that the data could be used for research purposes. This project would build an efficient OCR system for Tamil that could be made available as a product.

Modern Input methods for Tamil Language
Touch and gesture interfaces are becoming popular in the modern user interfaces. This research theme is centered on developing modern text input methods useful for Tamil Language. Moving away from the conventional keyboard based input methods; this research is focused on developing standardised cross platform tools specifically designed for Tamil text input. Dynamic keyboards, handwriting, and gesture inputs are few examples necessary to be developed for Tamil Language. Particular emphasis will be based on mobile text entry. Tap-a-tap, multi-tap, predictive text input etc. are some of the popular text input mechanisms used in mobile computing devices such as, PDAs and mobile phones. For Messaging, the limitations of the telephone keypad pose a greater challenge in keying in texts. The common solution is multi-tap approach, where the characters are spread across the keypad and the user has to press the keys appropriate number of times for a specific character. A better and efficient alternative is to use predictive text input algorithms. The idea here is to use the optimized language model for intelligent texting, which improves the input speed tremendously. This project would aim to build an intelligent texting system for Tamil based on predictive input which could be made available in mobile devices.

Evolutionary Study of Language

Evolution of Tamil
Evidences from archaeology and literature demonstrate that Tamil is one of the ancient languages in the world. Geological, cultural, and synthetic influences made the language to evolve in the past. Besides preserving the language, it is important to understand the process of evolution of the language to develop an in depth understanding of the current form of Tamil. This novel research studies about developing a mathematical model for representing the trends of the Tamil Language. Opportunities of exploiting cutting edge parallel computing and simulation techniques are considered in this task. The fundamental research in this task includes finding interconnections with other language proximities, such as Sanskrit and comparing present language forms of the geographical regions where Tamil believed to be the root language.

Supporting old Tamil scripts
Current Tamil script has evolved through many centuries. Important paleographic information are preserved in paleographic scripts. This area focusses on creating appropriate fonts that enable the faithful digitization and representation of these paleographic documents for research and disseminations. We would also focus on encoding these old scripts and any rare symbols into the Unicode standard.

Digital Paleography for Tamil
Digital Paleography is an area that is concerned with the dating of texts and writer recognitions based on the scribal handwriting present in paleographic texts. Tamil being on the earliest attested languages in India, there are thousands of paleographic artefacts such as epigraphs that span across two millennia. In many case, it is very hard to date these document to ascertain their provenance. In such cases, Digital Paleographic methods can be extremely useful. Similar to OCR, the characters are converted into feature vectors and are compared against a set of known characters, to place a text under a certain period of time or scribe.

Supporting Tamil philology
Philology is the study of literary texts. Frequently, they are involved in ascertaining the authenticity of texts, studying text variations and producing critical edition of texts that are based on multiple sources of the text. Bulk of this work is currently done manually, this project helps to introducing computational techniques to enable Tamil philologists to produce critical texts more effectively.

Knowledge Engineering

Knowledge Retrieval
Digital and electronic media provide unprecedented ways of communication and information handling. Vast quantities of information have been exchanged through, digital documents, social media and internet. Traditionally the information has been articulated thorough textual formats and natural human language. Data retrieval is the basic form of information access, while information and knowledge retrievals are respectively higher levels of information access. Technology for simple data access is generally saturated and does not vary largely across languages. However, information and knowledge access methods are more complex and heavily depended on languages. Knowledge retrieval techniques and tools when applied to, Tamil digital archives could provide new historical insights into the structured data. One such example could be possible generation of historical timelines.