Language Resource Development for Tamil

Summary: The pre-requisites for developing NLP applications in any language are the availability of Lexical Resources, Corpora and Computational Models. The sparseness of these resources for Tamil is one of the major reasons for the slow growth of NLP work in Tamil language. Through this project we aim to reduce the gap by creating required language resources for Tamil to undertake NLP research and to make them available for the research community. Language Resources refer to data-only resources such as tag sets, corpuses, lexicons, language models and dictionaries or thesaurus which are required by processing resources to build an application or product related to natural language processing.

We aim to create a corpus for Tamil from digitized resources such as textbooks, magazines, newspapers and Government documents. Where appropriate we would also collect already available Tamil lexical resources from existing institutional bodies. The corpus has to be then organised in the form of sentences which could be used extensively for Tamil NLP research areas such as Language Modelling, Resolution, Parsing, Machine Translation etc. We also intend to develop to web scraper that would extract Tamil sentences from online websites as part of the corpus collection. Creating Parallel corpora and annotated corpora is also important for NLP research and we intend to create these resources by collaborating with institutions and people who could work on this.

Building separate lexical resources for Sri Lankan Tamil is an important task since Sri Lankan Tamil differs in the way it is written and spoken. The grammatical structures follows the ancient Tamil script which is quite complex to analyse with existing Morphological analysers. Through this project we also intend to build lexical resources for Sri Lankan Tamil which would be useful in creating NLP applications involving Sri Lankan Tamil.

Researchers Ms. Riyafa Abdul HameedMs. Maryam Ziyad
Mentors Dr. Saatviga Sudhahar
Source Code https://github.com/ctnlpr/Language-Resource-Dev