Entity Extraction in Tamil Tweets

Summary: Social media text such as Twitter holds information regarding various important aspects. Extraction of such information serves as the basis for the most preliminary task in Natural Language Processing called Entity extraction. Entities are real world elements or objects such as Person names, Organization names, Product names, Location names. Entities are often referred to as Named Entities. Entity extraction refers to automatic identification of named entities in a text document. Identification of named entities is very important for several higher language technology systems such as information extraction systems, machine translation systems, and cross-lingual information access systems. This project aims to develop an entity extraction system for Tamil twitter data using word embedding models proposed in word2vec. Inferences are made from varying results with respect to the varying embedding size and Ngram/Gazetteer features are to be included.

The dataset from FIRE2015 and a custom collection of Tamil tweets would be used for the training and testing of the system. The approach consists of using Word2vec embedding, Sent2vec embedding and Support Vector Machines.

Partner Institution Center for Computational Engineering and Networking (CEN), Amrita Vishwa Vidyapeetham, Coimbatore, India
Supervisor Dr. M. Anand Kumar
Research Students Remmiya Devi G.
Source Code https://github.com/ctnlpr/Entityextractor-Tweets