- NET Web Desk
A researcher at the Assam Don Bosco University (ADBU) – Dr Medari Janai Tham, has developed a Natural Language Processing (NLP) application – “Tham Khasi Annotated Corpus” for the purpose of computing Khasi dialect.
It is a set of computer approaches for analyzing and synthesizing human language, including speech and text.
Furthermore, generating a corpus – collection of machine-readable content is an important step in developing NLP systems for a language.
The British National Corpus (BNC) is the most extensively used corpus in English, which is popular among scholars due to its accessibility.
As there is no publicly-available corpus for Khasi, it is classified as a resource-poor language.
However, the release of “Tham Khasi Annotated Corpus”, which is accessible through the European Language Resources Association, has made a significant contribution in this subject (ELRA).
In order to ensure standardized tagging with other Indian dialects, the corpus has been manually linked using the formulated BIS (Bureau of Indian Standards) PoS (Parts-of-Speech) system.
Tham has been awarded with the Doctorate degree from the Computer Science and Engineering Department, ADBU for her thesis ‘Shallow Parsing for Khasi’ under the guidance of Prof. Pushpak Bhattacharyya of IIT Bombay.
Details of the corpus including the annotation scheme and the development of Khasi NLP tools are available in research papers published as part of her PhD and available in www.grammarkhasi.in, which also serves as a companion website of the book “Ka Grammar Khasi Da Ka Jingdro” published by Macmillan Education, India.
The BIS Khasi tagset, a Hybrid Khasi PoS tagger, an HMM Khasi PoS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using the bi-directional gated recurrent unit; are among the other contributions made by Tham.