Deep Learning-based Sequence Labeling Tools for Nepali

Rai, Pooja; Chatterji, Sanjay; Kim, Byung-Gyu

doi:10.1145/3606696

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Deep Learning-based Sequence Labeling Tools for Nepali

Authors: Rai, Pooja; Chatterji, Sanjay; Kim, Byung-Gyu

Issue Date: Aug-2023

Publisher: ASSOC COMPUTING MACHINERY

Keywords: Deep learning-based Nepali tools; Nepali sequence labeling tools; Nepali chunker; BI-LSTM-CRF neural network; Nepali text feature selection; Nepali optimum feature set

Citation: ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, v.22, no.8

Journal Title: ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING

Volume: 22

Number: 8

URI: https://scholarworks.sookmyung.ac.kr/handle/2020.sw.sookmyung/151614

DOI: 10.1145/3606696

ISSN: 2375-4699
2375-4702

Abstract: A Part-of-Speech (POS) tagger and Chunker (or shallow parser) are sequence labeling tools, crucial for improving the accuracy of Natural Language Processing (NLP) tasks like parsing, named entity recognition, sentiment analysis, information extraction, and so on. Developing such tools for a low-resource language is an arduous task. Nepali is a relatively resource-poor Indian language and has not been able to evolve from a computational perspective. Therefore, we present effective part-of-speech tagging and chunking tools for the Nepali text using sequential deep learning models-Bidirectional Long Short-Term Memory Network with a Conditional Random Field Layer (BI-LSTM-CRF) and other LSTM-based models exploring both character and word embeddings of the Nepali texts. Word Embedding has been used to capture syntactic as well as semantic information whereas character embedding has been applied to capture the morphological as well as shape information of words and also to handle the out-of-vocabulary problem. The developed chunker is the first statistical chunker for the Nepali language. A baseline model with a Conditional Random Field has also been developed to identify the optimum feature set for the aforementioned tasks. The BI-LSTM-CRF model produced an accuracy of 99.20% and 98.40%, for Nepali POS tagging and chunking, respectively. This is the highest-ever accuracy for Nepali. Thorough error analysis and observations have also been reported with examples. The developed tools can help advance research in Nepali language processing, improve the accuracy of language technology applications, and contribute to the preservation and promotion of the Nepali language.

Files in This Item: Go to Link

Appears in Collections: ETC > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Kim, Byung Gyu photo

Kim, Byung Gyu: 공과대학 (인공지능공학부)

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :6,909,468; Today View :2,721

RSS_1.0 RSS_2.0 ATOM_1.0

Sookmyung Women's University. Cheongpa-ro 47-gil 100 (Cheongpa-dong 2ga), Yongsan-gu, Seoul, 04310, Korea02-710-9127

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE