Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

G Ramesh, S Doddapaneni, A Bheemaraj… - Transactions of the …, 2022 - direct.mit.edu
We present Samanantar, the largest publicly available parallel corpora collection for Indic
languages. The collection contains a total of 49.7 million sentence pairs between English …

Overview of the 8th workshop on Asian translation

T Nakazawa, H Nakayama, C Ding… - Proceedings of the …, 2021 - aclanthology.org
This paper presents the results of the shared tasks from the 8th workshop on Asian
translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 …

Part-of-speech tagging of Odia language using statistical and deep learning based approaches

T Dalai, TK Mishra, PK Sa - ACM Transactions on Asian and Low …, 2023 - dl.acm.org
Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language
processing tasks, such as named entity recognition, speech processing, information …

The LTRC hindi-telugu parallel corpus

V Mujadia, DM Sharma - Proceedings of the Thirteenth Language …, 2022 - aclanthology.org
Abstract We present the Hindi-Telugu Parallel Corpus of different technical domains such as
Natural Science, Computer Science, Law and Healthcare along with the General domain …

Building a llama2-finetuned llm for odia language utilizing domain knowledge instruction set

GS Kohli, S Parida, S Sekhar, S Saha, NB Nair… - Proceedings of the …, 2023 - dl.acm.org
Building LLMs for languages other than English is in great demand due to the unavailability
and performance of multilingual LLMs, such as understanding the local context. The …

Improving Access to Justice for the Indian Population: A Benchmark for Evaluating Translation of Legal Text to Indian Languages

S Mahapatra, D Datta, S Soni, A Goswami… - arXiv preprint arXiv …, 2023 - arxiv.org
Most legal text in the Indian judiciary is written in complex English due to historical reasons.
However, only about 10% of the Indian population is comfortable in reading English. Hence …

Language technologies for low resource languages: Sociolinguistic and multilingual insights

AS Doğruöz, S Sitaram - Proceedings of the 1st Annual Meeting of …, 2022 - aclanthology.org
There is a growing interest in building language technologies (LTs) for low resource
languages (LRLs). However, there are flaws in the planning, data collection and …

Addressing the data gap: building a parallel corpus for Kashmiri language

SMU Qumar, M Azim, SMK Quadri - International Journal of Information …, 2024 - Springer
This paper marks a significant step forward in language technology for low-resource
languages by developing the first parallel corpus for the Kashmiri language, which …

Exploring pair-wise NMT for Indian languages

K Akella, SH Allu, SS Ragupathi, A Singhal… - arXiv preprint arXiv …, 2020 - arxiv.org
In this paper, we address the task of improving pair-wise machine translation for specific low
resource Indian languages. Multilingual NMT models have demonstrated a reasonable …

Universal Dependency Treebank for Odia Language

S Parida, K Sahoo, AK Ojha, S Sahoo, SR Dash… - arXiv preprint arXiv …, 2022 - arxiv.org
This paper presents the first publicly available treebank of Odia, a morphologically rich low
resource Indian language. The treebank contains approx. 1082 tokens (100 sentences) in …