The bigscience roots corpus: A 1.6 tb composite multilingual dataset

H Laurençon, L Saulnier, T Wang… - Advances in …, 2022 - proceedings.neurips.cc
As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …

Overview of the 8th workshop on Asian translation

T Nakazawa, H Nakayama, C Ding… - Proceedings of the …, 2021 - aclanthology.org
This paper presents the results of the shared tasks from the 8th workshop on Asian
translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 …

A multilingual parallel corpora collection effort for Indian languages

S Siripragada, J Philip, VP Namboodiri… - arXiv preprint arXiv …, 2020 - arxiv.org
We present sentence aligned parallel corpora across 10 Indian Languages-Hindi, Telugu,
Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English-many of …

Revisiting low resource status of indian languages in machine translation

J Philip, S Siripragada, VP Namboodiri… - Proceedings of the 3rd …, 2021 - dl.acm.org
Indian language machine translation performance is hampered due to the lack of large scale
multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we …

Part-of-speech tagging of Odia language using statistical and deep learning based approaches

T Dalai, TK Mishra, PK Sa - ACM Transactions on Asian and Low …, 2023 - dl.acm.org
Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language
processing tasks, such as named entity recognition, speech processing, information …

A large-scale evaluation of neural machine transliteration for Indic languages

A Kunchukuttan, S Jain, R Kejriwal - … of the 16th Conference of the …, 2021 - aclanthology.org
We take up the task of large-scale evaluation of neural machine transliteration between
English and Indic languages, with a focus on multilingual transliteration to utilize …

[HTML][HTML] Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

R Jaworski, S Seljan, I Dunđer - Information, 2023 - mdpi.com
Parallel corpora have been widely used in the fields of natural language processing and
translation as they provide crucial multilingual information. They are used to train machine …

Efficiently reusing old models across languages via transfer learning

T Kocmi, O Bojar - arXiv preprint arXiv:1909.10955, 2019 - arxiv.org
Recent progress in neural machine translation is directed towards larger neural networks
trained on an increasing amount of hardware resources. As a result, NMT models are costly …

Open machine translation for low resource South American languages (AmericasNLP 2021 shared task contribution)

S Parida, S Panda, A Dash… - First Workshop on …, 2021 - biblio.ugent.be
This paper describes the team (“Tamalli”)'s submission to AmericasNLP2021 shared task on
Open Machine Translation for low resource South American languages. Our goal was to …

The reality of multi-lingual machine translation

T Kocmi, D Macháček, O Bojar - arXiv preprint arXiv:2202.12814, 2022 - arxiv.org
Our book" The Reality of Multi-Lingual Machine Translation" discusses the benefits and
perils of using more than two languages in machine translation systems. While focused on …