The bigscience roots corpus: A 1.6 tb composite multilingual dataset
As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …
Overview of the 8th workshop on Asian translation
T Nakazawa, H Nakayama, C Ding… - Proceedings of the …, 2021 - aclanthology.org
This paper presents the results of the shared tasks from the 8th workshop on Asian
translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 …
translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 …
A multilingual parallel corpora collection effort for Indian languages
We present sentence aligned parallel corpora across 10 Indian Languages-Hindi, Telugu,
Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English-many of …
Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English-many of …
Revisiting low resource status of indian languages in machine translation
Indian language machine translation performance is hampered due to the lack of large scale
multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we …
multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we …
Part-of-speech tagging of Odia language using statistical and deep learning based approaches
Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language
processing tasks, such as named entity recognition, speech processing, information …
processing tasks, such as named entity recognition, speech processing, information …
A large-scale evaluation of neural machine transliteration for Indic languages
A Kunchukuttan, S Jain, R Kejriwal - … of the 16th Conference of the …, 2021 - aclanthology.org
We take up the task of large-scale evaluation of neural machine transliteration between
English and Indic languages, with a focus on multilingual transliteration to utilize …
English and Indic languages, with a focus on multilingual transliteration to utilize …
[HTML][HTML] Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform
Parallel corpora have been widely used in the fields of natural language processing and
translation as they provide crucial multilingual information. They are used to train machine …
translation as they provide crucial multilingual information. They are used to train machine …
Efficiently reusing old models across languages via transfer learning
Recent progress in neural machine translation is directed towards larger neural networks
trained on an increasing amount of hardware resources. As a result, NMT models are costly …
trained on an increasing amount of hardware resources. As a result, NMT models are costly …
Open machine translation for low resource South American languages (AmericasNLP 2021 shared task contribution)
This paper describes the team (“Tamalli”)'s submission to AmericasNLP2021 shared task on
Open Machine Translation for low resource South American languages. Our goal was to …
Open Machine Translation for low resource South American languages. Our goal was to …
The reality of multi-lingual machine translation
Our book" The Reality of Multi-Lingual Machine Translation" discusses the benefits and
perils of using more than two languages in machine translation systems. While focused on …
perils of using more than two languages in machine translation systems. While focused on …