A survey of multilingual neural machine translation

R Dabre, C Chu, A Kunchukuttan - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
We present a survey on multilingual neural machine translation (MNMT), which has gained
a lot of traction in recent years. MNMT has been useful in improving translation quality as a …

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

T Kudo, J Richardson - arXiv preprint arXiv:1808.06226, 2018 - arxiv.org
This paper describes SentencePiece, a language-independent subword tokenizer and
detokenizer designed for Neural-based text processing, including Neural Machine …

Survey of low-resource machine translation

B Haddow, R Bawden, AVM Barone, J Helcl… - Computational …, 2022 - direct.mit.edu
We present a survey covering the state of the art in low-resource machine translation (MT)
research. There are currently around 7,000 languages spoken in the world and almost all …

Subword regularization: Improving neural network translation models with multiple subword candidates

T Kudo - arXiv preprint arXiv:1804.10959, 2018 - arxiv.org
Subword units are an effective way to alleviate the open vocabulary problems in neural
machine translation (NMT). While sentences are usually converted into unique subword …

Levenshtein transformer

J Gu, C Wang, J Zhao - Advances in neural information …, 2019 - proceedings.neurips.cc
Modern neural sequence generation models are built to either generate tokens step-by-step
from scratch or (iteratively) modify a sequence of tokens bounded by a fixed length. In this …

[PDF][PDF] Multilingual translation from denoising pre-training

Y Tang, C Tran, X Li, PJ Chen, N Goyal… - Findings of the …, 2021 - aclanthology.org
Recent work demonstrates the potential of training one model for multilingual machine
translation. In parallel, denoising pretraining using unlabeled monolingual data as a starting …

CCMatrix: Mining billions of high-quality parallel sentences on the web

H Schwenk, G Wenzek, S Edunov, E Grave… - arXiv preprint arXiv …, 2019 - arxiv.org
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

G Ramesh, S Doddapaneni, A Bheemaraj… - Transactions of the …, 2022 - direct.mit.edu
We present Samanantar, the largest publicly available parallel corpora collection for Indic
languages. The collection contains a total of 49.7 million sentence pairs between English …

A survey of domain adaptation for neural machine translation

C Chu, R Wang - arXiv preprint arXiv:1806.00258, 2018 - arxiv.org
Neural machine translation (NMT) is a deep learning based approach for machine
translation, which yields the state-of-the-art translation performance in scenarios where …

[HTML][HTML] A voyage on neural machine translation for Indic languages

SK Sheshadri, D Gupta, MR Costa-Jussà - Procedia Computer Science, 2023 - Elsevier
With the invention of deep learning concepts, Machine Translation (MT) migrated towards
Neural Machine Translation (NMT) architectures, eventually from Statistical Machine …