A survey of multilingual neural machine translation
We present a survey on multilingual neural machine translation (MNMT), which has gained
a lot of traction in recent years. MNMT has been useful in improving translation quality as a …
a lot of traction in recent years. MNMT has been useful in improving translation quality as a …
Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing
T Kudo, J Richardson - arXiv preprint arXiv:1808.06226, 2018 - arxiv.org
This paper describes SentencePiece, a language-independent subword tokenizer and
detokenizer designed for Neural-based text processing, including Neural Machine …
detokenizer designed for Neural-based text processing, including Neural Machine …
Survey of low-resource machine translation
We present a survey covering the state of the art in low-resource machine translation (MT)
research. There are currently around 7,000 languages spoken in the world and almost all …
research. There are currently around 7,000 languages spoken in the world and almost all …
Subword regularization: Improving neural network translation models with multiple subword candidates
T Kudo - arXiv preprint arXiv:1804.10959, 2018 - arxiv.org
Subword units are an effective way to alleviate the open vocabulary problems in neural
machine translation (NMT). While sentences are usually converted into unique subword …
machine translation (NMT). While sentences are usually converted into unique subword …
Levenshtein transformer
Modern neural sequence generation models are built to either generate tokens step-by-step
from scratch or (iteratively) modify a sequence of tokens bounded by a fixed length. In this …
from scratch or (iteratively) modify a sequence of tokens bounded by a fixed length. In this …
[PDF][PDF] Multilingual translation from denoising pre-training
Recent work demonstrates the potential of training one model for multilingual machine
translation. In parallel, denoising pretraining using unlabeled monolingual data as a starting …
translation. In parallel, denoising pretraining using unlabeled monolingual data as a starting …
CCMatrix: Mining billions of high-quality parallel sentences on the web
We show that margin-based bitext mining in a multilingual sentence space can be applied to
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …
monolingual corpora of billions of sentences. We are using ten snapshots of a curated …
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
G Ramesh, S Doddapaneni, A Bheemaraj… - Transactions of the …, 2022 - direct.mit.edu
We present Samanantar, the largest publicly available parallel corpora collection for Indic
languages. The collection contains a total of 49.7 million sentence pairs between English …
languages. The collection contains a total of 49.7 million sentence pairs between English …
A survey of domain adaptation for neural machine translation
Neural machine translation (NMT) is a deep learning based approach for machine
translation, which yields the state-of-the-art translation performance in scenarios where …
translation, which yields the state-of-the-art translation performance in scenarios where …
[HTML][HTML] A voyage on neural machine translation for Indic languages
With the invention of deep learning concepts, Machine Translation (MT) migrated towards
Neural Machine Translation (NMT) architectures, eventually from Statistical Machine …
Neural Machine Translation (NMT) architectures, eventually from Statistical Machine …