Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia

KS Kalyan, A Rajasekharan, S Sangeetha - arXiv preprint arXiv …, 2021 - arxiv.org

Transformer-based pretrained language models (T-PTLMs) have achieved great success in
almost every NLP task. The evolution of these models started with GPT and BERT. These …

被引用次数：301 相关文章所有 2 个版本

[PDF] arxiv.org

Neural machine translation for low-resource languages: A survey

S Ranathunga, ESA Lee, M Prifti Skenduli… - ACM Computing …, 2023 - dl.acm.org

Neural Machine Translation (NMT) has seen tremendous growth in the last ten years since
the early 2000s and has already entered a mature phase. While considered the most widely …

被引用次数：215 相关文章所有 6 个版本

[PDF] ieee.org

A metaverse: Taxonomy, components, applications, and open challenges

SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org

Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …

被引用次数：1588 相关文章所有 6 个版本

[PDF] mlr.press

Prompting large language model for machine translation: A case study

B Zhang, B Haddow, A Birch - International Conference on …, 2023 - proceedings.mlr.press

Research on prompting has shown excellent performance with little or even no supervised
training across many tasks. However, prompting for machine translation is still under …

被引用次数：160 相关文章所有 8 个版本

[PDF] mit.edu

The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

N Goyal, C Gao, V Chaudhary, PJ Chen… - Transactions of the …, 2022 - direct.mit.edu

One of the biggest challenges hindering progress in low-resource and multilingual machine
translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either …

被引用次数：388 相关文章所有 9 个版本

[PDF] neurips.cc

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

被引用次数：58 相关文章所有 6 个版本

[PDF] arxiv.org

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

J Dodge, M Sap, A Marasović, W Agnew… - arXiv preprint arXiv …, 2021 - arxiv.org

Large language models have led to remarkable progress on many NLP tasks, and
researchers are turning to ever-larger text corpora to train them. Some of the largest corpora …

被引用次数：381 相关文章所有 8 个版本

[PDF] acm.org

Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages

S Ghosh, A Caliskan - Proceedings of the 2023 AAAI/ACM Conference …, 2023 - dl.acm.org

In this multicultural age, language translation is one of the most performed tasks, and it is
becoming increasingly AI-moderated and automated. As a novel AI system, ChatGPT claims …

被引用次数：81 相关文章所有 4 个版本

[PDF] arxiv.org

InfoXLM: An information-theoretic framework for cross-lingual language model pre-training

Z Chi, L Dong, F Wei, N Yang, S Singhal… - arXiv preprint arXiv …, 2020 - arxiv.org

In this work, we present an information-theoretic framework that formulates cross-lingual
language model pre-training as maximizing mutual information between multilingual-multi …

被引用次数：327 相关文章所有 6 个版本

[PDF] arxiv.org

MLQA: Evaluating cross-lingual extractive question answering

P Lewis, B Oğuz, R Rinott, S Riedel… - arXiv preprint arXiv …, 2019 - arxiv.org

Question answering (QA) models have shown rapid progress enabled by the availability of
large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to …

被引用次数：457 相关文章所有 6 个版本