Ammus: A survey of transformer-based pretrained models in natural language processing

KS Kalyan, A Rajasekharan, S Sangeetha - arXiv preprint arXiv …, 2021 - arxiv.org
Transformer-based pretrained language models (T-PTLMs) have achieved great success in
almost every NLP task. The evolution of these models started with GPT and BERT. These …

Neural machine translation for low-resource languages: A survey

S Ranathunga, ESA Lee, M Prifti Skenduli… - ACM Computing …, 2023 - dl.acm.org
Neural Machine Translation (NMT) has seen tremendous growth in the last ten years since
the early 2000s and has already entered a mature phase. While considered the most widely …

A metaverse: Taxonomy, components, applications, and open challenges

SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …

Prompting large language model for machine translation: A case study

B Zhang, B Haddow, A Birch - International Conference on …, 2023 - proceedings.mlr.press
Research on prompting has shown excellent performance with little or even no supervised
training across many tasks. However, prompting for machine translation is still under …

The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

N Goyal, C Gao, V Chaudhary, PJ Chen… - Transactions of the …, 2022 - direct.mit.edu
One of the biggest challenges hindering progress in low-resource and multilingual machine
translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either …

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

J Dodge, M Sap, A Marasović, W Agnew… - arXiv preprint arXiv …, 2021 - arxiv.org
Large language models have led to remarkable progress on many NLP tasks, and
researchers are turning to ever-larger text corpora to train them. Some of the largest corpora …

Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages

S Ghosh, A Caliskan - Proceedings of the 2023 AAAI/ACM Conference …, 2023 - dl.acm.org
In this multicultural age, language translation is one of the most performed tasks, and it is
becoming increasingly AI-moderated and automated. As a novel AI system, ChatGPT claims …

InfoXLM: An information-theoretic framework for cross-lingual language model pre-training

Z Chi, L Dong, F Wei, N Yang, S Singhal… - arXiv preprint arXiv …, 2020 - arxiv.org
In this work, we present an information-theoretic framework that formulates cross-lingual
language model pre-training as maximizing mutual information between multilingual-multi …

MLQA: Evaluating cross-lingual extractive question answering

P Lewis, B Oğuz, R Rinott, S Riedel… - arXiv preprint arXiv …, 2019 - arxiv.org
Question answering (QA) models have shown rapid progress enabled by the availability of
large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to …