Miracl: A multilingual retrieval dataset covering 18 diverse languages

X Zhang, N Thakur, O Ogundepo… - Transactions of the …, 2023 - direct.mit.edu
MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively
encompass over three billion native speakers around the world. This resource is designed to …

Making a miracl: Multilingual information retrieval across a continuum of languages

X Zhang, N Thakur, O Ogundepo, E Kamalloo… - arXiv preprint arXiv …, 2022 - arxiv.org
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a
multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc …

Cross-language information retrieval

P Galuščáková, DW Oard, S Nair - arXiv preprint arXiv:2111.05988, 2021 - arxiv.org
Two key assumptions shape the usual view of ranked retrieval:(1) that the searcher can
choose words for their query that might appear in the documents that they wish to see, and …

Text embedding inversion security for multilingual language models

Y Chen, H Lent, J Bjerva - … of the 62nd Annual Meeting of the …, 2024 - aclanthology.org
Textual data is often represented as real-numbered embeddings in NLP, particularly with the
popularity of large language models (LLMs) and Embeddings as a Service (EaaS) …

Overview of the TREC 2023 NeuCLIR Track

D Lawrie, S MacAvaney, J Mayfield… - arXiv preprint arXiv …, 2024 - arxiv.org
The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR)
track is to study the impact of neural approaches to cross-language information retrieval. The …

Toward best practices for training multilingual dense retrieval models

X Zhang, K Ogueji, X Ma, J Lin - ACM Transactions on Information …, 2023 - dl.acm.org
Dense retrieval models using a transformer-based bi-encoder architecture have emerged as
an active area of research. In this article, we focus on the task of monolingual retrieval in a …

Xricl: Cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-sql semantic parsing

P Shi, R Zhang, H Bai, J Lin - arXiv preprint arXiv:2210.13693, 2022 - arxiv.org
In-context learning using large language models has recently shown surprising results for
semantic parsing tasks such as Text-to-SQL translation. Prompting GPT-3 or Codex using …

C3: Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval

E Yang, S Nair, R Chandradevan… - Proceedings of the 45th …, 2022 - dl.acm.org
Pretrained language models have improved effectiveness on numerous tasks, including ad-
hoc retrieval. Recent work has shown that continuing to pretrain a language model with …

AfriCLIRMatrix: Enabling cross-lingual information retrieval for african languages

O Ogundepo, X Zhang, S Sun, K Duh… - Proceedings of the 2022 …, 2022 - aclanthology.org
Abstract Language diversity in NLP is critical in enabling the development of tools for a wide
range of users. However, there are limited resources for building such tools for many …

BLADE: combining vocabulary pruning and intermediate pretraining for scaleable neural CLIR

S Nair, E Yang, D Lawrie, J Mayfield… - Proceedings of the 46th …, 2023 - dl.acm.org
Learning sparse representations using pretrained language models enhances the
monolingual ranking effectiveness. Such representations are sparse vectors in the …