Miracl: A multilingual retrieval dataset covering 18 diverse languages
MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively
encompass over three billion native speakers around the world. This resource is designed to …
encompass over three billion native speakers around the world. This resource is designed to …
Making a miracl: Multilingual information retrieval across a continuum of languages
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a
multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc …
multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc …
Cross-language information retrieval
Two key assumptions shape the usual view of ranked retrieval:(1) that the searcher can
choose words for their query that might appear in the documents that they wish to see, and …
choose words for their query that might appear in the documents that they wish to see, and …
Text embedding inversion security for multilingual language models
Textual data is often represented as real-numbered embeddings in NLP, particularly with the
popularity of large language models (LLMs) and Embeddings as a Service (EaaS) …
popularity of large language models (LLMs) and Embeddings as a Service (EaaS) …
Overview of the TREC 2023 NeuCLIR Track
The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR)
track is to study the impact of neural approaches to cross-language information retrieval. The …
track is to study the impact of neural approaches to cross-language information retrieval. The …
Toward best practices for training multilingual dense retrieval models
Dense retrieval models using a transformer-based bi-encoder architecture have emerged as
an active area of research. In this article, we focus on the task of monolingual retrieval in a …
an active area of research. In this article, we focus on the task of monolingual retrieval in a …
Xricl: Cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-sql semantic parsing
In-context learning using large language models has recently shown surprising results for
semantic parsing tasks such as Text-to-SQL translation. Prompting GPT-3 or Codex using …
semantic parsing tasks such as Text-to-SQL translation. Prompting GPT-3 or Codex using …
C3: Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval
Pretrained language models have improved effectiveness on numerous tasks, including ad-
hoc retrieval. Recent work has shown that continuing to pretrain a language model with …
hoc retrieval. Recent work has shown that continuing to pretrain a language model with …
AfriCLIRMatrix: Enabling cross-lingual information retrieval for african languages
Abstract Language diversity in NLP is critical in enabling the development of tools for a wide
range of users. However, there are limited resources for building such tools for many …
range of users. However, there are limited resources for building such tools for many …
BLADE: combining vocabulary pruning and intermediate pretraining for scaleable neural CLIR
Learning sparse representations using pretrained language models enhances the
monolingual ranking effectiveness. Such representations are sparse vectors in the …
monolingual ranking effectiveness. Such representations are sparse vectors in the …