CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual...

X Zhang, N Thakur, O Ogundepo… - Transactions of the …, 2023 - direct.mit.edu

MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively
encompass over three billion native speakers around the world. This resource is designed to …

被引用次数：56 相关文章所有 5 个版本

[PDF] arxiv.org

Making a miracl: Multilingual information retrieval across a continuum of languages

X Zhang, N Thakur, O Ogundepo, E Kamalloo… - arXiv preprint arXiv …, 2022 - arxiv.org

MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a
multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc …

被引用次数：48 相关文章所有 2 个版本

[PDF] arxiv.org

Cross-language information retrieval

P Galuščáková, DW Oard, S Nair - arXiv preprint arXiv:2111.05988, 2021 - arxiv.org

Two key assumptions shape the usual view of ranked retrieval:(1) that the searcher can
choose words for their query that might appear in the documents that they wish to see, and …

被引用次数：16 相关文章所有 7 个版本

[PDF] aclanthology.org

Text embedding inversion security for multilingual language models

Y Chen, H Lent, J Bjerva - … of the 62nd Annual Meeting of the …, 2024 - aclanthology.org

Textual data is often represented as real-numbered embeddings in NLP, particularly with the
popularity of large language models (LLMs) and Embeddings as a Service (EaaS) …

被引用次数：7 相关文章

[PDF] arxiv.org

Overview of the TREC 2023 NeuCLIR Track

D Lawrie, S MacAvaney, J Mayfield… - arXiv preprint arXiv …, 2024 - arxiv.org

The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR)
track is to study the impact of neural approaches to cross-language information retrieval. The …

被引用次数：27 相关文章所有 9 个版本

[PDF] acm.org

Toward best practices for training multilingual dense retrieval models

X Zhang, K Ogueji, X Ma, J Lin - ACM Transactions on Information …, 2023 - dl.acm.org

Dense retrieval models using a transformer-based bi-encoder architecture have emerged as
an active area of research. In this article, we focus on the task of monolingual retrieval in a …

被引用次数：34 相关文章所有 4 个版本

[PDF] arxiv.org

Xricl: Cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-sql semantic parsing

P Shi, R Zhang, H Bai, J Lin - arXiv preprint arXiv:2210.13693, 2022 - arxiv.org

In-context learning using large language models has recently shown surprising results for
semantic parsing tasks such as Text-to-SQL translation. Prompting GPT-3 or Codex using …

被引用次数：31 相关文章所有 4 个版本

[PDF] arxiv.org

C3: Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval

E Yang, S Nair, R Chandradevan… - Proceedings of the 45th …, 2022 - dl.acm.org

Pretrained language models have improved effectiveness on numerous tasks, including ad-
hoc retrieval. Recent work has shown that continuing to pretrain a language model with …

被引用次数：24 相关文章所有 5 个版本

[PDF] aclanthology.org

AfriCLIRMatrix: Enabling cross-lingual information retrieval for african languages

O Ogundepo, X Zhang, S Sun, K Duh… - Proceedings of the 2022 …, 2022 - aclanthology.org

Abstract Language diversity in NLP is critical in enabling the development of tools for a wide
range of users. However, there are limited resources for building such tools for many …

被引用次数：16 相关文章所有 2 个版本

[PDF] acm.org

BLADE: combining vocabulary pruning and intermediate pretraining for scaleable neural CLIR

S Nair, E Yang, D Lawrie, J Mayfield… - Proceedings of the 46th …, 2023 - dl.acm.org

Learning sparse representations using pretrained language models enhances the
monolingual ranking effectiveness. Such representations are sparse vectors in the …

被引用次数：9 相关文章所有 4 个版本