Vector search with OpenAI embeddings: Lucene is all you need

J Xian, T Teofili, R Pradeep, J Lin - … Conference on Web Search and Data …, 2024 - dl.acm.org
We provide a reproducible, end-to-end demonstration of vector search with OpenAI
embeddings using Lucene on the popular MS MARCO passage ranking test collection. The …

Salient phrase aware dense retrieval: can a dense retriever imitate a sparse one?

X Chen, K Lakhotia, B Oğuz, A Gupta, P Lewis… - arXiv preprint arXiv …, 2021 - arxiv.org
Despite their recent popularity and well-known advantages, dense retrievers still lag behind
sparse methods such as BM25 in their ability to reliably match salient phrases and rare …

Tevatron: An efficient and flexible toolkit for neural retrieval

L Gao, X Ma, J Lin, J Callan - Proceedings of the 46th International ACM …, 2023 - dl.acm.org
Recent rapid advances in deep pre-trained language models and the introduction of large
datasets have powered research in embedding-based neural retrieval. While many …

Simple yet effective neural ranking and reranking baselines for cross-lingual information retrieval

J Lin, D Alfonso-Hermelo, V Jeronymo… - arXiv preprint arXiv …, 2023 - arxiv.org
The advent of multilingual language models has generated a resurgence of interest in cross-
lingual information retrieval (CLIR), which is the task of searching documents in one …

Resources for brewing beir: Reproducible reference models and statistical analyses

E Kamalloo, N Thakur, C Lassance, X Ma… - Proceedings of the 47th …, 2024 - dl.acm.org
BEIR is a benchmark dataset originally designed for zero-shot evaluation of retrieval models
across 18 different domain/task combinations. In recent years, we have witnessed the …

Resources for brewing BEIR: reproducible reference models and an official leaderboard

E Kamalloo, N Thakur, C Lassance, X Ma… - arXiv preprint arXiv …, 2023 - arxiv.org
BEIR is a benchmark dataset for zero-shot evaluation of information retrieval models across
18 different domain/task combinations. In recent years, we have witnessed the growing …

Pre-processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering

MS Tamber, R Pradeep, J Lin - European Conference on Information …, 2023 - Springer
One of the contributions of the landmark Dense Passage Retriever (DPR) work is the
curation of a corpus of passages generated from Wikipedia articles that have been …

[HTML][HTML] Enhancing Biomedical Question Answering with Large Language Models

H Yang, S Li, T Gonçalves - Information, 2024 - mdpi.com
In the field of Information Retrieval, biomedical question answering is a specialized task that
focuses on answering questions related to medical and healthcare domains. The goal is to …

Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes

X Ma, T Teofili, J Lin - Proceedings of the 32nd ACM International …, 2023 - dl.acm.org
Anserini is a Lucene-based toolkit for reproducible information retrieval research in Java that
has been gaining traction in the community. It provides retrieval capabilities for both" …

[PDF][PDF] Multi-stage Literature Retrieval System Trained by PubMed Search Logs for Biomedical Question Answering.

A Shin, Q Jin, Z Lu - CLEF (Working Notes), 2023 - ceur-ws.org
This paper discusses our submission to the 2023 BioASQ challenge, document retrieval
subtask (subtask B, phase A). In the subtask, systems must return top 10 most relevant …