Miracl: A multilingual retrieval dataset covering 18 diverse languages

L Wang, N Yang, X Huang, L Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we introduce a novel and simple method for obtaining high-quality text
embeddings using only synthetic data and less than 1k training steps. Unlike existing …

被引用次数：218 相关文章所有 3 个版本

[PDF] arxiv.org

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

J Chen, S Xiao, P Zhang, K Luo, D Lian… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we present a new embedding model, called M3-Embedding, which is
distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It …

被引用次数：213 相关文章所有 2 个版本

[PDF] arxiv.org

Multilingual e5 text embeddings: A technical report

L Wang, N Yang, X Huang, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

This technical report presents the training methodology and evaluation results of the open-
source multilingual E5 text embedding models, released in mid-2023. Three embedding …

被引用次数：112 相关文章所有 2 个版本

[PDF] arxiv.org

Gecko: Versatile text embeddings distilled from large language models

J Lee, Z Dai, X Ren, B Chen, D Cer, JR Cole… - arXiv preprint arXiv …, 2024 - arxiv.org

We present Gecko, a compact and versatile text embedding model. Gecko achieves strong
retrieval performance by leveraging a key idea: distilling knowledge from large language …

被引用次数：54 相关文章所有 4 个版本

[PDF] arxiv.org

mgte: Generalized long-context text representation and reranking models for multilingual text retrieval

X Zhang, Y Zhang, D Long, W Xie, Z Dai, J Tang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present systematic efforts in building long-context multilingual text representation model
(TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Repetition improves language model embeddings

JM Springer, S Kotha, D Fried, G Neubig… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent approaches to improving the extraction of text embeddings from autoregressive
large language models (LLMs) have largely focused on improvements to data, backbone …

被引用次数：32 相关文章所有 2 个版本

[PDF] arxiv.org

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

N Thakur, L Bonifacio, X Zhang, O Ogundepo… - arXiv preprint arXiv …, 2023 - arxiv.org

Retrieval-augmented generation (RAG) grounds large language model (LLM) output by
leveraging external knowledge sources to reduce factual hallucinations. However, prior …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

Making text embedders few-shot learners

C Li, MH Qin, S Xiao, J Chen, K Luo, Y Shao… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) with decoder-only architectures demonstrate remarkable in-
context learning (ICL) capabilities. This feature enables them to effectively handle both …

被引用次数：3 相关文章

[PDF] arxiv.org

JaColBERTv2. 5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

B Clavié - arXiv preprint arXiv:2407.20750, 2024 - arxiv.org

Neural Information Retrieval has advanced rapidly in high-resource languages, but progress
in lower-resource ones such as Japanese has been hindered by data scarcity, among other …

被引用次数：4 相关文章

[PDF] clavie.eu

[PDF][PDF] Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

B Clavié - arXiv preprint arXiv:2312.16144, 2023 - ben.clavie.eu

Document retrieval in many languages has been largely relying on multi-lingual models,
and leveraging the vast wealth of English training data. In Japanese, the best performing …

被引用次数：7 相关文章所有 2 个版本