Improving text embeddings with large language models
In this paper, we introduce a novel and simple method for obtaining high-quality text
embeddings using only synthetic data and less than 1k training steps. Unlike existing …
embeddings using only synthetic data and less than 1k training steps. Unlike existing …
Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation
In this paper, we present a new embedding model, called M3-Embedding, which is
distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It …
distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It …
Multilingual e5 text embeddings: A technical report
This technical report presents the training methodology and evaluation results of the open-
source multilingual E5 text embedding models, released in mid-2023. Three embedding …
source multilingual E5 text embedding models, released in mid-2023. Three embedding …
Gecko: Versatile text embeddings distilled from large language models
We present Gecko, a compact and versatile text embedding model. Gecko achieves strong
retrieval performance by leveraging a key idea: distilling knowledge from large language …
retrieval performance by leveraging a key idea: distilling knowledge from large language …
mgte: Generalized long-context text representation and reranking models for multilingual text retrieval
We present systematic efforts in building long-context multilingual text representation model
(TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base …
(TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base …
Repetition improves language model embeddings
Recent approaches to improving the extraction of text embeddings from autoregressive
large language models (LLMs) have largely focused on improvements to data, backbone …
large language models (LLMs) have largely focused on improvements to data, backbone …
NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) grounds large language model (LLM) output by
leveraging external knowledge sources to reduce factual hallucinations. However, prior …
leveraging external knowledge sources to reduce factual hallucinations. However, prior …
Making text embedders few-shot learners
Large language models (LLMs) with decoder-only architectures demonstrate remarkable in-
context learning (ICL) capabilities. This feature enables them to effectively handle both …
context learning (ICL) capabilities. This feature enables them to effectively handle both …
JaColBERTv2. 5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources
B Clavié - arXiv preprint arXiv:2407.20750, 2024 - arxiv.org
Neural Information Retrieval has advanced rapidly in high-resource languages, but progress
in lower-resource ones such as Japanese has been hindered by data scarcity, among other …
in lower-resource ones such as Japanese has been hindered by data scarcity, among other …
[PDF][PDF] Towards Better Monolingual Japanese Retrievers with Multi-Vector Models
B Clavié - arXiv preprint arXiv:2312.16144, 2023 - ben.clavie.eu
Document retrieval in many languages has been largely relying on multi-lingual models,
and leveraging the vast wealth of English training data. In Japanese, the best performing …
and leveraging the vast wealth of English training data. In Japanese, the best performing …