Improving text embeddings with large language models

L Wang, N Yang, X Huang, L Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we introduce a novel and simple method for obtaining high-quality text
embeddings using only synthetic data and less than 1k training steps. Unlike existing …

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

J Chen, S Xiao, P Zhang, K Luo, D Lian… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we present a new embedding model, called M3-Embedding, which is
distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It …

Multilingual e5 text embeddings: A technical report

L Wang, N Yang, X Huang, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
This technical report presents the training methodology and evaluation results of the open-
source multilingual E5 text embedding models, released in mid-2023. Three embedding …

Gecko: Versatile text embeddings distilled from large language models

J Lee, Z Dai, X Ren, B Chen, D Cer, JR Cole… - arXiv preprint arXiv …, 2024 - arxiv.org
We present Gecko, a compact and versatile text embedding model. Gecko achieves strong
retrieval performance by leveraging a key idea: distilling knowledge from large language …

mgte: Generalized long-context text representation and reranking models for multilingual text retrieval

X Zhang, Y Zhang, D Long, W Xie, Z Dai, J Tang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present systematic efforts in building long-context multilingual text representation model
(TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base …

Repetition improves language model embeddings

JM Springer, S Kotha, D Fried, G Neubig… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent approaches to improving the extraction of text embeddings from autoregressive
large language models (LLMs) have largely focused on improvements to data, backbone …

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

N Thakur, L Bonifacio, X Zhang, O Ogundepo… - arXiv preprint arXiv …, 2023 - arxiv.org
Retrieval-augmented generation (RAG) grounds large language model (LLM) output by
leveraging external knowledge sources to reduce factual hallucinations. However, prior …

Making text embedders few-shot learners

C Li, MH Qin, S Xiao, J Chen, K Luo, Y Shao… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) with decoder-only architectures demonstrate remarkable in-
context learning (ICL) capabilities. This feature enables them to effectively handle both …

JaColBERTv2. 5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

B Clavié - arXiv preprint arXiv:2407.20750, 2024 - arxiv.org
Neural Information Retrieval has advanced rapidly in high-resource languages, but progress
in lower-resource ones such as Japanese has been hindered by data scarcity, among other …

[PDF][PDF] Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

B Clavié - arXiv preprint arXiv:2312.16144, 2023 - ben.clavie.eu
Document retrieval in many languages has been largely relying on multi-lingual models,
and leveraging the vast wealth of English training data. In Japanese, the best performing …