TLDR
The SFR-Embedding-Mistral marks a significant advancement in text-embedding models, building upon the solid foundations of E5-mistral-7b-instruct and Mistral-7B-v0.1.
The main takeaways are:
- SFR-Embedding-Mistral ranks top in text embedding performance on the MTEB (average score 67.6, over 56 datasets), achieving state-of-the-art results.
- The model demonstrates a substantial improvement in retrieval performance, leaping from a score of 56.9 for E5-mistral-7b-instruct to an impressive 59.0.
- In clustering tasks, the SFR-Embedding-Mistral model exhibits a notable absolute improvement of +1.4 compared to E5-mistral-7b-instruct.
Model checkpoint: https://huggingface.co/Salesforce/SFR-Embedding-Mistral
Training Details
The SFR-Embedding-Mistral model is trained on data from a broad array of tasks across multiple domains. For retrieval tasks, it leverages data from various question answering, fact-checking, and general information retrieval sources. Clustering tasks use research article data from scientific and medical domains (excluding data used in the MTEB to prevent contamination). The model also incorporates datasets from domains such as online reviews, sentiment analysis, intent classification, and toxic conversation detection for classification tasks. For Semantic Textual Similarity (STS), it is trained on multiple benchmark datasets, while for reranking tasks, it uses data from scientific literature and technical forums. In clustering and classification, labels are treated as documents, applying contrastive loss only to their specific negatives without using in-batch negatives. For other tasks, contrastive loss and in-batch negatives are employed. This report primarily presents results on the development sets of the MTEB benchmark.
We fine-tune e5-mistral-7b-instruct using a batch size of 2,048, a learning rate of 1e−5, and an lr-warmup of 100 steps followed by a linear decay. Each query-document pair is batched with 7 hard negatives. We employ a maximum sequence length of 128 for queries and 256 for documents. This fine-tuning process took approximately 15 hours on 8 A100 GPUs. LoRA adapters with rank r=8 are added to all linear layers, resulting in 21M trainable parameters. Our implementation is based on the HuggingFace PEFT library at https://github.com/huggingface/peft.
Multi-task Training Benefits Generalization
We have observed that embedding models experience a substantial enhancement in retrieval performance when integrated with clustering tasks, and their effectiveness can be further improved through knowledge transfer from multiple tasks. By explicitly guiding documents towards high-level tags, training with clustering data enables embedding models to navigate and retrieve information more effectively. While all three clustering datasets originate from the scientific domain, incorporating additional clustering training yields significant improvements across all tasks. We hypothesize that the clustering labels encourage models to regularize the embeddings based on high-level concepts, resulting in better separation of data across different domains.
Moreover, generalization can be strengthened by employing multi-task training and adapting the models to specific tasks. This approach not only improves the accuracy of search results but also ensures the adaptability of models to diverse domains and tasks, which is crucial for real-world application scenarios.
Task-Homogenous Batching
Task-homogeneous batching is another factor contributing to the performance of our models. This technique constructs batches consisting exclusively of samples from a single task. Consequently, the in-batch negative becomes more challenging as other examples within the batch closely resemble the test case scenario. Experimental results highlight a notable enhancement in retrieval tasks with task-homogeneous batching, particularly an increase of 0.8 points.
Interleaved Batching | Off | On |
Clustering_dev | 75.29 | 75.04 |
Reranking_dev | 86.57 | 86.54 |
Retrieval_dev | 68.2 | 69.02 |
STS_dev | 89.81 | 90.3 |
AVG | 79.97 | 80.23 |
Impact of Hard Negatives
An effective technique for training embedding models is using “hard negatives” – data points that are challenging for the models to distinguish from the positive ones. By default, we use BGE-base model to mine the hard negatives.
Strategy to Eliminate False Negatives
Upon inspecting the mined negatives, a considerable portion appear to be false negatives, meaning they are semantically identical to the corresponding positive documents but mistakenly treated as negatives. A strategy to accurately and efficiently select hard negatives is crucial for embedding training, as it aids models in identifying the most relevant documents to a query.
The results indicate that the range from 30 to 100 yields improved performance. This implies that the top-ranked documents (0-100) may include some false negatives, while those ranked lower (50-100) lack sufficient challenge. Therefore, finding the right tradeoff between these factors is important in contrastive training.
Range of Selecting Negatives | 0-100 | 30-100 | 50-100 |
Clustering_dev | 74.6 | 75.04 | 74.91 |
Reranking_dev | 86.57 | 86.54 | 86.45 |
Retrieval_dev | 68.99 | 69.02 | 69 |
STS_dev | 90.17 | 90.3 | 90.24 |
AVG | 80.08 | 80.225 | 80.15 |
Number of Hard Negatives
The quantity of hard negatives used in contrastive learning can significantly impact the model’s learning dynamics. Including more hard negative prompts enables the model to differentiate more subtle distinctions, potentially enhancing its generalization capabilities. Nevertheless, our findings suggest that the training process remains relatively stable regardless of the number of hard negatives utilized.
Number of Negative | #HN=1 | #HN=7 | #HN=15 | #HN=31 |
Clustering_dev | 74.96 | 75.04 | 75.11 | 75.07 |
Reranking_dev | 86.61 | 86.54 | 86.43 | 86.45 |
Retrieval_dev | 68.96 | 69.02 | 68.96 | 68.95 |
STS_dev | 90.26 | 90.3 | 90.21 | 90.17 |
AVG | 80.2 | 80.225 | 80.18 | 80.16 |
Impact of Batch Size
Leveraging larger batch sizes has proven advantageous, primarily due to the inclusion of more challenging negative examples. We leverage GradCache to facilitate training with large batch sizes. We conducted experiments with batch sizes of 128, 2,048, and 8,192 to assess the impact of batch size. Leveraging larger batch sizes (2K+) leads to considerable improvement compared to the smaller batch sizes (e.g., 128) conventionally used for fine-tuning. However, enlarging the batch size from 2048 to 8192 does not result in any significant change in performance.
Teacher models for hard negative mining
It is customary to utilize more advanced models to collect challenging hard negatives. In our study, we employ four models to investigate the impact of teacher models on mining hard negatives, spanning from the classic lexical model BM25 to advanced dense models, such as our SFR-Embedding-Mistral. The findings indicate that the selected dense models serve as superior teacher models compared to BM25, and in general, more powerful models can yield more effective hard negatives (SFR-Embedding-Mistral > E5-Mistral > BGE-base). In the future, it will be intriguing to explore the impact of multi-round training on two fronts: (a) hard negative (HN) mining with SFR-Embedding-Mistral, and (b) utilizing the identified HNs to refine and improve SFR-Embedding-Mistral.
Hard Negative Mining | BM25 | BGE-base | E5-mistral | SFR-Embedding-Mistral |
Clustering_dev | 75.07 | 75.04 | 75.04 | 75.45 |
Reranking_dev | 86.43 | 86.54 | 86.52 | 86.6 |
Retrieval_dev | 68.42 | 69.02 | 69.06 | 69.15 |
STS_dev | 90.11 | 90.3 | 90.3 | 90.24 |
AVG | 80.01 | 80.23 | 80.23 | 80.36 |
Impact of Context Length
We showcase the difference in ranking of the positive documents for the BGE-large and our SFR-Embedding-Mistral model in relation to the length of the query/question (left figure) and the length of the document (right figure). More precisely, the y-axis captures the rank(gold-document | BGE-large) – rank(gold-document | SFR-Embedding-Mistral), meaning the higher the absolute value, the more contrast between the two models. In both figures, SFR-Embedding-Mistral model ranks positive documents better than the BGE model overall. More importantly, we observe that after a certain length threshold, i.e., 25 for queries and 700 for documents, BGE model is significantly less likely to rank the gold document higher than SFR-Embedding-Mistral owing to the inherent power of LLMs to represent long-context. It becomes particularly appealing for downstream RAG applications where keeping the document structure intact is indispensable. For example, the RAG system maintains the structure of long legal documents during summarization by understanding and retrieving various sections, ensuring the summary accurately captures the case’s essence and legal reasoning, which is vital for legal contexts.
Full Evaluation on MTEB
MTEB (Massive Text Embedding Benchmark) is by far the most comprehensive benchmark for evaluating embedding models, encompassing 56 datasets across seven task types: seven tasks: classification, clustering, pair classification, re-ranking, retrieval, STS, and summarization.
As evidenced by the MTEB leaderboard (as of February 27, 2024), SFR-Embedding-Mistral claims the top spot among over 150 embedding models, including several proprietary ones such as voyage-lite-02-instruct, OpenAI text-embedding-3-large, and Cohere-embed-english-v3.0. Particularly noteworthy is its performance on retrieval tasks, which are considered the most pivotal among all MTEB task types. SFR-Embedding-Mistral excels with an average score of 59.0, surpassing the 2nd place model by a substantial margin (57.4). This outcome underscores the exceptional performance of our model across diverse tasks and domains.
Conclusion
SFR-Embedding-Mistral is the top-ranked model on the MTEB benchmark, attributed to several pivotal innovations and strategies:
- Transfer learning from multiple tasks, notably clustering, enhances the model’s adaptability and performance.
- Task-homogeneous batching increases the difficulty of the contrastive objective for the model, promoting enhanced generalization.
- Constructing better hard negatives further sharpens the model’s ability to discern discriminating misleading documents.
Ethical Considerations
This release is for research purposes only in support of an academic paper. Users are encouraged to carefully consider common limitations of embeddings, applicable laws and best practices when selecting use cases. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model, particularly in high-risk applications such as those involving healthcare, legal decision-making, public safety or result in human rights violations. For guidance on what we consider appropriate use cases, please refer to our AUP and AI AUP. While compliance with these policies is not mandated for this release, we urge users to align their applications with these principles to ensure responsible and ethical use.
Explore more
Salesforce AI invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.
- SFR Embedding Deep Dive
- Hugging Face
- Salesforce AI Research Website
- Follow us on X (Previously Twitter): @SFResearch, @Salesforce