As the development and deployment of large language models (LLMs) accelerates, evaluating model outputs has become increasingly important. The established method of evaluating responses typically involves recruiting and training human evaluators, having them evaluate the model responses, and then auditing the quality of the evaluations. Unfortunately, this process does not scale to keep up with the frantic release pace of new LLMs. As a result, practitioners have begun using LLMs themselves as evaluators. Here, an LLM (also called a judge model) is provided with a user’s input and outputs from a model. The judge model is then asked to rate the outputs based on certain evaluation criteria.
We introduce SFR-Judge, a family of three judge models of 8-billion parameters 8B, 12B, and 70B size, built with Meta Llama 3 and Mistral NeMO. These judge models are trained to perform three different types of evaluation tasks: pairwise comparisons (“Is output A better than output B?’’), single ratings (“Rate the output on a Likert scale of 1-5’’), and binary classification (“Does the output meet the specified criteria?’’). Here at Salesforce AI Research, we understand that trust serves as a foundational pillar of evaluation. Therefore, our judges are not only capable of producing judgments but are specifically trained to produce explanations for their judgments, avoiding the black-box nature of other types of judge models.
To evaluate SFR-Judge, we conducted extensive experiments on thirteen benchmarks across the three evaluation tasks. This set of experiments assessed our model’s ability to perform different tasks that a judge model may be asked: reward modeling, determining the safety of outputs, assessing the instruction following abilities of other models, or even evaluating outputs based on fine-grained rubrics. Our models consistently outperformed other open-source judge models as well as powerful proprietary models, like GPT-4o, achieving the best performance on 10/13 benchmarks.
While judge models offer a flexible alternative to human evaluations, they can also be sensitive to certain evaluation biases. In answering pairwise comparisons, some judge models are sensitive to the order in which responses are presented, with the judge changing its answer based on whether Response A and Response B are swapped. Additionally, judge models may prefer longer, pleasant-sounding responses that do not satisfy the user’s request over concise responses that precisely satisfy the user’s request. To probe for biases, we (a) evaluate our models on EvalBiasBench [https://github.com/ncsoft/offsetbias], which evaluates judge models on 6 different types of bias (e.g., length, familiar knowledge, etc.) and (b) record the pairwise order consistency (does the model respond with the same judgment if we swap the order of the responses?) of our models across 7 pairwise comparison benchmarks. Our models are shown to be less biased on EvalBiasBench while exhibiting far higher levels of consistency than many competitive judge models.
Judge models are not only used for auto-evaluation but also for improving other LLMs by assessing their outputs for reinforcement learning from human feedback (RLHF) finetuning. Here, the model to be improved (the downstream model) produces outputs, and a judge model is used to score each response. From there, the downstream model is further trained to encourage the highly rated responses and discourage the lowly rated responses. The judge model in this setting is often referred to as a reward model.
Notably, our models are first, second, and fourth on the RewardBench leaderboard for generative judge models as of September 20, 2024. RewardBench is a benchmark for assessing the reward-modeling abilities of judge models. Our models marked the first time and second time a generative judge crossed the 90% accuracy threshold on RewardBench. We also put our models to the test, showing that SFR-Judge not only acts as powerful reward models but the explanations generated from our judge models can be further used to improve downstream model outputs. We compared our models against two black-box (i.e., no explanations are generated) classifier-based reward models and showed that downstream models improved with SFR-Judge exhibit better performance on AlpacaEval-2, an instruction following benchmark.
SFR-Judge marks an exciting step in auto-evaluation and reward modeling. For more information, please check out our paper!
Paper: https://arxiv.org/abs/2409.14664
Code (coming soon): https://github.com/SalesforceAIResearch/SFRJudge