Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences.
We propose CoBBLEr, the COgnitive Bias Benchmark for evaluating the quality and reliability of LLMs as EvaluatoRs. Here is the pipeline:
We collect a set of 50 question-answering instructions from BigBench (Srivastava et al., 2023) and Eli5 (Fan et al., 2019).
Then, we assemble top 15 models that are open- and closed-source LLMs (GPT-4, ChatGPT, InstructGPT, LLaMAv2, LLaMA, Cohere, Falcon, Alpaca, Vicuna, OpenAssist, DollyV2, Baize, Koala, WizardLM, MPT, RedPajama).
We then generate responses from 15 open- and closed-source LLMs and conduct a round-robin over every possible unique pair between each of the model responses, prompting each model to evaluate its own and other models’ responses.
We then test six different biases to benchmark their evaluation quality and categorize the model biases into two groups: (1) Implicit Biases, which can be implicitly extracted from each model’s evaluation via a vanilla prompt, and (2) Induced Biases, which add modifications to the original prompts akin to induce negative behaviors.
Definition and examples of each bias type in CoBBLEr. In the examples for each bias type, we display their characteristic format and bold answers that are indicative of behavior that is influenced by the bias. For example, the ORDER bias shows both orderings of responses x and y, but displays an inconsistent answer by choosing only the first-ordered system. Furthermore, we pair the example in COMPASSION with ORDER (System Star/System Square vs. Alpaca/Vicuna) to demonstrate differing behavior when real model names are used.
We show proportion of responses that were labeled bias for each bias benchmark. We visualize the distribution of the 15 models tested that varies by the y-axis. The red dashed line indicates the RANDOM threshold for each bias benchmark that serves as a litmus between biased and unbiased LMs-as-evaluators. The spread on the x-axis is randomly distributed for visual clarity.
(Left) We show the proportion of biased responses for each evaluator relative to the random threshold. We scale each of the axes to the score of the most biased model. The polygon plot indicates that the majority of models strongly exhibit several of the different biases, which could compromise the credibility of them as evaluators.
(Right) Overview of performance across all of the bias benchmarks categorized into 4 size groups. The red-dotted line denotes the average threshold taken from the calculated RANDOM in which the average scores were taken by summing their bias scores and then taking their average. The plot shows that the induced biases (Bandwagon effect and Attentional Bias) contribute to almost half of their average bias score. Also, the implicit biases contribute similarly to each model's overall bias scores, so scaling up model size does not reduce the biases in LLM evaluators.
@misc{koo2023benchmarking,
title={Benchmarking Cognitive Biases in Large Language Models as Evaluators},
author={Ryan Koo and Minhwa Lee and Vipul Raheja and Jong Inn Park and Zae Myung Kim and Dongyeop Kang},
year={2023},
eprint={2309.17012},
archivePrefix={arXiv},
primaryClass={cs.CL}
}