Benchmarking Cognitive Biases in Large Language Models as Evaluators

University of Minnesota Twin Cities#, Grammarly*
Arxiv, 2023

Abstract

Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences.

Our CoBBLEr Pipeline


cobbler pipeline

We propose CoBBLEr, the COgnitive Bias Benchmark for evaluating the quality and reliability of LLMs as EvaluatoRs. Here is the pipeline:

  1. Dataset and Models

    We collect a set of 50 question-answering instructions from BigBench (Srivastava et al., 2023) and Eli5 (Fan et al., 2019).

    Then, we assemble top 15 models that are open- and closed-source LLMs (GPT-4, ChatGPT, InstructGPT, LLaMAv2, LLaMA, Cohere, Falcon, Alpaca, Vicuna, OpenAssist, DollyV2, Baize, Koala, WizardLM, MPT, RedPajama).


  2. Response Generation

    We then generate responses from 15 open- and closed-source LLMs and conduct a round-robin over every possible unique pair between each of the model responses, prompting each model to evaluate its own and other models’ responses.


  3. Pairwise Evaluation

    We then test six different biases to benchmark their evaluation quality and categorize the model biases into two groups: (1) Implicit Biases, which can be implicitly extracted from each model’s evaluation via a vanilla prompt, and (2) Induced Biases, which add modifications to the original prompts akin to induce negative behaviors.

Main Findings


(1) Definition of Biases in CoBBLEr


cobbler pipeline

Definition and examples of each bias type in CoBBLEr. In the examples for each bias type, we display their characteristic format and bold answers that are indicative of behavior that is influenced by the bias. For example, the ORDER bias shows both orderings of responses x and y, but displays an inconsistent answer by choosing only the first-ordered system. Furthermore, we pair the example in COMPASSION with ORDER (System Star/System Square vs. Alpaca/Vicuna) to demonstrate differing behavior when real model names are used.


(2) Overall Bias Scores by Models and Size


cobbler pipeline

We show proportion of responses that were labeled bias for each bias benchmark. We visualize the distribution of the 15 models tested that varies by the y-axis. The red dashed line indicates the RANDOM threshold for each bias benchmark that serves as a litmus between biased and unbiased LMs-as-evaluators. The spread on the x-axis is randomly distributed for visual clarity.

  1. Order Bias: the majority of models tend to prefer the first-order response, especially the models of size greater than 40B parameters.
  2. Compassion Bias: all models were influenced by real model names in evaluation settings, especially for those with 100B parameters or greater.
  3. Egocentric Bias: The models of size less than than 10B generally tend to prefer their own responses regardless of text quality, compared to anonymized aliases.
  4. Salience Bias: Larger models (>100B and >40B) are drawn more strongly to longer responses, compared to smaller models.
  5. Bandwagon Effect: The majority of models were heavily influenced by a simple fake statistics, regardless of text quality.
  6. Attentional Bias: Around half of models were influenced by irrelevant information that distracts models when evaluating text quality.

Image 1 Image 2

(Left) We show the proportion of biased responses for each evaluator relative to the random threshold. We scale each of the axes to the score of the most biased model. The polygon plot indicates that the majority of models strongly exhibit several of the different biases, which could compromise the credibility of them as evaluators.

(Right) Overview of performance across all of the bias benchmarks categorized into 4 size groups. The red-dotted line denotes the average threshold taken from the calculated RANDOM in which the average scores were taken by summing their bias scores and then taking their average. The plot shows that the induced biases (Bandwagon effect and Attentional Bias) contribute to almost half of their average bias score. Also, the implicit biases contribute similarly to each model's overall bias scores, so scaling up model size does not reduce the biases in LLM evaluators.

Check Our Demo to see the N-wise ranking evaluation by each of 15 LLMs and human annotators as well!


Demo (Experiment Viewer)

Citation

@misc{koo2023benchmarking,
        title={Benchmarking Cognitive Biases in Large Language Models as Evaluators}, 
        author={Ryan Koo and Minhwa Lee and Vipul Raheja and Jong Inn Park and Zae Myung Kim and Dongyeop Kang},
        year={2023},
        eprint={2309.17012},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
  }