How Far Can We Extract Diverse Perspectives from Large Language Models?

Abstract

Collecting diverse human opinions is costly and challenging. This leads to a recent trend in collaborative efforts between humans and Large Language Models (LLMs) for generating diverse data, offering potential scalable and efficient solutions. However, the extent of LLMs' capability to generate diverse perspectives on subjective topics remains an unexplored question. In this study, we investigate LLMs' capacity for generating diverse perspectives and rationales on subjective topics, such as social norms and argumentative texts. We formulate a new problem of maximum diversity extraction from LLMs. Motivated by how humans develop their opinions through their values, we propose a criteria-based prompting technique to ground diverse opinions. To see how far we can extract diverse perspectives from LLMs, or called >diversity coverage, we employ a step-by-step recall prompting for generating more outputs from the model in an iterative manner. As we apply our methods to various tasks, indeed we find that LLMs can generate diverse opinions according to the degree of task subjectivity.

Research Contributions

First, we propose the idea of perspective diversity for generative LLMs, unlike lexical diversity, syntactical diversity, and semantic diversity which have been main interests in previous works. We conduct various experiments to measure LLMs' ability to generate maximum perspective diversity.
Second, we thus introduce a new prompting technique called criteria-based diversity prompting, as a way of extracting and grounding diverse perspectives from LLMs.
Finally, as it is unclear how much diversity LLMs can cover, we suggest a step-by-step approach for measuring the coverage of LLMs' diversity generation (i.e., measuring the recall for diversity prompting). We then compare this coverage between LLM's generated opinions and human-written opinions.

Criteria-based Diversity Prompting
Our Criteria-based Diversity Prompting is as follows (shown in Figure [a]):
"Given a statement, we prompt the LLMs to generate its stance (e.g., agree or disagree) and explain its Reasons with a list of Criteria that affect its perspective. ""

Here, we consider criteria words or phrases that frame the LLM's high-level decision and generate the grounded reasons well (e.g., model values).

Step-by-Step Recall Prompting
To see the LLMs' diversity coverage, we suggest a step-by-step recall prompting (as shown in Figure [b] ):
We first ask LLMs to generate one opinion ('1st Opinion') for the given statement, and we ask the models to continue generating more opinions until the requested number of opinions ('N') is reached.

Note that the first opinion is used to guide the structured format for the output since we do not do few-shot prompting for this experiment.

Dataset & Models
We collected the following datasets: (1) Social-Chem-101 (Forbes et al., 2020); (2) Change My View (CMV) (Hidey et al., 2017). For the recall prompting technique, we added the two more datasets: (3) Hate Speech (Vidgen et al., 2021); and (4) Moral Stories (Emelin et al., 2021).

Then, we assemble GPT-4, ChatGPT, and GPT-3 (text-davinci-002) as well as open-source models such as LLaMA2-70B-chat (Touvron et al., 2023) and Mistral-7B-Instruct (Jiang et al., 2023).

Evaluation
We measured the diversity in LLM-generated opinions by using the following two metrics:
1. Semantic Diversity: For each statement, we first model the generated reasons from LLMs as sentence embeddings using SentenceBERT. We then measure the cosine distance among every pair of reasons and compute the average cosine distance across all the pairs. Note that we used this metric to compare the diversity of models' generated reasons between criteria-based prompting and free-form prompting.
2. Perspective Diversity: We prompt GPT-4 to cluster criteria words with similar meaning into one group, in order to examine the step-by-step recall prompting. A perspective diversity score for a statement is the percentage of how many generated opinions of the statement have each of their criteria not duplicated with each other. The higher the score is, the more diverse the set of generated opinions is.

Key Takeaways

GPT-4 with the criteria-based diversity prompting in an one-shot setting shows the most semantically diverse opinions about social norms and argumentative topics.

Semantic diversity (cosine distance) results for criteria-based prompting vs. free-form prompting and LLM variants. 1-criteria refers to one-shot criteria-based prompting and so on. Text for the highest diversity score within the same LLM type is made \textbf{bold}. * p< 0.05 when comparing criteria-based prompting with free-form prompting.

Task subjectivity of dataset tends to influence the capabilities of LLMs in producing the maximium number of diverse opinions.

X-axis is the number of generated opinions for our diversity coverage experiment and Y-axis is the average number of unique criteria clusters for all statements. Moral Stories do not have stances, so the line is only for all generated continued stories.

Different numbers of LLMs' generated unique criteria clusters for different task types. Max and median refer to the maximum and the median of the number of unique criteria clusters.

Semantic diversity is not always positively correlated with perspective diversity.

Scatter plot for X= semantic diversity (cosine distance) of opinions in each statement, Y = perspective diversity (% of statements without duplicate opinions). A green circle refers to one statement with agree/hate speech reasons while a red triangle refers to statements with disagree/not hate opinions. Story continuation in Moral Stories does not have stances and each story is represented by a purple circle.

Humans and LLMs have different perspectives on socially argumentative topics.

Opinions generated by GPT-4 (top) and a human (bottom) about a statement from Social-Chem-101.

Average number of criteria clusters of human opinions vs. GPT-4-generated opinions per statement with standard deviation. Humans generated slightly more diverse opinions than LLMs.

How Far Can We Extract Diverse Perspectives from Large Language Models?

EMNLP 2024 (Main, Long Paper)

Abstract

Research Contributions

Methods

Key Takeaways