Are LLM Agents Behaviorally Coherent?
Latent Profiles for Social Simulation

University of Minnesota wordmark
University of Chicago wordmark
Grammarly wordmark

Overview

High-level overview of the latent profile evaluation pipeline.

The paper tests whether LLM agents remain behaviorally coherent when latent preferences are elicited first and then compared against downstream conversational outcomes.

Abstract

The impressive capabilities of Large Language Models (LLMs) raise the possibility that synthetic agents can serve as substitutes for real participants in human-subject research. To evaluate this claim, prior research has largely focused on whether LLM-generated survey responses align with those produced by human respondents whom the LLMs are prompted to represent. In contrast, this paper asks a more fundamental question: do agents maintain empirical consistency when examined under different experimental settings? The study develops a design that first reveals an agent's latent profile and then tests whether conversational behavior remains consistent with that revealed state. Across model families and sizes, the findings show systematic inconsistencies: agents may match human-like responses at the surface level, but they fail more demanding tests that require coherent behavior across latent states and interaction.

Pipeline figure showing topic selection, agent generation, latent state extraction, pairing, and conversational scoring.

Method

The framework proceeds in five steps: choose a topic, generate agents with demographic and bias prompts, elicit preference P and openness O, pair agents over observed (P, O, B) values, and score final conversational agreement A.

The study uses nine topics spanning three contentiousness levels. Contentiousness 3 includes taxes, immigration, and free healthcare; contentiousness 2 includes electric scooters, student athletes being paid, and remote work; contentiousness 1 includes spring vs. fall, beaches vs. mountains, and Coca-Cola vs. Pepsi.

The central evaluation question is whether these latent-profile measurements predict downstream social behavior in the way established behavioral models would suggest.

Findings

The summary view highlights the overall failure pattern, while the panels at right show the corresponding empirical result for each focal finding.

Summary figure showing which tests pass at a surface level and which fail under deeper inspection.

Surface trends occasionally pass, but the deeper coherence checks largely fail.

Preference gap lowers agreement.

Larger preference gaps correspond to lower agreement, giving one of the paper’s clearest surface-level passes.

Bias instruction asymmetry figure.

Strong bias prompts amplify agreement for already aligned pairs, but they do not produce the expected increase in disagreement for maximally opposed pairs.

Topic contentiousness figure.

Even when two agents share the same preference, agreement still shifts with topic contentiousness, showing that topic context leaks into outcomes.

Low openness and large preference gap figure.

Among maximally opposed pairs, the lowest-openness configuration can produce unexpectedly high agreement rather than the lowest agreement.

Qualitative Examples

These qualitative comparisons illustrate each hypothesis using selected Gemma-3-12B-it interactions, highlighting where the model does and does not exhibit the expected behavior.

Loading qualitative comparisons...

Robustness Across Models

The paper reports the same overall pattern across Gemma, Llama, and Qwen families: simpler tests sometimes pass, but in-depth coherence tests mostly fail. This suggests the issue is not isolated to one architecture or one model size.

Model Test 1 Test 2 Test 3 Test 4 Test 5 Test 6
Qwen
Qwen3-0.6BFailFailFailFailFailFail
Qwen3-4BFailFailFailFailPassFail
Qwen3-8BPassFailFailFailPassPass
Llama
Llama-3.2-1BFailFailFailFailPassFail
Llama-3.2-3BPassFailFailFailPassFail
Llama-3.1-8BPassPassFailFailPassFail
Gemma
Gemma-3-1BFailFailFailFailFailFail
Gemma-3-4BPassFailFailFailPassFail
Gemma-3-12BPassFailFailFailPassFail

BibTeX

@misc{mooney2025llmagentsbehaviorallycoherent,
  title={Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation},
  author={James Mooney and Josef Woldense and Zheng Robert Jia and Shirley Anugrah Hayati and My Ha Nguyen and Vipul Raheja and Dongyeop Kang},
  year={2025},
  eprint={2509.03736},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2509.03736},
}