A reddit discussion simulation benchmark

MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

Yaoning Yu1  ·  Ye Yu1  ·  Haojing Luo2  ·  Haohan Wang1 *
1University of Illinois Urbana–Champaign   2Starc.Institute   *Corresponding author

LLM agents are increasingly used to simulate human behavior and online communities. But existing work measures individual responses, role-playing, or task performance — not whether a set of generated conversations matches the content and reply-tree dynamics of real ones. MiroBench is a Reddit discussion benchmark built from 4,292 real threads across five domains, evaluating both comment-derived signals and thread structure.

16 metrics · 5 families · hover any wedge
MIROBENCH
MiroBench · 16 metrics
Hover any wedge for the metric definition. Inner ring: 5 families. Outer ring: 16 individual metrics.
§ 01 — Motivation

Can LLM agents reproduce the distributional patterns of real online discussion, or only fluent text?

LLM-based agents are increasingly used as simulators of human behavior, social interaction, and online communities. The promise: a controllable way to study how people discuss products, react to events, exchange opinions, and form group-level patterns.

But generating conversations is not enough. Existing studies focus on individual responses, controlled role-playing, or task performance in simulated environments — not whether a set of simulated discussions reproduces the content patterns and interaction dynamics of real ones, or compares progress across systems.

Reddit discussions are a practical setting for this measurement: public, topic-grounded, and multi-party. They carry advice-seeking, personal experience, disagreement, emotional tone, repeated talking points, and complexity in structure — an observable window into broader social behavior.

MiroBench combines 4,292 real Reddit threads with matched train / validation / test splits, standardized seed contexts, and a framework that compares generated and real discussions through both comment-derived signals and reply-tree structure.

What we find

Current simulators remain distributionally mismatched to real Reddit threads even when individual comments look fluent. Prompt-based improvement only partly closes the gap. The leaderboard below makes that gap concrete, family by family.

§ 02 — Leaderboard

Models evaluated under OASIS as simulation engine.

# Method · Model Diversity Tone Structure Content Toxicity p > 0.05 ↑
Real reference
held-out test · 200 repeated samples
W₁ 0.016
|δ| 0.060
W₁ 0.034
|δ| 0.071
W₁ 0.165
|δ| 0.100
W₁ 0.074
|δ| 0.098
W₁ 0.004
|δ| 0.043
— baseline —
1
OASIS
DeepSeek-V4-Flash
W₁ 0.162 ×10.1
|δ| 0.669 ×11.2
W₁ 0.187 ×5.5
|δ| 0.472 ×6.6
W₁ 1.028 ×6.2
|δ| 0.768 ×7.7
W₁ 0.192 ×2.6
|δ| 0.318 ×3.2
W₁ 0.006 ×1.5
|δ| 0.132 ×3.1
23 / 80
28.7%
2
OASIS
DeepSeek-Chat
W₁ 0.165 ×10.3
|δ| 0.647 ×10.8
W₁ 0.177 ×5.2
|δ| 0.447 ×6.3
W₁ 1.044 ×6.3
|δ| 0.778 ×7.8
W₁ 0.269 ×3.6
|δ| 0.496 ×5.1
W₁ 0.006 ×1.5
|δ| 0.164 ×3.8
12 / 80
15.0%
3
OASIS
GPT-5-mini
W₁ 0.189 ×11.8
|δ| 0.680 ×11.3
W₁ 0.276 ×8.1
|δ| 0.630 ×8.9
W₁ 1.051 ×6.4
|δ| 0.777 ×7.8
W₁ 0.492 ×6.6
|δ| 0.760 ×7.8
W₁ 0.008 ×2.0
|δ| 0.228 ×5.3
7 / 80
8.8%
4
OASIS
GPT-4o-mini
W₁ 0.261 ×16.3
|δ| 0.830 ×13.8
W₁ 0.283 ×8.3
|δ| 0.611 ×8.6
W₁ 1.093 ×6.6
|δ| 0.811 ×8.1
W₁ 0.194 ×2.6
|δ| 0.509 ×5.2
W₁ 0.010 ×2.5
|δ| 0.703 ×16.3
3 / 80
3.8%
OASIS
Gemini 2.5 Pro · pending
OASIS
DeepSeek-V3.2 · pending

Family columns show per-family averages across 5 domains: W₁ is the Wasserstein distance between generated and real Reddit distributions (lower = closer to real); |δ| is the absolute Cliff's-δ effect size. The small ×N next to each value is the model's ratio over the real reference — how many times the real-vs-real noise floor it is. ×1.0 would mean indistinguishable from sampling more real data.

The top Real reference row is computed from 200 repeated sub-samples drawn from the held-out test split — not a single train/test split — so the floor accounts for sample-size variance.

Note on Structure. For structure metrics (avg_depth, structural_virality, length_cv), real Reddit threads have larger values than current LLM outputs — bigger is closer to real. The W₁ and |δ| columns still measure distance (so lower is still better), but the underlying gap is one-sided: LLMs under-build thread structure.

The p > 0.05 column counts (metric × domain) cells where Mann–Whitney U cannot reject equal distributions, out of 16 metrics × 5 domains = 80 cells. All models run under OASIS with the same task setup.

§ 03 — Real-vs-Real Reference

What does a good score look like?

These figures show the noise floor of the benchmark itself — i.e., the gap you observe when comparing one half of real Reddit data to the other half. Any model's score should be read against this floor. These figures are not part of the leaderboard; they characterise the measurement.

Repeated-sample MWU & KS p-value pass rates
Repeated-sample p-value pass rates across domains and metrics

Repeatedly drawing two equal-sized samples from the real distribution and running Mann–Whitney U / Kolmogorov–Smirnov: ≈95% of comparisons return p > 0.05, which is the false-positive rate we should expect by construction. This is the "100% match" line that no model has approached.

Real-vs-real W1 (log scale)
Real-vs-real W1 by family and domain
Real-vs-real |Cliff's δ|
Real-vs-real Cliff's delta by family and domain
Model gap vs real-vs-real baseline
Per-model W1 and |Cliff delta| ratios over the real-vs-real baseline

For each family, the dotplot shows the ratio of a model's average W1 (and |δ|) to the real-vs-real baseline. A ratio of 1.0 would mean a model is indistinguishable from sampling more real data; current models sit at 2×–11× the floor, with the largest gaps on Structure and Repetition.

§ 04 — Quick start

Three commands.

1 · Install
install
$ git clone https://github.com/yyu6/MiroBench.git
$ cd MiroBench
$ pip install -e .
2 · Score generated threads
score
# each thread = directory with a discussion.json
$ mirobench score my_threads/ --device cpu
3 · Compare against real Reddit
compare
$ mirobench compare my_threads/thread_scores.csv \
    --core-only \
    --model-name "my-model"
§ 05 — Citation

If you find MiroBench useful in your research, please cite our paper.

We also welcome citations from any work that builds on the data, the evaluation framework, or the reference distributions released with this benchmark.

mirobench.bib
@article{yu2026mirobench,
  title  = {MiroBench: Benchmarking Realism in Agentic Simulation
            of Real-world Discussions},
  author = {Yu, Yaoning and Yu, Ye and Luo, Haojing and Wang, Haohan},
  year   = {2026},
  url    = {https://github.com/yyu6/MiroBench}
}
Inspired by

MiroBench builds on prior open-source work: OASIS (CAMEL-AI) — the multi-agent social-interaction simulation framework that powers our generation pipeline — and MiroFish, a universal swarm-intelligence prediction engine powered by multi-agent simulation.