A reddit discussion simulation benchmark

MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

Yaoning Yu¹ · Ye Yu¹ · Haojing Luo² · Haohan Wang^{1 *}

¹University of Illinois Urbana–Champaign ²Starc.Institute ^*Corresponding author

LLM agents are increasingly used to simulate human behavior and online communities. But existing work measures individual responses, role-playing, or task performance — not whether a set of generated conversations matches the content and reply-tree dynamics of real ones. MiroBench is a Reddit discussion benchmark built from 4,292 real threads across five domains, evaluating both comment-derived signals and thread structure.

16 metrics · 5 families · hover any wedge

MIROBENCH

MiroBench · 16 metrics

Hover any wedge for the metric definition. Inner ring: 5 families. Outer ring: 16 individual metrics.

§ 01 — Motivation

Can LLM agents reproduce the distributional patterns of real online discussion, or only fluent text?

LLM-based agents are increasingly used as simulators of human behavior, social interaction, and online communities. The promise: a controllable way to study how people discuss products, react to events, exchange opinions, and form group-level patterns.

But generating conversations is not enough. Existing studies focus on individual responses, controlled role-playing, or task performance in simulated environments — not whether a set of simulated discussions reproduces the content patterns and interaction dynamics of real ones, or compares progress across systems.

Reddit discussions are a practical setting for this measurement: public, topic-grounded, and multi-party. They carry advice-seeking, personal experience, disagreement, emotional tone, repeated talking points, and complexity in structure — an observable window into broader social behavior.

MiroBench combines 4,292 real Reddit threads with matched train / validation / test splits, standardized seed contexts, and a framework that compares generated and real discussions through both comment-derived signals and reply-tree structure.

What we find

Current simulators remain distributionally mismatched to real Reddit threads even when individual comments look fluent. Prompt-based improvement only partly closes the gap. The leaderboard below makes that gap concrete, family by family.

§ 02 — Leaderboard

Models evaluated under OASIS as simulation engine.

#	Method · Model	Diversity	Tone	Structure	Content	Toxicity	p > 0.05 ↑
—	Real reference held-out test · 200 repeated samples	W₁ 0.016 \|δ\| 0.060	W₁ 0.034 \|δ\| 0.071	W₁ 0.165 \|δ\| 0.100	W₁ 0.074 \|δ\| 0.098	W₁ 0.004 \|δ\| 0.043	— baseline —
1	OASIS DeepSeek-V4-Flash	W₁ 0.162 ×10.1 \|δ\| 0.669 ×11.2	W₁ 0.187 ×5.5 \|δ\| 0.472 ×6.6	W₁ 1.028 ×6.2 \|δ\| 0.768 ×7.7	W₁ 0.192 ×2.6 \|δ\| 0.318 ×3.2	W₁ 0.006 ×1.5 \|δ\| 0.132 ×3.1	23 / 80 28.7%
2	OASIS DeepSeek-Chat	W₁ 0.165 ×10.3 \|δ\| 0.647 ×10.8	W₁ 0.177 ×5.2 \|δ\| 0.447 ×6.3	W₁ 1.044 ×6.3 \|δ\| 0.778 ×7.8	W₁ 0.269 ×3.6 \|δ\| 0.496 ×5.1	W₁ 0.006 ×1.5 \|δ\| 0.164 ×3.8	12 / 80 15.0%
3	OASIS GPT-5-mini	W₁ 0.189 ×11.8 \|δ\| 0.680 ×11.3	W₁ 0.276 ×8.1 \|δ\| 0.630 ×8.9	W₁ 1.051 ×6.4 \|δ\| 0.777 ×7.8	W₁ 0.492 ×6.6 \|δ\| 0.760 ×7.8	W₁ 0.008 ×2.0 \|δ\| 0.228 ×5.3	7 / 80 8.8%
4	OASIS GPT-4o-mini	W₁ 0.261 ×16.3 \|δ\| 0.830 ×13.8	W₁ 0.283 ×8.3 \|δ\| 0.611 ×8.6	W₁ 1.093 ×6.6 \|δ\| 0.811 ×8.1	W₁ 0.194 ×2.6 \|δ\| 0.509 ×5.2	W₁ 0.010 ×2.5 \|δ\| 0.703 ×16.3	3 / 80 3.8%
—	OASIS Gemini 2.5 Pro · pending	—	—	—	—	—	—
—	OASIS DeepSeek-V3.2 · pending	—	—	—	—	—	—

Family columns show per-family averages across 5 domains: W₁ is the Wasserstein distance between generated and real Reddit distributions (lower = closer to real); |δ| is the absolute Cliff's-δ effect size. The small ×N next to each value is the model's ratio over the real reference — how many times the real-vs-real noise floor it is. ×1.0 would mean indistinguishable from sampling more real data.

The top Real reference row is computed from 200 repeated sub-samples drawn from the held-out test split — not a single train/test split — so the floor accounts for sample-size variance.

Note on Structure. For structure metrics (avg_depth, structural_virality, length_cv), real Reddit threads have larger values than current LLM outputs — bigger is closer to real. The W₁ and |δ| columns still measure distance (so lower is still better), but the underlying gap is one-sided: LLMs under-build thread structure.

The p > 0.05 column counts (metric × domain) cells where Mann–Whitney U cannot reject equal distributions, out of 16 metrics × 5 domains = 80 cells. All models run under OASIS with the same task setup.

View full leaderboard per-domain breakdown

§ 03 — Real-vs-Real Reference

What does a good score look like?

These figures show the noise floor of the benchmark itself — i.e., the gap you observe when comparing one half of real Reddit data to the other half. Any model's score should be read against this floor. These figures are not part of the leaderboard; they characterise the measurement.

Repeated-sample MWU & KS p-value pass rates

Repeated-sample p-value pass rates across domains and metrics

Repeatedly drawing two equal-sized samples from the real distribution and running Mann–Whitney U / Kolmogorov–Smirnov: ≈95% of comparisons return p > 0.05, which is the false-positive rate we should expect by construction. This is the "100% match" line that no model has approached.

Real-vs-real W₁ (log scale)

Real-vs-real |Cliff's δ|

Real-vs-real Cliff's delta by family and domain

Model gap vs real-vs-real baseline

Per-model W1 and |Cliff delta| ratios over the real-vs-real baseline

For each family, the dotplot shows the ratio of a model's average W₁ (and |δ|) to the real-vs-real baseline. A ratio of 1.0 would mean a model is indistinguishable from sampling more real data; current models sit at 2×–11× the floor, with the largest gaps on Structure and Repetition.

§ 04 — Quick start

Three commands.

1 · Install

install

$ git clone https://github.com/yyu6/MiroBench.git
$ cd MiroBench
$ pip install -e .

2 · Score generated threads

score

# each thread = directory with a discussion.json
$ mirobench score my_threads/ --device cpu

3 · Compare against real Reddit

compare

$ mirobench compare my_threads/thread_scores.csv \
    --core-only \
    --model-name "my-model"

§ 05 — Citation

If you find MiroBench useful in your research, please cite our paper.

We also welcome citations from any work that builds on the data, the evaluation framework, or the reference distributions released with this benchmark.

mirobench.bib

@article{yu2026mirobench,
  title  = {MiroBench: Benchmarking Realism in Agentic Simulation
            of Real-world Discussions},
  author = {Yu, Yaoning and Yu, Ye and Luo, Haojing and Wang, Haohan},
  year   = {2026},
  url    = {https://github.com/yyu6/MiroBench}
}