Full Leaderboard

MiroBench · all models, all domains.

All scores are computed on the 16 selected MiroBench metrics across 5 domains under OASIS as the simulation engine. The summary mirrors the figure on the main page; the per-domain table below shows where each model passes and fails.

Summary by family

Per-family W₁ and |Cliff's δ| across 5 domains.

#	Method · Model	Diversity	Tone	Structure	Content	Toxicity	p > 0.05 ↑
—	Real reference held-out test · 200 repeated samples	W₁ 0.016 \|δ\| 0.060	W₁ 0.034 \|δ\| 0.071	W₁ 0.165 \|δ\| 0.100	W₁ 0.074 \|δ\| 0.098	W₁ 0.004 \|δ\| 0.043	— baseline —
1	OASIS DeepSeek-V4-Flash	W₁ 0.162 ×10.1 \|δ\| 0.669 ×11.2	W₁ 0.187 ×5.5 \|δ\| 0.472 ×6.6	W₁ 1.028 ×6.2 \|δ\| 0.768 ×7.7	W₁ 0.192 ×2.6 \|δ\| 0.318 ×3.2	W₁ 0.006 ×1.5 \|δ\| 0.132 ×3.1	23 / 80 28.7%
2	OASIS DeepSeek-Chat	W₁ 0.165 ×10.3 \|δ\| 0.647 ×10.8	W₁ 0.177 ×5.2 \|δ\| 0.447 ×6.3	W₁ 1.044 ×6.3 \|δ\| 0.778 ×7.8	W₁ 0.269 ×3.6 \|δ\| 0.496 ×5.1	W₁ 0.006 ×1.5 \|δ\| 0.164 ×3.8	12 / 80 15.0%
3	OASIS GPT-5-mini	W₁ 0.189 ×11.8 \|δ\| 0.680 ×11.3	W₁ 0.276 ×8.1 \|δ\| 0.630 ×8.9	W₁ 1.051 ×6.4 \|δ\| 0.777 ×7.8	W₁ 0.492 ×6.6 \|δ\| 0.760 ×7.8	W₁ 0.008 ×2.0 \|δ\| 0.228 ×5.3	7 / 80 8.8%
4	OASIS GPT-4o-mini	W₁ 0.261 ×16.3 \|δ\| 0.830 ×13.8	W₁ 0.283 ×8.3 \|δ\| 0.611 ×8.6	W₁ 1.093 ×6.6 \|δ\| 0.811 ×8.1	W₁ 0.194 ×2.6 \|δ\| 0.509 ×5.2	W₁ 0.010 ×2.5 \|δ\| 0.703 ×16.3	3 / 80 3.8%
—	OASIS Gemini 2.5 Pro · pending	—	—	—	—	—	—
—	OASIS DeepSeek-V3.2 · pending	—	—	—	—	—	—

Per-domain breakdown

p > 0.05 cells passing per (model × domain).

Each cell shows the number of metrics (out of 16) where the Mann–Whitney U test cannot reject equal distributions for that (model, domain) pair. Colour-coded: low, mid-low, mid, good, excellent.

Method · Model	Credit cards	Cameras	Cell phones	Headphones	Laptops	Total ↑
OASIS DeepSeek-V4-Flash	5/16	6/16	2/16	4/16	6/16	23/80 28.7%
OASIS DeepSeek-Chat	6/16	0/16	0/16	3/16	3/16	12/80 15.0%
OASIS GPT-5-mini	3/16	0/16	1/16	0/16	3/16	7/80 8.8%
OASIS GPT-4o-mini	1/16	1/16	0/16	0/16	1/16	3/80 3.8%
OASIS Gemini 2.5 Pro · pending	—	—	—	—	—	—
OASIS DeepSeek-V3.2 · pending	—	—	—	—	—	—

No model passes more than 6/16 metrics in any single domain. The hardest domain across the board is Cell phones (0–2 / 16); the easiest is Credit cards (1–6 / 16). For reference, sampling more real data from the same domain produces ≈95% pass rates (≈15/16 cells per domain) — i.e., the benchmark's own noise floor.

Metrics scope

16 metrics in 5 families.

Diversity · 3

self_bleu_4
semantic_mean_cosine
self_bertscore_mean_f1

Tone · 4

hard_disagree_rate
polite_rate
impolite_rate
neutral_rate

Structure · 3

length_cv
avg_depth
structural_virality

Content · 2

mean_story_probability
emotion_entropy

Toxicity · 4

toxicity_mean
severe_toxicity_mean
obscene_mean
threat_mean

Each metric is scored on every (model, domain) pair. A "cell" in the tables above is a single (metric, domain) result. Family columns on the summary table are unweighted averages across the family's metrics.