Full Leaderboard

MiroBench · all models, all domains.

All scores are computed on the 16 selected MiroBench metrics across 5 domains under OASIS as the simulation engine. The summary mirrors the figure on the main page; the per-domain table below shows where each model passes and fails.

Summary by family

Per-family W₁ and |Cliff's δ| across 5 domains.

# Method · Model Diversity Tone Structure Content Toxicity p > 0.05 ↑
Real reference
held-out test · 200 repeated samples
W₁ 0.016
|δ| 0.060
W₁ 0.034
|δ| 0.071
W₁ 0.165
|δ| 0.100
W₁ 0.074
|δ| 0.098
W₁ 0.004
|δ| 0.043
— baseline —
1
OASIS
DeepSeek-V4-Flash
W₁ 0.162 ×10.1
|δ| 0.669 ×11.2
W₁ 0.187 ×5.5
|δ| 0.472 ×6.6
W₁ 1.028 ×6.2
|δ| 0.768 ×7.7
W₁ 0.192 ×2.6
|δ| 0.318 ×3.2
W₁ 0.006 ×1.5
|δ| 0.132 ×3.1
23 / 80
28.7%
2
OASIS
DeepSeek-Chat
W₁ 0.165 ×10.3
|δ| 0.647 ×10.8
W₁ 0.177 ×5.2
|δ| 0.447 ×6.3
W₁ 1.044 ×6.3
|δ| 0.778 ×7.8
W₁ 0.269 ×3.6
|δ| 0.496 ×5.1
W₁ 0.006 ×1.5
|δ| 0.164 ×3.8
12 / 80
15.0%
3
OASIS
GPT-5-mini
W₁ 0.189 ×11.8
|δ| 0.680 ×11.3
W₁ 0.276 ×8.1
|δ| 0.630 ×8.9
W₁ 1.051 ×6.4
|δ| 0.777 ×7.8
W₁ 0.492 ×6.6
|δ| 0.760 ×7.8
W₁ 0.008 ×2.0
|δ| 0.228 ×5.3
7 / 80
8.8%
4
OASIS
GPT-4o-mini
W₁ 0.261 ×16.3
|δ| 0.830 ×13.8
W₁ 0.283 ×8.3
|δ| 0.611 ×8.6
W₁ 1.093 ×6.6
|δ| 0.811 ×8.1
W₁ 0.194 ×2.6
|δ| 0.509 ×5.2
W₁ 0.010 ×2.5
|δ| 0.703 ×16.3
3 / 80
3.8%
OASIS
Gemini 2.5 Pro · pending
OASIS
DeepSeek-V3.2 · pending
Per-domain breakdown

p > 0.05 cells passing per (model × domain).

Each cell shows the number of metrics (out of 16) where the Mann–Whitney U test cannot reject equal distributions for that (model, domain) pair. Colour-coded: low, mid-low, mid, good, excellent.

Method · Model Credit cards Cameras Cell phones Headphones Laptops Total ↑
OASIS
DeepSeek-V4-Flash
5/16 6/16 2/16 4/16 6/16 23/80
28.7%
OASIS
DeepSeek-Chat
6/16 0/16 0/16 3/16 3/16 12/80
15.0%
OASIS
GPT-5-mini
3/16 0/16 1/16 0/16 3/16 7/80
8.8%
OASIS
GPT-4o-mini
1/16 1/16 0/16 0/16 1/16 3/80
3.8%
OASIS
Gemini 2.5 Pro · pending
OASIS
DeepSeek-V3.2 · pending

No model passes more than 6/16 metrics in any single domain. The hardest domain across the board is Cell phones (0–2 / 16); the easiest is Credit cards (1–6 / 16). For reference, sampling more real data from the same domain produces ≈95% pass rates (≈15/16 cells per domain) — i.e., the benchmark's own noise floor.

Metrics scope

16 metrics in 5 families.

Diversity · 3
  • self_bleu_4
  • semantic_mean_cosine
  • self_bertscore_mean_f1
Tone · 4
  • hard_disagree_rate
  • polite_rate
  • impolite_rate
  • neutral_rate
Structure · 3
  • length_cv
  • avg_depth
  • structural_virality
Content · 2
  • mean_story_probability
  • emotion_entropy
Toxicity · 4
  • toxicity_mean
  • severe_toxicity_mean
  • obscene_mean
  • threat_mean

Each metric is scored on every (model, domain) pair. A "cell" in the tables above is a single (metric, domain) result. Family columns on the summary table are unweighted averages across the family's metrics.