All scores are computed on the 16 selected MiroBench metrics across 5 domains under OASIS as the simulation engine. The summary mirrors the figure on the main page; the per-domain table below shows where each model passes and fails.
| # | Method · Model | Diversity | Tone | Structure | Content | Toxicity | p > 0.05 ↑ |
|---|---|---|---|---|---|---|---|
| — |
Real reference
held-out test · 200 repeated samples
|
W₁ 0.016 |δ| 0.060 |
W₁ 0.034 |δ| 0.071 |
W₁ 0.165 |δ| 0.100 |
W₁ 0.074 |δ| 0.098 |
W₁ 0.004 |δ| 0.043 |
— baseline — |
| 1 | OASIS DeepSeek-V4-Flash |
W₁ 0.162 ×10.1 |δ| 0.669 ×11.2 |
W₁ 0.187 ×5.5 |δ| 0.472 ×6.6 |
W₁ 1.028 ×6.2 |δ| 0.768 ×7.7 |
W₁ 0.192 ×2.6 |δ| 0.318 ×3.2 |
W₁ 0.006 ×1.5 |δ| 0.132 ×3.1 |
23 / 80 28.7% |
| 2 | OASIS DeepSeek-Chat |
W₁ 0.165 ×10.3 |δ| 0.647 ×10.8 |
W₁ 0.177 ×5.2 |δ| 0.447 ×6.3 |
W₁ 1.044 ×6.3 |δ| 0.778 ×7.8 |
W₁ 0.269 ×3.6 |δ| 0.496 ×5.1 |
W₁ 0.006 ×1.5 |δ| 0.164 ×3.8 |
12 / 80 15.0% |
| 3 | OASIS GPT-5-mini |
W₁ 0.189 ×11.8 |δ| 0.680 ×11.3 |
W₁ 0.276 ×8.1 |δ| 0.630 ×8.9 |
W₁ 1.051 ×6.4 |δ| 0.777 ×7.8 |
W₁ 0.492 ×6.6 |δ| 0.760 ×7.8 |
W₁ 0.008 ×2.0 |δ| 0.228 ×5.3 |
7 / 80 8.8% |
| 4 | OASIS GPT-4o-mini |
W₁ 0.261 ×16.3 |δ| 0.830 ×13.8 |
W₁ 0.283 ×8.3 |δ| 0.611 ×8.6 |
W₁ 1.093 ×6.6 |δ| 0.811 ×8.1 |
W₁ 0.194 ×2.6 |δ| 0.509 ×5.2 |
W₁ 0.010 ×2.5 |δ| 0.703 ×16.3 |
3 / 80 3.8% |
| — | OASIS Gemini 2.5 Pro · pending |
— | — | — | — | — | — |
| — | OASIS DeepSeek-V3.2 · pending |
— | — | — | — | — | — |
Each cell shows the number of metrics (out of 16) where the Mann–Whitney U test cannot reject equal distributions for that (model, domain) pair. Colour-coded: low, mid-low, mid, good, excellent.
| Method · Model | Credit cards | Cameras | Cell phones | Headphones | Laptops | Total ↑ |
|---|---|---|---|---|---|---|
OASIS DeepSeek-V4-Flash |
5/16 | 6/16 | 2/16 | 4/16 | 6/16 | 23/80 28.7% |
OASIS DeepSeek-Chat |
6/16 | 0/16 | 0/16 | 3/16 | 3/16 | 12/80 15.0% |
OASIS GPT-5-mini |
3/16 | 0/16 | 1/16 | 0/16 | 3/16 | 7/80 8.8% |
OASIS GPT-4o-mini |
1/16 | 1/16 | 0/16 | 0/16 | 1/16 | 3/80 3.8% |
OASIS Gemini 2.5 Pro · pending |
— | — | — | — | — | — |
OASIS DeepSeek-V3.2 · pending |
— | — | — | — | — | — |
No model passes more than 6/16 metrics in any single domain. The hardest domain across the board is Cell phones (0–2 / 16); the easiest is Credit cards (1–6 / 16). For reference, sampling more real data from the same domain produces ≈95% pass rates (≈15/16 cells per domain) — i.e., the benchmark's own noise floor.
Each metric is scored on every (model, domain) pair. A "cell" in the tables above is a single (metric, domain) result. Family columns on the summary table are unweighted averages across the family's metrics.