RealPDEBench:
Bridging the Sim-to-Real Gap

The first scientific ML benchmark with paired real-world and simulated data for complex physical systems

ICLR 2026 Oral (top 1.2%, scores ranking top 20/20000)

Explore Datasets Paper Hugging Face GitHub

AI for Scientific Simulation and Discovery Lab

Peiyan Hu*¹˒³ Haodong Feng*¹ Hongyuan Liu*¹ Tongtong Yan² Wenhao Deng¹ Tianrun Gao¹˒⁴ Rong Zheng¹˒⁵ Haoren Zheng¹˒² Chenglei Yu¹ Chuanrui Wang¹ Kaiwen Li¹˒² Zhi-Ming Ma³ Dezhi Zhou² Xingcai Lu⁶ Dixia Fan¹ Tailin Wu†¹

* co-first authors (equal contribution). † corresponding author.

{hupeiyan, fenghaodong, liuhongyuan, wutailin}@westlake.edu.cn

¹ School of Engineering, Westlake University
² Global College, Shanghai Jiao Tong University
³ Academy of Mathematics and Systems Science, Chinese Academy of Sciences

⁴ Department of Geotechnical Engineering, Tongji University
⁵ School of Physics, Peking University
⁶ Key Laboratory for Power Machinery and Engineering of M. O. E., Shanghai Jiao Tong University

5 Datasets

700+ Trajectories

10 Baseline Models

9 Evaluation Metrics

Real-World Experiments

CFD Simulations

The Challenge

Why Real-World Data Matters

Most scientific ML models are only validated on simulated data, creating a critical gap between theory and practice.

Numerical Errors

Discretization and modeling assumptions in CFD simulations

Measurement Noise

Camera sensors and particle tracking introduce real-world noise

Unmeasured Modalities

Pressure fields and 3D velocities cannot be fully measured

Benchmark Datasets

Five Physical Systems
Real Experiments + CFD Simulations

Click a dataset card to open the scenario page (data format, downloads, and examples).

FSI

Two-way fluid–structure interaction with cylinder vibration (vortex-induced vibration), spanning Re 3272–9068 across varying mass ratio and damping.

Fluid-Structure Two-way Coupling VIV

Controlled Cylinder

Active control via forced vibration (f 0.5–1.4 Hz, Re 1781–9843).

Cylinder

Stationary cylinder wake (Re 1800–12000) measured by PIV.

Foil

NACA0025 airfoil: 2D slices of 3D flow (AoA 0°–20°, Re 2968–17031).

Combustion

3D swirl-stabilized NH₃/CH₄/air flames captured with OH* chemiluminescence at 4000 fps. Large Eddy Simulation with 38 species and 184 reactions.

Combustion 3D LES Multi-physics

Benchmark

Baselines & Evaluation

Click a model or metric to open its detail page.

10 Baseline Models

Foundation Models

Traditional & CNN

DMD
U-Net

Neural Operators

Transformers

9 Evaluation Metrics

Data-oriented

Physics-oriented

Results Explorer

Explore Results

Baseline ranking on real-world test data, stratified by dataset and training paradigm.

SINGLE-METRIC COMPARISON

Bar Chart

Bars are sorted best → worst (longest → shortest) for the selected metric. Bar length is min–max normalized across all models in the current dataset + training paradigm (best = 100% / full bar; worst = 0%). For error metrics (↓), smaller raw values correspond to longer bars; for R² (↑), larger values correspond to longer bars.

Dataset

Training paradigm

Metric

Show

Training paradigm Simulated training

trained on simulated (numerical/CFD) data.

Metric RMSE Root Mean Square Error (↓ lower is better)

Pointwise error between predicted and ground-truth fields.

Loading benchmark data…

Multi-metric comparison

Radar chart across performance dimensions

Scores are min–max normalized to 0–100 within the current dataset + training paradigm. Use Zoom to normalize within the currently selected models for clearer separation. Higher is better. Axes are computed from the reported benchmark metrics (no extra measurements).

Notes: Reported metrics are evaluated on real-world test data. DMD has no training stage; where unavailable, values are omitted.

Key Takeaways

Key Findings

Real data and simulation fail in different ways.

Real-world measurements are dominated by sensor and measurement noise, while simulated data are dominated by numerical and modeling error (e.g., discretization, LES closures, idealized conditions). That mismatch changes the error distribution—and is a key reason sim-to-real transfer is hard.

Simulation is cheap and information-rich, but imperfect.

Simulated data are cheaper to generate at scale, can expose additional modalities (e.g., pressure), and avoid measurement-induced noise. This makes simulation valuable for coverage and pretraining, even though it cannot perfectly match reality.

Simulation-only training doesn't transfer cleanly to real tests.

Across datasets, models trained on simulated trajectories show a clear performance gap when evaluated on real-world measurements. Even when physical parameters are matched, learning only from simulation tends to miss real-world effects.

Training on real data closes much of the gap.

On real-world benchmarks, training directly on real measurements yields substantially lower errors than training on simulated data only. In our main results, real-world training improves Rel \(L_2\) by 9.39% to 78.91% (depending on dataset and model).

Pretrain on simulation, finetune on real: best of both.

Simulated pretraining followed by real-world finetuning often outperforms training on real-world data from scratch with the same real-data budget. Pretraining helps models pick up broad dynamics from large simulated corpora, then adapt to real measurement artifacts during finetuning.

Pretraining saves updates.

Finetuned models reach the same (or better) performance with fewer real-data update steps—reflected by Update Ratios below one for most settings. On Combustion, the validation RMSE curve drops faster under finetuning than training from scratch.

Architectures trade off pointwise accuracy vs. global structure.

Convolution-based models (e.g., U-Net, CNO) tend to do well on pointwise errors like RMSE. Models with operator / wavelet structure (e.g., MWT) can better preserve periodicity and other global features—so "best model" depends on the metric you care about.

Long-horizon rollouts separate short-term wins from stable dynamics.

Autoregressive evaluation makes error accumulation obvious: a model that looks great at one-step prediction can drift quickly over multiple rollouts. In our Cylinder long-horizon analysis, the large pretrained DPOT model is among the most stable under multi-round evaluation.

Resources

Reproducibility

Access datasets, baselines, and evaluation scripts to reproduce results and benchmark new models on paired experiments and CFD simulations.

Getting Started Code & Baselines

Citation

If you find RealPDEBench useful in your research, please cite:

BibTeX

@inproceedings{hu2026realpdebench,
      title={RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data}, 
      author={Peiyan Hu and Haodong Feng and Hongyuan Liu and Tongtong Yan and Wenhao Deng and Tianrun Gao and Rong Zheng and Haoren Zheng and Chenglei Yu and Chuanrui Wang and Kaiwen Li and Zhi-Ming Ma and Dezhi Zhou and Xingcai Lu and Dixia Fan and Tailin Wu},
      booktitle={The Fourteenth International Conference on Learning Representations},
      year={2026},
      url={https://openreview.net/forum?id=y3oHMcoItR},
      note={Oral Presentation}
}

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

RealPDEBench: Bridging the Sim-to-Real Gap

Why Real-World Data Matters

Numerical Errors

Measurement Noise

Unmeasured Modalities

Five Physical SystemsReal Experiments + CFD Simulations

FSI

Controlled Cylinder

Cylinder

Foil

Combustion

Baselines & Evaluation

10 Baseline Models

9 Evaluation Metrics

Explore Results

Key Findings

Reproducibility

RealPDEBench:
Bridging the Sim-to-Real Gap

Five Physical Systems
Real Experiments + CFD Simulations