RealPDEBench:
Bridging the Sim-to-Real Gap

The first scientific ML benchmark with paired real-world and simulated data for complex physical systems

Peiyan Hu*¹˒³ Haodong Feng*¹ Hongyuan Liu*¹ Tongtong Yan² Wenhao Deng¹ Tianrun Gao¹˒⁴ Rong Zheng¹˒⁵ Haoren Zheng¹˒² Chenglei Yu¹ Chuanrui Wang¹ Kaiwen Li¹˒² Zhi-Ming Ma³ Dezhi Zhou² Xingcai Lu⁶ Dixia Fan¹ Tailin Wu†¹

* co-first authors (equal contribution). † corresponding author.

{hupeiyan, fenghaodong, liuhongyuan, wutailin}@westlake.edu.cn

¹ School of Engineering, Westlake University
² Global College, Shanghai Jiao Tong University
³ Academy of Mathematics and Systems Science, Chinese Academy of Sciences

⁴ Department of Geotechnical Engineering, Tongji University
⁵ School of Physics, Peking University
⁶ Key Laboratory for Power Machinery and Engineering of M. O. E., Shanghai Jiao Tong University

5 Datasets
700+ Trajectories
10 Baseline Models
9 Evaluation Metrics

Real-World Experiments

All real-world experimental data

CFD Simulations

All CFD simulation data

The Challenge

Why Real-World Data Matters

Most scientific ML models are only validated on simulated data, creating a critical gap between theory and practice.

Numerical Errors

Discretization and modeling assumptions in CFD simulations

Measurement Noise

Camera sensors and particle tracking introduce real-world noise

Unmeasured Modalities

Pressure fields and 3D velocities cannot be fully measured

Benchmark

Baselines & Evaluation

Click a model or metric to open its detail page.

10 Baseline Models

Foundation Models
Traditional & CNN
Neural Operators
Transformers

9 Evaluation Metrics

Data-oriented
Physics-oriented

Results Explorer

Explore Results

Baseline ranking on real-world test data, stratified by dataset and training paradigm.

SINGLE-METRIC COMPARISON
Bar Chart
Bars are sorted best → worst (longest → shortest) for the selected metric. Bar length is min–max normalized across all models in the current dataset + training paradigm (best = 100% / full bar; worst = 0%). For error metrics (↓), smaller raw values correspond to longer bars; for R² (↑), larger values correspond to longer bars.
Training paradigm
Training paradigm Simulated training
trained on simulated (numerical/CFD) data.
Metric RMSE Root Mean Square Error (↓ lower is better)
Pointwise error between predicted and ground-truth fields.
Loading benchmark data…
Multi-metric comparison
Radar chart across performance dimensions
Scores are min–max normalized to 0–100 within the current dataset + training paradigm. Use Zoom to normalize within the currently selected models for clearer separation. Higher is better. Axes are computed from the reported benchmark metrics (no extra measurements).
Notes: Reported metrics are evaluated on real-world test data. DMD has no training stage; where unavailable, values are omitted.

Key Takeaways

Key Findings

Real data and simulation fail in different ways.
Real-world measurements are dominated by sensor and measurement noise, while simulated data are dominated by numerical and modeling error (e.g., discretization, LES closures, idealized conditions). That mismatch changes the error distribution—and is a key reason sim-to-real transfer is hard.
Simulation is cheap and information-rich, but imperfect.
Simulated data are cheaper to generate at scale, can expose additional modalities (e.g., pressure), and avoid measurement-induced noise. This makes simulation valuable for coverage and pretraining, even though it cannot perfectly match reality.
Simulation-only training doesn't transfer cleanly to real tests.
Across datasets, models trained on simulated trajectories show a clear performance gap when evaluated on real-world measurements. Even when physical parameters are matched, learning only from simulation tends to miss real-world effects.
Training on real data closes much of the gap.
On real-world benchmarks, training directly on real measurements yields substantially lower errors than training on simulated data only. In our main results, real-world training improves Rel \(L_2\) by 9.39% to 78.91% (depending on dataset and model).
Pretrain on simulation, finetune on real: best of both.
Simulated pretraining followed by real-world finetuning often outperforms training on real-world data from scratch with the same real-data budget. Pretraining helps models pick up broad dynamics from large simulated corpora, then adapt to real measurement artifacts during finetuning.
Pretraining saves updates.
Finetuned models reach the same (or better) performance with fewer real-data update steps—reflected by Update Ratios below one for most settings. On Combustion, the validation RMSE curve drops faster under finetuning than training from scratch.
Architectures trade off pointwise accuracy vs. global structure.
Convolution-based models (e.g., U-Net, CNO) tend to do well on pointwise errors like RMSE. Models with operator / wavelet structure (e.g., MWT) can better preserve periodicity and other global features—so "best model" depends on the metric you care about.
Long-horizon rollouts separate short-term wins from stable dynamics.
Autoregressive evaluation makes error accumulation obvious: a model that looks great at one-step prediction can drift quickly over multiple rollouts. In our Cylinder long-horizon analysis, the large pretrained DPOT model is among the most stable under multi-round evaluation.

Resources

Reproducibility

Access datasets, baselines, and evaluation scripts to reproduce results and benchmark new models on paired experiments and CFD simulations.

Citation

If you find RealPDEBench useful in your research, please cite:

BibTeX
@misc{hu2026realpdebenchbenchmarkcomplexphysical,
      title={RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data}, 
      author={Peiyan Hu and Haodong Feng and Hongyuan Liu and Tongtong Yan and Wenhao Deng and Tianrun Gao and Rong Zheng and Haoren Zheng and Chenglei Yu and Chuanrui Wang and Kaiwen Li and Zhi-Ming Ma and Dezhi Zhou and Xingcai Lu and Dixia Fan and Tailin Wu},
      year={2026},
      eprint={2601.01829},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.01829}, 
}