Datasets¶
RealPDEBench contains 5 scenarios with paired real-world measurements and matched numerical simulations. Each scenario provides two branches:
- Real-world (
real): experimentally measured fields (often incomplete) - Simulated (
numerical): CFD/LES fields (often includes unmeasured modalities, e.g. pressure)
What "paired" means in RealPDEBench¶
A scenario provides two branches:
real/: experimentally measured fields (may be incomplete)numerical/: simulation fields (can include additional, unmeasured modalities)
"Paired" means that the real and numerical trajectories correspond to the same configuration (e.g., Reynolds number, control frequency, mixture ratio), enabling sim→real transfer and modality-mismatch evaluation. Note that paired trajectories are matched by configuration, but their initial frames are not necessarily aligned.
Dataset inventory¶
The table below summarizes dataset sizes, temporal resolution, spatial resolution, and observed modalities.
| Dataset | n_traj |
n_frame |
\(\Delta t\) (s) | Resolution (sim) | Resolution (real) | Memory (GB) | Modalities (sim) | Modalities (real) |
|---|---|---|---|---|---|---|---|---|
| Cylinder | 92 × 2 | 3990 | \(2.5\times 10^{-3}\) | 64×128 | 128×256 | 190.50 | \(u,v,p\) | \(u,v\) |
| Controlled Cylinder | 96 × 2 | 3990 | \(2.5\times 10^{-3}\) | 64×128 | 128×256 | 187.08 | \(u,v,p\) | \(u,v\) |
| FSI | 51 × 2 | 2173 | \(2.0\times 10^{-3}\) | 128×128 | 128×128 | 94.73 | \(u,v,p\) | \(u,v\) |
| Foil | 99 × 2 | 3990 | \(2.5\times 10^{-3}\) | 128×256 | 128×256 | 335.64 | \(u,v,p\) | \(u,v\) |
| Combustion | 30 × 2 | 2001 | \(2.5\times 10^{-4}\) | 128×128 | 128×128 | 110.12 | multi-modal (15 channels) | \(I\) |
Note
We use n_traj = X × 2 to indicate paired trajectories: X real-world and X numerical trajectories for the same scenario.
Windowing: sim_id + time_id¶
RealPDEBench evaluates forecasting on short spatiotemporal windows sampled from long trajectories:
- Trajectory ID (
sim_id): trajectory identifier string (e.g.,1800,1781_0.5,40NH3_1.1) - Window start (
time_id): integer time index - One sample: a contiguous window
data[time_id : time_id + T], where \(T\) is the window length (in_step+out_step)
Public distribution format (Hugging Face snapshot)¶
We distribute data as Hugging Face Datasets (Arrow) shards. On disk, a downloaded snapshot is organized as:
{dataset_root}/
{scenario}/
hf_dataset/
real_train/ ...
real_val/ ...
real_test/ ...
numerical_train/ ...
numerical_val/ ...
numerical_test/ ...
in_dist_test_params_real.json
out_dist_test_params_real.json
remain_params_real.json
in_dist_test_params_numerical.json
out_dist_test_params_numerical.json
remain_params_numerical.json
The *_test_params_*.json files are used for test_mode filtering ("in_dist/out_dist/seen/unseen") during validation/testing.
Evaluation subsets (JSON mappings)¶
The *_test_params_*.json files define evaluation subsets used by test_mode filters:
in_dist: in-distribution parameter settingsout_dist: out-of-distribution parameter settingsseen: settings used for training (held-out time windows)unseen: settings not used for training
HF Arrow schema (high level)¶
- Fluid scenarios (Cylinder / Controlled Cylinder / FSI / Foil)
sim_id(string),time_id(int)u(bytes),v(bytes),p(bytes; numerical only)-
shape_t,shape_h,shape_w(int) -
Combustion
sim_id(string),time_id(int)observed(bytes) — real-world intensity \(I\) (real) or surrogate intensity (numerical)numerical(bytes; numerical only),numerical_channels(int; numerical only)shape_t,shape_h,shape_w(int)