Datasets¶
RealPDEBench contains 5 scenarios with paired real-world measurements and matched numerical simulations. Each scenario provides two branches:
- Real-world (
real): experimentally measured fields (often incomplete) - Simulated (
numerical): CFD/LES fields (often includes unmeasured modalities, e.g. pressure)
What "paired" means in RealPDEBench¶
A scenario provides two branches:
real/: experimentally measured fields (may be incomplete)numerical/: simulation fields (can include additional, unmeasured modalities)
"Paired" means that the real and numerical trajectories correspond to the same configuration (e.g., Reynolds number, control frequency, mixture ratio), enabling sim→real transfer and modality-mismatch evaluation. Note that paired trajectories are matched by configuration, but their initial frames are not necessarily aligned.
Dataset inventory¶
The table below summarizes dataset sizes, temporal resolution, spatial resolution, and observed modalities.
| Dataset | n_traj |
n_frame |
\(\Delta t\) (s) | Resolution (sim) | Resolution (real) | Memory (GB) | Modalities (sim) | Modalities (real) |
|---|---|---|---|---|---|---|---|---|
| Cylinder | 92 × 2 | 3990 | \(2.5\times 10^{-3}\) | 64×128 | 128×256 | 190.50 | \(u,v,p\) | \(u,v\) |
| Controlled Cylinder | 96 × 2 | 3990 | \(2.5\times 10^{-3}\) | 64×128 | 128×256 | 187.08 | \(u,v,p\) | \(u,v\) |
| FSI | 51 × 2 | 2173 | \(2.0\times 10^{-3}\) | 128×128 | 128×128 | 94.73 | \(u,v,p\) | \(u,v\) |
| Foil | 99 × 2 | 3990 | \(2.5\times 10^{-3}\) | 128×256 | 128×256 | 335.64 | \(u,v,p\) | \(u,v\) |
| Combustion | 30 × 2 | 2001 | \(2.5\times 10^{-4}\) | 128×128 | 128×128 | 110.12 | multi-modal (15 channels) | \(I\) |
Note
We use n_traj = X × 2 to indicate paired trajectories: X real-world and X numerical trajectories for the same scenario.
Windowing: sim_id + time_id¶
RealPDEBench evaluates forecasting on short spatiotemporal windows sampled from long trajectories:
- Trajectory ID (
sim_id): trajectory identifier string (e.g.,1800,1781_0.5,40NH3_1.1) - Window start (
time_id): integer time index - One sample: a contiguous window
data[time_id : time_id + T], where \(T\) is the window length (in_step+out_step)
Public distribution format (Hugging Face snapshot)¶
We distribute data as Hugging Face Datasets (Arrow) shards using a lazy-slicing architecture. Each trajectory is stored complete (all frames), and train/val/test splits are defined by separate index files. On disk, a downloaded snapshot is organized as:
{dataset_root}/
{scenario}/
hf_dataset/
real/ # Arrow: complete trajectories
data-*.arrow
dataset_info.json
state.json
numerical/ # Arrow: complete trajectories
data-*.arrow
dataset_info.json
state.json
train_index_real.json # Index: [{"sim_id": "xxx.h5", "time_id": 0}, ...]
val_index_real.json
test_index_real.json
train_index_numerical.json
val_index_numerical.json
test_index_numerical.json
in_dist_test_params_real.json
out_dist_test_params_real.json
remain_params_real.json
in_dist_test_params_numerical.json
out_dist_test_params_numerical.json
remain_params_numerical.json
The *_index_*.json files define which (sim_id, time_id) pairs belong to each split. The *_test_params_*.json files are used for test_mode filtering ("in_dist/out_dist/seen/unseen") during validation/testing.
Evaluation subsets (JSON mappings)¶
The *_test_params_*.json files define evaluation subsets used by test_mode filters:
in_dist: in-distribution parameter settingsout_dist: out-of-distribution parameter settingsseen: settings used for training (held-out time windows)unseen: settings not used for training
HF Arrow schema (high level)¶
Each Arrow row stores one complete trajectory (all frames). Splits are defined externally by the *_index_*.json files.
- Fluid scenarios (Cylinder / Controlled Cylinder / FSI / Foil)
sim_id(string): trajectory identifier (e.g.,10031.h5)u,v(bytes): float32 arrays of shape(T_full, H, W)— complete time seriesp(bytes): float32 array(T_full, H, W)(numerical only)vo(bytes): float32 array(T_full, H, W)— vorticityx(bytes): float32 array(H, W)— spatial x-coordinate grid (time-invariant)y(bytes): float32 array(H, W)— spatial y-coordinate grid (time-invariant)t(bytes): float32 array(T_full,)— time stampsshape_t(int): complete trajectory length (e.g., 3990, 2173)-
shape_h,shape_w(int): spatial dimensions -
Combustion
sim_id(string): trajectory identifier (e.g.,40NH3_1.1.h5)observed(bytes): float32 array(T_full, H, W)— real-world intensity \(I\) (real) or surrogate (numerical)numerical(bytes): float32 array(T_full, H, W, 15)(numerical only)numerical_channels(int): number of channels (15) (numerical only)x(bytes): float32 array(H, W)— spatial x-coordinate grid (time-invariant)y(bytes): float32 array(H, W)— spatial y-coordinate grid (time-invariant)t(bytes): float32 array(T_full,)— time stampsshape_t(int): complete trajectory length (e.g., 2001)shape_h,shape_w(int): spatial dimensions
Spatial grids are stored time-invariant
x and y are identical across all frames, so they are stored once as (H, W) instead of (T, H, W). The time array t is stored as (T_full,). For methods that require per-frame coordinate grids (e.g., PINNs), broadcast at runtime:
# x, y: (H, W) → (T, H, W)
x_grid = np.broadcast_to(x[np.newaxis, :, :], (T, H, W))
y_grid = np.broadcast_to(y[np.newaxis, :, :], (T, H, W))
# t: (T,) → (T, H, W)
t_grid = np.broadcast_to(t[:, np.newaxis, np.newaxis], (T, H, W))
np.broadcast_to returns a read-only view with zero memory overhead.
Index file format¶
The {split}_index_{type}.json files map sample indices to trajectory positions:
[
{"sim_id": "10031.h5", "time_id": 0},
{"sim_id": "10031.h5", "time_id": 20},
{"sim_id": "10031.h5", "time_id": 40},
...
]
At runtime, the loader uses these indices to slice windows from the complete trajectories, enabling dynamic N_autoregressive support.