Datasets

RealPDEBench contains 5 scenarios with paired real-world measurements and matched numerical simulations. Each scenario provides two branches:

  • Real-world (real): experimentally measured fields (often incomplete)
  • Simulated (numerical): CFD/LES fields (often includes unmeasured modalities, e.g. pressure)

What "paired" means in RealPDEBench

A scenario provides two branches:

  • real/: experimentally measured fields (may be incomplete)
  • numerical/: simulation fields (can include additional, unmeasured modalities)

"Paired" means that the real and numerical trajectories correspond to the same configuration (e.g., Reynolds number, control frequency, mixture ratio), enabling sim→real transfer and modality-mismatch evaluation. Note that paired trajectories are matched by configuration, but their initial frames are not necessarily aligned.

Dataset inventory

The table below summarizes dataset sizes, temporal resolution, spatial resolution, and observed modalities.

Dataset n_traj n_frame \(\Delta t\) (s) Resolution (sim) Resolution (real) Memory (GB) Modalities (sim) Modalities (real)
Cylinder 92 × 2 3990 \(2.5\times 10^{-3}\) 64×128 128×256 190.50 \(u,v,p\) \(u,v\)
Controlled Cylinder 96 × 2 3990 \(2.5\times 10^{-3}\) 64×128 128×256 187.08 \(u,v,p\) \(u,v\)
FSI 51 × 2 2173 \(2.0\times 10^{-3}\) 128×128 128×128 94.73 \(u,v,p\) \(u,v\)
Foil 99 × 2 3990 \(2.5\times 10^{-3}\) 128×256 128×256 335.64 \(u,v,p\) \(u,v\)
Combustion 30 × 2 2001 \(2.5\times 10^{-4}\) 128×128 128×128 110.12 multi-modal (15 channels) \(I\)

Note

We use n_traj = X × 2 to indicate paired trajectories: X real-world and X numerical trajectories for the same scenario.

Windowing: sim_id + time_id

RealPDEBench evaluates forecasting on short spatiotemporal windows sampled from long trajectories:

  • Trajectory ID (sim_id): trajectory identifier string (e.g., 1800, 1781_0.5, 40NH3_1.1)
  • Window start (time_id): integer time index
  • One sample: a contiguous window data[time_id : time_id + T], where \(T\) is the window length (in_step + out_step)

Public distribution format (Hugging Face snapshot)

We distribute data as Hugging Face Datasets (Arrow) shards using a lazy-slicing architecture. Each trajectory is stored complete (all frames), and train/val/test splits are defined by separate index files. On disk, a downloaded snapshot is organized as:

{dataset_root}/
  {scenario}/
    hf_dataset/
      real/                           # Arrow: complete trajectories
        data-*.arrow
        dataset_info.json
        state.json
      numerical/                      # Arrow: complete trajectories
        data-*.arrow
        dataset_info.json
        state.json
      train_index_real.json           # Index: [{"sim_id": "xxx.h5", "time_id": 0}, ...]
      val_index_real.json
      test_index_real.json
      train_index_numerical.json
      val_index_numerical.json
      test_index_numerical.json
    in_dist_test_params_real.json
    out_dist_test_params_real.json
    remain_params_real.json
    in_dist_test_params_numerical.json
    out_dist_test_params_numerical.json
    remain_params_numerical.json

The *_index_*.json files define which (sim_id, time_id) pairs belong to each split. The *_test_params_*.json files are used for test_mode filtering ("in_dist/out_dist/seen/unseen") during validation/testing.

Evaluation subsets (JSON mappings)

The *_test_params_*.json files define evaluation subsets used by test_mode filters:

  • in_dist: in-distribution parameter settings
  • out_dist: out-of-distribution parameter settings
  • seen: settings used for training (held-out time windows)
  • unseen: settings not used for training

HF Arrow schema (high level)

Each Arrow row stores one complete trajectory (all frames). Splits are defined externally by the *_index_*.json files.

  • Fluid scenarios (Cylinder / Controlled Cylinder / FSI / Foil)
  • sim_id (string): trajectory identifier (e.g., 10031.h5)
  • u, v (bytes): float32 arrays of shape (T_full, H, W)complete time series
  • p (bytes): float32 array (T_full, H, W) (numerical only)
  • vo (bytes): float32 array (T_full, H, W) — vorticity
  • x (bytes): float32 array (H, W) — spatial x-coordinate grid (time-invariant)
  • y (bytes): float32 array (H, W) — spatial y-coordinate grid (time-invariant)
  • t (bytes): float32 array (T_full,) — time stamps
  • shape_t (int): complete trajectory length (e.g., 3990, 2173)
  • shape_h, shape_w (int): spatial dimensions

  • Combustion

  • sim_id (string): trajectory identifier (e.g., 40NH3_1.1.h5)
  • observed (bytes): float32 array (T_full, H, W) — real-world intensity \(I\) (real) or surrogate (numerical)
  • numerical (bytes): float32 array (T_full, H, W, 15) (numerical only)
  • numerical_channels (int): number of channels (15) (numerical only)
  • x (bytes): float32 array (H, W) — spatial x-coordinate grid (time-invariant)
  • y (bytes): float32 array (H, W) — spatial y-coordinate grid (time-invariant)
  • t (bytes): float32 array (T_full,) — time stamps
  • shape_t (int): complete trajectory length (e.g., 2001)
  • shape_h, shape_w (int): spatial dimensions

Spatial grids are stored time-invariant

x and y are identical across all frames, so they are stored once as (H, W) instead of (T, H, W). The time array t is stored as (T_full,). For methods that require per-frame coordinate grids (e.g., PINNs), broadcast at runtime:

# x, y: (H, W) → (T, H, W)
x_grid = np.broadcast_to(x[np.newaxis, :, :], (T, H, W))
y_grid = np.broadcast_to(y[np.newaxis, :, :], (T, H, W))

# t: (T,) → (T, H, W)
t_grid = np.broadcast_to(t[:, np.newaxis, np.newaxis], (T, H, W))

np.broadcast_to returns a read-only view with zero memory overhead.

Index file format

The {split}_index_{type}.json files map sample indices to trajectory positions:

[
  {"sim_id": "10031.h5", "time_id": 0},
  {"sim_id": "10031.h5", "time_id": 20},
  {"sim_id": "10031.h5", "time_id": 40},
  ...
]

At runtime, the loader uses these indices to slice windows from the complete trajectories, enabling dynamic N_autoregressive support.