Data-oriented Metrics¶
These metrics evaluate prediction accuracy directly in data space (plus one training-efficiency metric).
We follow the notation in our paper: let \(\{\mathbf{y}_k\}_{k=1}^K\) be ground-truth samples and \(\{\hat{\mathbf{y}}_k\}_{k=1}^K\) be model predictions. Each sample is indexed by time \(t=1,\ldots,T\) and spatial grid points \(\{\mathbf{x}_i\}_{i=1}^I\).
RMSE (Root Mean Square Error)¶
Measures average prediction error magnitude.
Better: lower is better (0 is perfect).
MAE (Mean Absolute Error)¶
Measures mean absolute deviation (less sensitive to outliers than RMSE).
Better: lower is better (0 is perfect).
Relative \(L_2\) Error¶
Scale-independent normalized error.
Better: lower is better (0 is perfect).
\(R^2\) (Coefficient of Determination)¶
where \(\bar{\mathbf{y}} = \sum_k \mathbf{y}_k / K\).
Goodness-of-fit measure (\(1.0\) = perfect fit).
Better: higher is better (1 is perfect).
Update Ratio (Training Efficiency)¶
Measures the relative efficiency of simulated pretraining with real-world finetuning versus real-world training from scratch.
Let \(\mathrm{RMSE}_0\) denote the best RMSE achieved with real-world training. Define \(N_1\) and \(N_2\) as the number of finetuning updates and training updates required to reach \(\mathrm{RMSE}_0\), respectively. The metric is then given by \(N_1/N_2\).
Better: lower is better. Values \(< 1\) indicate pretraining reduces the number of updates needed to match real-world training.
Note: Update Ratio is only reported for the Real-world finetuning setting (simulated pretraining → real-world finetuning).