from src.utils import COLORS INTRODUCTION_TEXT = """ This space hosts evaluation results for time series forecasting models. The results are obtained using [fev](https://github.com/autogluon/fev) - a lightweight library for evaluating time series forecasting models. """ LEGEND = """ """ TABLE_INFO = f""" The leaderboard summarizes the performance of all models evaluated on the 100 tasks comprising **fev-bench**. More details available in the [paper](https://arxiv.org/abs/2509.26468). Model names are colored by type: Deep Learning and Statistical. The full matrix $E_{{rj}}$ with the error of each model $j$ on task $r$ is available at the bottom of the page. * **Avg. win rate (%)**: Fraction of all possible model pairs and tasks where this model achieves lower error than the competing model. For model $j$, defined as $W_j = \\frac{{1}}{{R(M-1)}} \\sum_{{r=1}}^{{R}} \\sum_{{k \\neq j}} (\\mathbf{{1}}(E_{{rj}} < E_{{rk}}) + 0.5 \\cdot \\mathbf{{1}}(E_{{rj}} = E_{{rk}}))$ where $R$ is number of tasks, $M$ is number of models. Ties count as half-wins. Ranges from 0% (worst) to 100% (best). Higher values are better. This value changes as new models are added to the benchmark. * **Skill score (%)**: Measures how much the model reduces forecasting error compared to the Seasonal Naive baseline. Computed as $S_j = 100 \\times (1 - \\sqrt[R]{{\\prod_{{r=1}}^{{R}} E_{{rj}}/E_{{r\\beta}}}})$, where $E_{{r\\beta}}$ is baseline error on task $r$. Relative errors are clipped between 0.01 and 100 before aggregation to avoid extreme outliers. Positive values indicate better-than-baseline performance, negative values indicate worse-than-baseline performance. Higher values are better. This value does not change as new models are added to the benchmark. * **Median runtime (s)**: Median end-to-end time (training + prediction across all evaluation windows) in seconds. Note that inference times depend on hardware, batch sizes, and implementation details, so these serve as a rough guide rather than definitive performance benchmarks. * **Leakage (%)**: For zero-shot models, percentage of benchmark datasets included in the model's training corpus. Results for tasks with reported overlap are replaced with Chronos-Bolt (Base) performance to prevent data leakage. * **Failed tasks (%)**: Percentage of tasks where the model failed to produce a forecast. Results for failed tasks are replaced with Seasonal Naive performance. * **Zero-shot**: Indicates whether the model can make predictions without task-specific training (✓ = zero-shot, × = task-specific). """ CHRONOS_BENCHMARK_BASIC_INFO = f""" **Chronos Benchmark II** contains results for various forecasting models on the 27 datasets used in Benchmark II in the paper [Chronos: Learning the Language of Time Series](https://arxiv.org/abs/2403.07815). {LEGEND} """ CHRONOS_BENCHMARK_DETAILS = f""" {TABLE_INFO} Task definitions and the detailed results are available on [GitHub](https://github.com/autogluon/fev/tree/main/benchmarks/chronos_zeroshot). More information for the datasets is available in [Table 3 of the paper](https://arxiv.org/abs/2403.07815). """ FEV_BENCHMARK_BASIC_INFO = f""" Results for various forecasting models on 100 tasks of the **fev-bench** benchmark, as described in the paper [fev-bench: A Realistic Benchmark for Time Series Forecasting](https://arxiv.org/abs/2509.26468). {LEGEND} """ FEV_BENCHMARK_DETAILS = f""" {TABLE_INFO} Task definitions and the detailed results are available on [GitHub](https://github.com/autogluon/fev/tree/main/benchmarks/). Datasets used for evaluation are available on [Hugging Face](https://huggingface.co/datasets/autogluon/fev_datasets). """ CITATION_HEADER = """ If you find this leaderboard useful for your research, please consider citing the associated paper(s): """ CITATION_FEV = """ ``` @article{shchur2025fev, title={{fev-bench}: A Realistic Benchmark for Time Series Forecasting}, author={Shchur, Oleksandr and Ansari, Abdul Fatir and Turkmen, Caner and Stella, Lorenzo and Erickson, Nick and Guerron, Pablo and Bohlke-Schneider, Michael and Wang, Yuyang}, year={2025}, eprint={2509.26468}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` """ def get_pivot_legend(baseline_model: str, leakage_imputation_model: str) -> str: return f""" Task definitions and raw results in CSV format are available on [GitHub](https://github.com/autogluon/fev/tree/main/benchmarks/fev_bench). Best results for each task are marked with 🥇 1st 🥈 2nd 🥉 3rd

**Imputation:** - Failed tasks imputed by {baseline_model} - Leaky tasks imputed by {leakage_imputation_model} """ PAIRWISE_BENCHMARK_DETAILS = """ The pairwise charts show head-to-head results between models: * **Win rate**: Percentage of tasks where Model 1 achieves lower error than Model 2 (ties count as half-wins). A value above 50% means Model 1 is more accurate than Model 2 on average. * **Skill score**: Average relative error reduction of Model 1 with respect to Model 2. A positive value means Model 1 reduces forecasting error compared to Model 2 on average. **Confidence Intervals**: 95% intervals are estimated using 1000 bootstrap samples over tasks. For each bootstrap sample, tasks are resampled with replacement and the pairwise win rate / skill score are recomputed. The intervals correspond to the 2.5th and 97.5th percentiles of these bootstrap distributions, capturing how model comparisons vary under alternative benchmark compositions. """ CITATION_CHRONOS = """ ``` @article{ansari2024chronos, title={Chronos: Learning the Language of Time Series}, author={Ansari, Abdul Fatir and Stella, Lorenzo and Turkmen, Caner and Zhang, Xiyuan, and Mercado, Pedro and Shen, Huibin and Shchur, Oleksandr and Rangapuram, Syama Syndar and Pineda Arango, Sebastian and Kapoor, Shubham and Zschiegner, Jasper and Maddix, Danielle C. and Wang, Hao and Mahoney, Michael W. and Torkkola, Kari and Gordon Wilson, Andrew and Bohlke-Schneider, Michael and Wang, Yuyang}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2024}, url={https://openreview.net/forum?id=gerNCVqqtR} } ``` """