Foundation Model Ensemble (GIFT-Eval)¶
This notebook demonstrates the evaluation of a foundation model ensemble built using the TimeCopilot library on the GIFT-Eval benchmark.
TimeCopilot is an open‑source AI agent for time series forecasting that provides a unified interface to multiple forecasting approaches, from foundation models to classical statistical, machine learning, and deep learning methods, along with built‑in ensemble capabilities for robust and explainable forecasting.
Model Description¶
This ensemble leverages TimeCopilot's MedianEnsemble feature, which combines three state-of-the-art foundation models:
The ensemble uses median aggregation with isotonic regression to ensure monotonic quantiles for probabilistic forecasting, providing robustness against outliers and model-specific biases.
TimeCopilot's Key Features¶
- Foundation model integration: Unified API for 7+ state‑of‑the‑art foundation models
- Ensemble capabilities: Built-in ensemble methods
- Zero-shot capability: Leverages pretrained foundation models out‑of‑the‑box
- Dependency management: Handles complex model requirements automatically
- GPU efficiency: Optimized memory sharing and multi‑model execution
Requirements and Installation¶
Install TimeCopilot library:
%pip install "timecopilot>=0.0.15"
Dataset Setup¶
TimeCopilot includes built-in GIFT-Eval integration for dataset handling:
from timecopilot.gift_eval.eval import GIFTEval
# TimeCopilot's built-in GIFT-Eval dataset downloader
# Handles the complete benchmark dataset with all 97 configurations
storage_path = "./data/gift-eval"
GIFTEval.download_data(storage_path=storage_path)
Model Implementation¶
Using TimeCopilot's model hub and ensemble capabilities to create a foundation model ensemble:
from timecopilot.models.ensembles.median import MedianEnsemble
from timecopilot.models.foundation.moirai import Moirai
from timecopilot.models.foundation.sundial import Sundial
from timecopilot.models.foundation.toto import Toto
from timecopilot.models.utils.forecaster import Forecaster
batch_size = 64
# TimeCopilot's MedianEnsemble with isotonic regression for robust forecasting
# Automatically handles dependency conflicts and GPU memory management
ensemble = MedianEnsemble(
models=[
# Each model uses TimeCopilot's unified interface despite different architectures
Moirai(
repo_id="Salesforce/moirai-1.1-R-large",
batch_size=batch_size,
),
Sundial(batch_size=batch_size),
Toto(
context_length=1_024,
batch_size=batch_size,
),
],
alias="TimeCopilot",
)
Evaluation¶
Defining the evaluator¶
With TimeCopilot you can evaluate any Forecaster in a standardized way using its GIFT-Eval integration.
import pandas as pd
from timecopilot.gift_eval.eval import GIFTEval
from timecopilot.gift_eval.gluonts_predictor import GluonTSPredictor
def evaluate_forecaster(
forecaster: Forecaster,
dataset_name: str,
term: str,
output_path: str,
storage_path: str,
):
"""Evaluate a forecaster on a GIFT-Eval dataset defined by dataset name and term."""
# TimeCopilot's GIFT-Eval loader handles dataset preprocessing automatically
gifteval = GIFTEval(
dataset_name=dataset_name,
term=term,
output_path=output_path,
storage_path=storage_path,
)
# GluonTS wrapper for GIFT-Eval compatibility
# It can receive any Forecaster from TimeCopilot
predictor = GluonTSPredictor(
forecaster=forecaster,
max_length=4_096,
batch_size=1_024,
)
# Run evaluation with GIFT-Eval's standardized metrics
gifteval.evaluate_predictor(predictor, batch_size=512)
Performing evaluation¶
In the GIFT-Eval benchmark, each dataset is defined by a combination of a dataset name and its term (short, medium or long).
import torch
if torch.cuda.is_available(): # remove if you want to run on CPU
combinations = [
("m4_weekly", "short"),
("bizitobs_l2c/H", "short"),
("bizitobs_l2c/H", "medium"),
("bizitobs_l2c/H", "long"),
]
for dataset_name, term in combinations:
evaluate_forecaster(
forecaster=ensemble,
dataset_name=dataset_name,
term=term,
output_path=f"./results/timecopilot",
storage_path=storage_path,
)
# Load consolidated results in GIFT-Eval format
eval_df = pd.read_csv("./results/timecopilot/all_results.csv")
if torch.cuda.is_available():
eval_df
dataset | model | eval_metrics/MSE[mean] | eval_metrics/MSE[0.5] | eval_metrics/MAE[0.5] | eval_metrics/MASE[0.5] | eval_metrics/MAPE[0.5] | eval_metrics/sMAPE[0.5] | eval_metrics/MSIS | eval_metrics/RMSE[mean] | eval_metrics/NRMSE[mean] | eval_metrics/ND[0.5] | eval_metrics/mean_weighted_sum_quantile_loss | domain | num_variates | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | m4_weekly/W/short | TimeCopilot | 315007.262246 | 315007.262246 | 272.970319 | 2.107419 | 0.065258 | 0.065161 | 15.018074 | 561.255078 | 0.102252 | 0.049731 | 0.039081 | Econ/Fin | 1 |
1 | bizitobs_l2c/H/short | TimeCopilot | 87.605865 | 87.605865 | 5.529776 | 0.547469 | 0.414267 | 0.686486 | 4.308162 | 9.359800 | 0.504517 | 0.298069 | 0.239323 | Web/CloudOps | 7 |
2 | bizitobs_l2c/H/medium | TimeCopilot | 170.110478 | 170.110478 | 7.929163 | 0.793578 | 0.633269 | 0.931765 | 5.353425 | 13.042641 | 0.789752 | 0.480123 | 0.368329 | Web/CloudOps | 7 |
3 | bizitobs_l2c/H/long | TimeCopilot | 134.774621 | 134.774621 | 7.192394 | 0.754836 | 0.711176 | 0.865302 | 7.389433 | 11.609247 | 0.709126 | 0.439332 | 0.357904 | Web/CloudOps | 7 |
You can access the complete combination of datasets with the following:
from timecopilot.gift_eval.utils import DATASETS_WITH_TERMS
DATASETS_WITH_TERMS[:3]
[('m4_yearly', 'short'), ('m4_quarterly', 'short'), ('m4_monthly', 'short')]
len(DATASETS_WITH_TERMS)
97
The code for the complete evaluation can be found in the library's repo.
Reproducibility statement¶
The TimeCopilot's GIFT-Eval integration was designed considering reproducibility as one of its main features. The library can replicate the official results provided by the mantainers of the benchmark for the SeasonalNaive
method. The following code replicates the Seasonal Naive performance for the datasets evaluated in this notebook. The reproducibility of the results for the rest of the datasets are tested continuously in the library's repo.
from timecopilot.models.stats import SeasonalNaive
combinations = [
("m4_weekly", "short"),
("bizitobs_l2c/H", "short"),
("bizitobs_l2c/H", "medium"),
("bizitobs_l2c/H", "long"),
]
for dataset_name, term in combinations:
evaluate_forecaster(
forecaster=SeasonalNaive(alias="Seasonal_Naive"),
dataset_name=dataset_name,
term=term,
output_path=f"./results/seasonal_naive",
storage_path=storage_path,
)
eval_df_sn = pd.read_csv("./results/seasonal_naive/all_results.csv")
eval_df_sn
dataset | model | eval_metrics/MSE[mean] | eval_metrics/MSE[0.5] | eval_metrics/MAE[0.5] | eval_metrics/MASE[0.5] | eval_metrics/MAPE[0.5] | eval_metrics/sMAPE[0.5] | eval_metrics/MSIS | eval_metrics/RMSE[mean] | eval_metrics/NRMSE[mean] | eval_metrics/ND[0.5] | eval_metrics/mean_weighted_sum_quantile_loss | domain | num_variates | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | m4_weekly/W/short | Seasonal_Naive | 453525.145918 | 453525.145918 | 347.991483 | 2.777295 | 0.089373 | 0.091613 | 26.631225 | 673.442756 | 0.122691 | 0.063399 | 0.060870 | Econ/Fin | 1 |
1 | bizitobs_l2c/H/short | Seasonal_Naive | 281.843068 | 281.843068 | 12.531653 | 1.214064 | 1.360590 | 1.138373 | 7.486931 | 16.788182 | 0.904926 | 0.675488 | 0.521168 | Web/CloudOps | 7 |
2 | bizitobs_l2c/H/medium | Seasonal_Naive | 456.373289 | 456.373289 | 15.667392 | 1.510286 | 1.691291 | 1.402410 | 18.533654 | 21.362895 | 1.293556 | 0.948684 | 0.904205 | Web/CloudOps | 7 |
3 | bizitobs_l2c/H/long | Seasonal_Naive | 309.272222 | 309.272222 | 13.635488 | 1.426054 | 2.438311 | 0.916854 | 22.036198 | 17.586137 | 1.074212 | 0.832895 | 0.941065 | Web/CloudOps | 7 |
official_eval_sn = pd.read_csv(
"https://huggingface.co/spaces/Salesforce/GIFT-Eval/raw/main/results/seasonal_naive/all_results.csv"
)
official_eval_sn = official_eval_sn.set_index("dataset").loc[eval_df_sn["dataset"]].reset_index()
official_eval_sn
dataset | model | eval_metrics/MSE[mean] | eval_metrics/MSE[0.5] | eval_metrics/MAE[0.5] | eval_metrics/MASE[0.5] | eval_metrics/MAPE[0.5] | eval_metrics/sMAPE[0.5] | eval_metrics/MSIS | eval_metrics/RMSE[mean] | eval_metrics/NRMSE[mean] | eval_metrics/ND[0.5] | eval_metrics/mean_weighted_sum_quantile_loss | domain | num_variates | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | m4_weekly/W/short | Seasonal_Naive | 453525.145918 | 453525.145918 | 347.991483 | 2.777295 | 0.089373 | 0.091613 | 26.631225 | 673.442756 | 0.122691 | 0.063399 | 0.060870 | Econ/Fin | 1 |
1 | bizitobs_l2c/H/short | Seasonal_Naive | 281.843068 | 281.843068 | 12.531653 | 1.214064 | 1.360590 | 1.138373 | 7.486931 | 16.788182 | 0.904926 | 0.675488 | 0.521168 | Web/CloudOps | 7 |
2 | bizitobs_l2c/H/medium | Seasonal_Naive | 456.373289 | 456.373289 | 15.667392 | 1.510286 | 1.691291 | 1.402410 | 18.533654 | 21.362895 | 1.293556 | 0.948684 | 0.904205 | Web/CloudOps | 7 |
3 | bizitobs_l2c/H/long | Seasonal_Naive | 309.272222 | 309.272222 | 13.635488 | 1.426054 | 2.438311 | 0.916854 | 22.036198 | 17.586137 | 1.074212 | 0.832895 | 0.941065 | Web/CloudOps | 7 |
pd.testing.assert_frame_equal(official_eval_sn, eval_df_sn)
Changelog¶
2025-08-05¶
GIFT‑Eval recently enhanced its evaluation dashboard with a new flag that identifies models likely affected by data leakage (i.e., having seen parts of the test set during training). While the test set itself hasn’t changed, this new insight helps us better interpret model performance. To keep our results focused on truly unseen data, we’ve excluded any flagged models from this experiment and added the Sundial model to the ensemble. The previous experiment details remain available here.