Foundation Model Ensemble (GIFT-Eval)¶

This notebook demonstrates the evaluation of a foundation model ensemble built using the TimeCopilot library on the GIFT-Eval benchmark.

TimeCopilot is an open‑source AI agent for time series forecasting that provides a unified interface to multiple forecasting approaches, from foundation models to classical statistical, machine learning, and deep learning methods, along with built‑in ensemble capabilities for robust and explainable forecasting.

Model Description¶

This ensemble leverages TimeCopilot's MedianEnsemble feature, which combines three state-of-the-art foundation models:

The ensemble uses median aggregation with isotonic regression to ensure monotonic quantiles for probabilistic forecasting, providing robustness against outliers and model-specific biases.

TimeCopilot's Key Features¶

Foundation model integration: Unified API for 30+ state‑of‑the‑art foundation models
Ensemble capabilities: Built-in ensemble methods
Zero-shot capability: Leverages pretrained foundation models out‑of‑the‑box
Dependency management: Handles complex model requirements automatically
GPU efficiency: Optimized memory sharing and multi‑model execution

Requirements and Installation¶

Install TimeCopilot library:

In [ ]:

Copied!

%pip install "timecopilot>=0.0.22"
%pip install "timecopilot>=0.0.22"

Dataset Setup¶

TimeCopilot includes built-in GIFT-Eval integration for dataset handling:

In [ ]:

Copied!





from timecopilot.gift_eval.eval import GIFTEval

# TimeCopilot's built-in GIFT-Eval dataset downloader
# Handles the complete benchmark dataset with all 97 configurations
storage_path = "./data/gift-eval"
GIFTEval.download_data(storage_path=storage_path)
from timecopilot.gift_eval.eval import GIFTEval

# TimeCopilot's built-in GIFT-Eval dataset downloader
# Handles the complete benchmark dataset with all 97 configurations
storage_path = "./data/gift-eval"
GIFTEval.download_data(storage_path=storage_path)

Model Implementation¶

Using TimeCopilot's model hub and ensemble capabilities to create a foundation model ensemble:

In [ ]:

Copied!





from timecopilot.models.ensembles.median import MedianEnsemble
from timecopilot.models.foundation.chronos import Chronos
from timecopilot.models.foundation.timesfm import TimesFM
from timecopilot.models.foundation.tirex import TiRex
from timecopilot.models.utils.forecaster import Forecaster

batch_size = 64

# TimeCopilot's MedianEnsemble with isotonic regression for robust forecasting
# Automatically handles dependency conflicts and GPU memory management
ensemble = MedianEnsemble(
    models=[
        # Each model uses TimeCopilot's unified interface despite different architectures
        Chronos(
            repo_id="amazon/chronos-2",
            batch_size=batch_size,
        ),
        TimesFM(
            repo_id="google/timesfm-2.5-200m-pytorch",
            batch_size=batch_size,
        ),
        TiRex(
            batch_size=batch_size,
        ),
    ],
    alias="TimeCopilot",
)
from timecopilot.models.ensembles.median import MedianEnsemble
from timecopilot.models.foundation.chronos import Chronos
from timecopilot.models.foundation.timesfm import TimesFM
from timecopilot.models.foundation.tirex import TiRex
from timecopilot.models.utils.forecaster import Forecaster

batch_size = 64

# TimeCopilot's MedianEnsemble with isotonic regression for robust forecasting
# Automatically handles dependency conflicts and GPU memory management
ensemble = MedianEnsemble(
    models=[
        # Each model uses TimeCopilot's unified interface despite different architectures
        Chronos(
            repo_id="amazon/chronos-2",
            batch_size=batch_size,
        ),
        TimesFM(
            repo_id="google/timesfm-2.5-200m-pytorch",
            batch_size=batch_size,
        ),
        TiRex(
            batch_size=batch_size,
        ),
    ],
    alias="TimeCopilot",
)

Evaluation¶

Defining the evaluator¶

With TimeCopilot you can evaluate any Forecaster in a standardized way using its GIFT-Eval integration.

In [4]:

Copied!





import pandas as pd
from timecopilot.gift_eval.eval import GIFTEval
from timecopilot.gift_eval.gluonts_predictor import GluonTSPredictor


def evaluate_forecaster(
        forecaster: Forecaster,
        dataset_name: str,
        term: str,
        output_path: str,
        storage_path: str,
    ):
    """Evaluate a forecaster on a GIFT-Eval dataset defined by dataset name and term."""

    # TimeCopilot's GIFT-Eval loader handles dataset preprocessing automatically
    gifteval = GIFTEval(
        dataset_name=dataset_name,
        term=term,
        output_path=output_path,
        storage_path=storage_path,
    )

    # GluonTS wrapper for GIFT-Eval compatibility
    # It can receive any Forecaster from TimeCopilot
    predictor = GluonTSPredictor(
        forecaster=forecaster,
        max_length=4_096,
        batch_size=1_024,
    )

    # Run evaluation with GIFT-Eval's standardized metrics
    gifteval.evaluate_predictor(predictor, batch_size=512)
import pandas as pd
from timecopilot.gift_eval.eval import GIFTEval
from timecopilot.gift_eval.gluonts_predictor import GluonTSPredictor


def evaluate_forecaster(
        forecaster: Forecaster,
        dataset_name: str,
        term: str,
        output_path: str,
        storage_path: str,
    ):
    """Evaluate a forecaster on a GIFT-Eval dataset defined by dataset name and term."""

    # TimeCopilot's GIFT-Eval loader handles dataset preprocessing automatically
    gifteval = GIFTEval(
        dataset_name=dataset_name,
        term=term,
        output_path=output_path,
        storage_path=storage_path,
    )

    # GluonTS wrapper for GIFT-Eval compatibility
    # It can receive any Forecaster from TimeCopilot
    predictor = GluonTSPredictor(
        forecaster=forecaster,
        max_length=4_096,
        batch_size=1_024,
    )

    # Run evaluation with GIFT-Eval's standardized metrics
    gifteval.evaluate_predictor(predictor, batch_size=512)

Performing evaluation¶

In the GIFT-Eval benchmark, each dataset is defined by a combination of a dataset name and its term (short, medium or long).

In [ ]:

Copied!





import torch


if torch.cuda.is_available(): # remove if you want to run on CPU
    combinations = [
        ("m4_weekly", "short"),
        ("bizitobs_l2c/H", "short"),
        ("bizitobs_l2c/H", "medium"),
        ("bizitobs_l2c/H", "long"),
    ]

    for dataset_name, term in combinations:
        evaluate_forecaster(
            forecaster=ensemble,
            dataset_name=dataset_name,
            term=term,
            output_path=f"./results/timecopilot",
            storage_path=storage_path,
        )

    # Load consolidated results in GIFT-Eval format
    eval_df = pd.read_csv("./results/timecopilot/all_results.csv")
import torch


if torch.cuda.is_available(): # remove if you want to run on CPU
    combinations = [
        ("m4_weekly", "short"),
        ("bizitobs_l2c/H", "short"),
        ("bizitobs_l2c/H", "medium"),
        ("bizitobs_l2c/H", "long"),
    ]

    for dataset_name, term in combinations:
        evaluate_forecaster(
            forecaster=ensemble,
            dataset_name=dataset_name,
            term=term,
            output_path=f"./results/timecopilot",
            storage_path=storage_path,
        )

    # Load consolidated results in GIFT-Eval format
    eval_df = pd.read_csv("./results/timecopilot/all_results.csv")

In [13]:

Copied!

if torch.cuda.is_available():
  from IPython.display import display

  display(eval_df)
if torch.cuda.is_available():
  from IPython.display import display

  display(eval_df)

	dataset	model	eval_metrics/MSE[mean]	eval_metrics/MSE[0.5]	eval_metrics/MAE[0.5]	eval_metrics/MASE[0.5]	eval_metrics/MAPE[0.5]	eval_metrics/sMAPE[0.5]	eval_metrics/MSIS	eval_metrics/RMSE[mean]	eval_metrics/NRMSE[mean]	eval_metrics/ND[0.5]	eval_metrics/mean_weighted_sum_quantile_loss	domain	num_variates
0	m4_weekly/W/short	TimeCopilot	220437.833920	220437.833920	239.903438	1.916661	0.058617	0.058292	14.666592	469.508077	0.085537	0.043707	0.034997	Econ/Fin	1
1	bizitobs_l2c/H/short	TimeCopilot	54.628522	54.628522	4.459038	0.444425	0.385657	0.580057	2.696251	7.391111	0.398400	0.240354	0.186401	Web/CloudOps	7
2	bizitobs_l2c/H/medium	TimeCopilot	71.800877	71.800877	4.851640	0.488632	0.470714	0.757992	3.374162	8.473540	0.513086	0.293774	0.232035	Web/CloudOps	7
3	bizitobs_l2c/H/long	TimeCopilot	83.786483	83.786483	5.340595	0.566997	0.619369	0.782812	4.585122	9.153496	0.559122	0.326219	0.261100	Web/CloudOps	7

You can access the complete combination of datasets with the following:

In [14]:

Copied!

from timecopilot.gift_eval.utils import DATASETS_WITH_TERMS
from timecopilot.gift_eval.utils import DATASETS_WITH_TERMS

In [15]:

Copied!

DATASETS_WITH_TERMS[:3]
DATASETS_WITH_TERMS[:3]

Out[15]:

[('m4_yearly', 'short'), ('m4_quarterly', 'short'), ('m4_monthly', 'short')]

In [16]:

Copied!

len(DATASETS_WITH_TERMS)
len(DATASETS_WITH_TERMS)

Out[16]:

The code for the complete evaluation can be found in the library's repo.

Reproducibility statement¶

The TimeCopilot's GIFT-Eval integration was designed considering reproducibility as one of its main features. The library can replicate the official results provided by the mantainers of the benchmark for the SeasonalNaive method. The following code replicates the Seasonal Naive performance for the datasets evaluated in this notebook. The reproducibility of the results for the rest of the datasets are tested continuously in the library's repo.

In [ ]:

Copied!





from timecopilot.models.stats import SeasonalNaive

combinations = [
    ("m4_weekly", "short"),
    ("bizitobs_l2c/H", "short"),
    ("bizitobs_l2c/H", "medium"),
    ("bizitobs_l2c/H", "long"),
]

for dataset_name, term in combinations:
    evaluate_forecaster(
        forecaster=SeasonalNaive(alias="Seasonal_Naive"),
        dataset_name=dataset_name,
        term=term,
        output_path=f"./results/seasonal_naive",
        storage_path=storage_path,
    )
eval_df_sn = pd.read_csv("./results/seasonal_naive/all_results.csv")
from timecopilot.models.stats import SeasonalNaive

combinations = [
    ("m4_weekly", "short"),
    ("bizitobs_l2c/H", "short"),
    ("bizitobs_l2c/H", "medium"),
    ("bizitobs_l2c/H", "long"),
]

for dataset_name, term in combinations:
    evaluate_forecaster(
        forecaster=SeasonalNaive(alias="Seasonal_Naive"),
        dataset_name=dataset_name,
        term=term,
        output_path=f"./results/seasonal_naive",
        storage_path=storage_path,
    )
eval_df_sn = pd.read_csv("./results/seasonal_naive/all_results.csv")

In [18]:

Copied!

eval_df_sn
eval_df_sn

Out[18]:

	dataset	model	eval_metrics/MSE[mean]	eval_metrics/MSE[0.5]	eval_metrics/MAE[0.5]	eval_metrics/MASE[0.5]	eval_metrics/MAPE[0.5]	eval_metrics/sMAPE[0.5]	eval_metrics/MSIS	eval_metrics/RMSE[mean]	eval_metrics/NRMSE[mean]	eval_metrics/ND[0.5]	eval_metrics/mean_weighted_sum_quantile_loss	domain	num_variates
0	m4_weekly/W/short	Seasonal_Naive	453525.145918	453525.145918	347.991483	2.777295	0.089373	0.091613	26.631225	673.442756	0.122691	0.063399	0.060870	Econ/Fin	1
1	bizitobs_l2c/H/short	Seasonal_Naive	281.843068	281.843068	12.531653	1.214064	1.360590	1.138373	7.486931	16.788182	0.904926	0.675488	0.521168	Web/CloudOps	7
2	bizitobs_l2c/H/medium	Seasonal_Naive	456.373289	456.373289	15.667392	1.510286	1.691291	1.402410	18.533654	21.362895	1.293556	0.948684	0.904205	Web/CloudOps	7
3	bizitobs_l2c/H/long	Seasonal_Naive	309.272222	309.272222	13.635488	1.426054	2.438311	0.916854	22.036198	17.586137	1.074212	0.832895	0.941065	Web/CloudOps	7

In [19]:

Copied!

official_eval_sn = pd.read_csv(
    "https://huggingface.co/spaces/Salesforce/GIFT-Eval/raw/main/results/seasonal_naive/all_results.csv"
)
official_eval_sn = pd.read_csv(
    "https://huggingface.co/spaces/Salesforce/GIFT-Eval/raw/main/results/seasonal_naive/all_results.csv"
)

In [20]:

Copied!

official_eval_sn = official_eval_sn.set_index("dataset").loc[eval_df_sn["dataset"]].reset_index()
official_eval_sn = official_eval_sn.set_index("dataset").loc[eval_df_sn["dataset"]].reset_index()

In [21]:

Copied!

official_eval_sn
official_eval_sn

Out[21]:

	dataset	model	eval_metrics/MSE[mean]	eval_metrics/MSE[0.5]	eval_metrics/MAE[0.5]	eval_metrics/MASE[0.5]	eval_metrics/MAPE[0.5]	eval_metrics/sMAPE[0.5]	eval_metrics/MSIS	eval_metrics/RMSE[mean]	eval_metrics/NRMSE[mean]	eval_metrics/ND[0.5]	eval_metrics/mean_weighted_sum_quantile_loss	domain	num_variates
0	m4_weekly/W/short	Seasonal_Naive	453525.145918	453525.145918	347.991483	2.777295	0.089373	0.091613	26.631225	673.442756	0.122691	0.063399	0.060870	Econ/Fin	1
1	bizitobs_l2c/H/short	Seasonal_Naive	281.843068	281.843068	12.531653	1.214064	1.360590	1.138373	7.486931	16.788182	0.904926	0.675488	0.521168	Web/CloudOps	7
2	bizitobs_l2c/H/medium	Seasonal_Naive	456.373289	456.373289	15.667392	1.510286	1.691291	1.402410	18.533654	21.362895	1.293556	0.948684	0.904205	Web/CloudOps	7
3	bizitobs_l2c/H/long	Seasonal_Naive	309.272222	309.272222	13.635488	1.426054	2.438311	0.916854	22.036198	17.586137	1.074212	0.832895	0.941065	Web/CloudOps	7

In [22]:

Copied!

pd.testing.assert_frame_equal(official_eval_sn, eval_df_sn)
pd.testing.assert_frame_equal(official_eval_sn, eval_df_sn)

Changelog¶

2025-11-06¶

We introduced newer models based on the most recent progress in the field: Chronos-2, TimesFM-2.5 and TiRex.

2025-08-05¶

GIFT‑Eval recently enhanced its evaluation dashboard with a new flag that identifies models likely affected by data leakage (i.e., having seen parts of the test set during training). While the test set itself hasn’t changed, this new insight helps us better interpret model performance. To keep our results focused on truly unseen data, we’ve excluded any flagged models from this experiment and added the Sundial model to the ensemble. The previous experiment details remain available here.