Online Data Augmentation using Callbacks

Time series synthetic data generation can be useful in scenarios where an adequate sample size in not available.

This notebook explores how to do data augmentation and apply this process in the context of forecasting

  1. Loading M3’s Monthly dataset

  2. Set up a callback that uses moving blocks bootstrapping to augment each batch of time series

  3. Fitting two version of NHITS, one with the callback and another without

  4. Evaluating both models

[1]:
import warnings

warnings.filterwarnings("ignore")

If necessary, install the package using pip:

[2]:
# !pip install metaforecast -U

1. Data preparation

Let’s start by loading the dataset. This tutorial uses the ETTm2 dataset available on datasetsforecast.

We also set the forecasting horizon and input size (number of lags) to 360, 6 hours of data.

[3]:
import pandas as pd

from datasetsforecast.m3 import M3
from metaforecast.utils.data import DataUtils

horizon = 24
n_lags = 24

df, *_ = M3.load('.',group='Monthly')

Split the dataset into training and testing sets:

[4]:
train, test = DataUtils.train_test_split(df, horizon)

train.query('unique_id=="M1000"').tail()
[4]:
unique_id ds y
238 M1000 1991-10-31 4454.6
239 M1000 1991-11-30 4397.8
240 M1000 1991-12-31 4377.2
241 M1000 1992-01-31 4420.6
242 M1000 1992-02-29 4446.6
[5]:
test.query('unique_id=="M1000"').head()
[5]:
unique_id ds y
72 M1000 1992-03-31 4451.8
73 M1000 1992-04-30 4496.0
74 M1000 1992-05-31 4494.8
75 M1000 1992-06-30 4505.8
76 M1000 1992-07-31 4501.2

2. Data Augmentation

Use seasonal MBB to do data augmentation

First, setup the callback

[6]:
from metaforecast.synth.callbacks import OnlineDataAugmentationCallback
from metaforecast.synth import SeasonalMBB

tsgen = SeasonalMBB(seas_period=12)

augmentation_cb = OnlineDataAugmentationCallback(generator=tsgen)

3. Model setup and fitting

We focus on NHITS, with a default configuration

We train two version of NHITS: one on the original data (train), and another on the augmented dataset.

[7]:
from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS

models = [NHITS(input_size=horizon,
                h=horizon,
                start_padding_enabled=True,
                accelerator='mps'),
          NHITS(input_size=horizon,
                h=horizon,
                start_padding_enabled=True,
                accelerator='mps',
                callbacks=[augmentation_cb])]

nf = NeuralForecast(models=models, freq='ME')
2024-10-18 11:05:57,885 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-10-18 11:05:57,939 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO:lightning_fabric.utilities.seed:Seed set to 1
INFO:lightning_fabric.utilities.seed:Seed set to 1
[8]:
%%capture

nf.fit(df=train)
INFO: GPU available: True (mps), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (mps), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name         | Type          | Params | Mode
-------------------------------------------------------
0 | loss         | MAE           | 0      | train
1 | padder_train | ConstantPad1d | 0      | train
2 | scaler       | TemporalNorm  | 0      | train
3 | blocks       | ModuleList    | 2.4 M  | train
-------------------------------------------------------
2.4 M     Trainable params
0         Non-trainable params
2.4 M     Total params
9.794     Total estimated model params size (MB)
INFO: `Trainer.fit` stopped: `max_steps=1000` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1000` reached.
INFO: GPU available: True (mps), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (mps), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name         | Type          | Params | Mode
-------------------------------------------------------
0 | loss         | MAE           | 0      | train
1 | padder_train | ConstantPad1d | 0      | train
2 | scaler       | TemporalNorm  | 0      | train
3 | blocks       | ModuleList    | 2.4 M  | train
-------------------------------------------------------
2.4 M     Trainable params
0         Non-trainable params
2.4 M     Total params
9.794     Total estimated model params size (MB)
INFO: `Trainer.fit` stopped: `max_steps=1000` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1000` reached.
[9]:
fcst = nf.predict()
INFO: GPU available: True (mps), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (mps), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 132.85it/s]
INFO: Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
INFO:lightning.pytorch.utilities.rank_zero:Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
INFO: GPU available: True (mps), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (mps), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 248.16it/s]
[10]:
fcst = fcst.rename(columns={'NHITS1':'NHITS(MBB)'})

fcst.head()
[10]:
ds NHITS NHITS(MBB)
unique_id
M1 1993-09-30 2349.900635 2416.396240
M1 1993-10-31 2323.687988 1869.681885
M1 1993-11-30 2723.932617 2878.699951
M1 1993-12-31 2504.443115 2124.569092
M1 1994-01-31 2363.329834 2149.853516

4. Evaluation

Finally, we compare both approaches

[11]:
test = test.merge(fcst, on=['unique_id','ds'], how="left")

test.head()
[11]:
unique_id ds y NHITS NHITS(MBB)
0 M1 1993-09-30 4800.0 2349.900635 2416.396240
1 M1 1993-10-31 3000.0 2323.687988 1869.681885
2 M1 1993-11-30 3120.0 2723.932617 2878.699951
3 M1 1993-12-31 5880.0 2504.443115 2124.569092
4 M1 1994-01-31 2640.0 2363.329834 2149.853516
[12]:
from neuralforecast.losses.numpy import smape
from datasetsforecast.evaluation import accuracy

evaluation_df = accuracy(test, [smape], agg_by=['unique_id'])
[13]:
eval_df = evaluation_df.drop(columns=['metric','unique_id'])

eval_df
[13]:
NHITS NHITS(MBB)
0 0.481972 0.476603
1 0.244885 0.248488
2 0.084516 0.076151
3 0.017971 0.012527
4 0.039103 0.038067
... ... ...
1423 0.014715 0.019102
1424 0.019006 0.019156
1425 0.059260 0.062880
1426 0.055512 0.056238
1427 0.014873 0.013690

1428 rows × 2 columns

[14]:
eval_df.mean().sort_values()
[14]:
NHITS(MBB)    0.145017
NHITS         0.146005
dtype: float64