Data Augmentation for Improved Forecasting Accuracy

Time series synthetic data generation can be useful in scenarios where an adequate sample size in not available.

This notebook explores how to do data augmentation and apply this process in the context of forecasting

Loading M3’s Monthly dataset
Use synthetic time series to augment the training set
Fitting two version of NHITS, one in each dataset (original and augmented)
Evaluating both models

[1]:

import warnings

warnings.filterwarnings("ignore")

If necessary, install the package using pip:

[2]:

# !pip install metaforecast -U

1. Data preparation

Let’s start by loading the dataset. This tutorial uses the ETTm2 dataset available on datasetsforecast.

We also set the forecasting horizon and input size (number of lags) to 360, 6 hours of data.

[3]:

import pandas as pd

from datasetsforecast.m3 import M3
from metaforecast.utils.data import DataUtils

horizon = 12
n_lags = 12

df, *_ = M3.load('.',group='Monthly')

Split the dataset into training and testing sets:

[4]:

train, test = DataUtils.train_test_split(df, horizon)

train.query('unique_id=="M1000"').tail()

[4]:

	unique_id	ds	y
286	M1000	1992-10-31	4563.4
287	M1000	1992-11-30	4551.8
288	M1000	1992-12-31	4577.4
289	M1000	1993-01-31	4592.4
290	M1000	1993-02-28	4632.2

[5]:

test.query('unique_id=="M1000"').head()

[5]:

	unique_id	ds	y
36	M1000	1993-03-31	4625.6
37	M1000	1993-04-30	4668.2
38	M1000	1993-05-31	4598.0
39	M1000	1993-06-30	4619.4
40	M1000	1993-07-31	4640.4

2. Data Augmentation

Use seasonal MBB to do data augmentation

[6]:

%%capture

from metaforecast.synth import SeasonalMBB

tsgen = SeasonalMBB(seas_period=12)
synth_df = tsgen.transform(train)

[7]:

synth_df.head()

[7]:

	unique_id	ds	y
0	M1_MBB0	1990-01-31	5162.219129
1	M1_MBB0	1990-02-28	3069.687069
2	M1_MBB0	1990-03-31	1162.915908
3	M1_MBB0	1990-04-30	2808.912064
4	M1_MBB0	1990-05-31	7988.380562

[8]:

train_aug = pd.concat([train, synth_df])

3. Model setup and fitting

We focus on NHITS, with a default configuration

We train two version of NHITS: one on the original data (train), and another on the augmented dataset.

[9]:

from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS

CONFIG = {'max_steps': 1000, 'input_size': n_lags, 'h': horizon, 'enable_checkpointing': True, 'accelerator': 'cpu'}

models = [NHITS(start_padding_enabled=True, **CONFIG),]

nf_og = NeuralForecast(models=models, freq='ME')
nf_aug = NeuralForecast(models=models, freq='ME')

2024-10-11 13:51:35,790 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-10-11 13:51:35,845 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO:lightning_fabric.utilities.seed:Seed set to 1

[10]:

%%capture

nf_og.fit(df=train)
nf_aug.fit(df=train_aug)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name         | Type          | Params | Mode
-------------------------------------------------------
0 | loss         | MAE           | 0      | train
1 | padder_train | ConstantPad1d | 0      | train
2 | scaler       | TemporalNorm  | 0      | train
3 | blocks       | ModuleList    | 2.4 M  | train
-------------------------------------------------------
2.4 M     Trainable params
0         Non-trainable params
2.4 M     Total params
9.628     Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1000` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name         | Type          | Params | Mode
-------------------------------------------------------
0 | loss         | MAE           | 0      | train
1 | padder_train | ConstantPad1d | 0      | train
2 | scaler       | TemporalNorm  | 0      | train
3 | blocks       | ModuleList    | 2.4 M  | train
-------------------------------------------------------
2.4 M     Trainable params
0         Non-trainable params
2.4 M     Total params
9.628     Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1000` reached.

[11]:

fcst_og = nf_og.predict()
fcst_aug = nf_aug.predict(df=train)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs

Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 365.14it/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 340.97it/s]

[12]:

fcst = fcst_aug.rename(columns={'NHITS':'NHITS(MBB)'}).merge(fcst_og, on=['unique_id','ds'])

fcst.head()

[12]:

	ds	NHITS(MBB)	NHITS
unique_id
M1	1994-09-30	3200.186279	2964.885498
M1	1994-10-31	2518.282959	2712.033936
M1	1994-11-30	2716.197998	2643.748291
M1	1994-12-31	3212.451416	2930.787842
M1	1995-01-31	2180.542480	2436.013672

4. Evaluation

Finally, we compare both approaches

[13]:

test = test.merge(fcst, on=['unique_id','ds'], how="left")

test.head()

[13]:

	unique_id	ds	y	NHITS(MBB)	NHITS
0	M1	1994-09-30	1560.0	3200.186279	2964.885498
1	M1	1994-10-31	1440.0	2518.282959	2712.033936
2	M1	1994-11-30	240.0	2716.197998	2643.748291
3	M1	1994-12-31	1800.0	3212.451416	2930.787842
4	M1	1995-01-31	4680.0	2180.542480	2436.013672

[14]:

from neuralforecast.losses.numpy import smape
from datasetsforecast.evaluation import accuracy

evaluation_df = accuracy(test, [smape], agg_by=['unique_id'])

[15]:

eval_df = evaluation_df.drop(columns=['metric','unique_id'])

eval_df

[15]:

	NHITS(MBB)	NHITS
0	0.618445	0.608299
1	0.176442	0.202950
2	0.082508	0.093589
3	0.004968	0.013117
4	0.018100	0.020635
...	...	...
1423	0.008247	0.006558
1424	0.018236	0.024858
1425	0.072545	0.079717
1426	0.012381	0.008905
1427	0.010193	0.012599

1428 rows × 2 columns

[18]:

eval_df.mean().sort_values()

[18]:

NHITS         0.127942
NHITS(MBB)    0.128179
dtype: float64