Data Augmentation for Improved Forecasting Accuracy
Time series synthetic data generation can be useful in scenarios where an adequate sample size in not available.
This notebook explores how to do data augmentation and apply this process in the context of forecasting
Loading M3’s Monthly dataset
Use synthetic time series to augment the training set
Fitting two version of NHITS, one in each dataset (original and augmented)
Evaluating both models
[1]:
import warnings
warnings.filterwarnings("ignore")
If necessary, install the package using pip:
[2]:
# !pip install metaforecast -U
1. Data preparation
Let’s start by loading the dataset. This tutorial uses the ETTm2 dataset available on datasetsforecast.
We also set the forecasting horizon and input size (number of lags) to 360, 6 hours of data.
[3]:
import pandas as pd
from datasetsforecast.m3 import M3
from metaforecast.utils.data import DataUtils
horizon = 12
n_lags = 12
df, *_ = M3.load('.',group='Monthly')
Split the dataset into training and testing sets:
[4]:
train, test = DataUtils.train_test_split(df, horizon)
train.query('unique_id=="M1000"').tail()
[4]:
| unique_id | ds | y | |
|---|---|---|---|
| 286 | M1000 | 1992-10-31 | 4563.4 |
| 287 | M1000 | 1992-11-30 | 4551.8 |
| 288 | M1000 | 1992-12-31 | 4577.4 |
| 289 | M1000 | 1993-01-31 | 4592.4 |
| 290 | M1000 | 1993-02-28 | 4632.2 |
[5]:
test.query('unique_id=="M1000"').head()
[5]:
| unique_id | ds | y | |
|---|---|---|---|
| 36 | M1000 | 1993-03-31 | 4625.6 |
| 37 | M1000 | 1993-04-30 | 4668.2 |
| 38 | M1000 | 1993-05-31 | 4598.0 |
| 39 | M1000 | 1993-06-30 | 4619.4 |
| 40 | M1000 | 1993-07-31 | 4640.4 |
2. Data Augmentation
Use seasonal MBB to do data augmentation
[6]:
%%capture
from metaforecast.synth import SeasonalMBB
tsgen = SeasonalMBB(seas_period=12)
synth_df = tsgen.transform(train)
[7]:
synth_df.head()
[7]:
| unique_id | ds | y | |
|---|---|---|---|
| 0 | M1_MBB0 | 1990-01-31 | 5162.219129 |
| 1 | M1_MBB0 | 1990-02-28 | 3069.687069 |
| 2 | M1_MBB0 | 1990-03-31 | 1162.915908 |
| 3 | M1_MBB0 | 1990-04-30 | 2808.912064 |
| 4 | M1_MBB0 | 1990-05-31 | 7988.380562 |
[8]:
train_aug = pd.concat([train, synth_df])
3. Model setup and fitting
We focus on NHITS, with a default configuration
We train two version of NHITS: one on the original data (train), and another on the augmented dataset.
[9]:
from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS
CONFIG = {'max_steps': 1000, 'input_size': n_lags, 'h': horizon, 'enable_checkpointing': True, 'accelerator': 'cpu'}
models = [NHITS(start_padding_enabled=True, **CONFIG),]
nf_og = NeuralForecast(models=models, freq='ME')
nf_aug = NeuralForecast(models=models, freq='ME')
2024-10-11 13:51:35,790 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-10-11 13:51:35,845 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO:lightning_fabric.utilities.seed:Seed set to 1
[10]:
%%capture
nf_og.fit(df=train)
nf_aug.fit(df=train_aug)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
| Name | Type | Params | Mode
-------------------------------------------------------
0 | loss | MAE | 0 | train
1 | padder_train | ConstantPad1d | 0 | train
2 | scaler | TemporalNorm | 0 | train
3 | blocks | ModuleList | 2.4 M | train
-------------------------------------------------------
2.4 M Trainable params
0 Non-trainable params
2.4 M Total params
9.628 Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1000` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
| Name | Type | Params | Mode
-------------------------------------------------------
0 | loss | MAE | 0 | train
1 | padder_train | ConstantPad1d | 0 | train
2 | scaler | TemporalNorm | 0 | train
3 | blocks | ModuleList | 2.4 M | train
-------------------------------------------------------
2.4 M Trainable params
0 Non-trainable params
2.4 M Total params
9.628 Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1000` reached.
[11]:
fcst_og = nf_og.predict()
fcst_aug = nf_aug.predict(df=train)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 365.14it/s]
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 340.97it/s]
[12]:
fcst = fcst_aug.rename(columns={'NHITS':'NHITS(MBB)'}).merge(fcst_og, on=['unique_id','ds'])
fcst.head()
[12]:
| ds | NHITS(MBB) | NHITS | |
|---|---|---|---|
| unique_id | |||
| M1 | 1994-09-30 | 3200.186279 | 2964.885498 |
| M1 | 1994-10-31 | 2518.282959 | 2712.033936 |
| M1 | 1994-11-30 | 2716.197998 | 2643.748291 |
| M1 | 1994-12-31 | 3212.451416 | 2930.787842 |
| M1 | 1995-01-31 | 2180.542480 | 2436.013672 |
4. Evaluation
Finally, we compare both approaches
[13]:
test = test.merge(fcst, on=['unique_id','ds'], how="left")
test.head()
[13]:
| unique_id | ds | y | NHITS(MBB) | NHITS | |
|---|---|---|---|---|---|
| 0 | M1 | 1994-09-30 | 1560.0 | 3200.186279 | 2964.885498 |
| 1 | M1 | 1994-10-31 | 1440.0 | 2518.282959 | 2712.033936 |
| 2 | M1 | 1994-11-30 | 240.0 | 2716.197998 | 2643.748291 |
| 3 | M1 | 1994-12-31 | 1800.0 | 3212.451416 | 2930.787842 |
| 4 | M1 | 1995-01-31 | 4680.0 | 2180.542480 | 2436.013672 |
[14]:
from neuralforecast.losses.numpy import smape
from datasetsforecast.evaluation import accuracy
evaluation_df = accuracy(test, [smape], agg_by=['unique_id'])
[15]:
eval_df = evaluation_df.drop(columns=['metric','unique_id'])
eval_df
[15]:
| NHITS(MBB) | NHITS | |
|---|---|---|
| 0 | 0.618445 | 0.608299 |
| 1 | 0.176442 | 0.202950 |
| 2 | 0.082508 | 0.093589 |
| 3 | 0.004968 | 0.013117 |
| 4 | 0.018100 | 0.020635 |
| ... | ... | ... |
| 1423 | 0.008247 | 0.006558 |
| 1424 | 0.018236 | 0.024858 |
| 1425 | 0.072545 | 0.079717 |
| 1426 | 0.012381 | 0.008905 |
| 1427 | 0.010193 | 0.012599 |
1428 rows × 2 columns
[18]:
eval_df.mean().sort_values()
[18]:
NHITS 0.127942
NHITS(MBB) 0.128179
dtype: float64