The Time Series Transformer

Written by Theodoros Ntakouris

Table of Contents:

  • Introduction

  • Preprocessing

  • Learnable Time Representation (Time 2 Vec)

  • Architecture

  • Bag Of Tricks (things to consider when training Transformers)


Attention Is All You Need they said. Is it a more robust convolution? Is it just a hack to squeeze more learning capacity out of fewer parameters? Is it supposed to be sparse? How did the original authors come up with this architecture?

The Transformer Architecture

  • It’s better than RNNs because it’s not recurrent and can use previous time step features without a loss of detail

  • It’s the top performer architecture on plethora of tasks, including but not limited to NLP, Vision, Regression (it scales)

It is pretty easy to switch from an existing RNN model to the Attention architecture. Inputs are of the same shape!


Using Transformers for Time Series Tasks is different than using them for NLP or Computer Vision. We neither tokenize data nor cut them into 16x16 image chunks. Instead, we follow a more classic/old-school way of preparing data for training.

One thing that is definitely true is that we have to feed data in the same value range as input, to eliminate bias. This is typically on the [0, 1] or [-1, 1] range. In general, it is recommended to apply the same kind of preprocessing pipeline on all of your input features to eliminate this bias. Individual use cases may be exempt from this, different models and data are unique! Think about the origin of your data for a moment.

Popular time series preprocessing techniques include:

  • Just scaling to [0, 1] or [-1, 1]

  • Standard Scaling (removing mean, dividing by standard deviation)

  • Power Transforming (using a power function to push the data to a more normal distribution, typically used on skewed data / where outliers are present)

  • Outlier Removal

  • Pairwise Diffing or Calculating Percentage Differences

  • Seasonal Decomposition (trying to make the time series stationary)

  • Engineering More Features (automated feature extractors, bucketing to percentiles, etc)

  • Resampling in the time dimension

  • Resampling in a feature dimension (instead of using the time interval, use a predicate on a feature to re-arrange your time steps — for example when recorded quantity exceeds N units)

  • Rolling Values

  • Aggregations

  • Combinations of these techniques

Again, preprocessing decisions are tightly coupled to the problem and data at hand, but this is a nice list to get you started.

If your time series can become stationary by doing preprocessing such as seasonal decomposition, you could get good quality predictions by using smaller models (that also get trained way faster and require less code and effort), such as NeuralProphet or Tensorflow Probability.

Deep Neural Networks can learn linear and periodic components on their own, during training (we will use Time 2 Vec later). That said, I would advise against seasonal decomposition as a preprocessing step.

Other decisions such as calculating aggregates and pairwise differences depending on the nature of your data, and what you want to predict.

Treating sequence length as a hyperparameter, leads us to an input tensor shape that is similar to RNNs: (batch size, sequence length, features).

Here is a drawing for all the dimensions set to 3.

Input Shapes

Learnable Time Representation

For Attention to work, you need to attach the meaning of time to your input features. In the original NLP model, a collection of superimposed sinusoidal functions were added to each input embedding. We need a different representation now that our inputs are scalar values and not distinct words/tokens.

Positional Encoding Visualization from kazemnejad’s blog.

The Time 2 Vec paper comes in handy. It’s a learnable and complementary, model-agnostic representation of time. If you’ve studied Fourier Transforms in the past, this should be easy to understand.

Just break down each input feature into a linear component ( a line ) and as many periodic (sinusoidal) components as you wish. By defining the decomposition as a function, we can make the weights learnable through backpropagation.

Time 2 Vec Decomposition Equation

For each input feature, we apply the same layer in a time-independent (time-distributed layer) manner. This learnable embedding does not depend on time! Finally, concatenate the original inputs.

Here is an illustration of the learned time embeddings, which are different, for each input feature category (1 learned linear component and 1 learned periodic component per feature).

This does not mean that each time—step will carry the same embedding value because the computation of the time2vec embeddings depends on the input values!

And, in the end, we concatenate these all together to form the input for the attention blocks.


We are going to use Multi-Head Self-Attention (setting Q, K and V to depend on the input through different dense layers/matrices). The next part is optional and depends on the scale of your model and data, but we are also going to ditch the decoder part completely. This means, that we are only going to use one or more attention block layers.

In the last part, we are going to use a few (one or more) Dense layers to predict whatever we want to predict.

Our Architecture

Each Attention Block consists of Self Attention, Layer Normalizations, and a Feed-Forward Block. The input dimensions of each block are equal to its output dimensions.

Optionally, before the head part, you can apply some sort of pooling (Global Average 1D for example).

Bag Of Tricks

Things to consider when using Transformers and Attention, to get the most out of your model.

1) Start Small

Don’t go crazy with hyperparameters. Start with a single, humble attention layer, a couple of heads, and a low dimension. Observe results and adjust hyperparameters accordingly — don’t overfit! Scale your model along with your data. Nevertheless, nothing is stopping you from scheduling a huge hyperparameter search job :).

2) Learning Rate Warmup

A crucial part of the attention mechanism that leads to greater stability is the learning-rate warmup. Start with a small learning rate and gradually increase it till you reach the base one, then decrease again. You can go crazy with exponential — decaying schedules and sophisticated formulas, but I will just give you a simple example that you should be able to understand just by reading the following code out loud:

3) Use Adam (or variants)

Non-accelerated gradient descent optimization methods do not work well with Transformers. Adam is a good initial optimizer choice to train with. Keep an eye out for newer (and possibly better) optimization techniques like AdamW or NovoGrad!

Thanks for reading all the way to the end!


[1] Attention Is All You Need, Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin, 2017 [2] Time2Vec: Learning a Vector Representation of Time, Seyed Mehran Kazemi and Rishab Goel and Sepehr Eghbali and Janahan Ramanan and Jaspreet Sahota and Sanjay Thakur and Stella Wu and Cathal Smyth and Pascal Poupart and Marcus Brubaker, 2019