Mischa Dombrowski

Mischa Dombrowski

Master's Thesis

Systematic Analysis of the Transformer Architecture for Time Series Prediction Applications

Advisors:

Philipp Schlieper (M.Sc.), An Ngyuen (M.Sc.), Prof. Dr. Björn Eskofier

Duration:

05/2021 – 11/2021

Abstract:

Ever since the Transformer architecture was introduced by Vaswani et al.[1] it has overtaken the
natural language processing community. Many state of the art solutions like BERT [2] or GPT [3]
use a Transformer-like architecture as the backbone of their deep learning model.
Due to the success in the ability to capture and learn temporal dependencies for these kinds
of problems the natural next step was to look at how the model performs on time series problems.
The results were some first publications on the performance of the Transformer when applied to
forecasting problems [4, 5]. Each of them show promising results when compared to their benchmark deep and shallow learning models. However, one problem with most of these papers is that
they provide little to no insight on the characteristics of the data, but instead, they simply apply
the novel architecture to the problem and analyze the performance. Therefore, it is hard to assess
from these results alone, whether or not the Transformer architecture will be useful for a new
dataset and valuable information could be gathered to fill this gap.
The goal of this thesis is to provide a data-focused analysis on the applicability of the Transformer
architecture with a focus on univariate time series forecasting. This will be done by empirically
analyzing what characteristics of the data lead to the Transformer being better or worse when
compared to alternatives such as long short-term memory (LSTM) networks [8] or convolutional
neural networks (CNN). This means that first, an exhaustive literature search will be conducted
that looks at different publications that compare at least two of the previously mentioned models,
like [6, 7, 10], to see if intuition can be built around what type of data should be most suited for
each of the three models.
Then a synthetic dataset will be created similar to other publications that use synthetic data
[4, 9]. The difference is that there will be a focus on the ability to adjust relevant characteristics
of a typical real-world dataset like the length of the signal, short- or long-term dependencies, the
dynamics of the time series, and the sequence length.
Finally, the three models will be compared with respect to these characteristics and it will be
evaluated if one of them can be considered superior if these characteristics are known. Additionally, the advantages and disadvantages of each model will be discussed given the results of the
experiments including aspects like computational complexity.

References:

[1] Vaswani, Ashish et al.: Attention is all you need. 31st Conference on Neural Information
Processing Systems, 2017.
[2] Devlin, Jacob et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. Google AI Language, 2019.
[3] Brown, Tom B. et al.: Language Models are Few-Shot Learners arXiv. OpenAI, 2020.
[4] Li, Shiyang et al.: Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. 33rd Conference on Neural Information Processing Systems, 2019.
[5] Wu, Neo et al.: Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case. Google LLC, 2020.
[6] Agarwal, Bushagara et al.: Deep Learning based Time Series Forecasting. 19th IEEE International Conference on Machine Learning and Applications, 2020.
[7] Koprinska, Irena et al.: Convolutional Neural Networks for Energy Time Series Forecasting.
Proceedings of the International Joint Conference on Neural Networks, 2018.
[8] Hochreiter, Sepp et al.: Long Short-Term Memory. Neural Computation, 1997.
[9] Borovykh, Anastasia et al.: Conditional Time Series Forecasting with Convolutional Neural
Networks. Journal of Computational Finance, Forthcoming, 2018.
[10] Idrissiet, Touria El Idrissi et al.: Deep Learning for Blood Glucose Prediction: CNN vs
LSTM. International Conference on Computational Science and Its Application, 2020.