Fabio Philipp Rosenthal

Fabio Philipp Rosenthal

Master's Thesis

Adaptive Resolution Transformer (ART): efficient transformers using adaptive resolution tokens

Advisors

Kai Klede (M. Sc.), Leo Schwinn (M. Sc.), Prof. Dr. Björn Eskofier

Duration
12 / 2022 – 06 / 2023

Abstract

Since their introduction by Vaswani et al. in 2017, transformer models have become one of the most popular architectures in natural language processing [1]. Dosovitskiy et al. transferred transformers to the application area of image recognition in 2020 by dividing input images into a sequence of small patches used as input tokens [2]. Since then, the popularity of transformer architectures in image recognition tasks has increased significantly [3], [4]. Nevertheless, transformers have some inherent limitations. The main building block of every transformer is the attention mechanism, which computes attention scores between every token in the input sequence [1]. The attention module requires quadratic processing time and memory with respect to the sequence length and limits the application of transformers to comparably short sequences [5], [6]. Thus, transformers are not well suited for predictions in high-resolution images composed of many individual image patches, leading to a high number of input tokens [5].

More efficient transformer architectures may reduce the computational overhead for inference and training and enable the processing of high-resolution images with transformer architectures. Many approaches to improve the efficiency of transformers have been proposed. However, while these methods reduce the computational overhead, they only approximate the attention mechanisms and therefore, scarifice accuracy. In this thesis, we propose the Adaptive Resolution Transformer (ART) to enable vision transformers to process high-resolution images. For ART models, we first rescale the original input to a considerably lower resolution (i.e., 512×512 to 32×32). Subsequently, we continuously split highly informative tokens in the feed-forward path. For every split token, we add four higherresolution tokens extracted from a higher-resolution version of the image. Here, the resolution of the image is continuously increased for every token split until the original resolution is reached. This procedure enables ART to process important image regions in high resolution while not wasting computational resources on uninformative image regions.

References
[1] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Oct. 27, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
[2] A. Dosovitskiy et al., “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” arXiv, Jun. 03, 2021. Accessed: Oct. 10, 2022. [Online]. Available: http://arxiv.org/abs/2010.11929
[3] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in Vision: A Survey,” ACM Comput. Surv., vol. 54, no. 10s, pp. 1–41, Jan. 2022, doi: 10.1145/3505244.
[4] Y. Liu et al., “A Survey of Visual Transformers.” arXiv, May 02, 2022. Accessed: Oct. 28, 2022. [Online]. Available: http://arxiv.org/abs/2111.06091
[5] Q. Zhang and Y. Yang, “ResT: An Efficient Transformer for Visual Recognition.” arXiv, Oct. 14, 2021. Accessed: Oct. 27, 2022. [Online]. Available: http://arxiv.org/abs/2105.13677
[6] A. Gupta, G. Dar, S. Goodman, D. Ciprut, and J. Berant, “Memory-efficient Transformers via Top-K Attention.” arXiv, Jun. 12, 2021. Accessed: Oct. 27, 2022. [Online]. Available: http://arxiv.org/abs/2106.06899
[7] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998, doi: 10.1109/5.726791.
[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848.
[9] Z. Ying, H. Niu, P. Gupta, D. Mahajan, D. Ghadiyaram, and A. Bovik, “From Patches to Pictures (PaQ-2-PiQ): Mapping the Perceptual Space of Picture Quality.” arXiv, Dec. 20, 2019. Accessed: Oct. 28, 2022. [Online]. Available: http://arxiv.org/abs/1912.10088
[10] Y. Fang, H. Zhu, Y. Zeng, K. Ma, and Z. Wang, “Perceptual Quality Assessment of Smartphone Photography,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 3674–3683. doi: 10.1109/CVPR42600.2020.00373.
[11] V. Hosu, H. Lin, T. Sziranyi, and D. Saupe, “KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment,” IEEE Trans. Image Process., vol. 29, pp. 4041–4056, 2020, doi: 10.1109/TIP.2020.2967829.