Master's Thesis

Scale-invariant Convolution Kernels


Franz Köferl (M.Sc.) , Prof. Dr. B. Eskofier 




Convolutional layers within neural networks enjoy great popularity among researchers in many
areas of image processing, including in medicine [17], quality control for manufacturing [3] and
autonomous driving [8]. One of the key properties of convolution kernels contributing to this success
is translation invariance [9]. The same kernel weights are applied to every part of the input,
independent of the exact location, thus restraining the search space to those functions that make
sense in a natural image context.
It is argued that this crucial property should also extend to different scales, commonly resulting
from varying zoom levels and resolutions of objects. Currently larger-scale kernels need to be
generated by combining multiple small kernels, one of the reasons why 3 3 convolution are so
predominant [13].
There exist many different approaches for incorporating scale in convolutional neural network
(CNN) architectures. The simplest are data augmentation methods [2, 1] which resize the training
images. Other so-called multi-column architectures use an ensemble of CNNs operating at
different scale levels by either scaling the input [14] or the kernels themselves [16]. Yet others
propagate information from different depths and thus scales towards the output layer [4, 18, 10].
Still others directly model arbitrary transformations, e.g. in the form of spatial transformers [7],
dynamic Gaussian receptive fields [15] or matrix capsules [6]. More recently steerable filters have
gained attention for implementing scale-equivariant convolutions [11, 5]. All of the aforementioned
approaches either don’t actually yield scale-invariant kernels, cannot capture inter-scale correlations
or are comparatively complex to understand and implement.
The research in this thesis topic aims to introduce scale-invariance as a first-class citizen in convolutional
layers. This is different from most related work, which focuses on architectural changes.
A standard (2D) convolutional layer takes an input of shape (iy; ix; ic), where iy is the height,
ix is the width and ic is the number of channels. The output is of the shape (oy; ox; oc) where
the ox and oy again correspond to the location, or translation, within the image and oc usually
is the number of kernels used. The proposed scale-invariant (2D) convolutional layer adds a new
output dimension os which corresponds to the scale within the image, resulting in a total of four
output dimensions. It thus encapsulates scale information in a similar fashion as is currently the
case for the translation, making this novel layer conceptually easy to understand. This additional
dimension is formed by applying the standard kernel at different scales. Assuming the common
3×3 kernel, it would be successively applied to 3×3, 4×4,  5×5 and so forth regions of the
Since these regions don’t always divide up evenly, interpolation is required. For simplicity, this
work intends to use bilinear interpolation though other methods are certainly possible. One remaining
open question also is the interaction of this layer with other common layer types, given
the extra output dimension os. Possible solutions are both max pooling for collapsing this extra
dimension and 3D convolutions for exploiting the inter-scale correlations.




[1] Elahe Arani, Shabbir Marzban, Andrei Pata, and Bahram Zonooz. RGPNet: A Real-Time
General Purpose Semantic Segmentation. arXiv:1912.01394 [cs], December 2019. arXiv:
[2] Lokesh Boominathan, Srinivas S S Kruthiventi, and R. Venkatesh Babu. CrowdNet: A Deep
Convolutional Network for Dense Crowd Counting. In Proceedings of the 24th ACM International
Conference on Multimedia, MM ’16, pages 640–644, New York, NY, USA, 2016. ACM.
event-place: Amsterdam, The Netherlands.
[3] Nian Cai, Guandong Cen, Jixiu Wu, Feiyang Li, Han Wang, and Xindu Chen. SMT Solder
Joint Inspection via a Novel Cascaded Convolutional Neural Network. IEEE Transactions
on Components, Packaging and Manufacturing Technology, 8(4):670–677, April 2018.
[4] Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, and Nuno Vasconcelos. A Unified Multi-scale
Deep Convolutional Neural Network for Fast Object Detection. In Bastian Leibe, Jiri Matas,
Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, Lecture Notes in
Computer Science, pages 354–370. Springer International Publishing, 2016.
[5] Rohan Ghosh and Anupam K. Gupta. Scale Steerable Filters for Locally Scale-Invariant
Convolutional Neural Networks. arXiv:1906.03861 [cs], June 2019. arXiv: 1906.03861.
[6] Geoffrey E. Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with EM routing.
February 2018.
[7] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu. Spatial Transformer
Networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
editors, Advances in Neural Information Processing Systems 28, pages 2017–2025. Curran
Associates, Inc., 2015.
[8] Jelena Kocic, Nenad Jovicic, and Vujo Drndarevic. An End-to-End Deep Neural Network for
Autonomous Driving Designed for Embedded Automotive Platforms. Sensors, 19(9):2064,
January 2019.
[9] Yann LeCun and Yoshua Bengio. The Handbook of Brain Theory and Neural Networks.
pages 255–258. MIT Press, Cambridge, MA, USA, 1998.
[10] Gi Pyo Nam, Heeseung Choi, Junghyun Cho, and Ig-Jae Kim. PSI-CNN: A Pyramid-Based
Scale-Invariant CNN Architecture for Face Recognition Robust to Various Image Resolutions.
Applied Sciences, 8(9):1561, September 2018.
[11] Ivan Sosnovik, Michał Szmaja, and Arnold Smeulders. Scale-Equivariant Steerable Networks.
In International Conference on Learning Representations, April 2020.
[12] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The German Traffic
Sign Recognition Benchmark: A multi-class classification competition. In The 2011 International
Joint Conference on Neural Networks, pages 1453–1460, July 2011. ISSN: 2161-4407.
[13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture
for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2818–2826, June 2016.
[14] Nanne van Noord and Eric Postma. Learning scale-variant and scale-invariant features for
deep image classification. Pattern Recognition, 61:583–592, January 2017.
[15] Dequan Wang, Evan Shelhamer, Bruno Olshausen, and Trevor Darrell. Dynamic Scale Inference
by Entropy Minimization. arXiv:1908.03182 [cs], August 2019. arXiv: 1908.03182.
[16] Yichong Xu, Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, and Zheng Zhang. Scale-Invariant
Convolutional Neural Networks. arXiv:1411.6369 [cs], November 2014. arXiv: 1411.6369.
[17] Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. Convolutional
neural networks: an overview and application in radiology. Insights Imaging, 9(4):611–629,
August 2018.
[18] P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu. Scale-Transferrable Object Detection. In 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 528–537, June