Jonas Süskind

Jonas Süskind

Master's Thesis

Video-based Deep Learning Approaches for Animal Behavior Classification


Philip Stoll (M. Sc.), Matthias Zürl (M. Sc.), Prof. Dr. Björn Eskofier


05 / 2022 – 10 / 2023


To assess or study the health of an animal, researchers in the field of animal welfare usually observe that animal for extended periods. Observations about behavioral patterns such as stereotypical movement are noted down for later analysis. This however has multiple inconveniences, namely that this task is highly repetitive, labor-intensive, costly, and prone to human errors like subjective bias and accidental misclassifications. Therefore, there is high potential for automation, especially using recent advances in deep learning.
The catalog of behaviors exhibited by an animal is called ethogram which defines primary (walk, stand, swim, . . . ) and secondary (attentive, eat, . . . ) behaviors for polar bears. Preliminary experiments leading to the proposed thesis showed that even using off-the-shelf convolutional neural networks like the EfficientNetV2 [1] on a frame-by-frame basis delivers a 94% validation F1-score when classifying the primary behavior of polar bears. Reliably classifying the secondary behavior remained a challenge in the early experiments using such naive approaches. So far, most misclassifications occurred when temporal information such as the movement of the animal itself was vital for behavior differentiation. For instance, a frame where the polar bear is standing may look similar to a frame of a walking polar bear but more context in the form of a few additional subsequent frames aid in deciding the correct class. Significant advancements have been achieved in the associated domain of recognizing (human) actions including, but not limited to, 3D convolutions [2], RNNs [3], and two-stream architectures with optical flow [4, 5]. Applying these approaches to the behavior classification of polar bears will investigate the transferability of these methodologies.
The data for the experiments consists of 50 hours of two polar bears from the Nuremberg Zoo, filmed from three separate camera angles and the corresponding behavior annotations. Based on this raw data, frames showing the bears are cropped and stored as single images for training and testing the models.


[1] Mingxing Tan and Quoc V. Le. Efficientnetv2: Smaller models and faster training. 4 2021.
[2] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:221–231, 1 2013.
[3] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. Long-term recurrent convolutional networks for visual recognition and description. pages 2625–2634. IEEE, 6 2015.
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. pages 4724–4733. IEEE, 7 2017.
[5] James P Bohnslav, Nivanthika K Wimalasena, Kelsey J Clausing, Yu Y Dai, David A Yarmolinsky, Tom´as Cruz, Adam D Kashlan, M Eugenia Chiappe, Lauren L Orefice, Clifford J Woolf, and Christopher D Harvey. Deepethogram, a machine learning pipeline for supervised behavior classification from raw pixels. eLife, 10, 9 2021.
[6] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting, 2015.
[7] Oliver Sturman, Lukas von Ziegler, Christa Schl¨appi, Furkan Akyol, Mattia Privitera, Daria Slominski, Christina Grimm, Laetitia Thieren, Valerio Zerbi, Benjamin Grewe, and Johannes Bohacek. Deep learning-based behavioral analysis reaches human accuracy and is capable of outperforming commercial solutions. Neuropsychopharmacology, 45(11):1942–1952, July 2020.