05 / 2022 – 11 / 2022
The estimation of a person’s emotional state on live or recorded video data is an active topic of research in the field of computer vision. Gathering such data in response to events or products or during an interview may be useful for marketing, recruitment, medicine, education, entertainment as well as security and law enforcement purposes .
There are six distinct emotional states that were proposed by Paul Ekman in 1992: Anger, disgust, fear, happiness, sadness and surprise. These states can be distinguished by facial expressions and are independent of age and cultural background . However, due to the categorical nature of these emotional states, they do not include any nuance about the intensity of the emotion and they cannot capture any information about the presence of multiple overlapping emotional states. As a solution to this problem, the continuous two-dimensional valence and arousal space was developed. The valence axis models how positive or negative the observed person’s emotional state is. Lower values describe unpleasantness while higher values describe pleasantness. The arousal axis models how passive or active the observed person’s emotional state is with lower values describing a passivity and higher values describing activity .
The “Aff-Wild2” dataset was developed for the “Workshop and Competition on Affective Behavior Analysis in-the-wild” (ABAW)  and contains 558 YouTube videos with a total of about 2.8 million annotated frames . This dataset contains labels for three different methods of emotion estimation: Action units, categorical emotion expression and valence-arousal estimation . Some submissions to ABAW split the input data into multiple streams for separate processing [6, 7, 8]. Attention mechanisms have also been used to increase the predictive performance [9, 10].
This thesis aims to improve upon a previous work with an audio-visual two-stream approach  by evaluation of the contribution of individual streams to the performance and the exploration and implementation of viable attention mechanisms. Although the neural network from the previous work produces outputs for all three methods of the competition, this work focuses only on valencearousal estimation.
 D. Canedo and A. J. R. Neves, “Facial expression recognition using computer vision: A systematic review,” Applied Sciences, vol. 9, no. 21, 2019.
 P. Ekman, “An argument for basic emotions,” Cognition and Emotion, vol. 6, no. 3-4, pp. 169–200, 1992.
 J. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161–1178, 1980.
 D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou, “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,” International Journal of Computer Vision, pp. 1–23, 2019.
 S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, and I. Kotsia, “Aff-wild: Valence and arousal ’in-the-wild’ challenge,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pp. 1980–1987, IEEE, 2017.
 F. Kuhnke, L. Rumberg, and J. Ostermann, “Two-stream aural-visual affect analysis in the wild,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG), (Los Alamitos, CA, USA), pp. 366–371, IEEE Computer Society, may 2020.
 P. Antoniadis, I. Pikoulis, P. P. Filntisis, and P. Maragos, “An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild,” CoRR, vol. abs/2107.03465, 2021.
 Y.-H. Zhang, R. Huang, J. Zeng, S. Shan, and X. Chen, “m3t: Multi-modal continuous valence-arousal estimation in the wild,” 2020.
 S. Zhang, Y. Ding, Z.Wei, and C. Guan, “Audio-visual attentive fusion for continuous emotion recognition,” CoRR, vol. abs/2107.01175, 2021.
 D. L. Hoai, E. Lim, E. Choi, S. Kim, S. Pant, G.-S. Lee, S.-H. Kim, and H.-J. Yang, “An attention-based method for action unit detection at the 3rd abaw competition,” 2022.