Thomas Neher

Thomas Neher

Master's Thesis

On the Robustness of Semi-supervised Learning under adverse Data Distributions: The example of Handwriting Classification

Christoffer Löffler (M.Sc.), Prof. Dr. Björn Eskofier

05/2021 – 11/2021

Semi-supervised learning (SSL) promises to enable the introduction of large amounts of unlabeled data into the machine learning process, potentially greatly improving model performance [1][2][3]. This is especially useful, when an otherwise purely supervised training scheme does not have sufficient labeled data available for the learning task at hand. However, in practice, performance was shown to degrade unpredictably at times, rather than improve, upon addition of unlabeled data [4][5]. This is an open research topic that recently gained renewed attention from the scientific community [5, 15]. Reasons for performance degradation are thought to be related to adverse data distributions that break SSL assumptions like smoothness [6], or show shifts between labeled and unlabeled data [4][7]. One application, that may benefit substantially from using new, unlabeled data for training, is handwriting recognition. However, in the realistic OnHW [15] time series dataset, adverse data distributions inhibit the use of SSL. Labeled data may come from only a limited number of writers, thus introducing sample selection bias relative to a larger unlabeled distribution, or unlabeled data may be out-of-distribution due to environment effects like a minor rotation of the pen [7]. Furthermore, literature is highly inconsistent in describing such adverse scenarios, complicating the ideation of solutions. Terms that are used include dataset shift [7], covariate shift, concept drift [8], and class distribution mismatch [5], amongst others. Problematic data can often be viewed from the perspective of related areas such as domain adaptation [9], transfer learning [10] or out-of-distribution detection [11]. The exact relation to these areas, however, remains unclear. Hence, the lack of unified terminology on the conditions and reasons, that cause performance degradation in SSL, impedes progress, and applying SSL in practice becomes a game of chance.

This thesis focuses primarily on the ideation of an improved state-of-the-art SSL-method for the adverse OnHW time series dataset for handwriting recognition. A minor contribution is the identification of problematic SSL cases from literature, creating a unified terminology as well
as a taxonomy of adverse distributions and algorithmic solutions. In the first phase, a taxonomy and summary of possibly transferable solutions for SSL methods under adverse data is created. Problematic scenarios are each assessed from the perspective of related fields in order to establish clear relations. For each of the problems, methods proposed in the literature to counteract them are presented. The selection of these methods is based on research interest (number of citations) and relevance to the OnHW dataset.

In the second phase, that uses the OnHW dataset, the most severe problems for real-life application are studied. For in-depth investigation of adverse data distributions, the base problematic cases identified for OnHW are subsequently simulated and analyzed using simple artificial datasets such as 2D Gaussians or the CIFAR-10/100 data sets predominant in the literature [12]. Performance degradation is examined using a state-of-the-art SSL method (e.g. MixMatch [13]) from an existing code base [14]. Consequently, a suitable method countering adverse data distributions, such as MixMood [16], Uncertainty-Aware Self-Distillation [5] or methods based on OOD-detection [17], is implemented and evaluated on both synthetic an real data. The mitigation of degradation will be analyzed. As a supervised baseline, a CNN architecture will be used that shows best performance in the OnHW letter classification task.

[1] J. E. van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Mach. Learn., vol. 109, no. 2, pp. 373–440, 2020, doi: 10.1007/s10994-019-05855-6.
[2] Chapelle, O., Scholkopf, B., & Zien, A. (2009). Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3), 542-542.
[3] P. Ren et al., “A survey of deep active learning,” arXiv, 2020.
[4] A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Goodfellow, “Realistic evaluation of deep semi-supervised learning algorithms,” Proc. 32nd Int. Conf. Neural Inf. Process. Syst., pp. 3239–3250, 2018, doi: 10.5555/3327144.3327244.
[5] Y. Chen, X. Zhu, W. Li, and S. Gong, “Semi-Supervised Learning under Class Distribution Mismatch,” Proc. 34thAAAI Conf. Artif. Intell., vol. 34, no. 4, pp. 3569–3576, 2020, doi: 10.1609/aaai.v34i04.5763.
[6] A. Mey and M. Loog, “Improvability through semi-supervised learning: a survey of theoretical results,” arXiv, 2019.
[7] J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, and F. Herrera, “A unifying view on dataset shift in classification,” Pattern Recognit., vol. 45, no. 1, pp. 521–530, 2012, doi: 10.1016/j.patcog.2011.06.019.
[8] G. I. Webb, R. Hyde, H. Cao, H. L. Nguyen, and F. Petitjean, “Characterizing concept drift,” Data Min. Knowl. Discov., vol. 30, no. 4, pp. 964–994, 2016, doi: 10.1007/s10618-015-0448-4.
[9] J. Jiang and C. X. Zhai, “Instance weighting for domain adaptation in NLP,” ACL 2007 – Proc. 45th Annu. Meet. Assoc. Comput. Linguist., pp. 264–271, 2007.
[10] H. Y. Zhou, A. Oliver, J. Wu, and Y. Zheng, “When semi-supervised learning meets transfer learning: Training strategies, models and datasets,” arXiv, 2018.
[11] X. Zhao, K. Krishnateja, R. Iyer, and F. Chen, “Robust semi-supervised learning with out of distribution data,” arXiv, 2020.
[12] Krizhevsky, A., & Hinton, G. Learning multiple layers of features from tiny images. 2009
[13] D. Berthelot, N. Carlini, I. Goodfellow, A. Oliver, N. Papernot, and C. Raffel, “MixMatch: A holistic approach to semisupervised learning,” Advances in Neural Information Processing Systems, 2019.
[14] J. Goschenhofer, R. Hvingelby, D. Rügamer, J. Thomas, M. Wagner, and B. Bischl, “Deep Semi-Supervised Learning for Time Series Classification,” arXiv, 2020.
[15] F. Ott, M. Wehbi, T. Hamann, J. Barth, B. Eskofier, and C. Mutschler, “The OnHWDataset: Online Handwriting Recognition from IMU-Enhanced Ballpoint Pens with Machine Learning,” Proc. ACM Interactive, Mobile, Wearable Ubiquitous Technol., vol. 4, no. 3, 2020, doi: 10.1145/3411842.
[16] S. Calderon-Ramirez, L. Oala, J. Torrents-Barrena, S. Yang, A. Moemeni, W. Samek, and M.A. Molina-Cabello, “MixMOOD: A systematic approach to class distribution mismatch in semi-supervised learning using deep dataset dissimilarity measure”, arXiv, 2020.
[17] S. Liang, Y. Li, and R. Srikant, “Enhancing the Reliability of Out-of-distribution Image Detection in Neural Networks,” International Conference on Learning Representations (ICLR), 2018