Sven Steinkemper

Master's Thesis

Machine Learning for Systems Biology: Data Analysis of IBD Patients’ Microarray Data

Advisors
Thomas Altstidl (M.Sc.), Prof. Dr.-med. Raja Atreya (Uniklinikum Erlangen), Prof. Dr. Björn Eskofier

Duration
01 / 2022 – 07 / 2022

Abstract
Microarrays capture the expression of different genes by measuring the amount of messenger ribonucleic acid (mRNA) under varying conditions [4]. This data can be extremely useful for understanding the internal processes of patients undergoing therapy, or to predict and diagnose diseases. A suitable analysis of gene expression data using microarrays can, in many cases, help determine a more targeted and effective treatment for each patient [10, 1]. One possible application for this technology could be the prediction of inflammatory bowel disease (IBD) patients’ reactions to different immunotherapies, which could help better target treatments [8].

Analysis and handling of gene expression data yields a classic problem in the field of machine learning, as the sample size of medical data is often small and the number of genes easily ranges in the tens of thousands and therefore corresponds to a huge number of features [1]. Therefore, most related work employs a combination of a feature selection algorithm and a classification algorithm. Among the feature selection methods studied are Principal Component Analysis [3], Laplacian Score [9], Autoencoders [2] and even Support Vector Machine (SVM) Recursive Feature Elimination [6]. Classification methods include SVM [6], Artificial Neural Networks [9] and k-Nearest-Neighbour Classifiers [5].

Nevertheless, a critical point remains a suitable combination of feature selection and classification algorithms for every application of microarray data analysis. Furthermore, two important factors for all medical applications are the ability to generalize well from very small sample sizes and the explainability of gene correlations to enable interpretability.

In this thesis, the aim is to further explore feature selection and classification for microarrays with respect to these two vital factors: sample size and explainability. Building on the research of existing feature selection and classification approaches a plug-and-play pipeline will be implemented to evaluate the different feature selection methods in combination with a variety of classification methods. The IBD gene expression data set used consists of samples from 120 IBD patients and 26 control samples [11].

To explore the effects of varying sample sizes on the pipeline configurations, each configuration will be used in k-fold cross-validation with step-wise decreasing sample sizes to compare results and find a minimum viable sample size. Additionally, the best resulting combination will be used on various data-sets gathered by the FAU university hospital composed of relatively few samples. On these data-sets the feature selection and classification methods will be further evaluated according to their explainability by using SHAP (Shapley Additive Explanations) [7].

References:
[1] Omar Ahmed and Adnan Brifcani. Gene Expression Classification Based on Deep Learning. In Proc. 4th Sci. Int. Conf. Najaf (SICN), pages 145–149, Al-Najef, Iraq, 2019.
[2] Nabendu Bhui, Pintu Kumar Ram, and Pratyay Kuila. Feature Selection from Microarray Data based on Deep Learning Approach. In Proc. 11th Int. Conf. on Comput., Commun. and Netw. Technol. (ICCCNT), pages 1–5, Kharagpur, India, 2020.
[3] Maisa Daoud and Michael Mayo. A survey of neural network-based cancer prediction models from microarray data. Artif. Intell. in Med., 97:204–214, 2019.
[4] Rasool Fakoor, Faisal Ladhak, Azade Nazi, and Manfred Huber. Using deep learning to enhance cancer diagnosis and classification. In Proc. 30th Int. Conf. Machine Learning (ICML), volume 28, pages 3937–3949, Atlanta, GA, USA, 2013.
[5] Mukesh Kumar, Nitish Kumar Rath, and Santanu Kumar Rath. Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier. J. of Biomed. Inform., 60:395–409, 2016.
[6] Qingzhong Liu, Andrew H Sung, Zhongxue Chen, Jianzhong Liu, Lei Chen, Mengyu Qiao, Zhaohui Wang, Xudong Huang, and Youping Deng. Gene selection and classification for cancer microarray data based on machine learning and similarity measures. BMC Genomics, 12(5):1–12, 2011.
[7] Scott M Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. In Adv. Neural Inf. Process. Syst. (NeurIPS), pages 4768–4777, Long Beach, CA, USA, 2017.
[8] Heike Schmitt, Ulrike Billmeier, Walburga Dieterich, Timo Rath, Sophia Sonnewald, Stephen Reid, Simon Hirschmann, Kai Hildner, Maximilian J Waldner, Jonas Mudter, Arndt Hartmann, Robert Grützmann, Clemens Neufert, Tino Münster, Markus F Neurath, and Raja Atreya. Expansion of IL-23 receptor bearing TNFR2+ T cells is associated with molecular resistance to anti-TNF therapy in Crohn’s disease. Gut, 68(5):814–828, 2019.
[9] Shamveel Hussain Shah, Muhammad Javed Iqbal, Iftikhar Ahmad, Suleman Khan, and Joel JPC Rodrigues. Optimized gene selection and classification of cancer from microarray gene expression data using deep learning. Neural Comput. and Appl., pages 1–12, 2020.
[10] Reinel Tabares-Soto, Simon Orozco-Arias, Victor Romero-Cano, Vanesa Segovia Bucheli, José Luis Rodríguez-Sotelo, and Cristian Felipe Jiménez-Varón. A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data. PeerJ Comput. Sci., 6:e270, 2020.
[11] Kelli L VanDussen, Aleksandar Stojmirović, Ta-Chiang Liu, Patrick K Kimes, Jacqueline G Perrigoue, Joshua R Friedman, Jennifer E Towne, Richard D Head, and Thaddeus S Stappenbeck. P066 Intestinal Epithelial Microvilli Are Abnormal in Crohn’s Disease. Gastroenterology, 154(1):35, 2018.