An Nguyen (M.Sc.), Thomas Altstidl (M.Sc), Dr. Dario Zanca, Prof. Dr. Björn Eskofier
07/2021 – 01/2022
Data fusion of multiple modalities is commonly done somewhere between the source (data-level) and target (decision-level). However, for mixed-type time series consisting of both, sequences of categorical events  and real-valued continuous signals this proves difficult due to the incompatibility of the data sources. This thesis explores and benchmarks methods to overcome this inherent incompatibility, on both synthetic and real-world datasets.
Reliable methods for processing mixed-type time series are desirable since combining information from multiple modalities originating from the same underlying process can lead to improved robustness of models . Furthermore, multimodal machine learning can utilize potential complementary information within data sources, unavailable to unimodal approaches . Finally, multimodal systems can be more robust towards noisy inputs and compensate for missing modalities by relying on others .
Prior research already proposed methods for fusing both modalities. Approaches range from data-level fusion  to feature-level fusion [6, 7] and decision-level fusion . The following approaches are proposed to build on existing methods:
- To process irregular time series, standard machine learning methods can be adjusted to capture temporal information . Further exploring time-aware methods might prove beneficial.
- Some multimodal approaches like  require time-series inputs to be of equal or constant length. To bypass this limitation, sequences in mixed-type scenarios are often processed individually. Fusing intermediate temporal representations of both modalities might aid in capturing inter-modality correlations.
- Embedded sequences are often concatenated. Exploring more elaborate fusion methods like multi-level fusion  or canonical-correlation analysis (CCA) based fusion methods [12, 13] might lead to better results.
- A versatile data-fusion method could solve many of the problems motioned above. Prepossessing methods like adaptive segmentation  and symbolic aggregate approximation  can bring both data types closer together, potentially simplifying the fusion process.
The lack of publicly available mixed-type time-series datasets often leads to the use of synthetic datasets like in . For the generation of event sequences, temporal point processes like the Hawkes process are useful tools.
 N. Du, et al.: Recurrent Marked Temporal Point Processes: Embedding Event History to Vector. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1555-1564, 2010.
 G. Pomianos, et al.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91.9, 1306-1326, 2003.
 T. Baltruaitis, et al.: Multimodal Machine Learning: A Survey and Taxonomy.IEEE Transactions on Pattern Analysis and Machine Intelligence 41.2, 423-443, 2017.
 X. Yang, et al.: Deep Multimodal Representation Learning From Temporal Data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1, 5447-5455, 2017.
 H.P. Martínez, et al.: Deep Multimodal Fusion: Combining Discrete Events and Continuous Signals. Proceedings of the 16th International Conference on Multimodal Interaction, 34-41, 2014.
 S. Xiao, et al.: Modeling the Intensity Function of Point Process Via Recurrent Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 31, 2374-3468, 2017.
 S. Xiao, et al.: Learning Time Series Associated Event Sequences With Recurrent Point Process Networks. IEEE Transactions on Neural Networks and Learning Systems 30.10, 3124-3136, 2019.
 B. Cao, et al.: DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 747-755, 2017.
 I. M. Baytas, et al.: Patient Subtyping via Time-Aware LSTM Networks. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 6574, 2017.
 S. Du, et al.: A Hybrid Method for Traffic Flow Forecasting Using Multimodal Deep Learning. International Journal of Computational Intelligence Systems 13.1, 85 – 97, 2019.
 V. Sindagi, et al.: Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting. Proceedings of the IEEE/CVF International Conference on Computer Vision, 1002-1012, 2019.
 Q. Sun, et al.: Feature fusion method based on canonical correlation analysis and handwritten character recognition. 8th Control, Automation, Robotics and Vision Conference, 1547-1552, 2004.
 W. Zuobin, et al.: Feature Regrouping for CCA – Based Feature Fusion and Extraction Through Normalized Cut. 21st International Conference on Information Fusion, 2275-2282, 2018.
 L. Liu, et al.: Learning Hierarchical Representations of Electronic Health Records for Clinical Outcome Prediction. AMIA Annual Symposium Proceedings 2019, 597-606, 2019.
 J. Zhao, et al.: Learning from heterogeneous temporal data in electronic health records. Journal of Biomedical Informatics 65, 105-119, 2017.
 L. Feremans, et al.: Pattern-Based Anomaly Detection in Mixed-Type Time Series. Machine Learning and Knowledge Discovery in Databases ECML PKDD, 240-256, 2019.