Darko Boshkovski

Bachelor's Thesis

Applying speech recognition models to handwriting recognition

Advisors
Mohamad Wehbi (M.Sc.), Prof. Dr. Björn Eskofier

Duration
08/2020 – 12/2020

Abstract

Online handwriting recognition (OHWR) allows the transformation of written text using a specific digitizer into a form that can be interpreted by computer systems [1]. Speech recognition (SR) enables the translation of spoken text into a digitized form that is portrayed by digital
systems [2]. Both fields are a form of linguistic communication with a similar objective, and thus have common features in terms of structure and composition [3]. In terms of data types, data in both elds can be thought of as a temporal sequence of time series data that depict a sequence of letters and phonemes [4].

Prior work in the OHWR domain has successfully adapted methods from the SR field [5]. This includes relying mainly on Dynamic Time Warping techniques [6] and Hidden Markov Models [7]. Other works use language models that are used in SR as backend. The use of Optical character recognition (OCR) methods is implemented for the segmentation and recognition of sentences/words/letters, then language models implemented in SR methods are used for text generation [8]. Additionally, some neural networks that have been designed for the SR domain have been successfully applied for OHW, like Multi-state time delay neural networks (MSTDNN) [9] or Convolutions, LSTMs, and DNNs (CLDNN) [10], with some modifications on the networks in accordance with the OHW datasets. These methods were implemented on different datasets, and no single benchmark for comparison for better eficiency was defined. Moreover, the utilization of pretrained SR models and fine tuning for the OHW datasets has not been covered in prior work.

In this thesis, different SR methods are studied and re-implemented on OHW datasets to evaluate the eficiency of such methods when using OHW raw data. The availability of speech recognition models (Deep Speech 2, WaveNet, etc.)[11, 12], allows the usage of pre-trained models applied on speech datasets, with large amounts of data, and adapting these models with the available features for the OHW datasets, where data is not as much available. These approaches will be taken to develop algorithms to be applied for recognition on a character/word level using the public IAM-OnDB [13] dataset.

References:
[1] K. Dharmapala, Online handwriting recognition systems, International Journal of Scientific and Engineering Research, vol. 7, pp. 475481, 2016.
[2] L. Deng and X. Li, Machine learning paradigms for speech recognition: An overview, IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, pp. 10601089, 2013.
[3] M. Huckvale, Purpose: the missing link in speech and handwriting recognition, 1994.
[4] Speech and Handwriting Recognition, pp. 345379. London: Springer London, 2008.
[5] T. Starner, J. Makhoul, R. M. Schwartz, and G. Chou, On-line cursive handwriting recognition using speech recognition methods, Proceedings of ICASSP ’94. IEEE International Conference on Acoustics, Speech and Signal Processing, vol. v, pp. V/125V/128 vol.5, 1994.
[6] C. Bahlmann and H. Burkhardt, The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 299310, 2004.
[7] J. Hu, S. G. Lim, and M. K. Brown, Writer independent on-line handwriting recognition using an hmm approach, Pattern Recognit., vol. 33, pp. 133147, 2000.
[8] R. M. Schwartz, C. LaPre, J. Makhoul, C. Raphael, and Y. Zhao, Language-independent ocr using a continuous speech recognition system, Proceedings of 13th International Conference on Pattern Recognition, vol. 3, pp. 99103 vol.3, 1996.
[9] S. Jäger, S. Manke, J. Reichert, and A. H. Waibel, Online handwriting recognition: the npen++ recognizer, International Journal on Document Analysis and Recognition, vol. 3, pp. 169180, 2001.
[10] V. Carbune, P. Gonnet, T. Deselaers, H. A. Rowley, A. N. Daryin, M. C. Lafarga, L.-L. Wang, D. Keysers, S. Feuz, and P. Gervais, Fast multi-language lstm-based online handwriting recognition, International Journal on Document Analysis and Recognition (IJDAR), vol. 23, pp. 102 89, 2020.
[11] Amodei, Dario et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. ArXiv abs/1512.02595 (2016).
[12] Oord, Aaron van den et al. WaveNet: A Generative Model for Raw Audio. ArXiv abs/1609.03499 (2016).
[13] Marcus Liwicki and Horst Bunke, IAM-OnDB – an on-line English sentence database acquired from handwritten text on a whiteboard, Eighth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 956 961, 2005.