Applying speech recognition models to handwriting recognition

Bachelor's Thesis

Applying speech recognition models to handwriting recognition

Mohamad Wehbi (M.Sc.), Prof. Dr. Björn Eskofier

08/2020 – 12/2020


Online handwriting recognition (OHWR) allows the transformation of written text using a
specific digitizer into a form that can be interpreted by computer systems [1]. Speech recognition
(SR) enables the translation of spoken text into a digitized form that is portrayed by digital
systems [2]. Both fields are a form of linguistic communication with a similar objective, and thus
have common features in terms of structure and composition [3]. In terms of data types, data in
both elds can be thought of as a temporal sequence of time series data that depict a sequence of
letters and phonemes [4].
Prior work in the OHWR domain has successfully adapted methods from the SR field [5]. This
includes relying mainly on Dynamic Time Warping techniques [6] and Hidden Markov Models
[7]. Other works use language models that are used in SR as backend. The use of Optical
character recognition (OCR) methods is implemented for the segmentation and recognition of
sentences/words/letters, then language models implemented in SR methods are used for text
generation [8]. Additionally, some neural networks that have been designed for the SR domain
have been successfully applied for OHW, like Multi-state time delay neural networks (MSTDNN)
[9] or Convolutions, LSTMs, and DNNs (CLDNN) [10], with some modifications on the networks
in accordance with the OHW datasets. These methods were implemented on different datasets,
and no single benchmark for comparison for better eficiency was defined. Moreover, the utilization
of pretrained SR models and fine tuning for the OHW datasets has not been covered in prior work.
In this thesis, different SR methods are studied and re-implemented on OHW datasets to
evaluate the eficiency of such methods when using OHW raw data. The availability of speech
recognition models (Deep Speech 2, WaveNet, etc.)[11, 12], allows the usage of pre-trained models
applied on speech datasets, with large amounts of data, and adapting these models with the
available features for the OHW datasets, where data is not as much available. These approaches
will be taken to develop algorithms to be applied for recognition on a character/word level using
the public IAM-OnDB [13] dataset.



[1] K. Dharmapala, Online handwriting recognition systems, International Journal of Scien-
tic and Engineering Research, vol. 7, pp. 475481, 2016.
[2] L. Deng and X. Li, Machine learning paradigms for speech recognition: An overview,
IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, pp. 10601089,
[3] M. Huckvale, Purpose: the missing link in speech and handwriting recognition, 1994.
[4] Speech and Handwriting Recognition, pp. 345379. London: Springer London, 2008.
[5] T. Starner, J. Makhoul, R. M. Schwartz, and G. Chou, On-line cursive handwriting recognition
using speech recognition methods, Proceedings of ICASSP ’94. IEEE International
Conference on Acoustics, Speech and Signal Processing, vol. v, pp. V/125V/128 vol.5,
[6] C. Bahlmann and H. Burkhardt, The writer independent online handwriting recognition
system frog on hand and cluster generative statistical dynamic time warping, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 299310, 2004.
[7] J. Hu, S. G. Lim, and M. K. Brown, Writer independent on-line handwriting recognition
using an hmm approach, Pattern Recognit., vol. 33, pp. 133147, 2000.
[8] R. M. Schwartz, C. LaPre, J. Makhoul, C. Raphael, and Y. Zhao, Language-independent
ocr using a continuous speech recognition system, Proceedings of 13th International Con-
ference on Pattern Recognition, vol. 3, pp. 99103 vol.3, 1996.
[9] S. Jäger, S. Manke, J. Reichert, and A. H. Waibel, Online handwriting recognition: the
npen++ recognizer, International Journal on Document Analysis and Recognition, vol. 3,
pp. 169180, 2001.
[10] V. Carbune, P. Gonnet, T. Deselaers, H. A. Rowley, A. N. Daryin, M. C. Lafarga, L.-
L. Wang, D. Keysers, S. Feuz, and P. Gervais, Fast multi-language lstm-based online
handwriting recognition, International Journal on Document Analysis and Recognition
(IJDAR), vol. 23, pp. 102 89, 2020.
[11] Amodei, Dario et al. Deep Speech 2 : End-to-End Speech Recognition in English and
Mandarin. ArXiv abs/1512.02595 (2016).
[12] Oord, Aaron van den et al. WaveNet: A Generative Model for Raw Audio. ArXiv
abs/1609.03499 (2016).
[13] Marcus Liwicki and Horst Bunke, IAM-OnDB – an on-line English sentence database acquired
from handwritten text on a whiteboard, Eighth International Conference on Doc-
ument Analysis and Recognition (ICDAR), vol. 2, pp. 956 961, 2005.