Thomas Röthenbacher

Thomas Röthenbacher

Master's Thesis

Automated German Crossword Resolution

Advisors
Dr. Dario Zanca, Kai Klede (M.Sc.), Prof. Dr. Björn Eskofier, Dr. Andrea Zugarini (Expert.AI), Dr. Marco Ernandes (Expert.AI)

Duration
10 / 2022 – 04 / 2023

Abstract

The Crossword Puzzle (CP) is a game found in many newspapers and magazines around the world. Because of this, there are many different forms (or grids) and clues formats used in them. The complexity of the clues ranges from simple questions of definitions to more complicated puns. Some clues also only require basic or historical knowledge, while others depend on an understanding of current events.

As the forms of questions vary not only between different crosswords but also between each clue in a single puzzle, crossword solving poses a particularly interesting form of Question Answering (QA). There already exist multiple automated crossword solvers showing very promissing results that are able to compete with human solvers. However, as of now, no solver has been developed to solve German crossword puzzles.

One such automated crossword solver is Webcrow 2.0 [1]. It works in a tow-step approach, where the first step consists of generating a probability-weighted list of candidates for each clue, which then in the next step are used by a constraint satisfaction problem solver to find the most probable, valid assignment for the whole crossword. To generate the lists of candidates, the system relies on several expert modules. These modules include a database lookup module, a search-engine module and some rule-based modules to solve specific types of clues. While this type of matching was originally done by looking for exact or partial matches, they have since shifted to an approach based on semantic search. Zugarini et al. [2] relies on pre-trained language models like Word2Vec [3] and Universal Sentence Encoder [4] to generate word embeddings for clues. This is, however, complicated by the fact that answers in crosswords usually do not include spaces, even though some of the answers might be whole phrases or sentences. This makes it hard for pre-trained language models to find a suitable embedding for these kind of answers. To remedy this, one first has to segment the answers in the database into individual words before computing the embedding [5].

This thesis aims to adapt the precious implementation of expert modules in Webcrow 2.0 to the German language and additionally implement a module for segmentation of words and phrases. Adapting the implemented approaches to German crosswords can potentially prove to be difficult, because not only has German itself some peculiarities like compound words, but also the stlye of German crossword creators will most likely vary to some degree from writers of English or Italian crosswords.

References
[1] M. Ernandes, G. Angelini, and M. Gori, “Webcrow: A web-based system for crossword solving,” in AAAI, pp. 1412-1417, 2005.
[2] A. Zugarini and M. Ernandes, “A multi-strategy approach to crossword clue answer retrieval and ranking,” in Proceedings of the Eighth Italian Conference on Computational Linguistics, CLiC-it 2021, Milan, Italy, January 26-28, 2022 (E. Fersini, M. Passarotti, and V. Patti, eds.), vol. 3033 of CEUR Workshop Proceedings, CEUR-WS.org, 2021.
[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient representation of word representations in vector space,” in Proceedings of the international workshop on learning representations (ICLR), 2013.
[4] Y. Yang, D. Cer, A. Ahman, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H. Sung, et al., “Multilingual universal sentence encoder for semantic retrieval,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 87-94, 2020.
[5] E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak, M. Ginsberg, and D. Klein, “Automated crossword solving,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (Dublin, Ireland), pp. 3073-3085, Association for Computational Linguistics, May 2022.