Megh Vora

Master's Thesis

Multi-Modal AI-System for Ingredient Recognition and Nutritional Analysis from Food Images

Advisors

Rebecca Lennartz (M.Sc.), Yannick Wiesner (viatolea), Pauline Nöldemann (viatolea), Prof. Dr. Björn Eskofier

Duration

03 / 2025 – 09/ 2025

Abstract

Ingredient recognition from food images is increasingly important for applications in diet tracking, personalized nutrition, and healthcare. For individuals with food intolerances such as lactose, fructose, or sorbitol, it is not sufficient to identify dishes like “pasta” or “curry”; detailed knowledge of the ingredients is essential to ensure dietary safety. Several advancements in food recognition based on image recognition algorithms have simplified food tracking. However, most existing food recognition systems focus only on classifying entire dishes, lacking the granularity required for ingredient-level nutritional analysis [1].

Recent advances in multimodal artificial intelligence, particularly vision-language models such as BLIP-2 [2], have demonstrated the ability to generate descriptive captions from food images. When combined with large language models (LLMs) [3][4], these captions can potentially be transformed into structured ingredient lists. Nonetheless, the reliability and applicability of such systems for real-world dietary use, especially in regional contexts such as Germany, remain largely unexplored. This is especially relevant when nutritional assessments require mapping to standardized food composition databases like the Bundeslebensmittelschlüssel (BLS) [5], which provide compound-specific information on allergens and intolerances. For individuals with food intolerances, the reliability and safety of ingredient recognition systems are critical, as inaccurate or incomplete detection can pose serious health risks.

The goal of this thesis is to investigate the feasibility of developing a modular, lightweight AI system that enables ingredient-level recognition from food images and assesses intolerance-related risks using the BLS. The proposed system architecture consists of a vision-language model (BLIP-2) for caption generation, an LLM-based module for extracting and normalizing ingredients, and an analysis layer for intolerance detection based on BLS mappings.

Specifically, the goals/research questions of this thesis are:

Can a vision-language model (e.g., BLIP-2) be effectively fine-tuned to generate ingredient-specific image captions for food items?
Can LLMs accurately extract and standardize ingredient data, by resolving synonyms, deduplicating entries, and aligning with database formats—for structured dietary analysis?
Can the extracted ingredient data be reliably linked to entries in the Bundeslebensmittelschlüssel (BLS) to identify intolerance-relevant compounds and nutritional metadata?

References

[1] Min, W., Jiang, S., Liu, L., Rui, Y., Jain, R., & Wang, M. (2019). A survey on food computing. ACM Computing Surveys (CSUR), 52(5), 1–36. https://doi.org/10.1145/3329168

[2] J. Li, J. Zhang, Y. Yu, and L. Fei-Fei, “BLIP-2: Bootstrapped Language-Image Pretraining with Frozen Image Encoders and Large Language Models,” arXiv preprint, arXiv:2301.12597, 2023.

[3] Z. Zhu and Y. Dai, “Food Ingredients Identification by Deep Learning,” Journal of Computer and Communications, vol. 9, no. 10, pp. 33–45, 2021.

[4] R. S. Abdul Kareem, A. A. Al-Shamaa, and F. A. Altamimi, “Food classification and recipe extraction using DNNs and NLP,” Computers in Biology and Medicine, vol. 175, p. 108528, 2024. DOI: 10.1016/j.compbiomed.2024.108528.

[5] Max Rubner-Institut, Bundeslebensmittelschlüssel (BLS), Version 3.02. Karlsruhe, Germany: Max Rubner-Institut, 2023. [Online]. Available: https://www.blsdb.de/

[6] Machine Learning and Data Analytics (MaD) Lab (2024). FAU. Retrieved 02.03.2025, from https://www.mad.tf.fau.de/research/projects/innovation-lab-for-wearable-and-ubiquitouscomputing/happy-plate/

[7] PYRA MEDI. (2024). https://viatolea.de/