1 Introduction

Today’s population lives in a fast-paced society with challenging work environments, a multitude of free time activities available and increasingly less knowledge about food origins and nutritional value. Intelligent dietary self-management systems can save time, improve personal nutrition and such lead to a healthier living and decreased stress. In recent programs like the Food Scanner Prize of the European Commission (EC), institutions and authorities expend much effort into developing systems that reduce food-related problems.

Such systems should comply with four general demands. First, it should be a portable solution packed into a small mobile device. Second, it must be simple to use without prior knowledge in nutrition concerns and with basic computer skills. Third, the applied system should work fast and reliable. Finally, it should be able to provide valuable feedback to users regarding their health and lifestyle, resulting in better decision making.

We propose a system that aims to comprise all requirements and is intended to be used by nutrition-aware persons. Our system is developed as a portable, easy to use system running on a smart phone. It enables situated dietary information assistance and simplifies food choices with the aim to improve the user’s overall health and well-being.Our proposed system has two core components (Fig. 1). First, we have implemented a mobile application that integrates personalized dietary concerns into a recommendation system used during grocery shopping. This diet has been developed by medical experts in the field of nutrition and metabolism [18]. The user can take a survey for a customized grocery basket and additionally enter information about her personal condition, which is used for calculating personal energy expenditure. Second, we have developed a fast lightweight computer vision component that enables the user to retrieve information from the food database by pointing the device to grocery items for automatic recognition, instead of tediously entering information by hand. The image recognition system runs with high accuracy on a big set of different grocery food classes. It facilitates information retrieval, providing added benefit to the user.

Fig. 1.
figure 1

System overview. The dietary self-management system consists of two core elements: a personalized list of recommended groceries and computer vision based assistance for information retrieval.

2 Related Work

Dietary Mobile Applications. Mobile health and wellness is a rapidly expanding market, with daily emergence of innovative dietary management apps. For example, LoseIt Footnote 1 aids the user in loosing weight by setting daily calorie limits and monitoring food intake. It also features a recognition system which is coupled to a database of dishes where the user must select the appropriate one. In contrast, our system is designed to aid already during the food selection process in grocery stores and targets a more general audience that wants to better its eating behaviour. ShopWell Footnote 2 rates scanned foods and provides appropriate recommendations according to a personalized profile. The scanning works for barcodes only, which is also available in our application besides automated video recognition. Several EC research programs fund the investigation of dietary management, such as, for the care of elderly. CordonGris Footnote 3 manages relevant data for a healthy diet recommendation, coming from different sources: activity sensors, food composition tables, retailers’ or service providers’ information. HELICOPTER Footnote 4 exploits ambient-assisted living techniques and provides older adults and their informal caregivers with support, motivation and guidance in pursuing a healthy and safe lifestyle, including decision making on nutrition in grocery shopping. ChefMySelf Footnote 5 is a customizable open and extensible eco-system built around an automatic cooking solution to support elderly in preparing healthy meals.

Food Recognition Systems. Most research in computer vision based methods targets the recognition of meals as well as extracting the components of plated food. First food recognition systems have already been introduced in the late 90s, “Veggie Vision” [2] eases the checkout process at supermarkets. The topic regained attention recently with published food datasets for comparison of methods, e.g., PFID [5], UNICT-FD889/1200 [7, 8], Food-101 [3], UECFOOD-100/256 [14, 21]. Until the recent rise of CNN based methods [13], that automatically learn optimal feature representations from thousands of images, researchers mostly combined handcrafted color and texture descriptors with SVM classifiers or other kernel methods. In [12], a CNN recognizes the 10 most frequent food items in the FoodLog [17] image collection and is able to distinguish food from non-food items. In [15, 16] a CNN is finetuned on 1000 food related classes from the ImageNet database [6]. Recent wider [19] and deeper [11] CNNs boosted the results at the cost of high computational requirements. Compared to our method, that is running at 10 fps on standard smart phones, most afore-mentioned classification approaches are intractable to handle on mobile devices.

Dietary Self-Management Systems. Few available applications combine dietary mobile systems with automated food recognition. In [20], a mobile recipe recommender recognizes ingredients and retrieves recipes online. A mobile application proposed in [1] supports type 1 diabetes patients in counting carbohydrates and provides insulin dose advice. “Snap-n-Eat” [27] identifies food and portion size for calorie count estimation by incorporating contextual features (restaurant locations, user profiles). None of the above aids the user as early as in the food selection stage, but only when meals are already prepared. Some also rely on an internet connection while our system is entirely running on device. Compared to [26], this work uses a lightweight CNN instead of a Random Forest classifier which allows to recognize more than double the classes. Personalization has also been lifted to a new level, comprising personalized energy expenditure calculation and target weight advice. Usability is evaluated via an innovative user study.

3 Personalized Dietary Self-Management System

The proposed system enables dietary self-management on a mobile device. It includes integrated nutrition assistance based on an augmented reality recommender component. This recommender assistant provides an intuitive interface and is supported by video based food recognition. A user specific profile is assessed by a dietary questionnaire on first use. Afterwards, upon selection of food items, either from automated video recognition or from manual user selection, tailored nutritional advice is given to the user depending on her profile.

3.1 Dietary Concept for Self-Management

The dietary concept behind the recommender system is based on a personalized dietary concept with the idea of removing stringent rules (e.g. calorie counting, physical activity demands). Instead of forcing the patient to give up all possibly bad eating habits at once, the diet slowly changes nutrition habits to lose (gain) weight. Considering that every person has its own requirements concerning nutrition and every fruit or vegetable has its own composition of micronutrients, a correct food combination is indispensable to fulfill one’s individual demand of essential nutrients. The utilized diet incorporates this by defining several groups according to lifestyle, job, age, gender and the intensity of personal exercise. The groups are connected to different stress types and include nutrition recommendations, upon individual baskets of commodities that are composed based on investigation of a large medical dataset of 17,000 entries in [18]. Our mobile application incorporates these baskets and provides situated feedback on the user’s food choices during grocery shopping, where decision making for mid and long-term lasting food choice is actually taking place. This aims at a lasting change in eating behaviour for increased mental and physical performance according to the functional eating diet [26]. The app automatically classifies presented food and upon commodity selection the user gets recommendations triggered by her individual profile and is presented detailed nutrition information. This includes micronutrients with corresponding health claims as well as further food recommendations matching the user profile.

3.2 Personalization

Personalization of the self-management system is based on two main factors. First, the assignment of the user to a certain nutrition group (see [26]) and, second, customized energy expenditure calculation. The questionnaire for user to group assignment consists of multiple rating scale questions from whom the user type can be calculated, e.g. “Do you often feel stressed?” with possible answers between “1, not at all” to “4, very often”. We will now describe how the personalized energy expenditure is calculated, as it is influenced by the height, age, body type and physical activity level (PAL) of a person.

To account for different body types (slim, normal, muscular) and gender differences, the body structure of a person is incorporated through a weighting term \(\delta \), that ranges from 0.945 (slim) to 1.055 (muscular) for men and 0.900 to 1.000 for women respectively. A personalized target weight \(w_{p}\) is calculated by multiplication of the body weight w with \(\delta \): \(w_{p}=w \,*\, \delta \). The energy demand E as defined by [10] is adopted to the newly calculated weight with for men and for women, where l is the height and \(\alpha \) is the age of the person. Finally, PAL are considered through a factor \(\gamma _{PAL}\). They reflect the energy demand in dependence of physical activity and are such very suitable for personalized energy expenditure calculation. PAL factors range from \(\gamma _{PAL}=1.2\) for elderly people without any physical activity to \(\gamma _{PAL}=3.3\) for construction workers spending \(20+\) hours on sport. The final personalized daily energy expenditure is calculated as \(E_{p} = E \,*\, \gamma _{PAL}\).

3.3 Mobile Recognition System

We improve the usability of the recommender system and automatically classify food items with Convolutional Neural Networks (CNN). The user taps on an item, confirms the classification result and is displayed the desired information instead of cumbersome manual search. Our motivation is to get a fast, scalable classifier and we implement a shallow CNN network, running within our Android based mobile application at 10 fps. We design our CNN to have minimum complexity while performing at good accuracy.

4 Experimental Results

We evaluate our system with respect to usability through a user study and measure the performance of the recognition system on a novel grocery database.

4.1 Usability

For the purpose of user-centered optimisation of the novel computer vision based interface design, an innovative interaction and usability analysis was performed with 16 persons, \(M=26.3\) years of age. We used eye tracking to evaluate the automated nutrition information feedback interface component and evaluated the user experience in the frame of using the complete app. Eye tracking as a method to evaluate novel interface designs has been established, for example, using fixation durations on objects in the user interface: depending on the context, high numbers of fixations indicate less efficient search strategies, long fixation durations indicate difficulties of the user with the perception of the display [9]. Test persons were equipped with SMITM eye tracking glasses, the viewing behavior was video captured, and fixations on the display were localised [23]. From the investigation of the Seven Stages of (Inter-)Action [22] and corresponding fixation analysis interaction design was updated and optimised towards a SUS usability score [4] of \(80\%\) and user experience evaluation (UEQ [25]) of \(72\%\) (\(\pm 5\%\), \(90\%\) confidence interval; [24]) which represent high scores considering the early stage of development. See Fig. 2 for illustrations.

Fig. 2.
figure 2

Innovative interaction and usability analysis: (a) mobile eye tracking glasses (b) automated gaze localisation and analysis of seven stages of interaction from (c) stage duration and (d) corresponding fixation analysis.

4.2 Recognition System

We evaluate our method on a newly recorded dataset, which we term FruitVeg-81 Footnote 6. The database contains 15630 images of 81 raw fruit and vegetable classes. The proposed CNN processes images with a size of \(56\times 56\) pixels and consists of three convolutional layers: the first two are of size \(5\times 5\times 32\), the third layer is \(5\times 5\times 64\). We apply pooling with size 3 and stride 2 after each layer. After the third convolutional layer we add a 1024-dimensional fully connected layer with dropout and a soft-max classification layer with 81 units for food classification and 82 units when integrating a garbage class. We subtract the training set mean and train the network in minibatches of size 128, the training set is shuffled in the beginning of the training procedure. As it is unlikely to reach \(100\%\) accuracy in practical use, we give the users several choices for selecting the correct food item and reflect this in our experiments by reporting the \(top\text {-}1\) to \(top\text {-}5\).

Baseline. As Baseline, we evaluate all models on the 81 classes using leave-one-out cross-validation. We augment the training data with mirroring, cropping, rotating and color shifting. During test time we use mirroring and random crops.

Non-food Class. For our real-world application it is important to reduce false positives, e.g. on food items missing in the visual database. When the application recognizes non food items, appropriate feedback is displayed on the screen. We extract around 200 random images from 500 non-food categories of the ImageNet Challenge [6] and use those to add another category to the CNN. Due to the high variance within this non-food class, we choose the amount of images for training to be 10 times the average number of images per food category. For testing, we use the average number of per class test images.

Results are listed in Table 1, it can be seen that the image quality of the mobile phones differs. On average a top-5 accuracy of around \(90\%\) shows the good performance of the trained network. As we add one more class to the system the accuracy decreases, however the resulting decrease of the \(top\text {-}1\) mean accuracy is stronger than expected. This is presumably due to the very heterogeneous structure and different domain of the non-food samples, which is hard to model with the limited number of parameters. On the other hand the \(top\text {-}5\) accuracy is stable, which is a desired behavior for our application.

Table 1. Results for baseline and integration of a non-food class. The mean \(top\text {-}k\) accuracy ranges from \(69.77\%\) to \(90.19\%\) for the baseline and from \(60.47\%\) to \(90.41\%\) for non-food integration (best \(top\text {-}1\) accuracy is \(76.14\%\) and \(71.74\%\)). With non-food integration the \(top\text {-}1\) mean accuracy drops by roughly \(9\%\), while the \(top\text {-}5\) mean accuracy remains the same.

5 Conclusion

We have presented a innovative mobile application with a recommender engine and a fast recognition system running at 10 fps as core elements. The recommender engine supports users in decision making during grocery shopping and helps to improve health conditions backed on scientific findings. The recognition system is robustly recognizing food and non-food items. Along with this publication we make our grocery dataset FruitVeg-81 available for the public, with the intention to be used by scientific researchers around the globe for improvement of their nutrition related computer systems.