Keywords

1 Introduction and Motivations

Analysis and understanding of food images is a challenging Computer Vision task which has gathered much interest of the research community due to its potential impact on the quality of life of modern society [1]. In this context, the main problems considered by the community are related to the discrimination of food images vs other images [11, 15, 16], the detection/localization of food in images [17, 23], the recognition and classification of the food depicted in an image [19,20,21], the segmentation of food images to distinguish the different parts and ingredients [18, 22, 25], the estimation of the volume and nutrients contained in a food plate detected in an image [24, 26, 27]. A big issue in this application domain is the availability of public datasets, as well as the lack of common procedures for testing and evaluation of the different tasks. Despite some food datasets exist [1, 28], their size and variability is still limited to properly feed modern supervised learning approaches currently employed to solve different computer vision tasks [29].

In recent years, considering the advancement in the fields of Computer Vision and Machine Learning, the research community is making a great effort in designing and investigating intelligent systems able to help people in their daily activities [2]. Different studies have been proposed to design robotic personal assistants [4, 5], advanced wearable vision systems to help people to augment their memory [6,7,8], as well as to monitor daily activities in order and improve quality of life [9, 10]. The main motivation behind these studies is to help society by exploiting the advancements of computer and engineering science. In this regard, this paper builds on the following question: can we train a computer vision system to recognize the eating utensils to be used during a meal in order to help patients with dementia diseases in reminding how to eat food?

Eating is an important aspect of life. It is important to satisfy hunger, to stimulate our senses, to share moment with others, but, most importantly, to acquire the needed nutrients to live and be in good health. While the recognition of which utensils to use when eating a meal might seem to be straightforward, it is not so simple for people affected by dementia disease, such as Alzheimer. When the disease start to become severe, patients can experience problems in using utensils during a meal because of loose memory and other mental disabilities. The ability to correctly recognize which utensils to use during a meal is one of the aspect that is analyzed to monitor the functional abilities of dementia patients [3]. Patients often do not remember how to use eating utensils, and in late stages of the disease, food which can be eaten with fingers is usually adopted by caregivers to help patients.

Figure 1 illustrates the investigated problem: given a food image, the computer vision engine should be able to predict which utensils are to be used to consume the meal. We would like to note that the proposed investigation is of interest both from an application and a theoretical standpoint. On one hand, the proposed system can be used for practical purposes. For instance, it could be exploited in a wearable device or in a robotic personal assistant to provide suggestions to patients during meals, a task usually performed by the caregivers in real life. On the other hand, we find interesting to investigate up to what extent visual features can be used to infer higher level concepts such as utensils to be used for meal consumption.

Fig. 1.
figure 1

The investigated problem.

To benchmark the problem, we consider the UNICT-FD1200 dataset [1]. To perform the experiments, each image of the dataset has been labeled according to five different classes related to the utensils to be used for meal consumption: Chopsticks, Fork, Fork and Knife, Hands and Spoon. We investigate an approach based on the combination of features extracted using the AlexNet CNN architecture proposed in [12] and a Support Vector Machine to perform classification [14]. This simple pipeline has obtained a classification accuracy of \(86.27\%\).

The paper is organized as following. Section 2 summarizes the representation and classification components adopted to address the considered problem. Section 3 details the experimental settings and discusses the results. Conclusions are given in Sect. 4.

Fig. 2.
figure 2

The considered approach based on features extracted using the AlexNet CNN architecture and an SVM classifier.

2 Food Image Representation and Classification

In our experiments, we considered an image representation based on deep features. In particular, we use the AlexNet deep learning architecture proposed in [12]. The model has been pre-trained to categorize images from ImageNet into 1000 different object classes. The AlexNet architecture has 8 layers, plus a Softmax module at the end of the network. In our experiments, we used the activations of the fully connected seventh layer (FC7) as features. We choose to extract features form the FC7 layer since such activations are believed to have a high semantic relevance but are more general than the 1000 features of the FC8 layer which are to be considered as class-related scores. Classification is performed using a multiclass SVM classifier [14] with an RBF (Radial Basis Function) kernel. See Fig. 2 for a diagram of the approach.

The SVM has been trained on a balanced set of images with equal amounts of images for each of the considered class, i.e., Chopsticks, Fork, Fork and Knife, Hands and Spoon. Hyper-parameters (e.g., cost C and \(\gamma \) in the RBF kernel) are optimized using cross-validation.

3 Experimental Settings and Results

We consider the UNICT-FD1200 [1] for our experiments. The dataset contains 4754 images of 1200 distinct dishes of food plates characterized by different nationalities (e.g., English, Japanese, Indian, Italian, Thai, etc.). Each dish has been acquired with a smartphone several times to introduce geometric and photometric variability in the dataset (such as Flash vs. No Flash, different rotations, multiple scale, different points of view). To carry out the proposed investigation, each image of the dataset has been manually labeled considering the following classes: Chopsticks, Fork, Fork and Knife, Hands and Spoon. Examples of images belonging to the UNICT-FD1200 dataset are shown in Fig. 3, whereas the number of images belonging to each class is reported in Table 1.

Fig. 3.
figure 3

Example of images belonging to the UNICT-FD1200 dataset. Each row correspond to a specific class: (1) Chopsticks, (2) Fork, (3) Fork and Knife, (4) Hands, (5) Spoon

To perform evaluation, the dataset has been randomly divided into three balanced non-overlapping subsets. The three different splits allow to obtain three independent training set/test set pairs. Once FC7 features are extracted for all images in the dataset, the SVM classifier is trained and tested considering the three different split. Accuracy values over the three runs are hence averaged to assess overall performance. The proposed method is implemented using the Caffe library [13] to extract FC7 features form a pre-trained AlexNet model [12] and LibSVM [14] to implement the multi-class classifier.

Table 2 summarizes the results and reports the performances of the classifier in the different runs. In Table 3 the confusion matrix with respect to the five considered classes is reported. The approach obtains good results for the Fork and Knife class, probably because in the dataset there are more images than in the other classes. The method has difficulties in recognizing images belonging to the Chopsticks class (e.g., noodle plates), which are confused in those in which a fork utensil is used during meals.

Table 1. Per-class number of images in the UNICT-FD1200 dataset.
Table 2. Accuracy of our classification model.
Table 3. Confusion Matrix. Rows report real classes, while columns report predicted ones.

4 Conclusions

We have considered the problem of recognizing utensils to be used during meal consumption. The investigation is both of practical interest (e.g., to design systems to assist people with mental disabilities) and theoretical interest (i.e., to assess whether higher level concept related to how to eat food can be obtained from visual features). To address the problem, we augment the FD1200 dataset introducing labels related to utensils to be used to consume the food detected in the images. Experiments show that even a simple pipeline based on AlexNet features and an SVM classifiers can be leveraged to perform classification despite it can be considered only as a baseline approach to be improved.