VTnet+Handcrafted based approach for food cuisines classification

Nijhawan, Rahul; Sinha, Garima; Batra, Ashita; Kumar, Manoj; Sharma, Himanshu

doi:10.1007/s11042-023-15800-4

VTnet+Handcrafted based approach for food cuisines classification

Open access
Published: 24 June 2023

Volume 83, pages 10695–10715, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

VTnet+Handcrafted based approach for food cuisines classification

Download PDF

Rahul Nijhawan¹,
Garima Sinha²,
Ashita Batra³,
Manoj Kumar ORCID: orcid.org/0000-0001-5113-0639^4,5 &
…
Himanshu Sharma³

792 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, we propose a novel hybrid transformer architecture for food cuisine detection and classification. The work carried out within this paper develops a combination of Vision Transformer ensemble architecture with hand-crafted features, thereby making a hybrid Vision Transformer food recognition system. Recently, Vision transformers have been introduced as an alternative means of classification to convolutional neural networks. It performs pattern detection and classification without convolutions and interprets an image as a sequence of patches. The combination of Vision Transformer and hand-crafted features like GIST, HoG (Histogram of Oriented Gradients), and LBP (Local Binary Pattern) were employed on the dataset. The dataset was specifically created (for this work) from the public logging system. It consisted of 13 food categories with 400 images of Indian food items like Ghevar, Idli, Dosa, and much more. It helped to capture a variety of images from every domain and culture. This work made use of the common and readily available food items, which can further be increased by adding on the specialties (dishes) from different regions. Various experiments were performed on CNN with various classifiers like Random forest, and SVM. Further, we compared our proposed approach with several ensembles of CNN architectures. The experiments proved that our proposed approach outperformed the state-of-the-art ensemble CNN architectures for detecting food cuisines. The proposed hybrid approach achieved an accuracy of 94.63%, sensitivity 84.42%, specificity 95.23%, and kappa coefficient 0.93, which was the best amongst all approaches.

DeshiFoodBD: Development of a Bangladeshi Traditional Food Image Dataset and Recognition Model Using Inception V3

Vietnamese Food Recognition System Using Convolutional Neural Networks Based Features

Food Classification Using Deep Learning Algorithm

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Food is the essence of our lives. As and how time has evolved, food forms have also increased manifolds. However, their complexity has also increased [41]. The main reason for this complexity is their similarity amongst themselves. Many different food items look similar, and most similar-looking items differ greatly. This contrasting feature makes it very difficult to classify and detect them. Image recognition could be a good solution to food recognition and classification [26].

Food detection and classification have been investigated in the literature using different methods like deep learning, mobile computing devices, etc.[11]. Convolutional Neural Networks (CNN) are currently the most progressive used technique for image recognition purposes [11].

Many methods employ images consisting of a single food entity, which makes it easier to recognize the object in them but unfortunately, not much work is done in the multiple-food recognition. On the contrary, deep learning approaches are being extensively used for recognition. Heravi et al. [11] have used optimized CNN to recognize images from the data sets created using public domain images.

Wang et al. [34] has proposed a weakly supervised learning algorithm that performs food detection and classification by employing gaze features gathered by means of eye tracker. This approach has an advantage as it requires gaze features only at training time, while it is gaze free at testing time. On the other hand, Kawano et al. [17] proposed a food recognition system to monitor nutrients and calorie intake, and eating habits. They used Grubcut for segmentation purposes, a SURF-based bag of features for feature extraction, and then classified using linear SVM. This approach does not need a server, as the system was implemented on the android smartphone.

Oliveira et al. [24] addressed the main problem of food detection based on image recognition. They developed a system that detects and classify multiple foods present on a single plate. The system was developed for the camera-based smartphone. They proposed a novel segmentation algorithm that works on the principle of region growing parallelly operating on multiple feature spaces. Recently, proposed a large collection of food datasets capable of training very complicated deep learning-based architectures. It covers a very robust and dynamic range of food images. They also proposed a high-level objective classification model, which works well in such variant classes of food images.

Y. Matsuda, K.Yanai in [22] introduces a new manifold ranking method that introduces similarities between different food items in a single image using co-occurrence statistics. Zhu et al. in [40] introduced a two-step approach to automatically recognizing multiple food items present in an image. First, a segmentation process is applied, which iteratively fragments a food image into similar object classes, and secondly, classifying the segmented images using SVM as a classifier to label them appropriately.

Although promising, CNN-based architectures have some weaknesses. Firstly, as the convolutions work with constant window size, the model helps find the local information rather than long-range spatial relationships between different parts of the image and the complete image[23]. Secondly, there is some loss of local information through max-pooling. Inspired by Natural language processing, Vision transformers have recently been proposed as an alternative to CNNs and have shown promising results in the field of computer vision [23]. Vision transformer is free of convolutions and identifies an image as a sequence of patches, hence overcoming the issues of locality and translation invariance faced by CNN. This approach has been observed to use the hardware resources more efficiently than CNN and could be pre-trained on the public ImageNet dataset with fewer resources [23].

In this paper, a novel synergistic model based on vision transformer and hand-crafted features is proposed. Our approach applies the Transformer architecture with self-attention to sequences of image patches without using convolution layers. Further, ancillary hand-crafted features are computed to provide supplementary information on food images. Our proposed approach furnishes the following goals:

We introduced a dataset comprising different varieties of food dishes from every domain and culture, collected from publicly available food logging systems;
We have proposed a new mathematical formulation of our proposed architecture objective function;
We optimized hyperparameters of the transformer, showing that the vision transformer has significantly improved the food recognition accuracy as compared to the other state-of-the-art algorithms using only hand-crafted features or CNN features;
Our hybrid approach has the advantage of showing sublime results on every type and aspect of data (compared on other available datasets) that led to obtaining unparalleled results for the task of food detection when compared with state-of-the-art algorithms.

The rest of the paper is organized as follows: In Section 2, the dataset, architecture, components, and algorithms for the proposed system based on a hybrid vision transformer are presented. Section 3 defines the preliminaries for the proposed approach. In Section 4, we describe our proposed approach. Section 5 presents the evaluation of the results. Section 6 discusses the drawn conclusions, the comparison between ViT and CNN, and future work.

2 Material and methods

This section gives an overview of the food dataset employed for the study. It elaborates on our created Food Dataset and available Food datasets in literature to prove our proposed approach’s generalizability and robustness.

2.1 Food image dataset

Extensive dataset empowers learning more detailed and fine features, thus helping to combat overfitting. Therefore, the goal was to develop a robust dataset with different food images and where each item is represented with as many images as possible. These images will eventually perpetuate to give correct predictions and accuracy. Therefore, we have collected a public dataset of images to have realistic feature recognition with high precision and accuracy, as shown in Fig. 1 and Table 2. For this, we created approximately 300 to 400 images of different food items [19]. We compared the results of our proposed approach on the available Food datasets in the previous literature (PFID, FOOD85, UECFOOD-100, UECFOOD-256, UMPCFOOD-101, UNICT-FD1200, UNIMIB2016, and VIREO). Table 1 summarizes the employed food datasets available in the literature. We have given the size and number of food classes for each dataset. We have considered only those publicly available datasets with a minimum of 100 images in each class (Table 2).

Table 1 Food Datasets available in the literature

Full size table

Table 2 Our Proposed Food Dataset

Full size table

3 Preliminaries

A. Convolutional Neural Networks (CNN or ConvNet) ConvNet is a machine learning class that uses a multilayer perceptron variation designed for minimal preprocessing. Either we can forward propagate or backpropagate in the network model. The main advantage of using CNN is that it requires minimal input for pre-processing in comparison to others; this has brought further advancement in the processing of images, audio, speech, and videos [10].

B. Random Forest (RF) These are the architectures that are ensemble of different learning approaches. They are built by several decision trees at the time of training and output the mean class (considering outputs of all the decision trees [1].

While using the random forest to solve regression problems, MSE (Mean Squared Error) for showing the data branching from each node. The following equation is used to calculate the distance of every node from actual predicted values, which further helps in deciding the best branch from the forest. Here, $y_i$ denotes the value of the data point used for testing at a particular node, and $f_i$ denotes the value produced by the decision tree.

$$\begin{aligned} MSE=\frac{1}{N}\sum _{i=1}^{N}(f_i -y_i)^2 \end{aligned}$$

(1)

The Gini index helps to indicate the branching of the nodes in the decision tree for classification purposes. The equation will use the class label and probability measure to determine each branch’s Gini index on a node, which determines the most likely branch to occur. Here, pi denotes the relative frequency of the class observed in the dataset, and c denotes the number of classes.

$$\begin{aligned} Gini=1-\sum _{i=1}^{C}(p_i)^2 \end{aligned}$$

(2)

Entropy uses the probability measure of a certain output to decide how the node should branch.

$$\begin{aligned} Entropy=\sum _{i=1}^{C}- p_i *\log _2(p_i) \end{aligned}$$

(3)

C. Support Vector Machine (SVM) It is used for pattern recognition and classification [16]. The goal of SVM is to find the hyperplane which helps maximize the distance between data points between two classes hence forming the binary linear classifier [35]. We have n training example where each x of them have D dimensions, and either the class label is y=+1 or y= -1 class, and data can be separated by drawing a straight line. SVM can be formulated as:

$$\begin{aligned} w.x_i+b \ge 1 \quad \forall y_i=+1 \end{aligned}$$

(4)

$$\begin{aligned} w.x_i+b\ge 1 \quad \forall y_i=-1 \end{aligned}$$

(5)

Combining the equation 4 & 5

$$\begin{aligned} y_i(w.x_i +b)-1 \ge 0 \quad \forall y_i=+1,=1 \end{aligned}$$

(6)

D. Vision Transformer Approach For the past ten years, deep CNNs have been the go-to method for solving computer vision problems. CNN has been successful in terms of performance; applying numerous convolutions on images to learn hierarchical spatial features has been its major advantage compared to the standard machine learning algorithm [20]. Ever since 2012, CNN-based architectures such as VGG, ResNet, DenseNet, EffecientNet [31], have been solving compound image recognition challenges on ImageNet [6]. Nevertheless, CNN’s have flaws too. The model is more centralized on detecting local particulars than long-range semantic relationships amid parts of the images, and the images as a whole as convolution operates on a fixed size window. Additionally, max-pooling can lead to a loss of information.

In the NLP field, attention-based architectures such as Transformers [32] have been developed to tackle various language-related tasks more effectively, such as language translation and text classification. Significant performance gains have been achieved using transformers. Consequently, it has sparked great attentiveness in the computer vision commune to use similar self-attention models for vision goals. Numerous computer vision works have focused on amalgamating self-attention with CNN-like architectures. Some works have tried to replace the convolution entirely with self-attention mechanisms [36]. Nevertheless, these models have not yet scaled effectively on modern hardware accelerators due to specialized attention patterns.

The Vision Transformer (ViT) was recently suggested as the first deep neural network algorithm for large-scale computer vision datasets without convolution operators. The ViT uses the original transformer developed in [32] for the NLP task with some changes. First, the input images are split into parts and then arranged as linear embeddings as the transformer’s input. From an NLP point of view, image parts are treated as tokens (words) which then will be used to train the network in a supervised manner. In contrast to CNN’s, which encode prior knowledge about the spatial domain, transformers operate on vectors. They need more data to discover knowledge from high-dimensional and large-scale datasets. As far as we know, the first deployment of ViT for computer vision tasks was launched by Google, where a ViT was trained on an in-house dataset of 300 million images from the JFT dataset [30], and then fine-tuned to image recognition benchmarks like ImageNet. Fine-tuning is used to increase the performance of ViT, which matches state-of-the-art CNN models.

Within this model, a 2D flattening is completed on image patches to reshape them into a vector which is the required input of the transformer. Then the resultant vectorized image patches are converted into a linear patch embedding. Meanwhile, a position embedding chain is injected into each of these embedding patches to allow the network to retain the positional information of each embedding patch.

It works similar to that of BERT’s method. A class token ([cls], which stands for classification) is added at the beginning of the sequence of embedded patches as learnable embedding. The transformer encoder in the main architecture, obtained from [32], consists of multiple encoder blocks where each block has numerous layers of multi-headed self-attention mechanisms. Layer normalization is applied on the embedded patches before being fed into the multi-head self-attention and a second time before the MLP blocks. On the right side of Fig. 2, the general mechanism of an encoder is shown.

Conclusively, the visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder. Moreover, ViT models outperform CNNs by almost four times in computational efficiency and accuracy. Also, it performs better on small datasets. That is why our vision transformer approach outstands other CNN networks.

The self-attention layer in ViT makes it possible to embed information globally across the overall image. The model also learns on training data to encode the relative location of the image patches to reconstruct the structure of the image.

4 Methodology

This section describes the following: 1. Pre-processing techniques applied on the dataset. 2. Proposed approach 3. Post-processing technique.

4.1 Pre-processing

Raw data does not give good accuracy if applied to existing classification methods. Our goal is to improve the accuracy by using some well-known preprocessing techniques. Pre-processing involves the following: denoising the image dataset, mean normalization, standardization, zero component analysis(ZCA), and calibration of vision transformer hyperparameters [33].

Mean Normalization: The mean for every feature (dimensions of images) is calculated and then deducted from each image for all training samples. After which, the entire training dataset is transformed into an organized dataset. Consequently, the entire training set’s brightness is normalized concerning each dimension. This can be computed as:
$$\begin{aligned} Y^{'}=Y- \lambda \end{aligned}$$
(7)
where $Y^{'}$ denotes normalized data, Y denotes original data, & µ denotes mean vector along all Y features (Fig. 3).
Standardization: First, mean normalization of data takes place, then the standard deviation across each feature of training samples is calculated and partitioned. Finally making the final data mean & variance normalized. Thus making the raw data organized concerning by calculating mean and variance of each dimension of the training set. Standardization of input data Y is done by (8).
$$\begin{aligned} Y^{'}=\frac{(Y- \lambda )}{\sigma } \end{aligned}$$
(8)
where $Y^{\prime }$ denotes normalized data, $\mu $ denotes the mean vector covering all the features, and $\sigma $ denotes the standard deviation vector covering all the features.
Zero Component Analysis (ZCA): It makes the edges of the objects more eminent. The convolutional layers find several features from these edges’ feature maps.
$$\begin{aligned} Y^{\prime }=\frac{Y}{255} \end{aligned}$$
(9)
Initially, the data(Y) is size normalized by feature scaling using (9) where diag(P) denotes diagonal matrix of given matrix a, V is the Eigenvector matrix, & S denotes Eigenvalue matrix of singular value decomposition of the covariance matrix, $V^{T}$ denotes transpose of the Eigen vector-matrix V, $\varepsilon $ denotes whitening coefficient. This method of pre-processing brings in one more hyper-parameter whitening coefficient( $\varepsilon $).
$$\begin{aligned} Y_{ZCA}={V.diag\frac{1}{\sqrt{(diag(P)+\varepsilon )4}V^T \times Y^{\prime }}} \end{aligned}$$
(10)

4.2 Proposed hybrid vision transformer framework

The complete structure of our proposed approach is shown in Fig. 4. This section describes 1) The structure of our proposed approach, 2) Extracting features using vision transformer and handcrafted, and 3) Merging the output of both classifiers, resulting in a final hybrid classification.

A combination of two classifiers is employed. The first one uses the vision transformer approach to learn the most relevant features required for the training and testing of the image dataset. The second classifier employs hand-crafted features applied to our dataset for extracting the information present in the image itself. Hand-crafted features used here are LBP, Gist, and HoG [13].

4.2.1 Extracting features

This section discusses the hand-crafted and vision transformer feature extraction procedure, which were later combined to classify the input food images. The image database contains $x_i$ images where each image belongs to $R^{D}$ (D is the number of dimensions), having i = 1, 2, 3, ..., N, N being the total number of image examples. Each image belongs to a corresponding class $y_j$ where j = 1, 2, 3, ..., K, with K being the number of categories. Now, for calculating the class scores (mapping from image dimensions to image categories), let:

$$\begin{aligned} \quad \quad \quad \quad \quad f: R^D \rightarrow R^K \end{aligned}$$

(11)

A. Extracting hand-crafted features Different local hand-crafted features extract the essential characteristics from the images. Local Binary Pattern (LBP) is a robust feature extraction algorithm mainly used for texture recognition in computer vision. However, combined with Histogram of Oriented Gradients (HoG), it considerably improves the detection performance. GIST features are the global image features that assist in characterizing various important statistics of a scene. These features are computed by convolving the filter with an image at different scales and orientations. Thus, an image’s high and low-frequency repetitive gradient directions can be measured. The scores for filter convolution at each orientation and scale are used as GIST images’ features. Similarly, HoG is basically exploited for object detection in images. HoG features tell intensity gradients distribution which describes the shape and impression of local objects in the image. The following approach involves counting the occuring of gradient orientation and by maintaining photometric transformations & geometric invariance.

Further, for calculating the hand-crafted features, we performed the following computations. For every image sample, a class is associated, which can be represented as $ \varnothing ( \overrightarrow{x},y) $ where $\overrightarrow{x}$ are the features extracted and y be their corresponding class. For extracting only the effective and useful features from the image samples, . We associate a weight vector with the images to extract only the effective and useful features from the image samples. During the test, the classifier chooses the class, which is further calculated as::

$$\begin{aligned} \quad \quad \quad \quad \quad y_2 = {argmax_{y'}} \hspace{1mm} w^T \hspace{1mm} \varnothing (\overrightarrow{x},y') \end{aligned}$$

(12)

Entering all the features in a structure, thereby making:

$$\begin{aligned} \quad \quad \quad \quad \quad D_{h} = \left\{ (f(x_i),c_i) \mid 1< i< N \right\} \end{aligned}$$

(13)

where $D_h$ is the image database and f(.) are the features extracted.

B. Extracting the vision transformer features For calculating the vision transformer-based features, we performed the following computations. For every image sample, a class is associated which can be represented as $ \varnothing (\overrightarrow{x},y) $ where $\overrightarrow{x}$ are the features extracted, and y be their corresponding class. For extracting only the effective and useful features from the image samples, we associate a weight vector with the images, which is calculated as:

$$\begin{aligned} \quad \quad \quad \quad \quad y_1 = {argmax_{y'}} \hspace{1mm} w^T \hspace{1mm} \varnothing (\overrightarrow{x},y') \end{aligned}$$

(14)

Entering all the features in a structure, thereby making:

$$\begin{aligned} \quad \quad \quad \quad \quad D_{e} = \left\{ (f(x_i),c_i) \mid 1< i< N \right\} \end{aligned}$$

(15)

where $D_e$ is the image database and f(.) are the features extracted. Now for predicting the combined final scores for each dish X, its class $c^*$ is predicted as follows:

$$\begin{aligned} \quad \quad \quad \quad \quad final = avg \left\{ D_h,D_e \right\} \end{aligned}$$

(16)

C. Evaluate the classifier Both the classifiers (Hand-Crafted and vision transformer set) predict the output classes for the input image dataset in the form of a predicted class score or posterior probability [15]. This comparison confirms that vision features outperform hand-crafted features (Table 3). Further, the final classification is being done based on comparing values of the scores from both the classifiers (as already referenced in Fig. 5. Two of the above-tested classifiers give score values that describe the test data that ascertain which is more confident for the predicted label.

Table 3 The comparison between first SVM classifier based on vision transformer features and second SVM based on hand-crafted features

Full size table

4.3 Post-pocessing

When we applied post-processing to the outputs of the hybrid model, we saw a further improvement in precision and recall using Likelihood Class Filter (LCF). These improvements can be attributed to the fact that the probability information in the neighborhoods is effectively exploited, which showed higher accuracy values than the raw classification [28]. Likelihood class filter is thought of as a kernel having $N\times N$ elements. Normally, $N \in odd number$, where $oddnumber \ge 3$ such that the kernel filters central elements. Likewise, a kernel size of $3\times 3$, LCF having 8-neighborhood classes, will assign the probable class to the element at the centre. It should be noted that edges are not much considered because the $3\times 3$ kernel cannot consider the elements on the boundary of the matrix. But if we consider large kernel size, then classes with small spatial structures might go unnoticed, which decreases final accuracy. Additionally, an extensive computation time is taken. The filter/kernel size is $3\times 3$ in our work. Hence, only two conditions are taken for probable class for the element at the centre.

1.
Only neighborhood class comprising of absolutely majority elements (condition 1): In a $3\times 3$ kernel, only p elements have the possibility of likelihood class. In our work, p= 5, 6, 7, and 8 are tested. When p=5, LCF algorithm gives best thematic map transformation since $\ge 5$, includes 6, 7, or 8 elements already.
2.
Neighborhood class comprises of relatively majority elements (condition 2): This condition is found out by comparing the number of elements among the neighboring classes. The neighboring class having the most number of elements will be chosen the likelihood class. If neighboring classes have equal number of elements, then the class initially at the central element is taken into consideration. This implies ${2^{nd}}$ includes $1^{st}$ condition also, giving us a bigger thematic map transformation as compared to $1^{st}$ condition; because of this, it deteriorates the classification accuracy. It can be calculated as:
$$\begin{aligned} f_j(Y)=\ln {p(Y/q_j)p(q_j)}=\ln {p(Y/q_j)}+\ln {p(q_j)} \end{aligned}$$
(17)
where: j = class y = m-dimensional data (where m is the number of bands) p($q_{i}$) = probability that class $q_{j}$ occurs in the image and is assumed the same for all classes $\mid \Sigma _{i}\mid $= determinant of the covariance matrix of the data in class $q_{i}$ $\mid \Sigma _{i}^{-1}\mid $ = its inverse matrix $n_{i}$ = mean vector
$$\begin{aligned} Y \in q_{j} if f_j(Y)>f_i(Y) \forall i \end{aligned}$$
(18)

5 Results and discussion

This section elaborates on the experiments done on different available food datasets in literature (including our created dataset), demonstrating and analyzing how our proposed approach has outperformed the state-of-the-art algorithms. A.Experimental Scenario 1 (CNN-2L+RF): In this architecture, we have combined two CNNs for feature extraction, such that one is for the left half of the image and the other is for the right half of the image. We evaluated the performance by different classifiers from which RF produced the best result. We achieved an accuracy of 75.54%.

B.Experimental Scenario 2 (CNN-3L+SVM): In this architecture, we have combined three CNNs for feature extraction, such that one is for the left half of the image, the second one is for the complete image, and the third one is for the right half of the image. We evaluated the performance by different classifiers from which SVM produced the best result. We achieved an accuracy of 78.95%.

C.Experimental Scenario 3 (CNN-4L+SVM): In this architecture, we have combined four CNNs for feature extraction, such that one is for the top left part of the image, the second is for the top right part, the third for the bottom left part, and the fourth for the bottom right part of the image. All CNNs features are combined to produce the final feature vector. Finally, classification is done using an RF classifier. We achieved an accuracy of 79.21%.

D.Experimental Scenario 4(CNN-5L+ RF): To improve the accuracy, we have now exploited 5 CNN layers. From the ensemble, $1^{st}$ CNN is for the top left part of the image, $2^{nd}$ for the bottom left, $3^{rd}$ for the full image, $4^{th}$ for the top right, and $5^{th}$ for the bottom right part of the image with best classification accuracy achieved using SVM classifier. We achieved an accuracy of 83.11%

E.Experimental Scenario 5 (CNN-(4L+LBP, HoG and GIST)+SVM): Here, four different CNN layers are used, which helped in finding better features. Further classification accuracy increased when integrated with hand-crafted LBP, GIST, and HoG feature vectors. We achieved an accuracy of 91.21%.

F.Experimental Scenario 6 (Hybrid Vision Transformer approach(VT+LBP, HoG and GIST)): In the sixth scenario, the hybrid vision transformer approach is presented as discussed in Section 4.2. Further classification accuracy increased when integrated with hand-crafted LBP, GIST, and HoG feature vectors. We achieved an accuracy of 94.63%.

Out of all the scenarios, scenario 6 has outperformed, as shown in Table 4, preceded by scenario 5. Among scenarios 3 and 4, little variation in classification accuracy based on the number of neural network layers is observed, which has increased the training time. Therefore, for optimality 4 layer CNN is combined with morphometry LBP, GIST, and HoG features used to improve accuracy.

The highest accuracy of $94.63\%$, with specificity $95.23\%$, sensitivity $84.42\%$. And kappa coefficient 0.93. is obtained by integrating a hybrid vision transformer with hand-crafted features (Table 4). The precision and recall values obtained for different food categories are depicted in Table 5 and Fig. 5. We have also demonstrated the performance of our proposed model in comparison to state-of-the-art algorithms on different available datasets (PFID, FOOD85, UECFOOD-100, UECFOOD-256, UMPCFOOD-101, UNICT-FD1200, UNIMIB2016, and VIREO) as could be seen in Tables 6–13. Our proposed approach outperforms other algorithms (Tables 6–13), proving its generalizability and robustness.

Table 4 Accuracy Assessment of different experimental scenarios on Our Proposed Food Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

Table 5 Precision and Recall of different Food Items of proposed approach on our proposed Food Dataset

Full size table

5.1 Building the Food-class Mapping Models

We created a vast library of classification algorithms by varying their parameters to estimate various ensemble combinations. To be precise, we used SVM with a different type of kernel, viz. polynomial, radial basis, and sigmoid function. Similarly, for ANN, we composed diverse blends of designs with one or more hidden layers. For every individual classifier, the planning stage was executed methodically to choose the ideal outline for the most effective execution of a single classifier. We evaluated the performance of K-mean clustering by employing a different combination of parameters. A total of 225 combinations were attempted on the dataset.

Table 6 Accuracy Assessment of different experimental scenarios onPFID Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

Table 7 Accuracy Assessment of different experimental scenarios onFOOD 85 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

Table 8 Accuracy Assessment of different experimental scenarios on UECFOOD-100 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

Table 9 Accuracy Assessment of different experimental scenarios on UECFOOD-256 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

Table 10 Accuracy Assessment of different experimental scenarios on UMPC Food-101 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

Table 11 Accuracy Assessment of different experimental scenarios on UNICT-FD 1200 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

Table 12 Accuracy Assessment of different experimental scenarios on UNIMIB2016 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

Table 13 Accuracy Assessment of different experimental scenarios on VIREO Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

Full size table

The experimental evaluation of the food model has been done by plying over the ROC curves (Fig. 6). It has been modelled by outlining y-axis:“sensitivity” and x-axis: “100-specificity”. The number of food pixels accurately classified belongs to “Food class” depicting sensitivity, and the number of non-food pixels being accurately classified belongs to “non-Food class” depicting specificity. The performance of the prognostic model is delineated by the area under curve (AUC) value showing area under the ROC curve. The ideal model should have an AUC value closer to 1.

Investigating the results of each food model utilizing a deep learning network shows that (S1, S2, S3, S4, S5) have outflanked the state-of-art algorithm having AUC values as SVM=0.76, KNN=0.61, RF=0.68 (Fig. 6). It is logical because a framework using deep learning architecture gives better results in comparison with other state-of-art algorithms.

Intricate study of the results indicate that the proposed methodology with AUC of 0.92 has the best performance followed by the scenario 4 with AUC of 0.88, then by scenario 3 having AUC of 0.85 and lastly by scenario 2 with an AUC of 0.83. This is because the deep-leaning architectures have some of the major limitations. Firstly, as the convolutions work with constant window size, the model helps find the local information rather than long-range spatial relationships between different parts of the image and the complete image [38]. Secondly, there is some loss of local information through max-pooling. While, Vision transformers act as an alternative to CNN’s, and have shown promising results in the field of computer vision. The vision transformer is free of convolutions and identifies an image as a sequence of patches, hence overcoming the locality and translation invariance faced by CNN. This approach has been observed to use the hardware resources more efficiently than CNN and could be pre-trained on the public ImageNet dataset with fewer resources [38].

As can be seen from Table 3, even deep features, when used along with hand-crafted features, gave better results compared to when only using the CNN deep features. This is because more the number of CNNs employed side by side, a higher number of features are extracted, which has improved the classification accuracy. Further, hand-crafted features provide additional information, absent in other scenarios. Unlike the simple deep learning approach, the hand-crafted features integrated here with deep CNNs have shown impressive accuracy growth and rich feature extraction.

5.2 Discussion

5.2.1 ViT versus CNN

Amongst several computer vision tasks, ViT, i.e., Vision Transformer, has illustrated good performance. Since it’s based on a multi-head attention concept, it can encode the image in patches to form meaning. We have been intrigued by the fundamental differences in the operation of convolution and self-attention that have not been extensively explored in the context of robustness and generalization. While convolutions excel at learning local interactions between elements in the input domain (e.g., edges and contour information), self-attention has been shown to learn global exchanges effectively (e.g., relations between distant object parts) [14]. Given a query embedding, self-attention finds interactions with the other embeddings in the sequence, thereby conditioning the local content while modeling global relationships. In contrast, convolutions are content-independent as the same filter weights are applied to all inputs regardless of their distinct nature. Given the content-dependent long-range interaction modeling capabilities, our analysis shows that ViTs can easily adjust their receptive field to get through with nuisances in data and add strength to the expressivity of the representations.

6 Conclusion

In this paper, we have proposed a novel food framework architecture for recognizing the patterns of various food cuisines. Further, a detailed analysis has been performed on the food patterns extracted from the proposed approach in comparison with the state-of-the-art machine learning algorithms. The proposed framework utilizes a hybrid vision transformer approach amalgamating the contribution of vision transformer and hand-crafted features. The major strengths of our proposed work are as follows:

1) We proposed a more robust dataset comprising of a dynamic range of food cuisines, gathered from publicly available food logging systems;

2) We have proposed a novel framework for extracting fine pattern details for different food cuisines. Further, a detailed analysis has been performed on the patterns extracted by our proposed approach with the state-of-the-art machine learning algorithms; 3) We have optimized hyper transformer parameters, to improve the recognition ability of the proposed architecture, in comparison with the deep CNN and hand-crafted based pattern recognition algorithms; 4) Our proposed hybrid approach proved to show promising results over different types and aspect of datasets that led to obtaining unparalleled results for the task of food detection when compared with state-of-the-art algorithms. Hand-crafted features ushered the transformer mechanism in improving the accuracy to a new level. The paper also gives a detailed analysis of different parameters employed in transformer architecture.

So far, CNNs have dominated computer vision tasks. The notion behind an image is that one pixel is dependent on its immediate neighbors, and the next pixel is dependent on its neighbors. CNN’s work is based on this concept, and it extracts significant features and edges by applying filters to a section of a picture. This allows the model to learn only the most important elements from an image rather than the fine details of each pixel. Moreover, our proposed model works on the principle where the complete image data is put into it, rather than only the sections that the filters can extract (or find relevant). This serves as a reason why our proposed approach recognizes patterns more meaningfully. Our proposed architecture obtained an accuracy of 94.63%, specificity 95.23%, sensitivity 84.42%, and kappa coefficient 0.93, which was better than state-of-the-art food recognition systems using only hand-crafted features or CNN features.

Also, there is a limitation in our proposed approach. Generally, the main challenge with transformers is that they require a huge number of tokens at every layer to obtain reasonable results. This increases the computation cost of the transformer with each layer, making it intractable for large images. In future work, we could employ a token learner, which further improves the transformer’s performance by generating a smaller number of tokens in each layer for food pattern recognition systems.

Data Availability

Data is available on request.

References

Bai J, Li Y, Li J, Yang X, Jiang Y, Xia ST (2022) Multinomial random forest. Pattern Recogn 122:108331
Article Google Scholar
Bossard L, Guillaumin M, Gool LV (2014) Food-101–mining discriminative components with random forests. In: European Conference on Computer Vision, pp. 446–461. Springer
Chen M, Dhingra K, Wu W, Yang L, Sukthankar R, Yang J (2009) Pfid: Pittsburgh fast-food image dataset. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 289–292. IEEE
Chen J, Ngo CW (2016) Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 32–41
Ciocca G, Napoletano P, Schettini R (2016) Food recognition: a new dataset, experiments, and results. IEEE J Biomed Health Informatics 21(3):588–598
Article Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Ieee
Fahira PK, Rahmadhani ZP, Mursanto P, Wibisono A, Wisesa HA (2020) Classical machine learning classification for javanese traditional food image. In: 2020 4th International Conference on Informatics and Computational Sciences (ICICoS), pp. 1–5. IEEE
Fakhrou A, Kunhoth J, Al Maadeed S (2021) Smartphone-based food recognition system using multiple deep cnn models. Multimedia Tools Appl 80(21):33011–33032
Article Google Scholar
Farinella GM, Allegra D, Moltisanti M, Stanco F, Battiato S (2016) Retrieval and classification of food images. Comput Biol Med 77:23–39
Article Google Scholar
Groenendijk R, Karaoglu S, Gevers T, Mensink T (2020) On the benefit of adversarial training for monocular depth estimation. Comput Vis Image Underst 190:102848
Article Google Scholar
Heravi EJ, Aghdam HH, Puig D (2018) An optimized convolutional neural network with bottleneck and spatial pyramid pooling layers for classification of foods. Pattern Recogn Lett 105:50–58
Article Google Scholar
Hoashi H, Joutou T, Yanai K (2010) Image recognition of 85 food categories by feature fusion. In: 2010 IEEE International Symposium on Multimedia, pp. 296–301. IEEE
Hu R, Collomosse J (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Comput Vis Image Underst 117(7):790–806
Article Google Scholar
Hu H, Zhang Z, Xie Z, Lin S (2019) Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3464–3473
Isa NAM, Mamat WMFW (2011) Clustered-hybrid multilayer perceptron network for pattern recognition application. Appl Soft Comput 11(1):1457–1466
Article Google Scholar
Jiang B, Yang J, Lv Z, Tian K, Meng Q, Yan Y (2017) Internet cross-media retrieval based on deep learning. J Vis Commun Image Represent 48:356–366
Article Google Scholar
Kawano Y, Yanai K (2013) Real-time mobile food recognition system. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7
Kawano Y, Yanai K (2014) Automatic expansion of a food image dataset leveraging existing categories with domain adaptation. In: European Conference on Computer Vision, pp. 3–17. Springer
Krizhevsky A, Sutskever I, Hinton GE (2012) XImagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
Li X, Wu H, Li M, Liu H (2022) Multi-label video classification via coupling attentional multiple instance learning with label relation graph. Pattern Recogn Lett
Matsuda Y, Hoashi H, Yanai K (2012) Recognition of multiple-food images by detecting candidate regions. In: 2012 IEEE International Conference on Multimedia and Expo, pp. 25–30. IEEE
Matsuda Y, Yanai K (2012) Multiple-food recognition considering co-occurrence employing manifold ranking. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 2017–2020. IEEE
Mazzia V, Angarano S, Salvetti F, Angelini F, Chiaberge M (2022) Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recogn 124:108487
Article Google Scholar
Oliveira L, Costa V, Neves G, Oliveira T, Jorge E, Lizarraga M (2014) A mobile, lightweight, poll-based food identification system. Pattern Recogn 47(5):1941–1952
Article Google Scholar
Setyono NFP, Chahyati D, Fanany MI (2018) Betawi traditional food image detection using resnet and densenet. In: 2018 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 441–445. IEEE
Sharma P, Sharma A et al (2022) Hybrid approach for food recognition using various filters. Int J Adv Comput Technol 11(1):1–5
MathSciNet Google Scholar
Simon P, Uma V (2023) Integrating inceptionresnetv2 model and machine learning classifiers for food texture classification. In: Advances in Cognitive Science and Communications: Selected Articles from the 5th International Conference on Communications and Cyber-Physical Engineering (ICCCE 2022), Hyderabad, India, pp. 531–539. Springer
Su TC (2016) A filter-based post-processing technique for improving homogeneity of pixel-wise classification data. Eur J Remote Sens 49(1):531–552
Article Google Scholar
Su H, Lin TW, Li CT, Shan MK, Chang J (2014) Automatic recipe cuisine classification by ingredients. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, pp. 565–570
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 843–852
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008
Wang L, Cheng H, Liu Z, Zhu C (2014) A robust elastic net approach for feature learning. J Vis Commun Image Represent 25(2):313–321
Article Google Scholar
Wang X, Thome N, Cord M (2017) Gaze latent support vector machine for image classification improved by weakly supervised region selection. Pattern Recogn 72:59–71
Article Google Scholar
Wang G, Chen X, Guo H, Zhang C (2018) Region ensemble network: Towards good practices for deep 3d hand pose estimation. J Vis Commun Image Represent 55:404–414
Article Google Scholar
Wang B, Ma J, Zhang L, Su Y, Xie Y, Ahmad Z, Xie B (2021) The synergistic strategy and microbial ecology of the anaerobic co-digestion of food waste under the regulation of domestic garbage classification in china. Sci Total Environ 765:144632
Article Google Scholar
Wang X, Kumar D, Thome N, Cord M, Precioso F (2015) Recipe recognition with large multimodal food dataset. In: 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6. IEEE
Yuan Y, Wang LN, Zhong G, Gao W, Jiao W, Dong J, Shen B, Xia D, Xiang W (2022) Adaptive gabor convolutional networks. Pattern Recogn 124:108495
Article Google Scholar
Zahisham Z, Lee CP, Lim KM (2020) Food recognition with resnet-50. In: 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), pp. 1–5. IEEE
Zhu F, Bosch M, Khanna N, Boushey CJ, Delp EJ (2014) Multiple hypotheses image segmentation and classification with application to dietary assessment. IEEE J Biomed Health Informatics 19(1):377–388
Article Google Scholar
Zhu Y, Zhao X, Zhao C, Wang J, Lu H (2020) Food det: Detecting foods in refrigerator with supervised transformer network. Neurocomputing 379:162–171
Article Google Scholar

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Thapar University, Patiala, Punjab, India
Rahul Nijhawan
Department of Computer science, Jain Deemed to be University, Banglore, Karnataka, India
Garima Sinha
Department of Computer science, IIT Guwahati, Guwahati, Assam, India
Ashita Batra & Himanshu Sharma
School of Computer Science, FEIS, University of Wollongong in Dubai, Dubai Knowledge Park, Dubai, UAE
Manoj Kumar
MEU Research Unit, Middle East University, Amman, 11831, Jordan
Manoj Kumar

Authors

Rahul Nijhawan
View author publications
You can also search for this author in PubMed Google Scholar
Garima Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Ashita Batra
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Himanshu Sharma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manoj Kumar.

Ethics declarations

Conflicts of interest

The authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nijhawan, R., Sinha, G., Batra, A. et al. VTnet+Handcrafted based approach for food cuisines classification. Multimed Tools Appl 83, 10695–10715 (2024). https://doi.org/10.1007/s11042-023-15800-4

Download citation

Received: 19 May 2022
Revised: 04 May 2023
Accepted: 08 May 2023
Published: 24 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15800-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

VTnet+Handcrafted based approach for food cuisines classification

Abstract

Similar content being viewed by others

DeshiFoodBD: Development of a Bangladeshi Traditional Food Image Dataset and Recognition Model Using Inception V3

Vietnamese Food Recognition System Using Convolutional Neural Networks Based Features

Food Classification Using Deep Learning Algorithm

1 Introduction