1 Introduction

Food is the essence of our lives. As and how time has evolved, food forms have also increased manifolds. However, their complexity has also increased [41]. The main reason for this complexity is their similarity amongst themselves. Many different food items look similar, and most similar-looking items differ greatly. This contrasting feature makes it very difficult to classify and detect them. Image recognition could be a good solution to food recognition and classification [26].

Food detection and classification have been investigated in the literature using different methods like deep learning, mobile computing devices, etc.[11]. Convolutional Neural Networks (CNN) are currently the most progressive used technique for image recognition purposes [11].

Many methods employ images consisting of a single food entity, which makes it easier to recognize the object in them but unfortunately, not much work is done in the multiple-food recognition. On the contrary, deep learning approaches are being extensively used for recognition. Heravi et al. [11] have used optimized CNN to recognize images from the data sets created using public domain images.

Wang et al. [34] has proposed a weakly supervised learning algorithm that performs food detection and classification by employing gaze features gathered by means of eye tracker. This approach has an advantage as it requires gaze features only at training time, while it is gaze free at testing time. On the other hand, Kawano et al. [17] proposed a food recognition system to monitor nutrients and calorie intake, and eating habits. They used Grubcut for segmentation purposes, a SURF-based bag of features for feature extraction, and then classified using linear SVM. This approach does not need a server, as the system was implemented on the android smartphone.

Oliveira et al. [24] addressed the main problem of food detection based on image recognition. They developed a system that detects and classify multiple foods present on a single plate. The system was developed for the camera-based smartphone. They proposed a novel segmentation algorithm that works on the principle of region growing parallelly operating on multiple feature spaces. Recently, proposed a large collection of food datasets capable of training very complicated deep learning-based architectures. It covers a very robust and dynamic range of food images. They also proposed a high-level objective classification model, which works well in such variant classes of food images.

Y. Matsuda, K.Yanai in [22] introduces a new manifold ranking method that introduces similarities between different food items in a single image using co-occurrence statistics. Zhu et al. in [40] introduced a two-step approach to automatically recognizing multiple food items present in an image. First, a segmentation process is applied, which iteratively fragments a food image into similar object classes, and secondly, classifying the segmented images using SVM as a classifier to label them appropriately.

Although promising, CNN-based architectures have some weaknesses. Firstly, as the convolutions work with constant window size, the model helps find the local information rather than long-range spatial relationships between different parts of the image and the complete image[23]. Secondly, there is some loss of local information through max-pooling. Inspired by Natural language processing, Vision transformers have recently been proposed as an alternative to CNNs and have shown promising results in the field of computer vision [23]. Vision transformer is free of convolutions and identifies an image as a sequence of patches, hence overcoming the issues of locality and translation invariance faced by CNN. This approach has been observed to use the hardware resources more efficiently than CNN and could be pre-trained on the public ImageNet dataset with fewer resources [23].

In this paper, a novel synergistic model based on vision transformer and hand-crafted features is proposed. Our approach applies the Transformer architecture with self-attention to sequences of image patches without using convolution layers. Further, ancillary hand-crafted features are computed to provide supplementary information on food images. Our proposed approach furnishes the following goals:

  • We introduced a dataset comprising different varieties of food dishes from every domain and culture, collected from publicly available food logging systems;

  • We have proposed a new mathematical formulation of our proposed architecture objective function;

  • We optimized hyperparameters of the transformer, showing that the vision transformer has significantly improved the food recognition accuracy as compared to the other state-of-the-art algorithms using only hand-crafted features or CNN features;

  • Our hybrid approach has the advantage of showing sublime results on every type and aspect of data (compared on other available datasets) that led to obtaining unparalleled results for the task of food detection when compared with state-of-the-art algorithms.

The rest of the paper is organized as follows: In Section 2, the dataset, architecture, components, and algorithms for the proposed system based on a hybrid vision transformer are presented. Section 3 defines the preliminaries for the proposed approach. In Section 4, we describe our proposed approach. Section 5 presents the evaluation of the results. Section 6 discusses the drawn conclusions, the comparison between ViT and CNN, and future work.

2 Material and methods

This section gives an overview of the food dataset employed for the study. It elaborates on our created Food Dataset and available Food datasets in literature to prove our proposed approach’s generalizability and robustness.

2.1 Food image dataset

Extensive dataset empowers learning more detailed and fine features, thus helping to combat overfitting. Therefore, the goal was to develop a robust dataset with different food images and where each item is represented with as many images as possible. These images will eventually perpetuate to give correct predictions and accuracy. Therefore, we have collected a public dataset of images to have realistic feature recognition with high precision and accuracy, as shown in Fig. 1 and Table 2. For this, we created approximately 300 to 400 images of different food items [19]. We compared the results of our proposed approach on the available Food datasets in the previous literature (PFID, FOOD85, UECFOOD-100, UECFOOD-256, UMPCFOOD-101, UNICT-FD1200, UNIMIB2016, and VIREO). Table 1 summarizes the employed food datasets available in the literature. We have given the size and number of food classes for each dataset. We have considered only those publicly available datasets with a minimum of 100 images in each class (Table 2).

Fig. 1
figure 1

Input images of different food items (Our Proposed)

Table 1 Food Datasets available in the literature
Table 2 Our Proposed Food Dataset

3 Preliminaries

A. Convolutional Neural Networks (CNN or ConvNet) ConvNet is a machine learning class that uses a multilayer perceptron variation designed for minimal preprocessing. Either we can forward propagate or backpropagate in the network model. The main advantage of using CNN is that it requires minimal input for pre-processing in comparison to others; this has brought further advancement in the processing of images, audio, speech, and videos [10].

B. Random Forest (RF) These are the architectures that are ensemble of different learning approaches. They are built by several decision trees at the time of training and output the mean class (considering outputs of all the decision trees [1].

While using the random forest to solve regression problems, MSE (Mean Squared Error) for showing the data branching from each node. The following equation is used to calculate the distance of every node from actual predicted values, which further helps in deciding the best branch from the forest. Here, \(y_i\) denotes the value of the data point used for testing at a particular node, and \(f_i\) denotes the value produced by the decision tree.

$$\begin{aligned} MSE=\frac{1}{N}\sum _{i=1}^{N}(f_i -y_i)^2 \end{aligned}$$
(1)

The Gini index helps to indicate the branching of the nodes in the decision tree for classification purposes. The equation will use the class label and probability measure to determine each branch’s Gini index on a node, which determines the most likely branch to occur. Here, pi denotes the relative frequency of the class observed in the dataset, and c denotes the number of classes.

$$\begin{aligned} Gini=1-\sum _{i=1}^{C}(p_i)^2 \end{aligned}$$
(2)

Entropy uses the probability measure of a certain output to decide how the node should branch.

$$\begin{aligned} Entropy=\sum _{i=1}^{C}- p_i *\log _2(p_i) \end{aligned}$$
(3)

C. Support Vector Machine (SVM) It is used for pattern recognition and classification [16]. The goal of SVM is to find the hyperplane which helps maximize the distance between data points between two classes hence forming the binary linear classifier [35]. We have n training example where each x of them have D dimensions, and either the class label is y=+1 or y= -1 class, and data can be separated by drawing a straight line. SVM can be formulated as:

$$\begin{aligned} w.x_i+b \ge 1 \quad \forall y_i=+1 \end{aligned}$$
(4)
$$\begin{aligned} w.x_i+b\ge 1 \quad \forall y_i=-1 \end{aligned}$$
(5)

Combining the equation 4 & 5

$$\begin{aligned} y_i(w.x_i +b)-1 \ge 0 \quad \forall y_i=+1,=1 \end{aligned}$$
(6)

D. Vision Transformer Approach For the past ten years, deep CNNs have been the go-to method for solving computer vision problems. CNN has been successful in terms of performance; applying numerous convolutions on images to learn hierarchical spatial features has been its major advantage compared to the standard machine learning algorithm [20]. Ever since 2012, CNN-based architectures such as VGG, ResNet, DenseNet, EffecientNet [31], have been solving compound image recognition challenges on ImageNet [6]. Nevertheless, CNN’s have flaws too. The model is more centralized on detecting local particulars than long-range semantic relationships amid parts of the images, and the images as a whole as convolution operates on a fixed size window. Additionally, max-pooling can lead to a loss of information.

In the NLP field, attention-based architectures such as Transformers [32] have been developed to tackle various language-related tasks more effectively, such as language translation and text classification. Significant performance gains have been achieved using transformers. Consequently, it has sparked great attentiveness in the computer vision commune to use similar self-attention models for vision goals. Numerous computer vision works have focused on amalgamating self-attention with CNN-like architectures. Some works have tried to replace the convolution entirely with self-attention mechanisms [36]. Nevertheless, these models have not yet scaled effectively on modern hardware accelerators due to specialized attention patterns.

The Vision Transformer (ViT) was recently suggested as the first deep neural network algorithm for large-scale computer vision datasets without convolution operators. The ViT uses the original transformer developed in [32] for the NLP task with some changes. First, the input images are split into parts and then arranged as linear embeddings as the transformer’s input. From an NLP point of view, image parts are treated as tokens (words) which then will be used to train the network in a supervised manner. In contrast to CNN’s, which encode prior knowledge about the spatial domain, transformers operate on vectors. They need more data to discover knowledge from high-dimensional and large-scale datasets. As far as we know, the first deployment of ViT for computer vision tasks was launched by Google, where a ViT was trained on an in-house dataset of 300 million images from the JFT dataset [30], and then fine-tuned to image recognition benchmarks like ImageNet. Fine-tuning is used to increase the performance of ViT, which matches state-of-the-art CNN models.

Within this model, a 2D flattening is completed on image patches to reshape them into a vector which is the required input of the transformer. Then the resultant vectorized image patches are converted into a linear patch embedding. Meanwhile, a position embedding chain is injected into each of these embedding patches to allow the network to retain the positional information of each embedding patch.

It works similar to that of BERT’s method. A class token ([cls], which stands for classification) is added at the beginning of the sequence of embedded patches as learnable embedding. The transformer encoder in the main architecture, obtained from [32], consists of multiple encoder blocks where each block has numerous layers of multi-headed self-attention mechanisms. Layer normalization is applied on the embedded patches before being fed into the multi-head self-attention and a second time before the MLP blocks. On the right side of Fig. 2, the general mechanism of an encoder is shown.

Fig. 2
figure 2

ViT model architecture: An image is segmented into patches(like \(16\times 16\)), after applying position embedding on flattened patches pass it to transformer encoder. NLZ:Normalization, MHA: Multi-Head-Attention

Conclusively, the visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder. Moreover, ViT models outperform CNNs by almost four times in computational efficiency and accuracy. Also, it performs better on small datasets. That is why our vision transformer approach outstands other CNN networks.

The self-attention layer in ViT makes it possible to embed information globally across the overall image. The model also learns on training data to encode the relative location of the image patches to reconstruct the structure of the image.

4 Methodology

This section describes the following: 1. Pre-processing techniques applied on the dataset. 2. Proposed approach 3. Post-processing technique.

4.1 Pre-processing

Raw data does not give good accuracy if applied to existing classification methods. Our goal is to improve the accuracy by using some well-known preprocessing techniques. Pre-processing involves the following: denoising the image dataset, mean normalization, standardization, zero component analysis(ZCA), and calibration of vision transformer hyperparameters [33].

  • Mean Normalization: The mean for every feature (dimensions of images) is calculated and then deducted from each image for all training samples. After which, the entire training dataset is transformed into an organized dataset. Consequently, the entire training set’s brightness is normalized concerning each dimension. This can be computed as:

    $$\begin{aligned} Y^{'}=Y- \lambda \end{aligned}$$
    (7)

    where \(Y^{'}\) denotes normalized data, Y denotes original data, & µ denotes mean vector along all Y features (Fig. 3).

  • Standardization: First, mean normalization of data takes place, then the standard deviation across each feature of training samples is calculated and partitioned. Finally making the final data mean & variance normalized. Thus making the raw data organized concerning by calculating mean and variance of each dimension of the training set. Standardization of input data Y is done by (8).

    $$\begin{aligned} Y^{'}=\frac{(Y- \lambda )}{\sigma } \end{aligned}$$
    (8)

    where \(Y^{\prime }\) denotes normalized data, \(\mu \) denotes the mean vector covering all the features, and \(\sigma \) denotes the standard deviation vector covering all the features.

  • Zero Component Analysis (ZCA): It makes the edges of the objects more eminent. The convolutional layers find several features from these edges’ feature maps.

    $$\begin{aligned} Y^{\prime }=\frac{Y}{255} \end{aligned}$$
    (9)

    Initially, the data(Y) is size normalized by feature scaling using (9) where diag(P) denotes diagonal matrix of given matrix a, V is the Eigenvector matrix, & S denotes Eigenvalue matrix of singular value decomposition of the covariance matrix, \(V^{T}\) denotes transpose of the Eigen vector-matrix V, \(\varepsilon \) denotes whitening coefficient. This method of pre-processing brings in one more hyper-parameter whitening coefficient( \(\varepsilon \)).

    $$\begin{aligned} Y_{ZCA}={V.diag\frac{1}{\sqrt{(diag(P)+\varepsilon )4}V^T \times Y^{\prime }}} \end{aligned}$$
    (10)
Fig. 3
figure 3

Flowcharts describing complete process of CNN model & ViT model

4.2 Proposed hybrid vision transformer framework

The complete structure of our proposed approach is shown in Fig. 4. This section describes 1) The structure of our proposed approach, 2) Extracting features using vision transformer and handcrafted, and 3) Merging the output of both classifiers, resulting in a final hybrid classification.

A combination of two classifiers is employed. The first one uses the vision transformer approach to learn the most relevant features required for the training and testing of the image dataset. The second classifier employs hand-crafted features applied to our dataset for extracting the information present in the image itself. Hand-crafted features used here are LBP, Gist, and HoG [13].

Fig. 4
figure 4

Proposed methodology for food image classification where FV:Feature Vector

4.2.1 Extracting features

This section discusses the hand-crafted and vision transformer feature extraction procedure, which were later combined to classify the input food images. The image database contains \(x_i\) images where each image belongs to \(R^{D}\) (D is the number of dimensions), having i = 1, 2, 3, ..., N, N being the total number of image examples. Each image belongs to a corresponding class \(y_j\) where j = 1, 2, 3, ..., K, with K being the number of categories. Now, for calculating the class scores (mapping from image dimensions to image categories), let:

$$\begin{aligned} \quad \quad \quad \quad \quad f: R^D \rightarrow R^K \end{aligned}$$
(11)

A. Extracting hand-crafted features Different local hand-crafted features extract the essential characteristics from the images. Local Binary Pattern (LBP) is a robust feature extraction algorithm mainly used for texture recognition in computer vision. However, combined with Histogram of Oriented Gradients (HoG), it considerably improves the detection performance. GIST features are the global image features that assist in characterizing various important statistics of a scene. These features are computed by convolving the filter with an image at different scales and orientations. Thus, an image’s high and low-frequency repetitive gradient directions can be measured. The scores for filter convolution at each orientation and scale are used as GIST images’ features. Similarly, HoG is basically exploited for object detection in images. HoG features tell intensity gradients distribution which describes the shape and impression of local objects in the image. The following approach involves counting the occuring of gradient orientation and by maintaining photometric transformations & geometric invariance.

Further, for calculating the hand-crafted features, we performed the following computations. For every image sample, a class is associated, which can be represented as \( \varnothing ( \overrightarrow{x},y) \) where \(\overrightarrow{x}\) are the features extracted and y be their corresponding class. For extracting only the effective and useful features from the image samples, . We associate a weight vector with the images to extract only the effective and useful features from the image samples. During the test, the classifier chooses the class, which is further calculated as::

$$\begin{aligned} \quad \quad \quad \quad \quad y_2 = {argmax_{y'}} \hspace{1mm} w^T \hspace{1mm} \varnothing (\overrightarrow{x},y') \end{aligned}$$
(12)

Entering all the features in a structure, thereby making:

$$\begin{aligned} \quad \quad \quad \quad \quad D_{h} = \left\{ (f(x_i),c_i) \mid 1< i< N \right\} \end{aligned}$$
(13)

where \(D_h\) is the image database and f(.) are the features extracted.

B. Extracting the vision transformer features For calculating the vision transformer-based features, we performed the following computations. For every image sample, a class is associated which can be represented as \( \varnothing (\overrightarrow{x},y) \) where \(\overrightarrow{x}\) are the features extracted, and y be their corresponding class. For extracting only the effective and useful features from the image samples, we associate a weight vector with the images, which is calculated as:

$$\begin{aligned} \quad \quad \quad \quad \quad y_1 = {argmax_{y'}} \hspace{1mm} w^T \hspace{1mm} \varnothing (\overrightarrow{x},y') \end{aligned}$$
(14)

Entering all the features in a structure, thereby making:

$$\begin{aligned} \quad \quad \quad \quad \quad D_{e} = \left\{ (f(x_i),c_i) \mid 1< i< N \right\} \end{aligned}$$
(15)

where \(D_e\) is the image database and f(.) are the features extracted. Now for predicting the combined final scores for each dish X, its class \(c^*\) is predicted as follows:

$$\begin{aligned} \quad \quad \quad \quad \quad final = avg \left\{ D_h,D_e \right\} \end{aligned}$$
(16)

C. Evaluate the classifier Both the classifiers (Hand-Crafted and vision transformer set) predict the output classes for the input image dataset in the form of a predicted class score or posterior probability [15]. This comparison confirms that vision features outperform hand-crafted features (Table 3). Further, the final classification is being done based on comparing values of the scores from both the classifiers (as already referenced in Fig. 5. Two of the above-tested classifiers give score values that describe the test data that ascertain which is more confident for the predicted label.

Table 3 The comparison between first SVM classifier based on vision transformer features and second SVM based on hand-crafted features

4.3 Post-pocessing

When we applied post-processing to the outputs of the hybrid model, we saw a further improvement in precision and recall using Likelihood Class Filter (LCF). These improvements can be attributed to the fact that the probability information in the neighborhoods is effectively exploited, which showed higher accuracy values than the raw classification [28]. Likelihood class filter is thought of as a kernel having \(N\times N\) elements. Normally, \(N \in odd number\), where \(oddnumber \ge 3\) such that the kernel filters central elements. Likewise, a kernel size of \(3\times 3\), LCF having 8-neighborhood classes, will assign the probable class to the element at the centre. It should be noted that edges are not much considered because the \(3\times 3\) kernel cannot consider the elements on the boundary of the matrix. But if we consider large kernel size, then classes with small spatial structures might go unnoticed, which decreases final accuracy. Additionally, an extensive computation time is taken. The filter/kernel size is \(3\times 3\) in our work. Hence, only two conditions are taken for probable class for the element at the centre.

  1. 1.

    Only neighborhood class comprising of absolutely majority elements (condition 1): In a \(3\times 3\) kernel, only p elements have the possibility of likelihood class. In our work, p= 5, 6, 7, and 8 are tested. When p=5, LCF algorithm gives best thematic map transformation since \(\ge 5\), includes 6, 7, or 8 elements already.

  2. 2.

    Neighborhood class comprises of relatively majority elements (condition 2): This condition is found out by comparing the number of elements among the neighboring classes. The neighboring class having the most number of elements will be chosen the likelihood class. If neighboring classes have equal number of elements, then the class initially at the central element is taken into consideration. This implies \({2^{nd}}\) includes \(1^{st}\) condition also, giving us a bigger thematic map transformation as compared to \(1^{st}\) condition; because of this, it deteriorates the classification accuracy. It can be calculated as:

    $$\begin{aligned} f_j(Y)=\ln {p(Y/q_j)p(q_j)}=\ln {p(Y/q_j)}+\ln {p(q_j)} \end{aligned}$$
    (17)

    where: j = class y = m-dimensional data (where m is the number of bands) p(\(q_{i}\)) = probability that class \(q_{j}\) occurs in the image and is assumed the same for all classes \(\mid \Sigma _{i}\mid \)= determinant of the covariance matrix of the data in class \(q_{i}\) \(\mid \Sigma _{i}^{-1}\mid \) = its inverse matrix \(n_{i}\) = mean vector

    $$\begin{aligned} Y \in q_{j} if f_j(Y)>f_i(Y) \forall i \end{aligned}$$
    (18)
Fig. 5
figure 5

Flow chart of Precision and Recall of different food classes of proposed approach on our Proposed Food Datset

5 Results and discussion

This section elaborates on the experiments done on different available food datasets in literature (including our created dataset), demonstrating and analyzing how our proposed approach has outperformed the state-of-the-art algorithms. A.Experimental Scenario 1 (CNN-2L+RF): In this architecture, we have combined two CNNs for feature extraction, such that one is for the left half of the image and the other is for the right half of the image. We evaluated the performance by different classifiers from which RF produced the best result. We achieved an accuracy of 75.54%.

B.Experimental Scenario 2 (CNN-3L+SVM): In this architecture, we have combined three CNNs for feature extraction, such that one is for the left half of the image, the second one is for the complete image, and the third one is for the right half of the image. We evaluated the performance by different classifiers from which SVM produced the best result. We achieved an accuracy of 78.95%.

C.Experimental Scenario 3 (CNN-4L+SVM): In this architecture, we have combined four CNNs for feature extraction, such that one is for the top left part of the image, the second is for the top right part, the third for the bottom left part, and the fourth for the bottom right part of the image. All CNNs features are combined to produce the final feature vector. Finally, classification is done using an RF classifier. We achieved an accuracy of 79.21%.

D.Experimental Scenario 4(CNN-5L+ RF): To improve the accuracy, we have now exploited 5 CNN layers. From the ensemble, \(1^{st}\) CNN is for the top left part of the image, \(2^{nd}\) for the bottom left, \(3^{rd}\) for the full image, \(4^{th}\) for the top right, and \(5^{th}\) for the bottom right part of the image with best classification accuracy achieved using SVM classifier. We achieved an accuracy of 83.11%

E.Experimental Scenario 5 (CNN-(4L+LBP, HoG and GIST)+SVM): Here, four different CNN layers are used, which helped in finding better features. Further classification accuracy increased when integrated with hand-crafted LBP, GIST, and HoG feature vectors. We achieved an accuracy of 91.21%.

F.Experimental Scenario 6 (Hybrid Vision Transformer approach(VT+LBP, HoG and GIST)): In the sixth scenario, the hybrid vision transformer approach is presented as discussed in Section 4.2. Further classification accuracy increased when integrated with hand-crafted LBP, GIST, and HoG feature vectors. We achieved an accuracy of 94.63%.

Out of all the scenarios, scenario 6 has outperformed, as shown in Table 4, preceded by scenario 5. Among scenarios 3 and 4, little variation in classification accuracy based on the number of neural network layers is observed, which has increased the training time. Therefore, for optimality 4 layer CNN is combined with morphometry LBP, GIST, and HoG features used to improve accuracy.

The highest accuracy of \(94.63\%\), with specificity \(95.23\%\), sensitivity \(84.42\%\). And kappa coefficient 0.93. is obtained by integrating a hybrid vision transformer with hand-crafted features (Table 4). The precision and recall values obtained for different food categories are depicted in Table 5 and Fig. 5. We have also demonstrated the performance of our proposed model in comparison to state-of-the-art algorithms on different available datasets (PFID, FOOD85, UECFOOD-100, UECFOOD-256, UMPCFOOD-101, UNICT-FD1200, UNIMIB2016, and VIREO) as could be seen in Tables 613. Our proposed approach outperforms other algorithms (Tables 613), proving its generalizability and robustness.

Table 4 Accuracy Assessment of different experimental scenarios on Our Proposed Food Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient
Table 5 Precision and Recall of different Food Items of proposed approach on our proposed Food Dataset

5.1 Building the Food-class Mapping Models

We created a vast library of classification algorithms by varying their parameters to estimate various ensemble combinations. To be precise, we used SVM with a different type of kernel, viz. polynomial, radial basis, and sigmoid function. Similarly, for ANN, we composed diverse blends of designs with one or more hidden layers. For every individual classifier, the planning stage was executed methodically to choose the ideal outline for the most effective execution of a single classifier. We evaluated the performance of K-mean clustering by employing a different combination of parameters. A total of 225 combinations were attempted on the dataset.

Table 6 Accuracy Assessment of different experimental scenarios onPFID Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient
Table 7 Accuracy Assessment of different experimental scenarios onFOOD 85 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient
Table 8 Accuracy Assessment of different experimental scenarios on UECFOOD-100 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient
Table 9 Accuracy Assessment of different experimental scenarios on UECFOOD-256 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient
Table 10 Accuracy Assessment of different experimental scenarios on UMPC Food-101 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient
Table 11 Accuracy Assessment of different experimental scenarios on UNICT-FD 1200 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient
Table 12 Accuracy Assessment of different experimental scenarios on UNIMIB2016 Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient
Table 13 Accuracy Assessment of different experimental scenarios on VIREO Dataset, AC: Accuracy, Sen: Sensitivity, Spe: Specificity, K.C: Kappa Coefficient

The experimental evaluation of the food model has been done by plying over the ROC curves (Fig. 6). It has been modelled by outlining y-axis:“sensitivity” and x-axis: “100-specificity”. The number of food pixels accurately classified belongs to “Food class” depicting sensitivity, and the number of non-food pixels being accurately classified belongs to “non-Food class” depicting specificity. The performance of the prognostic model is delineated by the area under curve (AUC) value showing area under the ROC curve. The ideal model should have an AUC value closer to 1.

Investigating the results of each food model utilizing a deep learning network shows that (S1, S2, S3, S4, S5) have outflanked the state-of-art algorithm having AUC values as SVM=0.76, KNN=0.61, RF=0.68 (Fig. 6). It is logical because a framework using deep learning architecture gives better results in comparison with other state-of-art algorithms.

Fig. 6
figure 6

Validation of the model using ROC curve and analysis of AUC for the experimental scenarios as depicted in Table 4, on our proposed Food Dataset

Intricate study of the results indicate that the proposed methodology with AUC of 0.92 has the best performance followed by the scenario 4 with AUC of 0.88, then by scenario 3 having AUC of 0.85 and lastly by scenario 2 with an AUC of 0.83. This is because the deep-leaning architectures have some of the major limitations. Firstly, as the convolutions work with constant window size, the model helps find the local information rather than long-range spatial relationships between different parts of the image and the complete image [38]. Secondly, there is some loss of local information through max-pooling. While, Vision transformers act as an alternative to CNN’s, and have shown promising results in the field of computer vision. The vision transformer is free of convolutions and identifies an image as a sequence of patches, hence overcoming the locality and translation invariance faced by CNN. This approach has been observed to use the hardware resources more efficiently than CNN and could be pre-trained on the public ImageNet dataset with fewer resources [38].

As can be seen from Table 3, even deep features, when used along with hand-crafted features, gave better results compared to when only using the CNN deep features. This is because more the number of CNNs employed side by side, a higher number of features are extracted, which has improved the classification accuracy. Further, hand-crafted features provide additional information, absent in other scenarios. Unlike the simple deep learning approach, the hand-crafted features integrated here with deep CNNs have shown impressive accuracy growth and rich feature extraction.

5.2 Discussion

5.2.1 ViT versus CNN

Amongst several computer vision tasks, ViT, i.e., Vision Transformer, has illustrated good performance. Since it’s based on a multi-head attention concept, it can encode the image in patches to form meaning. We have been intrigued by the fundamental differences in the operation of convolution and self-attention that have not been extensively explored in the context of robustness and generalization. While convolutions excel at learning local interactions between elements in the input domain (e.g., edges and contour information), self-attention has been shown to learn global exchanges effectively (e.g., relations between distant object parts) [14]. Given a query embedding, self-attention finds interactions with the other embeddings in the sequence, thereby conditioning the local content while modeling global relationships. In contrast, convolutions are content-independent as the same filter weights are applied to all inputs regardless of their distinct nature. Given the content-dependent long-range interaction modeling capabilities, our analysis shows that ViTs can easily adjust their receptive field to get through with nuisances in data and add strength to the expressivity of the representations.

6 Conclusion

In this paper, we have proposed a novel food framework architecture for recognizing the patterns of various food cuisines. Further, a detailed analysis has been performed on the food patterns extracted from the proposed approach in comparison with the state-of-the-art machine learning algorithms. The proposed framework utilizes a hybrid vision transformer approach amalgamating the contribution of vision transformer and hand-crafted features. The major strengths of our proposed work are as follows:

1) We proposed a more robust dataset comprising of a dynamic range of food cuisines, gathered from publicly available food logging systems;

2) We have proposed a novel framework for extracting fine pattern details for different food cuisines. Further, a detailed analysis has been performed on the patterns extracted by our proposed approach with the state-of-the-art machine learning algorithms; 3) We have optimized hyper transformer parameters, to improve the recognition ability of the proposed architecture, in comparison with the deep CNN and hand-crafted based pattern recognition algorithms; 4) Our proposed hybrid approach proved to show promising results over different types and aspect of datasets that led to obtaining unparalleled results for the task of food detection when compared with state-of-the-art algorithms. Hand-crafted features ushered the transformer mechanism in improving the accuracy to a new level. The paper also gives a detailed analysis of different parameters employed in transformer architecture.

So far, CNNs have dominated computer vision tasks. The notion behind an image is that one pixel is dependent on its immediate neighbors, and the next pixel is dependent on its neighbors. CNN’s work is based on this concept, and it extracts significant features and edges by applying filters to a section of a picture. This allows the model to learn only the most important elements from an image rather than the fine details of each pixel. Moreover, our proposed model works on the principle where the complete image data is put into it, rather than only the sections that the filters can extract (or find relevant). This serves as a reason why our proposed approach recognizes patterns more meaningfully. Our proposed architecture obtained an accuracy of 94.63%, specificity 95.23%, sensitivity 84.42%, and kappa coefficient 0.93, which was better than state-of-the-art food recognition systems using only hand-crafted features or CNN features.

Also, there is a limitation in our proposed approach. Generally, the main challenge with transformers is that they require a huge number of tokens at every layer to obtain reasonable results. This increases the computation cost of the transformer with each layer, making it intractable for large images. In future work, we could employ a token learner, which further improves the transformer’s performance by generating a smaller number of tokens in each layer for food pattern recognition systems.