1 Introduction

The food industry across the globe has reported a loss estimate of about £31 billion per year due to the economically motivated food adulteration, often referred to as food fraud (Johnson 2014). A joint report by Interpol and Europol in 2017 revealed that fake foods and adulterated drinks worth €230 m were seized (OPSON 2017; EC 2020). The quality of food can be degraded or affected by adding low cost ingredients, or by mislabeling the actual contents. Food authentication is a verifying method of a declared food item by checking the ingredients list, origin, and method of production and technology used for processing. Notorious past incidents such as the scandal of rapeseed oil, melamine adulteration in milk and the infamous horse meat scandal had eroded the consumers’ trust in the food they purchase (Kendall et al. 2019).

The ability of finding unique fingerprints of components in a non-destructive way makes spectroscopy a highly desirable technique in the field of food analysis. Spectral data analysis for food quality focuses mainly on the visible (VIS: 400–700 nm), ultra-violet (UV: 185–400 nm), near infra-red (NIR: 700–2500 nm) and mid infra-red (MIR: 2500–5000 nm) ranges of the electromagnetic spectrum (Didham Truong Chapman & Cozzolino 2019; Xiong et al. 2021). Although laboratory based spectrometers can detect food products fingerprints, they are relatively cumbersome and expensive; thus hindering their deployment outside the laboratory in situ by consumers; or when in need of a short analysis time (Cifuentes 2012; Noviyanto & Abdulla 2020). For this, an inexpensive miniature and portable spectrometer that is simple to deploy could be used by consumers to assess foods quality. As such, the consumer could be first line in defense against food fraud. There is an increasing demand for research development making laboratory experiments portable and implementable for the public, according to an assessment of biosensors and their expanding applications in food and bioscience (Cifuentes 2012). The transformation of laboratory experiments into portable and implementable devices have already been published (Beć t al. 2022). However, spectral data quality from miniature and portable spectrometers is lower compared to data collected in a controlled laboratory environment with non-portable devices. Having a limited control over the data collection environment can impact the noise ratio and introduce various distortions due to changes in ambient temperature, lighting conditions, distance from the devices and background noises. All those issues need to be addressed to make miniature spectroscopy a widely accepted technology for effective and efficient food quality checks. Furthermore, spectral data is characterized by high dimensionality, collinearity and non-linearity. Identifying and preserving the useful information without losing the efficiency is another major challenge to be addressed in order to increase the performance and accuracy of such portable devices (Didham et al. 2019). Machine learning (ML) algorithms can be used to overcome these issues (e.g. with pocket sized spectrometers such as Tellspec and SCiO). When augmented with adequate signal processing and ML, such portable devices can attain high performance rates with short processing times; hence adding to their suitability for in situ food authentication scenarios.

1.1 Related works

The use of ML for food quality analysis has gained significant attention in recent years, with numerous research studies demonstrating the potential of this approach for improving food safety and quality control. Linear discriminant analysis (LDA) has been successfully applied on spectral data of a portable NIR spectrometer for establishing meat quality (Bazar et al. 2017) and for assessing the authenticity of fish fillets and patties (Grassie et al. 2018; Kosmowski and Worku 2018), Support vector machine (SVM) has been used on spectral data acquired from a handheld NIR spectrometer for differentiating varieties of barley and chickpeas. The overall classification accuracy obtained for barley and chickpeas is 87% and 95% respectively. K-nearest neighbor (k-NN) algorithm has been applied on spectral data to establish the quality of olive oil blends in (Jiménez-Carvelo et al. 2017) and for the analysis of the quality of pineapple fruits (Amuah et al. 2019). Random forests (RF) has been used to classify premium quality extra virgin olive oil; RF attained a better discrimination rate with better selection of variables for clustering (Ai et al. 2014). Further related works are mentioned in section S1 (Supplementary Material).

Olive oil is widely consumed due to its high nutritional and health benefits. It has a high economic value; hence it is more prone to adulteration with cheaper and lesser quality oils such as vegetable oils (Didham et al. 2019; Jiménez-Carvelo et al. 2017). Consumer centric and a real-time solution for the quality analysis of olive oil is very important to stand against adulteration. The present work focuses on an investigation of the capabilities and challenges of a miniature spectrometer data analytics for the food quality analysis which is illustrated using a case study in establishing the purity of various olive oil blends. To gauge the potential of such a device for food screening, a multi-step data analytic pipeline is presented. This pipeline can be used to augment the classification performance of a miniature UV–Vis spectrometer leveraging various ML classification algorithms. The aim of this study was to understand distinctive characteristics of such spectral data and its challenges. The study also highlights the need for pre-processing methods, especially dimension reduction, filtering, and normalization. The ML classifier was integrated with effective pre-processing methods and expected to enhance the overall performance for spectral data analysis from portable devices. This article also aimed to present an overview of related work in the field, including background information on the topic, discussing the used devices, the dataset collection process, and previously the used methods for data analysis. It also provides an in-depth analysis of the results from different ML algorithms, focusing on the most significant findings and their implications for the field.

The results of this study are based on cross validation on a dataset with random 10% as test set and multisession evaluation results. The list of symbols and notations used in this paper is shown in Table S1 (Supplementary Material).

2 Material and methods

2.1 Data acquisition and experimental setup

The experimental set-up has been prepared to carry out the acquisition of data in a scenario that nearly mimics the usage of an ordinary consumer of a portable device and where one would have limited control over environment and background. Multiple data collection sessions have been carried out in a normal room temperature and under natural lighting. There are several reasons why multiple data collection sessions are conducted as part of this research study. The first reason is to analyse the differing conditions that affect the quality of the data acquisition by the portable devices. Another reason is to check the robustness of the models and ensure that the results are consistent and reliable especially when the data collection is done in a less controlled environment. This way of data collection mimics the ordinary consumers doing the food quality check in an uncontrolled environment in which various distortion may affect the quality of data. The OceanView™ Software has been used to collect the raw spectral data from the spectrometer. Covering the range from 190 nm − 650 nm and 650 nm − 1100 nm, different spectral datasets have been collected using STS-UV and STS-NIR miniature spectrometers. Various datasets prepared are broadly classified into liquid (oils), solid (apples and dates), and powder (flours and spices) differentiation. A summary of prepared datasets is illustrated in Table S2 (Supplementary Material). The dataset for oils included extra virgin olive oil (EVOO) gradually mixed with vegetable oil (VO). The proportion of vegetable oil was set to between 10 and 50% with an increase step of 10%; thus, making into samples of 100% to 50% of olive oil. Another dataset prepared for analysis were dates, in which spectral data of 16 distinct type of dates from various countries of origin were used for this dataset. The dataset for apples was prepared from 10 distinct types of apples available on the market. The dataset for powders was prepared from 18 different types of powders which include spices, flours, and chocolate powders. In this work, the spectral data of olive oil from STS-UV spectrometer has been taken as a case study for the detailed analysis. Figure S1 (Supplementary Material) shows the various steps involved in checking the quality of olive oil including the data acquisition phase.

As the used spectrometer sensor had no built-in light source, an integrated tungsten halogen lamp was used. The device used here is the DR-probe which is a 45° diffuse reflectance probe with an integrated tungsten halogen light source that allows elegant reflectance measurements. By fixing the probe’s collection optics in place relative to the light source, the DR-Probe design ensures that diffuse light is collected, and that the measurement geometry is stable. The position of the probe and the oil surface was kept uniform at a distance of 1.5 cm across the measurements. This makes the measurement results more reliable and consistent (Integrated Light Source, Ocean Optics GmbH). The portable spectrometer used has an optical resolution of 1.5 nm full width half maximum (FWHM) with a slit size of 25 µm (STS UV Microspectrometer, Ocean Optics GmbH). An optical fibre has been employed to connect the lighting source to the entrance aperture of the portable spectrometer. Each scan covers the UV and Vis ranges with 1024 nm intensity pixels per sample. The adopted integration time is one second; and the calibration of the device has been carried out by the manufacturer and which was further updated using the manufacturer’s calibration.

The raw and unprocessed spectral data was further processed and analysed using ML methods. Matlab and Simulink (Mathworks®) has been used for data processing and ML methods. The schematic view of the experimental setup including all components is shown in Figure S2a (Supplementary Material). Oil samples were collected in 10 ml cuvettes which included olive oil gradually mixed with vegetable oil as shown in Figure S2b (Supplementary Material). The proportion of vegetable oil was set in the interval of 0 and 50% with an increase step of 10%, making mixtures leading to different classes of olive oil containing 100% to 50% olive oil. The small increase steps of 10% in adulteration was adopted to allow for a thorough analysis of the feasibility of using miniature spectrometers for quality screening of olive oil. Over 3 weeks, 10 different data collection sessions took place. Once a session started, it was completed before starting another one. Further, there were morning, afternoon, and early evening sessions; and beyond the collection times, to further include any lighting induced possible distortions, the window blinds were randomly kept open or close. In addition, all experiments have been conducted under ambient room temperature to mimic the use of such a device by an ordinary consumer in a relatively wide deployment scenario. In each session, spectral data from pure olive oil and adulterated oils were obtained leading to a total of 380 samples, and 10 such sessions were carried out which has completed a dataset of 3.800 samples. The samples from each class emphasized in the most responsive wavelength region shows Figure S3 (Supplementary Material). The intensity measured for each class is plotted in the sub images.

2.2 Exploratory statistical analysis

All analyses were conducted in Matlab (Mathworks®) using statistical analysis packages. This evaluation includes a description of the characteristics of spectral data, its distribution, and differences in spectra across multiple data acquisitions, as well as a summary of the analysed key challenges. The spectral data collected using portable and miniature spectrometers is of lower quality compared to that collected using benchtop spectrometers. Figure 1 displays the mean and standard deviation (SD) of apple type and EVOO sample from various sessions, as well as the combination of data from all available sessions. The data for apples is from 2 different sessions, and the EVOO data is from 10 different sessions. The data exhibits distortions such as limited immunity to background noise, a shift in the baseline, and misalignment.

Fig. 1
figure 1

Mean and SD of an apple type and EVOO sample in 2 sessions and all sessions combined

The spectral data exhibited a non-linearity and multimodality due to following a multimodal distribution that is accentuated by background noise and distortions. There were differences in the spectra for the same class of samples across different sessions of data acquisition using miniature spectrometers. An example where clear variations in the samples from the class of 70% adulterated EVOO collected from different sessions shows Figure S4 (Supplementary Material), and these variations were not identical in nature across the different sessions. Variations in spectral data had not been observed with high-resolution, accurate benchtop spectrometers and makes data modelling a challenging task, and the extent of the variation depends on various distortions such as temperature, humidity, and sudden changes in lighting conditions that affect data collection. The the correlation coefficients for the datasets for apple, olive oil, dates, and powders are illustrated in Fig. 2, indicating a strong correlation between most pairs of variables. This highlights the challenges associated with using miniature spectrometers for food analysis and quality management. The data exploration of the food datasets collected from miniature spectrometers revealed various distortions affecting the quality of the data. These distortions included variations in the spectral data for the same samples across different sessions, higher dimensionality, and highly collinear features. The correlation coefficients (Fig. 2) also indicated a high degree of correlation among the features, which can make it challenging for the ML models to achieve high classification rates.

Fig. 2
figure 2

Mean and SD of an apple type and EVOO sample in 2 sessions and all sessions combined

To establish miniaturized spectroscopy as a widely adopted method for accurately and effectively checking food quality, it is necessary to address those issues. Another important factor in improving the efficiency and effectiveness of portable devices is the ability to identify and preserve relevant information. This highlights the importance of pre-processing techniques and specialized algorithms to address these distortions and improve the accuracy of food quality control using miniature spectrometers. To overcome these challenges, this article presents a comprehensive pipeline encompassing various steps including data pre-processing, dimension reduction, classification, and regression analysis, and the use of ML techniques to achieve higher classification rates.

2.3 A multistep data analytic pipeline

A data analytic pipeline that encompasses all data analytic steps has been devised for this work. Such steps include data pre-processing, dimensionality reduction, classification, and evaluation which uses various signal pre-processing and classical ML methods. The used pipeline and methods are shown in the Figure S5 (Supplementary Material).

2.3.1 Data pre-processing

Data pre-processing prepares the raw data and renders it suitable for further processing, classification and regression steps in an artificial intelligence and pattern analysis pipeline. It refines the collected data by removing redundant and irrelevant information and is an essential part in modelling, addressing signal distortion matters and noise reduction. Pre-processing methods include smoothing, removal and the reduction of the values of spikes and peaks which are usually impacted by high frequency noise and physical phenomena. This can be carried out by using linear moving average and wavelets filters or nonlinear median filters (Huang et al. 1979; Daubechies 1992). Another approach for data smoothing is to employ interpolating filters of which the Savitzky-Golay (SG) class of filters is widely used for linear filtering of spectral data in analytical chemistry and applied statistics (Savitzky and Golay 1964). In this work, wavelet filters and Savitzky-Golay (SG) filters have been combined for better smoothing and de-noising. DSG (Daubechies, Savitzky-Golay) filter where a wavelet filter (Daubechies-3 with decomposition level 4) is applied first and then combined with an SG filtering step (order 3 and window size 41). Samples of 50%, 70%, and 100% pure olive oil classes were plotted along with their DSG filtered results (Fig. 3). The Multiplicative Scatter Correction (MSC) is a normalisation technique which is useful for baseline shift correction and addressing scattered light induced distortions (Amuah et al. 2019). Data sample with additive effect \(a\) and multiplicative effect \(m\) were presented as:

Fig. 3
figure 3

Comparison of raw data, filtered data and normalized data

$${\varvec{X}}=a+{\varvec{\mu}}m+noise$$
(1)

MSC-corrected spectra are calculated as:

$${{\varvec{X}}}_{corrected}=\frac{\left({\varvec{X}}-a\right)}{m}$$
(2)

The DSG filtered data was further normalised using MSC method for a better data representation (Fig. 3).

2.3.2 Dimensionality reduction using PCA

A major concern in spectral data analysis is the large number of variables with a relatively small number of available samples. This matter can adversely affect the performance of many classification and regression methods. Dimensionality reduction techniques reduce data size whilst keeping data information intact. For instance, PCA is a widely used technique to reduce the dimension of spectral data and lessen collinearity related issues (Liu et al. 2019). In the literature, PCA, its variants, marginal relevance, and genetic algorithms have been employed for dimensionality reduction and features extraction (Singh and Domijan 2019). Aiming to improve the classification performance, a comparison of these different dimensionality reduction techniques on NIR data, to evaluate their effect on the classification of olive oil has shown that PCA based classification can attain higher classification rates (Singh and Domijan 2019). PCA combined with discriminant analysis (DA) has been successfully applied on Fourier Transform Infra-Red (FTIR) spectral data for the authentication of olive oil blends and the establishment of the Mediterranean origin using Vis–NIR data (Jiménez-Carvelo et al. 2017).

2.3.3 Classification and regression methods

Classification can refer to both supervised and unsupervised learning approaches which predict the outcome of an unknown sample. The supervised approach is based on a model built and trained using data whose outcomes are known. Regression is another statistical process which models the output variable with selected input variables. Linear discriminant analysis (LDA) is a supervised classification technique used to separate 2 or more classes based on a linear combination of features (Iatan 2010). The goal of LDA is to find a projection matrix \(\varvec{W}\) which projects the data to LDA space such that the between-class measure is maximised, and within-class measure is minimised (Fisher 1936). The method uses \({\varvec{S}}_{\varvec{b}}\) as a measure for class separability and \({\varvec{S}}_{\varvec{w}}\) as a measure for class compactness; the 2 matrices are defined as:

$${{\varvec{S}}}_{b}=\frac{1}{{\varvec{N}}}\sum_{i=1}^{{{\varvec{C}}}_{{\varvec{N}}}}{{\varvec{N}}}_{{\varvec{i}}}\left({{\varvec{\mu}}}_{i}-{\varvec{\mu}}\right){\left({{\varvec{\mu}}}_{i}-{\varvec{\mu}}\right)}^{^{\prime}}$$
(3)
$${{\varvec{S}}}_{w}=\frac{1}{{\varvec{N}}}\sum_{i=1}^{{{\varvec{C}}}_{{\varvec{N}}}}\sum_{j=1}^{{{\varvec{N}}}_{{\varvec{i}}}}({{\varvec{x}}}_{ij}-{{\varvec{\mu}}}_{{\varvec{i}}}){\left({{\varvec{x}}}_{ij}-{{\varvec{\mu}}}_{i}\right)}^{^{\prime}}$$
(4)

Fisher criterion \({\varvec{J}}({\varvec{W}})\) is the ratio of between-class scatters to within-class scatter (Fisher 1936):

$${\varvec{J}}\left({\varvec{W}}\right)=\frac{{{\varvec{W}}}^{\boldsymbol{^{\prime}}}{{\varvec{S}}}_{{\varvec{b}}}{\varvec{W}}}{{{\varvec{W}}}^{\boldsymbol{^{\prime}}}{{\varvec{S}}}_{{\varvec{w}}}{\varvec{W}}}$$
(5)

LDA can be seen as an optimisation problem with the objective of maximising the Fisher criterion in (5) (Iatan 2010) to find an optimal projection matrix. LDA has been utilized to analyze the spectral data obtained from a portable NIR spectrometer to determine meat quality (Bazar et al. 2017) and verify the authenticity of fish fillets and patties (Grassi et al. 2018).

Support vector machines (SVM) is a supervised classification technique in which samples are prepared in a way that there is a clear gap between different categories (Sain and Vapnik 1996). The main objective of SVM is to construct a hyperplane or a set of hyperplanes that makes the distance between the training samples maximum which helps to classify samples distinctly. SVM classification can be seen as an optimization problem in which one would minimise a mapping function \({\varvec{\phi}}\left({\varvec{v}}\right)\) that maps the training sample points from the input space to a feature space (Boser et al. 1992; Sun et al. 2016):

$$\left\{ \begin{gathered} Min\,\,\varvec{\phi} \left(\varvec{ v }\right) = \frac{1}{2}v^{\prime}v \hfill \\ st.\varvec{c_{x}} \left( {v^{\prime}.\,\varvec{x} + \varvec{\delta} } \right) \ge 1 \hfill \\ \end{gathered} \right.$$
(6)

where \(v\) is a weight vector, \({{\varvec{c}}}_{{\varvec{x}}}\) is the class labels of sample points \({\varvec{x}}\) and \({\varvec{\delta}}\) is a bias. SVM was applied to spectral data from a handheld NIR spectrometer to distinguish between different varieties of barley and chickpeas (Kosmowski and Worku 2018).

The k-nearest neighbour algorithm (k-NN) is a non-parametric classification method where a sample observation is classified based on the maximum votes of its \(k\)-nearest neighbour. When the parameter \(k\) is one, the data point is assigned to the class of its closest neighbour. The selection of k is very important, and the value mainly depends on the data distribution. Performance of the classification depends on the selection of \(k\) which can be either static or dynamic. Static \(k\) is selected for the entire data using various methods including the cross validation methods (Zhu et al. 2011) and dynamic \(k\) is selected for different sample sets for improving the performance as in (Zhang et al. 2018). k-NN has been used to analyze spectral data in order to determine the quality of olive oil blends (Jiménez-Carvelo et al. 2017) and analyze the quality of pineapple (Amuah et al. 2019).

Partial Least Squares Regression (PLS) was used for analysis of highly collinear spectral data (Kettaneh-Wold 1992; Song 2018). PLS regression renders predicted variables matrix Y on a matrix of observable variables X. It yields a latent structure which is spanned by a few latent variables (LV), as the model attempts to find the projection directions in the X-space and explains variance changes in the Y-space:

$$\mathrm{min}\| {\varvec{Y}}-{\varvec{X}}{{\varvec{B}}}_{PLS}\|$$
(7)

where \({\varvec{X}}\) is an \(n\times d\) matrix and \({\varvec{Y}}\) is an \(n\times q\) matrix and \({{\varvec{B}}}_{PLS}\) is a \(d\times q\) regression coefficient matrix computed to build the PLS model. Matrices \({\varvec{X}}\) and \({\varvec{Y}}\) can be decomposed as follows:

$${\varvec{X}}={\varvec{R}}{{\varvec{P}}}^{\boldsymbol{^{\prime}}}+{\varvec{E}}$$
(8)
$${\varvec{Y}}=\boldsymbol{ }{\varvec{U}}{{\varvec{Q}}}^{\boldsymbol{^{\prime}}}+{\varvec{F}}$$
(9)

PLS-discriminant analysis (PLS-DA) is an extension of PLS to cover cases where Y represents a categorical matrix. Further, the selection of the number of LVs can vary; a high number of LVs may not necessarily guarantee a high accuracy due to over fitting. Usually, the number of LVs can be selected using cross validation (Tobias 1995). PLS regression has been effectively utilized to predict the amount of additional chemical components present in olive oil (Uncu and Ozen 2015) and beverages (Wang et al. 2020).

Artificial neural networks (ANN) mimic biological neural networks for data processing and are used in various fields of science and engineering to provide good decision support (Hopfield 1982). In an ANN, neurons are arranged in layers including an input layer, hidden layer (one or more) and an output layer. The first layer contains the neurons with input data and maps it to output through the hidden layers. Each neuron connects to every neuron in the next level and in every neuron, there is an activation of a non-linear function that maps the input data to an output (Gurney and York 1997). The ability to handle nonlinearity and to solve very complex problems make ANNs successful in numerous applications.

Classifier’s combination can be attempted to improve the recognition accuracy by fusing the evidence of individual classifiers (Fig. 4). For a new sample \({\varvec{x}}\), the decision \(D({\varvec{x}})\) is taken by combining the decisions from individual classifiers (\({{D}_{1}\left({\varvec{x}}\right){,D}_{2}\left({\varvec{x}}\right),\dots ,D}_{L}\left({\varvec{x}}\right))\) using a combination method and a decision rule. Combinations can be carried out at the abstract level, the rank level and at the score level (Kuncheva 2014). Combination at the abstract level uses the output label from each of the individual classifiers, whereas in rank level, decisions are taken based on the ranks given to the output of each classifier. A score level combination approach considers the best labels from each classifier together with their confidence scores. The scores can either be fused by density or transformation methods. Various mathematical rules have been used for classifiers combination decision taking. This includes the simple rules of the majority voting, max, min and sum of individual classifiers decisions (Kittler et al. 1998).

Fig. 4
figure 4

A schematic representation of classifier combination

Decision rules work differently on various cases; and hence, there is no single combination rule which works well and better suited for all cases (Tulyakov et al. 2008). Another approach is the decision template method which uses degrees of support among classes and classifiers. For this, decision profile (DP) is constructed first that gives the decision degree of an input sample \({\varvec{x}}\) to a particular class using the available different classifiers. Further, decision template for each class is calculated by taking the mean of the DP of that class. Therefore, a sample is classified by first constructing its DP and compare it with the decision template for each class. The nearest match based on the distance measure is selected.

The evidence theory can be used for classifiers combination at the score level (Chitroub 2010). Dempster-Shafer theory (DST) uses belief functions for the combination of evidence from different classifiers (Kuncheva 2014). Using DST, decision for a test sample \({\varvec{x}}\) is computed based on the belief degree \({b}_{ji}({\varvec{x}})\) estimated for every class \(j\) and for each classifier \(i\):

$${b}_{ji}({\varvec{x}})= \frac{{\boldsymbol{\Psi }}_{j,i}({\varvec{x}})\prod_{k\ne j}(1-{\boldsymbol{\Psi }}_{k,i}\left(x\right))}{1-{\boldsymbol{\Psi }}_{j,i}\left({\varvec{x}}\right)[1-\prod_{k\ne j}(1-{\boldsymbol{\Psi }}_{k,i}\left({\varvec{x}}\right))]}$$
(10)

where \({\boldsymbol{\Psi }}_{j,i}\left({\varvec{x}}\right)\) is the function used to find the relationship between the class \(j\) and the classifier \(i\).

DST has been used to fuse different classifiers and has yielded higher classification rates in various applications; nevertheless, its use for food quality analysis is limited and relatively scarce, except for e.g., the fusion of classifiers to classify different types of edible oils (Saha and Saha 2017). Compared to individual classifiers, the results based on the DST approach were higher. Another approach for classifiers combination is to employ ensemble methods that combine many classifiers. Ensemble methods include bagging, boosting and stacked ensembles (Rokach 2010). Ensemble methods can improve classification accuracies by avoiding overfitting issues (Polikar 2006). However, selecting a set of individual classifiers and a suitable decision strategy that gives higher rates was the main challenge.

Random forest (RF) is an ensemble learning algorithm that can be used for both classification and regression. RF constructs multiple decision trees; it overcomes overfitting problems by randomly selecting sets from the training data for the construction of such trees. The ensemble approach to learning can be evidenced by the fact that a forest aggregates outcomes from multiple trees; as such, a classifier constructed in this way can be robust to noise, and reduces variance and errors by training on various datasets (Breiman 2001). Each dataset is used to build one decision tree and during classification, an unknown sample \({\varvec{x}}\) is given to all of the decision trees and class probabilities of each leaf nodes visited are noted. The class with the largest average of the class probabilities \({{\varvec{P}}}_{{\varvec{t}}}({\varvec{c}}|{\varvec{x}})\), attained from all the decision trees denoted by (10) is taken as the classification decision (Breiman 2001). random forests (RF) was employed to classify premium quality extra virgin olive (Ai et al. 2014)

$$P\left({\varvec{c}}|{\varvec{x}}\right)=\frac{1}{T}\sum_{t=1}^{T}{{\varvec{P}}}_{{\varvec{t}}}({\varvec{c}}|{\varvec{x}})$$
(11)

Ensemble of classifiers based on subspace methods have been used to attain higher classification accuracies. The random subspace method has been used with discriminant analysis and k-NN classifiers. A subspace discriminant method had attained higher classification rates in classifying images (Ashour et al. 2018; Radhika and Varadarajan 2018). An Ensemble based on the subspace discriminant method was used for the classification of saffron and attained good classification rates (Mohamadzadeh et al. 2020). The ensemble method used is described in the following algorithm:

figure a

3 Results and discussion

3.1 Preprocessing

LDA was applied on PCA projected data while retaining various number of principal vectors to observe the algorithm performance. From Table 1 and based on the average success rate, the selected number of PC’s was 32 used in any further application involving PCA. As it is difficult to visualize the data in higher dimensions, a 3D-plot of the PCA scores of the first 3 PC’s in all 6 classes of the pre-processed data is shown (Fig. 5).

Table 1 Average LDA classification performance with different number of PC
Fig. 5
figure 5

Average classification results on all pre-processing methods using LDA

Given that the present work is preliminary research, the ideal number of LV for the PLS-DA method was chosen using a brute force approach, ranging from 2 to 128 (Table 2). The average performance rises from 36.58 to 55% from 2 to 12 LVs and after that, there were no further signs of improvements in the performance. The PLS-DA model appears to be overfitting beyond 12, and thus 12 LVs were chosen for the remainder of the PLS-DA analysis (Table 2).

Table 2 Average PLS-DA performance with different numbers of LV

We applied LDA to raw, filtered. and normalised data (Fig. 6). Based on the obtained results, applying the DSG filter followed by the MSC correction yielded the best results; and hence this pre-processing approach has been adopted for the remainder of the work.

Fig. 6
figure 6

PCA scatter plot of pre-processed data using the first 3 PC’s

3.2 Classification analysis

Each dataset represents a complete set of data in a session with pure and adulterated olive oil samples. Two types of tests were carried out: First, a tenfold cross validation method applied on the entire dataset, where 10% of the data were randomly taken as test case. Second, the multi session validation test where data were recorded in a separate session, taken as a test set, and the data from the remaining sessions were taken as training. For the first approach, data was randomly split into 10% for testing and the remaining 90% for training. The average performance has been calculated by repeating the process 10 times (Fig. 7).

Fig. 7
figure 7

Multi-session testing methodology (data not shuffled as for the traditional k-fold approach)

The second test helped for the evaluation of small devices’ sensitivity to various environmental factors as well as their suitability for their use outside of a laboratory’s controlled environment. To gauge the efficiency of a classifier, out of the 10 datasets 9 were used for building the training model and the remaining dataset was used for testing. For instance, a model can be built using sets dataset-2 to dataset-10 (9 datasets) during the training phase, and dataset-1 as not part of the training was used for testing. In this way, all sample classes representing different adulteration levels in a particular session were used for testing. Hence, a clear analysis of the classification performance can be carried out. The main difference with k-fold cross validation methodology is that the data is not selected in random but instead based on the data collected in sessions. The average performance was calculated by taking the average of individual classification results on every test dataset.

3.2.1 Individual classifiers

Six classification algorithms have been adopted and employed on a six-class data problem, and 2 types of testing were carried out to learn about the impact of ambient distortions affecting the spectral data acquired from the miniature/portable devices. The first one is the cross validation evaluation test and the second one is the multi-session test. The performance from both testing methods include the maximum, minimum and median performance (Figs. 8, 9).

Fig. 8
figure 8

Classification rates obtained from all methods from tenfold- cross validation

Fig. 9
figure 9

Classification rates for individual classifiers using the multi-session test

For the Cross validation test, the accuracy of each model evaluated was determined using tenfold cross-validation on the entire dataset disregarding the different sessions of data acquisition. To obtain the most accurate performance profile possible, the cross-validation was itself done ten times using random partitions of 90% training and 10% of test data (Fig. 8).

The k-NN method outperformed all other methods with a maximum and minimum value of 97.27% and 94.85%, respectively. The 75th and 25th percentiles were 96.97 and 95.45%, showing a consistent high classification. One of the possible reasons could be the nature of data which is clustered well. The next highest accuracy has been achieved by SVM with maximum and minimum values of 93.64% and 89.09%, respectively. The LDA, RF and NN showed almost similar accuracies with maximum and minimum values of 89.69% and 83.64% (LDA), 90% and 85.76% (RF) and 89.69% and 83.64% (NN). Classification rates using PLS-DA was slightly in contrast with the high classification rates attained with other methods. The maximum accuracy rate for PLS-DA was 77.27% and the minimum 73.33%. This lower performance was probably due to the large number of regressors compared to the number of variables. After using different classifiers, the results clearly showed the good performance of the spectrometer and its suitability for food quality checks.

The following Multi-session test made use of inter-session tests. Olive oil samples were re-scanned on several days using the same methodology. The newly scanned data served as a test set, and the data from other sessions served as training data for the model. This test measures the system's sensitivity to variations in the scan environment and confirmed that the ML models correctly identified relevant signals instead of simply picking up on well correlated background noise (Fig. 9). The accuracy rate showed a considerable drop in performance among all models. Different levels of performance were noticed as each dataset was collected in a distinct session that were impacted with noises due various distortions.

The SVM method outperformed all other methods with an average classification accuracy of max. 83.32% and min. 96.58% and 66.05%, respectively. It achieved ≥ 90% in 5 of the sessions. The maximum performance showed dataset-1 with an accuracy of 96.58%, the minimum accuracy dataset-4 with 66.04%. The contrasting results give an idea about the amount of noise affecting data quality.

The second highest average accuracy was achieved by the LDA method with an accuracy of 79.95 ± 7.42%. This method achieved > 90% accuracies in 2 sessions and dataset-9 achieved maximum performance with an accuracy of 90.53%.

The third highest performance was achieved by ANN method with an average accuracy of 78.58 ± 7.0%. RF showed an average accuracy of 71.5% and 15.49% SD. Out of 10 sessions, 2 were > 90% and 2 > 80%, and the remaining were in the range of 50 − 60% which is clear from SD. The maximum accuracy from these 3 methods (LDA, RF and ANN) was 90.53% for all, whereas the minimum values were 66.32%, 51.05% and 66.32%, respectively.

The classification rates after using PLS-DA and k-NN were slightly in contrast to other methods, with max. 69.47% and 76.32%, and 44.21% and 36.84%, respectively. The average accuracy for PLS-DA and k-NN were 60.68 ± 8.39 and 57.13 ± 12.49%, respectively. Dataset-1 which gave the highest accuracy for SVM, had 79.74% with LDA and 73.95%, 51.32%, 44.21% and 36.84% with ANN, RF, PLS-DA and k-NN, respectively.

The contrasting results compared to the cross validation test methodology gives the need for a proper analysis of data quality for the different sessions and various distortions affecting the data. There could be several reasons for the performance gap between the cross validation method and the multi-session validation method for spectral data from portable devices. One possibility is the difference in the ambient conditions between the 2 methods. Spectral data collected in an uncontrolled environment may be affected by various factors such as temperature, humidity, and lighting, which can introduce distortions and noise in data, which can affect the performance of the ML models, leading to a lower classification rate in the multi-session validation method. Another possible reason is that the multi-session validation method used a different test set collected in a different environment compared to the training set. This can introduce a domain shift between the training and test sets, causing the models to perform poorly on the test set from different session. Additionally, there could be a difference in the quality of the data between the cross validation method and the multi-session testing methods. The data collected in the multi-session validation method may be of lower quality due to the uncontrolled environment, which can negatively impact the performance of the ML models. Due to this overlapping ranges of performance, an appealing solution would be to fuse decisions of standalone classifiers to improve the performance.

3.2.2 Classifier combination

To address the impact of ambient distortions on the multi-session test, the classifier combination method was carried out using the MAX, MIN, SUM, decision template, majority voting, DST, and the ensemble approach based on subspace discriminant methods with LDA, SVM and k-NN being the combined individual classifiers (Table 3). Majority voting method attained the average classification rate of 82.95% ± 8.57. It outperformed the other methods with > 90% in 3 test cases and > 80% in 4 test cases. The subspace discriminant method came next in terms of accuracy with an average rate of 80.89% ± 6.16.

Table 3 Classification rates (%) obtained from classifier combination

The belief rule based combination working for DST did not excel well when compared to other combination methods. The average accuracy for DST method was 78.53% ± 8.48 with 91.84% max. and 68.16% min. One interesting point with DST was that it gave higher accuracy in the second test case where SVM got minimum accuracy. The average classification rates obtained for the other combination methods MAX, SUM, MIN, and decision template are 76.45 ± 10.67, 80.87 ± 8.41, 72.76 ± 13.12 and 76.82 ± 8.29, respectively. Another observation was that the average accuracy for the SUM was almost same as in the ensemble method. The maximum accuracy obtained for the combination methods were in the test cases of dataset-7 and dataset-9, the minimum accuracy in dataset-1, and the least accuracy (45%) with MIN method. The average classification results attained by all methods (individual and combination) shows Figure S1 (Supplementary Material). While comparing the results from individual classifiers and classifier combinations, the best results have been obtained using SVM and majority voting methods.

Further, a detailed performance analysis has been carried out for the best cases of both individual and classifier combinations, which focuses on how well the individual classes performed during classification using SVM and majority voting methods. The different performance metrics used for the evaluation are accuracy, sensitivity, specificity, precision, recall, f-measure and g-mean (Table 4). Precision is an indicator for the quality of a positive prediction made by the model and is 0.8333 for SVM and 0.7303 for MV combination. The f-measure for SVM and majority voting was 0.9091 and 0.8442, respectively. Similarly, the g-mean for SVM and MV was 0.9791 and 0.9612, respectively.

$${\text{Sensitivity}}\,\,\, = \,\,\,\left( {{\text{true positives }}\left[ {{\text{TP}}} \right]} \right)/{\text{TP}}\, + \,{\text{false negatives }}\left[ {{\text{FN}}} \right])$$
$${\text{Specificity}}\,\,\, = \,\,\,({\text{true negatives}}\,\,\left[ {{\text{TN}}} \right])/{\text{TN}}\, + \,{\text{false positives}} \,\left[ {{\text{FP}}} \right])$$
$${\text{Precision}}\,\, = \,\,\,{\text{TP/P}}\, = \,{\text{TP}}/({\text{TP}}\, + \,{\text{FP}})$$
$${\text{Recall}}\, = \,\,\,{\text{TP}}/({\text{TP}}\, + \,{\text{FN}})$$
$${\text{f1 measure}}\,\,\, = \,\,\,{2}*\left( {\left( {{\text{precision}}*{\text{recall}}} \right)/\left( {{\text{precision}}\, + \,{\text{recall}}} \right)} \right)$$
$${\text{g-Mean}}\,\, = \,\,\surd \left( {{\text{precision }}*{\text{ recall}}} \right)$$
Table 4 Performance metrics used to evaluate the model

Based on the results, some ML models performed well on the spectral data collected from portable devices, while others did not. The results indicate that SVM gave very high results, followed by LDA, PLSDA, Random Forest, and Neural Networks. On the other hand, KNN gave better results in cross validation but performed very poorly in the multi-session approach. One possible reason for KNN's high performance in the cross validation method could be that the data is well-clustered, which makes it easier for the algorithm to accurately classify the samples based on their similarity to nearby points. Other possible reasons for the high performance of KNN and other models such as LDA, SVM, RF, and NN could be suitable hyperparameter tuning, and the suitability of the algorithm to the nature of the data. Regarding the poor performance of PLS-DA, another reason could be the number of regressors compared to the number of variables. PLS-DA is a dimensionality reduction technique that is sensitive to the ratio of samples to variables. When there are more variables than samples, the PLS-DA method may not be able to find the underlying structure of the data and hence may result in poor performance. Additionally, the PLS-DA algorithm assumes that the data is linear, if the data is non-linear, it may not be suitable for the PLS-DA method and will result in poor performance.

There could be several reasons for the different performances between the models. One possibility is that each model is sensitive to different noise and distortion types in the data. SVM, LDA, PLS-DA, RF, and NN may be more robust to noise and distortions in data collected from portable devices, while k-NN is not. Furthermore, each model could use different assumptions about the data distribution; e.g., SVM, LDA, PLS-DA, RF, and NN on data more closely aligned with the data collected from portable devices, unlike k-NN. Additionally, k-NN might be sensitive to the domain shift between the training and test sets, which leads to poor performance in the multi-session validation method. It is also possible that the specific implementation of the k-NN model for this study was not well suited for the data and task at hand, which would explain the poor performance. Further research and experimentation are necessary to confirm the exact reason.

There is a clear performance gap between the results obtained by multi-session validation compared to cross validation. The k-NN algorithm, which performed well in the cross validation method (96.73% on average), performed poorly in the multi-session validation method (57.13% on average). This is likely due to the variations in ambient conditions during data collection in different sessions, which can affect the data acquisition process and lead to distortions in the spectral data, making it more challenging for the k-NN algorithm to accurately classify the samples. SVM algorithm performed the best for multi-session validation (83.32% average), followed by LDA (79.95% average). PLS-DA performed poorly (63.01% on average), but still better than k-NN. SVM is a robust algorithm, handling non-linearity and high dimensionality better than other algorithms, which makes it well-suited for handling data collected in different sessions. LDA is a linear method, and performs well with relatively small datasets like in the case of this study. PLS-DA's poor performance in multi-session validation confirms that PLS-DA is sensitive to the ratio of samples to variables and may not be suitable for handling data collected in different sessions.

The goal of this study was to improve the performance by combining classifiers. Despite the use of combined methods, the results were not as successful as expected. The majority voting method, which selects the most common prediction among multiple classifiers, showed the best results (82.95% on average), but this still did not match the performance of SVM which was the best performing individual classifier. Other combined methods, e.g., subspace discriminant and SUM, showed better results than most individual classifiers but not as good as SVM. Further research is needed to develop more effective methods for combining the predictions of multiple classifiers, to fully utilize their strengths and improve performance in food quality analysis using portable spectrometers.

When comparing the results of this study to results from traditional and lab-based spectrometers, Downey et al. (2003), used NIR-VIS spectral data acquired from a lab based NIR spectrometer for the classification of extra virgin olive oil based on their geographic origins. The factorial discriminant analysis was used for classification with a success rate of 93.9%. Uncu et al. (2019) used an UV spectrometer on a set of 89 samples, and applied PLS on mean centered data for quality analysis of olive oil from 3 different regions with a classification rate of 99%. They also been reported that similar results were attained on data collected using an FTIR spectrometer. Kružlicová et al. (2008) used k-NN for the classification of olive oil on 193 samples from 5 different types and 3 different origins and yielded a rate of 98.7%. A laboratory based UV–VIS spectrometer was used for acquiring spectral data for the detection of vegetable oil adulteration in olive oil (Didham et al. 2019), yielding in a classification rate of 90% using PCA and PLS-DA. Further, they used a laboratory based FTIR spectrometer and reported a classification rate of 97% (Didham et al. 2019). The classification rate of 96.73% reported in this paper and attained using a miniature device compares well to the results mentioned above even though such works may have employed non hand-held spectrometers. We have compared our results to those of other research studies involving oil analysis using portable and miniature devices (Table 5). This study focused on the multi-session test case, which is not commonly seen in literature, with the understanding that data acquired by consumers may be affected by various ambient distortions and that these limitations must be properly addressed. The results are comparable to similar outcomes found in literature and provide further insights for future research on food authenticity analysis using portable and miniature spectrometer devices.

Table 5 Comparison of related works using laboratory based and portable/miniature devices

4 Conclusions

This study explored the use of miniature spectrometers and ML techniques for food analysis, with a specific focus on assessing the quality and purity of olive oil. Through an exploratory analysis of food datasets collected from miniature spectrometers, the study highlights the various distortions affecting the quality of the data, including variations in spectra data across different sessions, high dimensionality, and collinearity. The spectral data collected from a miniature spectrometer accurately differentiated pure olive oil from adulterated ones with a classification rate of 96.73%, by using cross validation on 10% random test data, and an average of 83.32% by using multi-session testing. This suggests that miniature spectrometers, when augmented with a suitable ML pipeline, have the potential to perform classification tasks comparable to non-portable and more expensive spectrometers at lower costs. Furthermore, this work highlights the importance of specialized algorithms to handle various ambient conditions during data acquisition and minimize performance gaps. The findings also suggest that miniature spectrometers could be well-suited for in situ scenarios with a broader range of applications. Further research should focus on different pre-processing and classification algorithms to enhance the performance of multiple-session testing and addressing the distortions affecting the newly acquired data. Future studies should also apply deep learning techniques, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), to analyze spectral data. These methods are promising for handling high-dimensional and collinear data, and may be able to improve classification performance in comparison to traditional ML algorithms. Additionally, research should investigate the use of transfer learning and fine-tuning of pre-trained models to improve performance on spectral data from miniature spectrometers. The potential of using miniature spectrometers in fighting food fraud is significant, as it could allow even ordinary consumers using such devices for in situ food quality analysis.