Geographical classification of Spanish bottled mineral waters by means of iterative models based on linear discriminant analysis and artificial neural networks

  • Francisco Gutiérrez-Reguera
  • J. Marcos Jurado
  • Rocío Montoya-Mayor
  • Miguel Ternero-Rodríguez
Original Article
  • 74 Downloads

Abstract

The composition of Spanish natural mineral waters has been determined by means of inductively coupled plasma-mass spectrometry, inductively coupled plasma-atomic emission spectrometry, ionic chromatography and other routine techniques. Methods were applied to samples of bottled water from springs situated in five different mountain systems such as Cordillera Costero-Catalana, Macizo Galaico, Sistemas Béticos, Sistema Central and Sistema Ibérico. Pattern recognition techniques have been applied to differentiate the origin of samples. Data were initially studied by using nonparametric multiple comparison techniques and principal component analysis to highlight data trends. Classification models based on linear discriminant analysis and multilayer perceptron artificial neural networks have been built and validated by means of a stratified jackknifing methodology. An iterative approach has been used to build an artificial neural network model based on the variables selected by linear discriminant analysis. The prediction ability of the constructed model was 94 %.

Keywords

Pattern recognition Multivariate analysis Multielemental analysis Geographical characterization Natural mineral water 

1 Introduction

Water is one of the most important compounds in earth due to its essential role for life. The human consumption of water has varied from ancient times till nowadays. At the beginning, humans drank water directly from sources such as rivers, lakes or wells, but today water is adequately treated before consumption. Potabilization and chlorination of water are perhaps the most important advances in human history and one of the main contributions of chemistry for development of society. Sometimes these processes mean a sacrifice of the organoleptic characteristics of water in order to improve its suitability for human consumption by eliminating small particles, pathogens and chemical contaminants. Nevertheless, people appreciate products with their natural characteristics, and in this way, natural mineral waters are very demanded by consumers.

According to the Directive 2009/54/EC of the European Union [1], natural mineral water is obtained from a protected underground source and directly bottled without any chemical treatment, except for the separation of the unstable elements, such as sulphur, iron, manganese and arsenic compounds. The treatment for these compounds is usually the filtration or decantation (compounds of iron or sulphur), possibly preceded by oxygenation, whereas certain natural mineral waters are treated with ozone-enriched air, provided that such treatment does not have the effect of modifying the composition of the water as regards the essential constituents. Each natural mineral water has its own and stable mineral composition, and it must be labelled stating the analytical composition, place of origin and the name of the source. Most of European producers are joined to national associations. In Spain, there are approximately a hundred of companies engaged in the exploitation of natural water springs, most of them pertaining to an association named Asociación Nacional de Empresas de Aguas de Bebida Envasada (ANEABE). European associations are federated into the European Federation of Bottled Waters (EFBW). Created in 2003, EFBW is committed to protecting the unique specificities of natural waters and works to promote the sector and its products.

As it was said before, the chemical composition of bottled water is like the fingerprint of each natural water source. This composition depends on the nature of soil and rock formations and weather. It can be expected to find similar waters in nearby areas. For instance, Sipos et al. [2] used sensory evaluation, electronic tongue responses and chemical composition to differentiate the geographical origin of Hungarian spring waters. The elemental profile of mineral waters is very stable, and for this reason, the use of pattern recognition techniques based on this composition is a potential tool to be applied with authenticity and adulteration testing purposes. Some studies have been focused on the mineral profiling of bottled waters from different countries, such as Cameroon [3], Croatia [4], Hungary [5], Italy [6], Germany [7], Turkey [8], Spain [9] and UK [10], but few of them developed authentication studies. This fact has been explored in the case of Brazilian spring waters with 94–97 % of prediction abilities [11]. Birke et al. [7] studied the geographical dependency of German bottled waters according to major and trace elements composition by means of principal components analysis (PCA). Güler [8] applied PCA and cluster analysis to characterize Turkish bottled waters according to major component. Oyebog et al. [3] used factor analysis and cluster analysis to find out the relationship between the composition of waters and the composition of soils nearby to the springs and other surface enrichment phenomena. Groŝelj et al. [12] also studied the relationship between the chemical composition of bottled waters and their geological origin but using artificial neural networks (ANN) approach.

Paying attention to Iberian Peninsula, most of sources of bottled natural mineral water are distributed in five mountain systems, such as Sistemas Béticos, Sistema Central, Sistema Ibérico, Cordillera Costero-Catalana and Macizo Galaico. Sistemas Béticos are located in the southern and eastern of Iberian Peninsula reaching from western Andalusia to the region of Murcia, as well as the southern of Castilla La Mancha and Valencia. Sistema Central is a mountain range separating the Tajo and Duero basins, being the natural boundaries of Castilla León, at north, and Castilla La Mancha, Madrid and Extremadura, at south. Mountains from Sistema Ibérico cover from Burgos to north of Valencia. The mountains known as Cordillera Costero-Catalana are situated parallel to the Catalonia Coast. In the case of Macizo Galaico, mountains are distributed from the south-west to the north-east of Galicia.

Based on the cited literature [2, 11], it can be expected that samples from these regions could be differentiated according to their elemental composition. The purpose of this work is to explore the adequacy of the inorganic profile and some non-specific parameters of Spanish mineral waters from the above considered mountain systems to establish classification models. The contents of Al, As, B, Ba, Co, Cr, Cs, Fe, Li, Mn, Mo, Ni, Se, Sb, Sr, Ti, U and Zn were determined by inductively coupled plasma-mass spectrometry (ICP-MS). The contents of Si, Ca, Mg, Na and K were determined by inductively coupled plasma-atomic emission spectrometry (ICP-AES), SO42− and Cl were determined by ionic chromatography, whereas HCO3 was determined by potentiometric titration. Other parameters such as pH, electrical conductivity (EC), redox potential (E) and dry extract (DE) have been also experimentally measured. Pattern recognition techniques, such as principal component analysis (PCA), stepwise linear discriminant analysis (SLDA) and artificial neural networks (ANN), have been used to obtain suitable classification models.

2 Materials and methods

2.1 Samples and study area

A total of 52 samples of commercial bottled mineral waters were purchased in markets or obtained directly from suppliers. Samples from Cordillera Costero-Catalana (N = 11), Macizo Galaico (N = 8), Sistemas Béticos (N = 10), Sistema Central (N = 8) and Sistema Ibérico (N = 15) were considered for this study. The information about samples is included in electronic supplementary material (Table S1). Prior to analysis, water samples were stored at 4 °C.

2.2 Analytical method for metals and metalloids

Major elements (Si, Na, K, Ca and Mg) were analysed by ICP-AES, whereas minor and trace elements (Al, As, Sb, Ba, Be, B, Cd, Cs, Zn, Co, Cu, Cr, Sr, Fe, Li, Mn, Mo, Ni, Ag, Pb, Se, Tl, Ti, Th, U and V) were analysed using ICP-MS. An ULTIMA 2 Instrument (Horiba Scientific, Japan) was used for ICP-AES determinations, while the ICP-MS Instrument was an X7 SERIES ICP-MS (Thermo Elemental, USA). The instrumental details and operating conditions are summarized in Tables S2 and S3 of the electronic supplementary material, respectively. ICP-AES and ICP-MS measurements were carried out at the Research General Services of the University of Seville.

Complete ICP-MS analyses were conducted according to the 200.8 US EPA method, with some modifications related to tuning and mass calibration. These adaptations were established from the X Series ICP-MS Getting Started Guide [13] and were restricted to the isotopes of interest. Sample matrix was reproduced in calibration standards and QC standards and properly internal standards were selected. In order to minimize metal residues, all glass materials were cleaned with a 0.2 M solution of nitric acid during 24 h. All reagents, materials and samples were handled within a vertical laminar airflow cabinet (Indelab, model IDL-48 V). The cabinet contained a high-efficiency particulate air HEPA filter that ensured air cleanliness class 100, according to Federal Standard 209E.

The performance characteristics of the analytical method, such as trueness, precision, limit of detection (LOD), limit of quantification (LOQ) and linearity within the calibration range, were tested. The accuracies of spectrometric determinations were established by analysing international certified reference materials (CRM). Experimental concentrations were obtained from 18 replicates over 6 days. Quality control samples were used according to the EPA protocol (Section 9.0 of EPA 200.8:1994). CRM-TMDW (Trace Metals in Drinking Water Standard) from High Purity Standards (Charleston, USA), which is certified for trace metals in drinking water, was used to determine the accuracy of the US EPA method for drinking water by ICP-MS. The trueness of the method was evaluated via determination of specific elemental concentrations in the CRM. Recoveries (90.5–105.6 %) demonstrated that the method presented optimum trueness, with values included in the AOAC range [14]. Precision, expressed as relative standard deviation (RSD) of repeatability, presents values in the range 0.6–5.1 %, also included in the AOAC range according to the elemental content. Limits of detection and quantification were calculated using the standard deviations obtained from calibration curves. LOD and LOQ were obtained as the concentrations corresponding to a signal that was 3 and 10 times the standard deviation of the intercept, respectively. Limits of quantification were quite low for ICP-MS (0.1–1.5 µgL−1), enabling analyses of very low levels of metals and metalloids in drinking water. Linearity within the calibration range was calculated as 100·(1−sb/b), where b is the slope of the calibration curve and sb is its standard deviation [15]. The ICP-MS calibration curves were linear for all elements analysed, generally obtaining values higher than 95.0 %.

Each sample was analysed in three replicates, and the standard deviation for each element was calculated. Internal standards used in ICP-MS were Sc, In, Tb and Bi, which presented optimum accuracies (97.8–101.2 %). Standard solutions for metals and acids were from MERCK. Ultra-pure water was from WATERS-MILLIPORE (Milli-Q-grade, Model Plus).

In the case of elements also included in the CRM-TMDW certificate but determined by ICP-AES, recoveries vary from 90.1 to 102.4 % and precision from 1.9 to 6.2 %. Linearity was higher than 98 % for all these elements. LOQs varied from 0.036 to 0.18 mg L−1.

2.3 Analytical method for inorganic anions

Inorganic anions (Cl, SO42−) were determined by ion chromatography (IC) with conductivity detector and chemical suppression (H2SO4). An 792 Basic IC (Metrohm, Germany) was used. A column METROSEP A Supp5-250, protected by a METROHM precolumn module, was employed for the determination of the anions according to the following conditions: mobile phase 3.2 mM Na2CO3/1.0 mM NaHCO3; flow rate 0.7 mL min−1 and injection volume 100 µL.

Accuracy and precision of the applied method were established by use of international certified reference material (BCR, Simulated Rain Water) for two levels of concentration (CRM 408, low content, and CRM 409, high content) supplied by the Institute for Reference Materials and Measurements (IRMM, Belgium). All the results obtained for the analyses of these materials present optimum percentages of recovery (>98 %). Precision, expressed as % of RSD, was 1.3 in the case of SO42− and 2.1 in the case of Cl. Limits of detection and quantification and linearity were calculated. LOD and LOQ for SO42− were 0.04 and 0.13 mgL−1, respectively. In the case of Cl, LOD and LOQ were 0.15 and 0.5 mg L−1, respectively. Linearity of 98 % was accomplished for both anions.

The determination of alkalinity was performed by potentiometric titration with HCl, according ISO 9963-1:1994 procedure [16]. General parameters such as pH, electrical conductivity (EC) and redox potential (E) were measured according standard methods [17].

2.4 Chemometric calculations

A data matrix consisting of 52 rows (samples) and 30 columns (variables) was obtained to perform chemometric calculations. Basic statistic and Kruskal–Wallis test were used in order to highlight differences between the five mountain systems considered in this study. PCA was used to initially observe data trends. LDA and ANN were used to obtain classification models. Before these calculations, all variables were auto-scaled, i.e. data in each column were mean-centred and divided by the standard deviation of that column. All chemometric calculations were carried out by using the software package Statistica 8.0 (StatSoft, Tulsa, OK, USA).

3 Results and discussion

3.1 Chemical composition

The contents of Al, As, B, Ba, Co, Cr, Cs, Fe, Li, Mn, Mo, Ni, Se, Sb, Sr, Ti, U, Zn, Ca, Mg, Na, K, Si, HCO3, SO42− and Clas well as values for parameters pH, EC, E and DE, for samples of water proceeding from the five considered mountain systems are given in Table S4, included in electronic supplementary material. As can be seen, pH varies from 6.96 to 8.89, without apparent differences between the considered groups. The same can be observed for EC, ranging from 129 to 794 S cm−1. Median values of E vary from 202 mV, in the case of Sistemas Béticos, to 217 mV, in the case of Cordillera Costero-Catalana. Samples from Cordillera Costero-Catalana and Sistemas Béticos present the lowest median values for DE (174 mg L−1), whilst those from Macizo Galaico present the higher one (291 mg L−1). HCO3 is the most abundant anion with median values ranging 140–297 mg L−1, being the highest contents found in samples from Sistema Ibérico. Median values of SO42− and Cl ranged from 10.0 to 22.2 and 6.0 to 32.9 mg L−1, respectively. Samples from Sistema Central present the highest median concentration of Cl and the lowest of SO42−. Considering the content of Ca, the highest median content (88.7 mg L−1) was found in samples from Sistema Ibérico and the lowest in waters from Macizo Galaico and Sistema Central, with contents of 22.8 and 21.5 mg L−1, respectively. Samples from Sistema Ibérico also present the highest contents for Mg (23.4 mg L−1), whilst those from Macizo Galaico present the lowest one (5.4 mg L−1). On the contrary, the highest median contents of Na (67.1 mg L−1), K (4.0 mg L−1) and Si (3.81 mg L−1) were found in Macizo Galaico waters. The highest median contents of Al (16.7 mg L−1), Ba (29.0 g L−1), Zn (5.8 g L−1), Co (0.40 g L−1), Mn (0.56 g L−1), Mo (1.4 g L−1), Ti (3.3 g L−1) and U (6.5 g L−1) were found in samples from Cordillera Costero-Catalana. Samples from Macizo Galaico present the highest contents of Li, Sb, B, Cs and Sr, with medians of 816, 0.27, 237, 54.7 and 161 g L−1, respectively. Waters from Sistema Central present the highest median contents of As (1.6 g L−1) and Se (0.65 g L−1). In the case of Sistema Ibérico and Sistemas Béticos, samples from both origins present the highest median values of Fe, being 137 and 149 L−1, respectively.

In light of these results, a nonparametric multiple comparison method, such as Kruskal–Wallis test [18], has been applied to find out significant differences between groups. The Kruskal–Wallis H value is computed and compared with the tabulated 2-value for 4° of freedom and = 0.05. In the case that a variable presents a significant H value, a post hoc comparison is performed to highlight significant differences between pairs of groups. Table 1 summarizes the obtained results. As can be seen, samples from Sistemas Béticos and Cordillera Costero-Catalana present significant differences in the case of K, Si, Mo, Ti, U and Zn. Waters from Sistemas Béticos also present differences with samples from Macizo Galaico in the contents of Mg, Na, K, Si, Cs, Li and Ti. Differences between Sistemas Béticos and Sistema Central were found considering Cl, Na, K, Si, As and U. Macizo Galaico and Sistema Ibérico statistically differ in the levels of Ca, Mg, Na, K, Si and Cs. The contents of HCO3, Ca, Si and As present significant differences in the comparison between Sistema Central and Sistema Ibérico. Cs, Li, Sb and U were statistically different in the case of the pair Cordillera Costero-Catalana and Macizo Galaico, whilst the comparison between Cordillera Costero–Catalana and Sistema Ibérico presents significant differences for E, Mo, Se and Ti. Cordillera Costero-Catalana and Sistema Central show differences for the contents of Cl, As and Se. Waters from Sistemas Béticos and Sistema Ibérico present only differences in two variables, DE and Ba. The pair Macizo Galaico and Sistema Central only differs in the case of U. Taking into account these previous comparisons, pattern recognition methods were applied to obtain classification models.
Table 1

Kruskal–Wallis test results

Parameter

Ha

Comparison

CC–MG

CC–SB

CC–SC

CC–SI

MG–SB

MG–SC

MG–SI

SB–SC

SB–SI

SC–SI

pH

3.47

          

EC

7.93

          

E

11.7

   

X

      

DE

13.44

        

X

 

HCO3

11.24

         

X

SO42−

4.58

          

Cl

19.52

  

X

    

X

  

Ca

22.06

      

X

  

X

Mg

18.15

    

X

 

X

   

Na

28.35

    

X

 

X

X

  

K

32.60

 

X

  

X

 

X

X

  

Si

35.91

 

X

  

X

 

X

X

 

X

Al

7.77

          

As

23.43

  

X

    

X

 

X

B

8.66

          

Ba

19.71

 

X

      

X

 

Co

9.13

          

Cr

7.30

          

Cs

16.23

X

   

X

 

X

   

Fe

5.38

          

Li

17.46

X

   

X

     

Mn

6.26

          

Mo

15.01

 

X

 

X

      

Ni

5.25

          

Sb

11.27

X

         

Se

10.21

  

X

X

      

Sr

4.54

          

Ti

23.17

 

X

 

X

X

     

U

25.35

X

X

   

X

 

X

  

Zn

11.22

 

X

        

aSignificant differences at H > 9.49

CC Cordillera Costero-Catalana, MG Macizo Galaico, SB Sistemas Béticos, SC Sistema Central, SI Sistema Ibérico

3.2 Differentiation of geographical origin

PCA was first applied in order to visualize data trends in the space of the considered variables. PCA is based on obtaining linear combinations of the original variables to produce new variables called principal components (PCs) that are uncorrelated. PCA can be used to reduce the dimensionality of the n-dimensional space of original variables by computing PCs retaining the highest variability as possible of the original variance of data [19]. The first principal component (PC1) expresses the largest variability of the data and each successive PC represents as much of the residual variance as possible. Taking into account that the matrix of data is auto-scaled, each observed variable contributes one unit of variance to the total variance in the data set. An eigenvalue is computed for each PC indicating the amount of variance explained by this PC. In order to reduce dimensionality only PCs with eigenvalues greater than 1 were retained, because these components account for a greater amount of variance than one observed variable [20]. In this case, the 9 first PCs present eigenvalues >1, explaining the 81.21 % of total variance (Table S5 of the electronic supplementary material).

Factor loadings of the variables can be calculated as the correlation coefficient between the original variables and the obtained PCs. Factor loadings with absolute values equal or higher than 0.7 indicate a strong association between the variable and the principal component. Loadings in the range 0.4–0.7 show a moderate participation of the original variable in the calculated principal component [21]. According to the loadings of the variables in the obtained PCs (Table S6), PC1 was negatively correlated to EC and DE, with factor loadings higher than 0.7, in absolute values. In the case of PC2, the most contributing variables were Ca and Mg, with positive correlation, and K and Na, with negative correlation. PC3 was positively correlated to Ti and Al. The remaining components do not present marked correlation to any variable (data not provided). This fact indicates that the variability of the data is distributed among a high number of original variables and a high number of PCs can be retained according to their eigenvalues, but this situation leads to models including noise [22]. On the other hand, a poor representation of data trends is obtained considering a less number of PCs. For instance, the percentage of explained variance falls to 47.67 % when the three first PCs are considered. In addition, according to Kruskal–Wallis results, DE, Ca, Mg, K, Na and Ti present significant H values, but the information given by these variables is not enough to differentiate the five sample provenances, as can be observed in the distribution of samples in the space of the three first PCs (Fig. 1). Samples from Cordillera Costero-Catalana appear at positive values of PC1 and PC3. This fact is due to the slightly lower values for EC and DE (negatively correlated to PC1) that can be found in samples from Cordillera Costero-Catalana in comparison with others origins. This trend is reinforced by low values of HCO3, SO42−, Cl, Na, B, Cs and Li that present moderate negative correlation to this PC. On the other hand, most of samples from Macizo Galaico appeared at negative values of PC1 due to their highest values of EC and DE, reinforced by high contents of Na, B, Cs and Li. Samples from Sistema Ibérico and Sistemas Béticos are distributed at positive values of PC2, whilst most of the other samples appear at negative scores. Considering the correlations between variables and PC2, this trend is due to the slightly higher contents of Ca and Mg and lower for K and Na found in samples from Sistema Ibérico and Sistemas Béticos. This trend is also reinforced by low values of Si, Li and Cs, with moderate negative correlation to PC2. It can be pointed that most of samples from Macizo Galaico appeared at negative PC2-scores. This can be related to the highest contents of B, which also presents a moderate negative correlation to PC2. The distribution of samples from Cordillera Costero-Catalana at positive PC3-scores is related to their high contents of Al and Ti, which were positively correlated to PC3. Samples from Macizo Galaico and Cordillera Costero-Catalana presented the highest median values of Sr, with a moderate positive correlation to PC3. This fact can also influence on the distribution of those samples at positive values of that PC. In the case of Cordillera Costero-Catalana, the lowest median content of Cr, moderately and negatively correlated to PC3, can also cause the same effect. Samples from different origins appear generally overlapped. For this reason, any of the original variables can be considered to be a good chemical descriptor using only PCA results and other pattern recognition techniques must be used.
Fig. 1

Distribution of the samples in the space of the three first PCs. The variance explained by PC1, PC2 and PC3 was 19.01, 17.05 and 11.61 %, respectively

Forward SLDA has been applied to obtain a classification model using the most discriminant variables to compute discriminant functions (DFs) which allow the differentiation of the considered classes [23]. At first instance, all samples were used as training cases in order to select the variables to be included in the model. The selected variables were Al, As, Co, Cr, Fe, Li, Mn, Mo, Ni, Se, Sb, Sr, Ti, Zn, Ca, Mg, Na, K, SO42− and Cl and the model presented a recognition ability of 100 % for samples of Cordillera Costero-Catalana, Sistema Central and Sistemas Béticos. The recognition ability for Macizo Galaico and Sistema Ibérico was 87.5 and 93.3 %, respectively. As shown in Fig. 2a, samples appear separately in the plane of the two first DFs. In order to identify the variables responsible of such separation among those selected by SLDA, a correlation study has been carried out between the original variables and the calculated DFs. The correlations are depicted in Fig. 2b. Cl, As and Na appear correlated with DF1 with correlation coefficients of 0.61, 0.60 and 0.48, respectively. Samples from Sistema Central and Macizo Galaico present the highest contents in these elements, and accordingly, they are distributed at positive values of DF1. In the case of DF2, elements K, Ti and Mo present negative correlation coefficients of −0.56, −0.56, and −0.46, respectively, whilst Mg and Cr present positive correlation coefficient of 0.47 in both cases. Samples from Sistemas Béticos and Sistema Ibérico appear at positive values of DF2 due to their low contents of K and high contents of Mg. Samples from Macizo Galaico, distributed at negative DF2-scores, are influenced by high contents of Mo and Ti and low contents of Cr.
Fig. 2

a Distribution of the samples in the plane of the two first DFs obtained by model LDA1. b Correlation coefficients between original variables and discriminant functions

In order to obtain a most reliable model, it must be tested using two sets of samples, a training set (75 % of samples) to build the model and a test set (25 %) to compute the prediction capability. With this aim sensitivity (SENS) and specificity (SPEC) calculations were done. SENS refers to the percentage of cases belonging to a determinate class correctly classified and SPEC does to percentage of cases not belonging to a class correctly not classified in this class [24]. A stratified delete-a-group jackknife (SDAGJK) cross-validation procedure was followed to compute these parameters [25]. SDAGJK randomly discards a group of samples from each class before computing the model and uses them as test samples to obtain SENS and SPEC. In this case, nine replicates were obtained and mean SENS and SPEC were computed for each class (Table 2). The obtained model, denoted as LDA1, presents poor results, with SENS ranging from 56 to 87 %. The overall SENS and SPEC were 72 and 92 %, respectively.
Table 2

Results of pattern recognition models

 

CC

MG

SB

SC

SI

Overall

SENS

SPEC

SENS

SPEC

SENS

SPEC

SENS

SPEC

SENS

SPEC

SENS

SPEC

LDA1

67 ± 26

96 ± 7

56 ± 30

98 ± 4

87 ± 20

88 ± 9

72 ± 26

95 ± 7

72 ± 18

88 ± 6

72 ± 10

92 ± 3

LDA2

  

83 ± 35

83 ± 12

83 ± 26

90 ± 12

39 ± 42

92 ± 13

61 ± 21

90 ± 10

67 ± 11

89 ± 5

ANN1

100 ± 0

97 ± 5

72 ± 36

100 ± 0

94 ± 17

95 ± 7

89 ± 22

99 ± 3

89 ± 13

96 ± 5

90 ± 9

97 ± 2

ANN2

  

72 ± 36

100 ± 0

100 ± 0

95 ± 7

94 ± 17

99 ± 4

100 ± 0

97 ± 6

93 ± 9

98 ± 3

ANN3

100 ± 0

96 ± 5

72 ± 36

100 ± 0

96 ± 11

100 ± 0

100 ± 0

100 ± 0

97 ± 8

96 ± 5

94 ± 6

98 ± 2

CC Cordillera Costero-Catalana, MG Macizo Galaico, SB Sistemas Béticos, SC Sistema Central, SI Sistema Ibérico

In order to improve these results, a nonlinear approach such as MLP-ANN was applied. MLP-ANNs are feed forwarded networks consisting of neurons arranged in an input layer, various hidden layers and an output layer. As LDA, ANN uses training and test set, but a third set (validation set) is needed to avoid overtraining [26]. In this case, samples were divided into training (50 %), validation (25 %) and test (25 %) sets, maintaining this proportion in each class. The model was trained by back-propagation during 50 cycles by minimizing the prediction error made by the network. Learning rate and momentum were set to 0.1 and 0.3, respectively. Logistic sigmoid activation functions were used for hidden nodes, and softmax (normalized exponential) activation functions were used for the output layer. A network with 20 inputs, one for each variable selected by LDA, 10 hidden neurons and 5 outputs was obtained. The model was cross-validated using SDAGJK, and SENS and SPEC were computed for each considered class. As given in Table 2, SENS obtained by model ANN1 for Cordillera Costero-Catalana was 100 % and a SPEC of 97 % was accomplished. The other classes presented SENS ranging from 72 %, in the case of Macizo Galaico, to 94 % in the case of Sistemas Béticos. The overall values were 90 and 97 % for SENS and SPEC, respectively.

As the prediction ability of model ANN1 for the class Cordillera Costero-Catalana was 100 %, it is useful to build a new model only considering the four remaining classes in order to select the most adequate variables to differentiate them. In this case, the adequate variables were obtained by forward SLDA considering the classes Macizo Galaico, Sistemas Béticos, Sistema Central and Sistema Ibérico. As shown in Fig. 3a, samples from the different origin appear separately in the plain of the two first DFs. The variables selected by this procedure were Al, As, B, Co, Cr, Cs, Fe, Li, Mn, Ni, Se, Sb, Sr, Ti, U, Zn, Ca, Mg, Na, K, HCO3, SO42−, Cl, EC and DE. A correlation study, depicted in Fig. 3b, reveals that DF1 is negatively correlated to the contents of Cl (−0.59), Na (−0.48), K (−0.55), As (−0.52) and Ti (−0.50). These coefficients allow explaining the group distributions in Fig. 3a. Sistema Central presented the highest median values of Cl, Ti and As and Macizo Galaico the highest contents of Na and K, and, accordingly, both groups appear at negative DF1-scores. On the other hand, Sistemas Béticos and Sistema Central appear at positive values of that discriminant function. The most positive DF1-scores of Sistemas Béticos can be explained considering the lowest of K and Na in the case of this group. In addition, samples from Macizo Galaico appear at negative DF2-scores due to highest contents of Na, K, B, Cs and Li, negatively correlated with DF2.
Fig. 3

a Distribution of the samples in the plane of the two first DFs obtained by model LDA2. b Correlation coefficients between original variables and discriminant functions

The built LDA model was cross-validated by means of SDAGJK using a data division of 75 % for training and 25 % for test set. Table 2 (LDA2) shows an improvement for the SENS for Macizo Galaico, but the results for the other three classes are worst. In order to improve these results, the computing of an ANN model was considered. The same variables selected by LDA2 were used to obtain model ANN2, with architecture 25:13:4. This model was built also applying logistic sigmoidal and softmax activation functions for hidden and output layers, respectively. Learning rate and momentum were the same used for ANN1. In this case, after applying cross-validation, SENS rises to 100, 94 and 100 % for Sistemas Béticos, Sistema Central and Sistema Ibérico, respectively. In the case of Macizo Galaico, SENS of 72 % was obtained. The overall performance of model ANN2 was 93 and 98 % of SENS and SPEC, respectively.

According to the previous results, a combination of models ANN1 and ANN2 could work as an iterative model to better solve the classification problem. The use of iterative models for LDA has been proved to be useful in problems with high number of classes, selecting the appropriate variables in each comparison [27]. In this case, an iterative model (ANN3) was obtained by combining ANN1 and ANN2, including SLDA selection of the variables. A scheme of model ANN3 is depicted in Fig. 4. Accordingly, in a first step samples are classified as pertaining or not pertaining to Cordillera Costero-Catalana by applying ANN1, which uses variables selected by LDA1. Samples classified as belonging to Cordillera Costero-Catalana were not considered for subsequent calculations. In the second step, all the remaining samples were introduced in model ANN2 (built with variables selected by LDA2) to be classified as pertaining to one of the other four classes. This iterative model showed a prediction ability of 100 % for samples from Cordillera Costero-Catalana and Sistema Central. In the case of Sistemas Béticos and Sistema Ibérico and Macizo Galaico, the obtained SENS was 96, 97 and 72 %, respectively. The overall SENS of this model was 94 % and SPEC was 98 %. The performance of this model is similar to the obtained by Souza et al. [11] in the case of Brazilian waters and by Sipos et al. [2] for the sensory evaluation of Hungarian waters.
Fig. 4

Working scheme of model ANN3. SLDA was used to select the most discriminant variables at each step. The matrix of data was divided into training, verification and test set before ANN computation. Stratified delete-a-group jackknifing cross-validation was applied to the whole model in nine replicates. CCC Cordillera Costero-Catalana, MG Macizo Galaico, SB Sistemas Béticos, SC Sistema Central, SI Sistema Ibérico

4 Conclusions

Bottled natural mineral waters are food products very appreciated by consumers, and consequently, the chemical characterization and geographical traceability of these products have gained more and more importance from an economical point of view. In this work, natural mineral waters from Spain have been chemically characterized in order to study their correlation with their production area. Samples from five different Spanish mountain systems, such as Sistemas Béticos, Sistema Central, Sistema Ibérico, Cordillera Costero-Catalana and Macizo Galaico, were collected and analysed. Some differences were detected by application of simple nonparametric test and principal component analysis. Samples from Cordillera Costero-Catalana generally presented lower values for electrical conductivity and dry extract than the other origins, and in the case of Ti and Al, the contents were slightly higher. Samples from Sistemas Béticos and Sistema Ibérico present higher contents of Ca and Mg and lower ones for K and Na, when they are compared with the other three origins.

Classical nonparametric multiple comparison method, such as Kruskal–Wallis test, and principal component analysis do not allow a good differentiation among the considered origins. For this reason, a pattern recognition approach is necessary to solve the classification problem. The development of stepwise linear discriminant analysis models allows the selection of the most discriminant variables, but these models do not solve the classification problem by themselves. In this case, nonlinear models based on artificial neural networks obtain better results.

In this study, samples of water from Cordillera Costero-Catalana are usually the best differentiated from the others. This fact can lead to a biased model which works worse in the classification of samples into the others groups. Consequently, in order to obtain the most discriminant variables allowing the differentiation among the remaining groups, the use of iterative models is adequate. The proposed model first differentiates samples from Cordillera Costero-Catalana from the other mountain systems and then performs the classification of the remaining groups. This model presented an average classification ability of 94 %.

Supplementary material

521_2016_2459_MOESM1_ESM.docx (48 kb)
Supplementary material 1 (DOCX 48 kb)

References

  1. 1.
    European Union (2009) Directive 2009/54/EC of the European Parliament and of the Council of 18 June 2009 on the exploitation and marketing of natural mineral waters. Official Journal of the European Union, L 164/45, Brussels. http://eur-lex.europa.eu/eli/dir/2009/54/oj. Accessed 9 July 2016
  2. 2.
    Sipos L, Kovács Z, Sági-Kiss V, Csiki T, Kókai Z, Fekete A, Éberger K (2012) Discrimination of mineral waters by electronic tongue, sensory evaluation and chemical analysis. Food Chem 135:2947–2953. doi:10.1016/j.foodchem.2012.06.021 CrossRefGoogle Scholar
  3. 3.
    Oyebog SA, Ako AA, Nkeng GE, Suh EC (2012) Hydrogeochemical characteristics of some Cameroon bottled waters, investigated by multivariate statistical analyses. J Geochem Explor 112:118–130. doi:10.1016/j.gexplo.2011.08.003 CrossRefGoogle Scholar
  4. 4.
    Peh Z, Ŝorŝa A, Halamić J (2010) Composition and variation of major and trace elements in Croatian bottled waters. J Geochem Explor 107:227–237. doi:10.1016/j.gexplo.2010.02.002 CrossRefGoogle Scholar
  5. 5.
    Fugedi U, Kuti L, Jordan G, Kerek B (2010) Investigation on the hydrogeochemistry of some bottled mineral waters in Hungary. J Geochem Explor 107:305–316. doi:10.1016/j.gexplo.2010.10.011 CrossRefGoogle Scholar
  6. 6.
    Naddeo V, Zarra T, Belgiorno V (2008) A comparative approach to the variation of natural elements in Italian bottled waters according to the national and international standard limits. J Food Comp Anal 21:505–514. doi:10.1016/j.jfca.2008.02.010 CrossRefGoogle Scholar
  7. 7.
    Birke M, Rauch U, Harazim B, Lorenz H, Glatte W (2010) Major and trace elements in German bottled water, their regional distribution and accordance with national and international standards. J Geochem Explor 107:245–271. doi:10.1016/j.gexplo.2010.06.002 CrossRefGoogle Scholar
  8. 8.
    Güler C (2007) Characterization of Turkish bottled waters using pattern recognition methods. Chemom Intell Lab Syst 86:86–94. doi:10.1016/j.chemolab.2006.08.009 CrossRefGoogle Scholar
  9. 9.
    Gutiérrez-Reguera F, Seijo-Delgado I, Montoya-Mayor R, Ternero-Rodríguez M (2012) Caracterización fisicoquímica (parámetros generales y componentes mayoritarios) de las aguas minerales naturales envasadas de España. Afinidad 519:165–174Google Scholar
  10. 10.
    Smedley PL (2010) A survey of the inorganic chemistry of bottled mineral waters from the British Isles. Appl Geochem 25:1872–1888. doi:10.1016/j.apgeochem.2010.10.003 CrossRefGoogle Scholar
  11. 11.
    Souza AL, Lemos SG, Naozuka J, Miranda-Correia PR, Oliveira PV (2011) Exploring the emission intensities of ICPOES aided by chemometrics in the geographical discrimination of mineral waters. J Anal At Spectrom 26:852–860. doi:10.1039/C0JA00071J CrossRefGoogle Scholar
  12. 12.
    Groŝelj N, van der Veer G, Tuŝar M, Vračko M, Novič M (2010) Verification of the geological origin of bottled mineral waters using artificial neural networks. Food Chem 118:941–947. doi:10.1016/j.foodchem.2008.11.085 CrossRefGoogle Scholar
  13. 13.
    Thermo Electron Corporation (2004) X series ICP-MS getting started guide. Ref. no. S419MA. Thermo Electron Corporation, WinsfordGoogle Scholar
  14. 14.
    AOAC (2012) Appendix F: guidelines for standard method performance requirements. In: Official methods of analysis of AOAC international, 19th edn. AOAC International, GaithersburgGoogle Scholar
  15. 15.
    Cuadros L, García AM, Bosque JM (1996) Statistical estimation of linear calibration range. Anal Lett 29:1231–1239. doi:10.1080/00032719608001471 CrossRefGoogle Scholar
  16. 16.
    ISO (1994) ISO 9963-1:1994 Water quality. Determination of alkalinity. Part 1: determination of total and composite alkalinity. International Organization for Standardization, GenevaGoogle Scholar
  17. 17.
    ISO (1985) ISO 7888:1985 Water quality. Determination of electrical conductivity. International Organization for Standardization, GenevaGoogle Scholar
  18. 18.
    Muth JE (1999) Basic statistic and pharmaceutical statistical applications, 1st edn. Chapman and Hall/CRC, New YorkGoogle Scholar
  19. 19.
    Jolliffe IT (2002) Principal components analysis, 2nd edn. Springer, New YorkMATHGoogle Scholar
  20. 20.
    Palacios-Morillo A, Alcázar A, Pablos F, Jurado JM (2013) Differentiation of tea varieties using UV–Vis spectra and pattern recognition techniques. Spectrochim Acta A 103:79–83. doi:10.1016/j.saa.2012.10.052 CrossRefGoogle Scholar
  21. 21.
    Tsakovski S, Simeonov V (2009) Chemometrics as a tool for treatment processing of multiparametric analytical data sets. In: Namiesnik J, Szefer P (eds) Analytical measurements in aquatic environments. CRC Press, Boca Raton, pp 369–388CrossRefGoogle Scholar
  22. 22.
    Valle S, Li W, Qin SJ (1999) Selection of the number of principal components: the variance of reconstruction error criterion with comparison to other methods. Ind Eng Chem Res 38:4389–4401. doi:10.1021/ie990110i CrossRefGoogle Scholar
  23. 23.
    Massart DL (1998) Handbook of chemometrics and qualimetrics, part B. Elsevier, AmsterdamGoogle Scholar
  24. 24.
    Forina M, Armanino C, Leardi R, Drava G (1991) A class modelling technique based on potential functions. J Chemom 5:435–453. doi:10.1002/cem.1180050504 CrossRefGoogle Scholar
  25. 25.
    Kott PS (2001) The delete-a-group jackknife. J Off Stat 17:521–526Google Scholar
  26. 26.
    Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies. 1. Comparison of overfitting and overtraining. J Chem Inform Comput Sci 35:826–833. doi:10.1021/ci00027a006 CrossRefGoogle Scholar
  27. 27.
    Martin AE, Watling RJ, Lee GS (2012) The multi-element determination and regional discrimination of Australian wines. Food Chem 133:1081–1089. doi:10.1016/j.foodchem.2012.02.013 CrossRefGoogle Scholar

Copyright information

© The Natural Computing Applications Forum 2016

Authors and Affiliations

  • Francisco Gutiérrez-Reguera
    • 1
  • J. Marcos Jurado
    • 1
  • Rocío Montoya-Mayor
    • 1
  • Miguel Ternero-Rodríguez
    • 1
  1. 1.Department of Analytical Chemistry, Faculty of ChemistryUniversity of SevillaSevilleSpain

Personalised recommendations