All the data collected were organised in a matrix, X, of dimension 71 × 34. All 71 biofilm samples (Table 1) were characterised by the contents of 17 chemical elements analysed in the biofilms and in the water phases extracted at the sampling locations. Three groups of samples were distinguished depending on the type of water body, namely flowing water, standing water and seawater.
Firstly, PCA was used for an overall exploration of the data structure. PCA is an unsupervised approach, and is frequently employed for data compression and visualisation [9]. With PCA, the original data matrix, X (m × n), is decomposed into two matrices: a scores matrix, T (m × n), the columns of which contain principal components (PCs) and a loadings matrix, P (n × n). PCs are found as linear combinations of explanatory variables by maximising the variance of projected data. The loadings matrix, P, describes the contributions of each variable to the constructed PCs.
Prior to the PCA analysis, the explanatory variables were autoscaled, because they had been measured in different units. Autoscaling is performed by subtracting the column mean from each data element and dividing it by the corresponding standard deviation. It gives variables the same importance in the PCA analysis. The results of PCA for autoscaled data are presented in Fig. 1.
The first three PCs explain about 50% of the total data variance (Fig. 1a). Figure 1a indicates that the compression is not very effective, because the data variance is distributed over all PCs. However, some general trends in the data structure can be revealed.
All the seawater samples are differentiated from the standing water and flowing water samples in the PC 1–PC 2 score plot (Fig. 1b). The sea samples can be divided into two subgroups. The first subgroup contains the samples from Steinbeck (Germany), Travemünde (Germany) and Damp (Germany), located in the Baltic Sea, while the second subgroup includes the samples from Punta Skala (Croatia), Nin (Croatia) and Majorca (Spain), situated in the Mediterranean Sea. The standing water and flowing water biofilm samples overlap. The flowing water biofilm sample from Munich can also be distinguished from all the other samples along PC 2. This distinction is even more evident along PC 3 (Fig. 1c). Looking at the loading plots, shown in Fig. 1d and e, one finds the reasons for the objects’ distributions observed in the score plots. PC 1 represents the Sr, Cu, Mg, Se, K and Na contents in the water phase (W-Sr, W-Cu, W-Mg, W-Se, W-K, W-Na). This factor can be related to the salt content of the water phase and is conditionally called the ‘salt’ factor. The second PC, PC 2, is mainly associated with Fe and Mg (Fig. 1d). These elements are basic components participating in the biofilm formation. PC 3 reflects the Mn, Zn, Cd, Pb, Fe, and Co contents in the water phase. The presence of Cd and Pb is usually a result of environmental contamination and that is why PC 3 is associated with the anthropogenic influence. From the information obtained from the score and loading plots, it follows that the water phase of the biofilm grown in seawater is indeed richer in dissolved salts than the water phase of the standing water and flowing water bodies. The levels of Fe and Mg in sea biofilms are also higher in comparison with those for the other biofilms. The salt content, pH and temperature of water vary at different sampling locations and they influence the biofilm formation. It was reported in [10] that Mg strongly influences attachment and biofilm structure. The surface colonisation and biofilm depth increase with the increasing Mg concentration. The biofilms collected in Punta Skala, Nin and Majorca contain higher levels of dissolved salts in the water phase and higher Fe and Mg contents in comparison with the biofilms collected in Travemünde, Damp and Steinbeck. The sample originating from Munich shows a high anthropogenic influence, i.e. it has higher Mn, Zn, Cd, Pb, Fe and Co contents in the water phase and lower Fe and Mg contents in the biofilm in comparison with the other samples.
In order to see whether the biofilms developed in standing water could be distinguished from the biofilms grown in flowing water, supervised approaches such as CART, DPLS and UVE-DPLS were applied. Furthermore, it was important to determine if the models constructed could predict the origin of new biofilm samples and how well. Another question to be answered was what variables are responsible for an eventual discrimination of groups. Only seven biofilms were grown in seawater; therefore, they were excluded from the forthcoming analysis.
To construct a reliable discriminant model and to test its predictive ability, the data were divided into two subsets (model and test) with the Kennard and Stone [11, 12] and duplex [13] algorithms enabling a uniform subset selection. In the Kennard and Stone method, objects in the model set are selected sequentially, starting with the object closest to the data mean. The next object included in the subset is the one situated furthest away from the first one. The third object selected is the most distant one from the objects selected in the model set. The selection of objects continues while a predefined number of objects are not assigned to the model set. The remaining objects form the test set. As a similarity measure, the Euclidian distance was used. With the duplex algorithm, the two most distant objects in the data are found and included in the model set. The next two most distant objects are assigned to the test set. The remaining objects are consecutively added to the subsets, switching over to the most distant unassigned object with respect to the model set and to the most distant unassigned object with respect to the test set. The Kennard and Stone algorithm ensures that the objects in the model set cover all possible sources of data variance, while the duplex method guarantees the representativeness of both subsets. Selection of model and test sets should be done for each group separately. When a preprocessing procedure is required, the selection of objects is applied to preprocessed data. In our study, the model and test sets were selected using autoscaled data in order to remove the scale differences among variables while evaluating the Euclidean distances among objects. It should be mentioned that the performance of CART is not influenced by autoscaling. In our study, the model set of dimension 42 × 34 contains 21 flowing water and 21 standing water biofilm samples, whereas the test set of dimension 22 × 34 includes 15 biofilms of flowing water and seven biofilms of standing water.
Results of CART, DPLS and UVE-DPLS for model and test sets designed with the Kennard and Stone algorithm
To trace the importance of variables responsible for the discrimination of both groups, a classification tree was built. After tenfold cross-validation, an optimal tree, containing two terminal nodes, was selected. The cross-validation error is 14%, indicating a relatively good predictive ability of the constructed tree shown in Fig. 2.
Since there is only one split in the tree, the discriminant problem is rather simple and the most discriminative variable describes the Mg content in the water phase (W-Mg). As mentioned before, Mg plays an important role during the biofilm formation [10]. All the model set samples, belonging to the group of standing water (17 samples), have Mg content in water phase below 37 mg g−1. The remaining samples (21 flowing water biofilms and four standing water biofilms) are placed in the left terminal node, which results in a misclassification error of 9.5% for the complete tree. Although four model standing water biofilms are recognised as flowing water biofilms, the constructed classification tree provides a correct classification of 100% for the test samples (Table 2). Therefore, the model yields fairly high sensitivity (percentage of correct classification of the test flowing water biofilms) and selectivity (percentage of correct classification of the test standing water biofilms).
Table 2 Correct classification rate (CCR), sensitivity and selectivity of the models
Additionally, good discrimination results can be obtained when the primary split is made on the variable describing the Ca content in the water phase (W-Ca). This variable is a competitive variable selected after removing W-Mg. The split on W-Ca leads to a total misclassification error of 14.3%. The presence of Ca has been shown to have an influence on mechanical properties of biofilms [14].
In the next step of the investigation, DPLS was considered, in order to check if a discrimination model using linear combinations of explanatory variables can perform better than CART. The DPLS model has complexity 1. The RMSCV is 0.95 and RMS error is 0.64. The DPLS model constructed allows for 81.8% correct classification of the test set samples. The analysis of the misclassified test samples indicates that four out of 15 (26.7%) flowing water biofilms collected in the Leutra river are incorrectly predicted as standing water biofilms; therefore, the model has a lower sensitivity (73.3%) than the CART model. All the test samples belonging to the group of standing water biofilms are well predicted, which again indicates the high selectivity (100%) of the model constructed (Table 2). An improved DPLS model was obtained by use of the UVE-DPLS approach, after discarding the uninformative variables. The one-factor UVE-DPLS model constructed with three informative variables (W-Mg, W-Ca, W-Sr), offers a total correct classification of 90.9% for the test set. It yields a selectivity of 100% and a better sensitivity (86.7%) in comparison with the DPLS model, because only two out of 15 (13.3%) biofilms grown in the flowing water of the Leutra river are now assigned to the group of standing water biofilms (Table 2).
The best discrimination results are obtained from CART, even though this model shows a misclassification error of 9.5% for the complete tree. Since the splits are done in a univariate way, the correlation between variables is not taken into account. Therefore, CART provides unsatisfactory results when a linear combination of variables is responsible for discriminating the samples. This, however, cannot be verified unless multivariate approaches such as DPLS and UVE-DPLS are used. Although CART and UVE-DPLS have different objective functions, common variables are selected as essential for the discrimination. The primary variable, W-Mg, and two competitive variables, W-Ca and W-Sr, in CART are also selected by UVE-DPLS.
Results of CART, DPLS and UVE-DPLS for model and test sets designed with the duplex algorithm
Results of CART, DPLS and UVE-DPLS were obtained using data designed with the duplex algorithm, which ensures the representativeness of the model and test sets.
The classification tree built has two terminal nodes and the primary split is again made on the variable representing the Mg content (W-Mg) in the water phase. The cross-validation error is 7.1%. Two out of 42 model set samples are wrongly classified, which leads to a misclassification error of 4.8% for the complete tree. Compared with the previous results, the constructed tree shows a better performance for the model set samples, but worse prediction rates (Table 2); therefore, the model has again a fairly high sensitivity (100%), but quite low selectivity (57.1%).
The DPLS model constructed for the data designed by the duplex algorithm shows slightly better prediction ability (86.4%) than the model built for the data designed by the Kennard and Stone algorithm (81.6%). It presents a better sensitivity (100%), but a reduced selectivity, with only 57.1% of standing water samples being well recognised. A discriminant model characterised by relatively high sensitivity and selectivity parameters is to be preferred over a model with a high sensitivity and a low selectivity. Therefore, the UVE-DPLS model for data designed by the Kennard and Stone algorithm is to be favoured (Table 2). All the methods allow a correct prediction for 86.4% of samples. The samples collected at Chemnitz and White Dak Pond, Metebach, are improperly classified by all methods. In fact, this is not a striking observation though when the data contain some samples that are different in comparison with the majority of samples. These samples are always assigned to the model set using the Kennard and Stone method and then the test samples are correctly predicted. Using the duplex method, we assigned some atypical samples to the test set, which results in a construction of models with too pessimistic predictive abilities.
Results of CART, DPLS and UVE-DPLS for biofilm samples grown on natural substrates
Another important issue to be discussed is whether the biofilm samples grown on natural substrates (see the group of uniquely sampled biofilms in Table 1) can be used to derive similar conclusions as those drawn using the whole data. If this is possible, the sampling procedure will be carried out in a simpler way, which will be less time-consuming and relatively low in price.
For an initial inspection of the data structure, PCA was considered. PCA was applied to autoscaled data (25 × 34) containing only uniquely sampled biofilms and the results are presented in Fig. 3
The first three PCs account for 59.1% of the total data variance (Fig. 3a). Similar to PCA of the whole data, the compression is not very effective. The biofilms grown in seawater can again be distinguished along PC 1 (Fig. 3b). Moreover, two subgroups of sea biofilms are distinguished along PC 1 (Fig. 3c). The content of the subgroups is the same as before. The biofilm sample collected in Munich is again found far away from all the other samples. Another extreme biofilm sample, collected in the Aller river (Celle, Germany), appears along PC 3. Regarding the variable loadings (Fig. 3d, e), PC 1 is again associated with the salt content of the biofilm water phases, while PC 2 probably is now linked to the contamination of the biofilm water phases, because the variables W-Zn, W-Cd, W-Ni, W-Fe, W-Co, W-Mn and W-Pb possess high loading values. PC 3 consists of Zn, Cd and Pb, which are usually associated with an anthropogenic influence and this factor is therefore associated with contaminants accumulated by the biofilm. Summarising the results of PCA, one can additionally point out that the biofilm samples collected in Majorca, Punta Skala and Nin are richer in Zn, Cd and Pb in comparison with the remaining sea biofilm samples. Moreover, the highest Zn, Cd and Pb contents are characteristic for the biofilm sample from Celle.
In order to construct the CART, DPLS and UVE-DPLS models, only data of natural biofilms grown in flowing water and in standing water were considered. Since the number of samples in each group is small (Table 1), the models were used for an exploratory purpose only. Because of this, the predictive abilities of the models were not tested using an independent test set.
The complete classification tree with three terminal nodes is shown in Fig. 4. The primary split is made on the variable describing the Pb content in the water phase (W-Pb). W-Pb is the most discriminant variable. The next split on variable Al corrects the improper assignment of one sample and it is of a lower importance. Owing to the small number of samples, the required tenfold cross-validation procedure could not be applied and, therefore, the cross-validation error was not reported. All the biofilms grown in standing water are well classified, but two biofilm samples grown in flowing water are wrongly classified, which results in a total classification rate of 88.9%. The incorrectly classified samples originate from Steinach (Germany) and Geithain (Germany).
The DPLS model constructed has complexity 1. RMSCV is 1.78 and RMS error accounts for 0.59. Two samples are incorrectly classified. One of them belongs to the biofilms of standing water and originates from Chemnitz (Germany), while the other one is the biofilm collected in the flowing water body (São Lourenço) located in Juquitiba (Brazil). The DPLS model built yields a total classification rate of 88.9%. It should be emphasised that DPLS can lead to a too optimistic result when the number of variables outnumbers the number of samples [7]. A remedy for this problem is to reduce the number of variables by the use of a feature selection technique, e.g. UVE-DPLS. The UVE-DPLS model has RMSCV of 0.95. One variable, namely W-Pb, is selected. However, all biofilms grown in flowing water are correctly classified with the model constructed, but all biofilms grown in standing water are improperly classified.