Keywords

1 Introduction

Minera Panamá S.A. (MPSA), wholly owned by Minera Panama S.A-First Quantum Minerals Ltd (MPSA-FQML), is investigating the feasibility of developing the MPSA Project Mina de Cobre Panamá (the Project). The proposed Project would mine and process copper sulfide ore in the Petaquilla Concession, Panamá. This concession covers an area of 130 square kilometers (km2) and is located in the District of Donoso, Colón Province, in north-central Panamá. The concession contains at least three spatially distinct copper ore bodies (Colina, Botija and Valle Grande) and three conventional open pit mines are currently planned to exploit these ore bodies [1].

The copper sulfide ore will be mined using conventional open pit mining and will be processed using crushing, milling, flotation recovery and concentrate dewatering. The proposed design ore feed to the processing plant is 150,000 tons per day (t/d). It is expected that this will be expanded to 225,000 t/d at year ten by the addition of a third processing line. The Project will export materials through a port site to be constructed on the Caribbean coast at Punta Rincón and linked to the main Project site by a road, a power line corridor, and buried pipelines for transfer of products and other materials. As the nation develops, increasing industrialization and urbanization has led to a wide-scale contamination of many surface water resources from industrial effluents, domestic sewage discharges, and excessive use of fertilizers, pesticides and the emerging mining activities. Then, it may be inferred that the increased anthropogenic pressures and natural processes are accounting for degradation in surface water and groundwater quality [2]. Hence, given these pressures experienced on the water resources in the area, the main objectives of conservation must be in the control and minimization of pollution occurrences and problems facing these pollutants and to provide water of an adequate quality that can serve different purposes, such as drinking water, irrigation water [3, 4]. Then, the monitoring of water quality for any water body must be one of the highest priorities for their protection policy [5].

Giving that the physical and chemical characteristics of water bodies affect species composition, abundance, productivity and physiological conditions of aquatic organisms [6], water quality monitoring must be of highest priorities in any environmental protection policy [7].

Multivariate statistical methods such as factor analysis and principal components have been used successfully in hydrochemistry for many years. Nowadays, with the emerging technique offered by data mining techniques, the water quality of a given river state can reveal features otherwise not seen by conventional methods. The evaluation of the quality of surface water, groundwater and environmental assessment techniques use multiple components that are well defined in the literature [8, 9]. Multivariate techniques allow us to discover the information hidden in the data set about the possible environmental influences on water quality [10]. Today, data mining is popular among researchers of water quality investigations, for example in regard to chlorophyll levels researchers in [11] proposed Decision-making tree to forecast levels for the next day. Also in [12], it was proposed the use of fuzzy c-means clustering method to classify and assess rural surface water quality built on monitoring data from 33 water quality stations in 23 rural rivers and 4 reservoirs in Lianyungang city. Multivariate methods have several shortcomings such as the presence of mathematical calculations, equal treatment and process to the old and new data, problems with prediction and classification task due to multivariate overlapping of the parameters. Notwithstanding, data mining and machine learning techniques have shown to achieve great success in many disciplines [13]. Nevertheless, it is a well-known fact that data mining algorithms work best on large data sets, there are several studies which encourages its application on small databases [14,15,16].

In this study, we evaluated the possibility that a smaller group of water quality parameters could provide sufficient information for assessing water quality. For this reason, Factor analysis and data mining methods were applied to water quality data obtained from the surface waters of three (3) water quality monitoring stations at the Petaquilla River Basin during two hydrological seasons (high and low flows). The first part of the evaluation of the data was done using the MATLAB 2011a Package, version 7.5 (March 18, 2011), (http://www.matworks.com/) for MAC OSX and the second part was evaluated by data mining approaches using the R statistical computing language [16]. The water quality record with 12 observations of 41 water quality parameters selected for estimation of the physical-chemical characteristics were: Total Alkalinity (T_Alk), Electrical Conductivity (EC), Oxygen Redox Potential (ORP), pH, Total Suspended Solids (TSS), Turbidity (NTU), Chlorides (Cl), Fluorides (F), Nitrogen-Nitrite (N_NO2), Nitrogen-Nitrate (N_NO3), Sulfate (SO4), Total Cyanide (CN_T), Weak Acid Dissociable Cyanide (CN_W), Ammonia (NH3), Aluminum (Al), Boron (B), Barium (Ba), Beryllium (Be), Bismuth (Bi), Calcium (Ca), Cobalt (Co), Chromium (Cr), Copper (Cu), Iron (Fe), Potassium (K), Lithium (Li), Magnesium (Mg), Manganese (Mn), Sodium (Na), Nickel (Ni), Phosphorus (P), Selenium (Se), Silicon (Si), Zirconium (Sr), Titanium (Ti), Vanadium (V), Zinc (Zn), Phenol (PHEN), Total Kjeldahl Nitrogen (TKN), Total Nitrogen (TN) and Dissolved Organic Carbon (DOC). Water samples were collected on site and physical-chemical analysis was performed in a local accredited laboratory according to the methodology contained in the Standard Methods for the Examination of Water and Wastewater 21st Ed 2005.

The paper is organized as follows. Section 2, describes the background of the overall study location. Section 3 illustrates the methodology that has been used to tackle the problem. The experimental setup and analysis results of the study are presented in Sect. 4. Finally, conclusions and future work are discussed in Sect. 5.

2 Background

2.1 Data Source and Study Area

The Petaquilla basin is the westernmost drainage basin at the mine site. Surface runoff in this basin reports to the Petaquilla River, where it subsequently flows northwest and discharges directly to the Caribbean Sea. Two open pits (Colina and Valle Grande), as well as the southwest waste rock storage facility and associated sedimentation ponds, will be developed in the southeast region of the Petaquilla basin. The community of Nueva Lucha and the Faldalito sector are also located in this basin. The three-surface water baseline sampling stations established in the Petaquilla River Basin, specifically in the Petaquilla River, are described in Table 1 and shown in Fig. 1.

Table 1. Petaquilla River Basin water quality stations and community location.
Fig. 1.
figure 1

(Source: MPSA, ESIA, 2010)

Map showing the Petaquilla River Basin location.

3 Methods

The Isthmus of Panama has basically two seasons: the dry low flow season (January to April) and the high flow season (May to December). The climate in the region of the Petaquilla River Basin is typically governed by these two seasons. Therefore, the hydrological conditions during the low and high flow seasons are somewhat different. From there, the water quality was assessed separately for both seasons (periods of low flow conditions) and the wet season (period for high flow conditions) would assist in the understanding of the major pollutants of water quality sources and in establishing priorities for monitoring this water course during these periods.

Before we ran the experiments, it was necessary to impute missing data for the available small data record by means of PCA. No data is registered for the year 2010. The experiment was then divided into two parts. First, we applied principal components analysis (PCA) and factor analysis (FA) techniques to the 41 water quality parameters from the 12 surface water samples collected for both seasons, dry (January-April 2008) and rainy (May to December 2011, respectively) to orderly simplify the number of interrelated measures to a few representative constructs or factors [17]. From the PCA, FA, correlation and variance matrices, we chose 9 variables for the purposes of the experiment. On the one hand, and of particular notice, the FA left out the T_Alk, and by the other hand the PCA left out the N_NO2 variable. Then, completed this phase and given the scarcity of the data, and keeping in mind to potentiate the information within, the data was also treated with cluster analysis (CA) and classification analysis (CLA) with the purpose of discovering features not otherwise revealed by the PCA and FA.

The descriptive statistics of the data is presented in Tables 2 and 3, respectively.

Table 2. Descriptive statistics for water quality variables during low flow season
Table 3. Descriptive statistics for water quality variables during high flow season.

4 Experimental Setup and Results

4.1 Assessment of the Water Quality Profile with Principal Components and Factor Analysis

Factor Analysis: Low flow season

The correlation matrix of the selected variables was generated and the factors extracted and rotated using the Promax criterion, a common oblique method which has the advantage of being fast and conceptually simple [18]. Eigenvalues, the percentage of total variance, the factor loadings and cumulative variance are in Tables 4 and 5 respectively.

Table 4. Matrix of factor loads and total variance explained for low flow season.
Table 5. Matrix of factor loads and total variance explained for high flow season.

The FA for the dry season generated three (3) important factors (with eigenvalues of 2.53, 2.05 and 1.24 respectively) and this solution explaining 76% of the total variance in the data set for surface waters is kept for rotation. From the analysis, the following factors were indicators of the physical-chemical profiles of the water quality:

  • Factor 1: TSS, ORP, Cl,

  • Factor 2: F, EC, NTU,

  • Factor 3: pH.

The first factor accounted for 32% of the explained variance. TSS and CL recorded high positive loadings (1.0114 and 0.7206 respectively). ORP showed a strong negative loading (−0.9130), meaning that they were inversely correlated with this factor.

Water bodies may vary in their physical-chemical composition and concentration on a seasonal, diurnal or even hourly basis. These variations may be related to patterns of land and water use and precipitation [19, 20].

Land uses such as urbanization and industrialization influences the water cycle through changes in flow and water quality. Urban land use (Na+, F, Cl) may be differentiated from other land disturbances such as agricultural runoff (F, EC, NTU). Subsurface drainage water would commonly include Salts such as sulfates, chlorides, carbonates, and bicarbonates of calcium, and magnesium. Excess surface water draining also may contain these salts, but generally in much lower concentrations than in drainage water [21]. Based on the results of the factor analysis and typical sources of water pollutants, it is concluded that factor 1 can be denoted as the erosion processes factor with the presence of TSS, ORP and Cl. Negative factor loading of ORP explained the disproportion between this parameter and Factor 1. Based on the factor analysis, these parameters are usually associated with masses of large land movements and disturbances on the levels of sediments in the study area. This is because TSS as the highest loadings (1.0114).

Factor 2 is strongly correlated with F, EC and NTU, assigned as the agricultural processes factor. Factor loadings were 1.0046, 0.6691 and 0.6195. pH is included in Factor 3 and the indicator of biochemical processes in water.

In summary, three factors representing three different processes are:

  • Erosion processes

  • Agricultural processes

  • Biochemical processes

Hence, the water quality of the Petaquilla River Basin during the dry season period was mainly controlled by activities which trigger erosion processes.

Factor 1: High factor scores observed for TSS (erosion processes factor) Giving the presence of an already established gold mine, within the study area, illegal Artisanal Small-Scale Miners use domestic gravel pumping and flushing devices in their activities. The use of these devices in gold mining by the Artisanal Small-Scale Miners constitutes a point source of contamination by Suspended Solids.

Factor 2: High factor scores (agricultural processes factor) were observed in the west of the basin. The middle and eastern parts where high values were monitored were faced with pollution risks originating from agricultural uses.

Factor 3: (Biochemical processes factor) pH values were distributed in the basin almost uniformly and are indicative of decay. The mean value was 6.89.

Factor Analysis: High flow season

The wet season period may influence positive effects with a dilution of surface water by rain and stormwater. In contrast, runoff water increases pollutant concentrations, thereby decreases water quality. To assess the water quality of the Petaquilla River Basin under rainy conditions, factor analysis was applied to the datasets obtained from 3 monitoring stations between May-December. Statistical analyses of the data are shown in Table 3.

Results of the factor analysis including factor-loading matrix, eigenvalues, total and cumulative variance values are given in Table 5.

From the Factor Analysis, three factors that are indicated below explained 75% of total variance.

  • Factor 1: Cl, EC, NTU,

  • Factor 2: N_NO2, pH-,

  • Factor 3: TSS.

It is proposed that factor 1 represents the agricultural processes shown by the presence of Cl and EC, with the NTUs indicating an inverse relation for that factor. This factor explained 40% of the variance. Factor 2 is strongly correlated with N_NO2 and pH which is mainly influenced by biochemical processes. Factor 3 was marked by TSS. Thus, agricultural processes were the major pollution source for this hydrological period.

Considering the distribution of factor scores and locations of the monitoring stations, it is concluded that:

Factor 1: High factor scores (agricultural processes factor) were observed at the southwest part, upstream of the basin.

Factor 2: Relatively high values of biochemical processes obtained throughout the entire basin where some agriculture and grazing of cattle are most important economic activities.

Factor 3: Significant scores of total suspended solids pollution were distributed in the basin, because Factor 3 depends on point pollution sources and is affected by eroded soils in the study area

4.2 Assessment of the Water Quality Profile with Cluster Analysis and Decision Tree

Spatial Similarity and Stations Grouping: Low flow season

The experimental approach involved the use of hierarchical cluster analysis (CA) with Euclidean distance and linkage method using ward.D2, from the R Statistical Package on the scaled data formed with data sorted by two seasons (dry and wet). For a clearer understanding of the stations’ representation with respect to the date of sampling campaigns, we systematically arranged the stations with four levels with the purpose to identify specifically to which date corresponded the collection of the samples. So, station W-1 as the following labels (W-1.1 to W-1.4), station W-2 (W-2.1 to W-2.4) and finally W-10 (W-10.1 to W-10.4), such that each level corresponds to a sampling date (01-Jan-08, 01-Feb-09, 11-Jan-11, 05-Apr-11, 01-Jan-08, 01-Feb-09, 11-Jan-11, 05-Apr-11, 01-Jan-08, 01-Feb-09, 11-Jan-11 and 05-Apr-11 respectively). Temporal CA for the dry period generated a dendrogram as shown in Fig. 2, grouping 3 combined water quality stations into three clusters. Cluster I comprised Stations W-1 and W-2 and the cluster II included Stations W-1, W-2 and W-10, while the cluster III grouped all three stations in one for the various consecutive periods. The cluster III, shows similarities in the water quality for the respective sampling period between February 2009, January 2011 and April 2011. Figure 2 shows that the spatial patterns of the water quality are not purely consistent with the dry season. From the monitoring campaigns, at Stations W-1 and W-2, the highest pollution levels for (TSS = 5.0 and 10.5 mg/l, respectively) were recorded in February 2009. In cluster II, for stations W-1, W-2 and W-10, W-1 had the highest value of (T_Alk = 13 CaCO3 mg/l) in January 2008. Aside from the eroded land scape, these high levels of TSS recorded in the dry season can be probably attributed to the location of Station W-1 been directly in the project foot print and W-2 been downstream of it. Notwithstanding, it is also reported the activities of artisanal mining by peasants in the area of station W-2, as was previously mentioned. With an increase of the TSS, it is more likely that the concentration of other water quality constituents like chlorides (Cl) to be present as was depicted from the factor loadings in Table 4.

Fig. 2.
figure 2

Cluster analysis on sampling stations along Petaquilla River during low flow season.

Spatial Similarity and Stations Grouping: High flow season

For the wet season period, the information depicted in Fig. 3, shows the grouping of 3 clusters, been the third one the largest group. In cluster I, station W-1 is isolated from the other stations, cluster II groups Stations W-1 and W-2 and cluster III all three stations. As with the dry season, the systematic arrangement of the stations with respect to the sampling dates or as follows: 1-Sep-08, 1-Oct-08, 1-Nov-08, 13-Oct-10, 1-Jun-08, 1-Oct-08, 1-Nov-08 27-Sep-10, 1-Jun-08, 1-Sep-08, 13-Oct-10 and 15-Jun-11. Again, Station W-1 isolated by cluster I for November 2008, shows to be the most polluted in terms of the (TSS = 248 mg/l). Cluster II shows high pollution of (Cl = 3.36 and 3.33 mg/l, respectively) at Stations W-1 and W2 in October 2008. Cluster III shows lower levels compared to the other two clusters. From the FA loadings in Table 4, it is observed the pollution of the water constituents that follows the high levels of eroded soils during high water flow seasons (agricultural and biochemical) and is reflected in all three stations for the sampling campaign.

Fig. 3.
figure 3

Cluster analysis on sampling stations along Petaquilla River during high flow season.

The results showed that CA technique can be useful in classification of the river water [22] in the study region and was able to disclose the sampling sites and its associated monitoring periods. in spite of the short database, the small data record was used without missing much information.

Decision Tree: Low flow season

Decision trees are one of the most widely used inductive inference algorithms and it is a kind of approaching discrete value function method which is good and robust for dealing with noise and can analyze and extract expressions. Decision trees are usually generated from top to bottom and its routines procedures use greedy search traversal method to traverse. Every decision or event may extract two or more events and cause different results. This kind of decision branches can be depicted in figures which resemble the branching of trees, reason why it is called decision tree. The decision tree algorithms commonly used include CART, ID3, C4.5 AND C5.0 algorithms [23,24,25].

In this section of the experiment, the objective was to test the applicability of decision trees to the limited data set and find rules that could have just as well provide information that leads to unseen patterns in the data and also corroborates the other analysis carried out previously. With the R package for data mining, after serval trial and errors with the Party, RPart, Random Forest and the C5.0 algorithms, a classifier was finally built with the C5.0 algorithm. The C5.0 was feasible to deal with the twelve observations, we could only go as far as dividing the data in half for training and testing, as other combination approaches of data splitting did not suffice for the task. Below in Fig. 4 the output tree for the dry season period representation.

Fig. 4.
figure 4

Decision tree generated by C5.0 for the water quality stations during dry season period.

As shown in Fig. 4 the decision tree C5.0 model, which ranked the water quality of the three stations according to the content of two key indicators (T_Alk and NT) generated the following equations of water quality: when T_Alk <= 8: W-1, when T_Alk > 8: and the NTU > 3: W-1, when NTU <= 3: and NTU <= 1: W-2 and NTU > 1: W-10.

The design of this tree model was built by the sampled data of the 12 observations. The performance error rate was 16.7%. The other metrics of the model are shown in Fig. 4. The prediction with the test data output, suggested a model performance of 83% with only two miss classified instances for station W-2.

The information provided by the decision tree corroborates the information given previously by the FA and the CA analysis in regard to impairments of the water quality profile at stations W-1, W-2 and W-10 with respect to pollution by high levels of TSS and on the other hand, it reveals the direct effects of the TSS on the water column as a consequence of the increased NTU levels.

Decision tree: High flow season

Conditions of the water quality under the wet season as labeled by the decision tree is shown in Fig. 5. As can be seen, three water quality indicators (EC, Cl and TSS) in this case are responsible for the governing pattern of the water quality profile for Stations W-1, W-2 and W-10 respectively. The water quality equations generated for this scenario are: when EC <= 37.4 and Cl <= 1.84: W-1, and Cl > 1.84: W-10; when EC > 37.4 and TSS <= 1: W-1 and TSS > 1: W-2. The error rate for the performance of this model is 25%, 0.9% more than the model built under dry season condition. From the confusion matrix, this model classified correctly three instances out of four for Station W-1, all four instances for Station W-10 were classified correctly and only two for Station W-2 out of four were classified correctly. The prediction of the model of the classes a 75% accuracy.

Fig. 5.
figure 5

Decision tree generated by C5.0 for the water quality stations during wet season period.

5 Conclusions and Future Work

In this study, the application of multivariate techniques demonstrated that the water quality of the Petaquilla River Basin is under environmental stress.

The factors obtained from the FA indicated that the water quality for the two distinct hydrological periods and locations differed slightly. From the FA, Factor model 1 is interpreted to be of water affected by erosion processes, in spite of low flow periods. This may be inferred to illegal Artisanal Small-Scale miners who in their search for gold in the river bed, employs gravel pumping and flushing devices to dredge the river bed. Added to this, is the presence of an abandoned gold mining firm and recently to ongoing land movement activities in the study area. On the other hand, during the wet season, water quality was controlled by agricultural pollutant sources, this was explained by 40% of the total variance and for the dry season conditions, it was under only by 26%. So, the major pollutant source changed from erosion to agricultural processes during the wet season. This is attributed to the negative effects of runoff to surface water quality.

Another contribution to this study has been achieved by complementing the multivariate approach with data mining techniques. The results of this study unveiled the importance and usefulness of cluster analysis of a small and complex database to obtain better information concerning the surface water quality. Hierarchical CA grouped the 12 observed sampling campaigns into three clusters and classified 3 sampling sites based on the similarity sites of water quality parameters. The temporal pattern shows that February and November have high pollution levels in contrast with the rest of the months. The spatial pattern shows that the sampling sites (W-1 and W-2) have highest levels of pollution. We also used the water quality data with the C5.0 algorithm found in R data mining statistical tool. The experiment with the 12 observations and nine attributes of water quality data revealed interesting information with the built decision tree. See Sect. 4.2 where it is shown the If…then rules extracted and the generated trees. This information is enriching, which otherwise would not have been gained solely from a PCA/FA only.

It is obvious from this study, that major pollution threats during the dry and wet seasons were erosion and agricultural processes which are defined as nonpoint pollution sources. This situation suggests that it is imperative to establish an environmental monitoring and assessment plan to promote minimization of these sources as to protect and improve water quality in the basin.

The above analysis demonstrates the use of multivariate statistics and complementation with data mining techniques to study a small database of physical-chemical parameters in surface water systems. It’s a useful methodology that could aid in decision making as to determine the extent and severity of pollution via practical pollution indicators. It could provide guidelines for selecting the priorities of keen preventative measures in the proper management of the surface water resources of the basin. Future work should involve fetching new and current data, and to explore or/and implement new algorithms to help foster the analysis of small databases of river basins.