Introduction

The presence of potentially toxic elements (PTEs) poses a significant risk in various environmental sectors (Papamichaela et al. 2023), Boumaza et al. 2023, Ebru Yeşim Özkan et al. 2024, Hadzi et al. 2024). Over the last few decades, the cumulative impact of these elements on the environment has been considerable. During that time, there has been an exponential increase in the concentrations of PTEs, thus enhancing the risk to humans and the environment (Antoniadis et al. 2017; Kumudunis et al. 2020). Mining and heavy industrial activities may potentiate these high observed levels of PTEs and may be the origin of numerous sources of contamination (Boente et al. 2022, 2018; Carvalho et al. 2022). Thus, in recent decades, researchers have invested in the development of new techniques that offer accurate scenarios of the spatial distribution of PETs (Özkan et al. 2024, Sulemana et al. 2024, Petryshen 2023, Zhang et al. 2023). The definition of geochemical backgrounds and the identification of enrichment sources are key to the accomplishment of this objective (Wang et al. 2021; McKinley et al. 2016). The visualization and depiction of pollutants requires the use of simulated maps to visualize spatial–temporal distribution models. The definition of vulnerability and risk hot clusters may provide a basis for environmental policy-making in complex scenarios (Boente et al. 2020; Albuquerque et al. 2017; McKinley et al. 2016). In soil and stream-sediment science, mapping a new variable called an index or an indicator is a common technique for describing the distribution of PTEs. The use of classical characterization methods like statistics and soil/stream sediment pollution indexes (SPIs) can help identify potentially polluted areas. A review by Joanna et al. (2018) provides a detailed and critical assessment of heavy metal soil pollution using various indicators. Unfortunately, however, the compositional nature (Pawlowsky-Glahn et al. 2015; Filzmoser et al. 2009) of the geochemical data is usually not considered. In most cases, the indicators are related to the study of individual elements without considering the interdependence of the concentrations of all the elements in the same set. Non-compositional indices that are often used to study geochemical data include the geoaccumulation index (Muller 1969), the enrichment factor (Sucharova et al. 2012), and the single pollution index (SPI) (Hakanson 1980), as reviewed in Kowalska et al. (2018). Nevertheless, it is well known that a traditional statistical approach using direct raw data can be misleading (Chayes 1962, 1971). Aitchison (1982, 1986) answered these questions in his fundamental work on the logarithmic ratio method. Theories of composition data (CoDa) have enhanced our understanding of the sampling space of composition data and their corresponding structure (Pawlowsky-Glahn and Egozcue 2001). Representations of data that consider pairwise log ratios (pwlr), isometric log-ratio coordinates (ilr), centered log-ratio coordinates (clr), and additive log-ratio coordinates (alr) are statistically robust approaches to deal with the compositional nature of chemical concentration data (Pawlowsky-Glahn and Egozcue 2001; Egozcue et al. 2003; Buccianti and Grunsky 2014). The compositional approach (CoDa) is well represented in various fields of research in environmental science, such as ecotoxicology (Mullineaux et al. 2021), urban impacts (Cicchella et al. 2020), water quality management (Wei et al. 2018), and human health (Tepanosyan et al. 2020, Pawlowsky-Glahn and Buccianti 2011; Filzmoser et al. 2021). Recently, the adoption of compositional indicators for characterizing PTE pollution of soil has been increasing (Boente et al. 2022; Petrik et al. 2018). Compositional indicators involving the definition of geochemical baselines offer a valuable contribution as they are scale invariant and sub-compositionally coherent, meaning that a change in the concentration unit used will not modify the study’s results (Pawlowsky-Glahn et al. 2015).

Compositional indicators are commonly used to measure water and air contamination. In stream sediments and soils, the use of compositional indexes or indicators to address pollution has only recently been explored (Boente et al. 2022). The primary challenge is the highly varied geochemical background, which hinders the distinction between what is polluted and what is natural. In addition, to assess stream-sediment pollution, the compositional baseline needs to account for a set of key issues: (1) the compositional nature of data; (2) spatial changes in the background; (3) the definition of pollution; (4) the indicator as a log-contrast; and (5) that an indicator should be provided for every type of pollution. This research introduces a new compositional pollution indicator (CPI) of riverine sediments, based on expert criteria, to characterize pollution in the Caveira mine in southern Portugal. This indicator corresponds to a balance of elements that respect the CoDa principles (Aitchison 1982).

Material and methods

Characteristics of the study area and the dataset

The studied sector is part of the Portuguese Iberian Pyrite Belt and is an example of a European post-mining area that dates back to the 1990s. Mining activity ceased there mainly because of ore exhaustion and more profitable methods worldwide, which resulted in an ore price reduction and made local mining activities infeasible (Martins and Oliveira 2000). Therefore, major pollution problems related to metal dispersion and mine waste management are present there. The geological sequence at Caveira mine, which closed back in the 1980s, corresponds to (from bottom to top) phyllites and quartzites (PQG) followed by a volcanic sedimentary complex sequence (VSC) unit (Late Famennian) represented by pyroclastics, rhyolitic lavas, tuffs, dark gray and siliceous shales, and rare jaspers. Intruding diabase rocks can be seen in the northern sector (Fig. 1). The massive sulfide deposits that were exploited in the region occurred in the vicinity of felsic volcanic rocks. The Mértola formation, from the Visean age, overlays the CVS and corresponds to a flysch sequence consisting of sandstones alternating with shales and thin-bedded siltstones. From a structural point of view, the whole sequence is part of the South Portuguese Zone, a thin-skinned fold and thrust belt from the Variscan age. Tailings and associated waste rock resulting from 129 years of pyrite and Cu mining are scattered along Grândola Creek. The semi-arid climatic conditions result in high erosion of residues by surface water, primarily during rainfall, causing serious contamination of Grândola Stream and its tributaries due to the degradation of sediments (Ferreira da Silva et al. 2015).

Fig. 1
figure 1

Study area and the sample collection locations

A dataset of 33 bottom-sediment samples distributed across small and narrow creeks—two of them flowing by the mine tailings pile and the larger Grândola Stream, of which they are tributaries—was obtained. These streams belong to the Sado River basin, the second-largest hydrographical basin in Southern Portugal. Samples were collected at 0 to 10 cm depth with an environmental hand soil sampling kit (#209.55, AMS) from within a grid of 1 km × 1 km. These samples were preserved at about 4 °C and later analyzed for 12 chemical elements, including PTEs of variable toxicity (As, Cd, Co, Cr, Hg, Mn, Ni, Pb, Zn, V) and major elements from lithogenic sources (Fe, Al). The most extractable forms of the metals (except for Hg) were obtained by partial digestion with aqua regia (HCl and HNO3) in a high-pressure microwave digestion unit (Anton Paar Multiwave PRO) following US EPA (2007) method 3051A. The metals and As were analyzed by optical emission spectroscopy with an inductive plasma source (ICP-OES, PerkinElmer Optima 8300), using yttrium as an internal standard. The accuracy and analytical precision of all the analyses were checked through the analysis of reference materials and duplicate samples in each analytical set.

Mercury (Hg) was analyzed by a mercury analyzer (NIC MA-3000) based on thermal decomposition, gold amalgamation, and cold vapor atomic absorption spectroscopy detection. Sampling was followed by immediate readings of pH and redox potential values in wet samples using a portable multi-parameter analyzer (Consort C5020—the SP10T model for pH and the SP50X model for redox potential). In samples with insufficient moisture for direct pH readings, this parameter was measured in a water–sediment suspension (2.5:1) in the laboratory. Concerning to samples’ chemistry, the dataset included PTEs of variable toxicity (Fabian et al. 2014). The set of 12 elements was reported for each of the 33 sampling points, resulting in a 12-part composition that was assumed to represent the stream sediments.

Compositional pollution indicator (CPI) construction

The first fundamental principles of composition data are to be found in the founding work of Aitchison (1986). These initial contributions were explained and expanded into general-purpose works such as those of Pawlowsky-Glahn et al. (2015), Boogaart van den and Tolosana-Delgado (2013), Filzmoser and Hron (2011), Pawlowsky-Glahn and Buccianti (2011), and Pawlowsky-Glahn and Serra (2019).

The analysis of a stream-sediment sample based on its chemical composition should be conducted under the assumption that the data are compositional. As a result, when performing data analysis, the functions used to describe the composition should be invariant under multiplication by a positive constant (Boente et al. 2022). Also, any composition can be expressed in proportions (where the components sum to 1) without adding or losing any information, irrespective of the units in which the data were initially represented.

Analysis of the chemical composition of a sample of riverine sediments in units such as mg/kg should be performed assuming that the data are compositional. Moreover, the conversion of units from mg/kg to g/kg, as an example, must not change the information about the sample. This is summed up by one of the principles of CoDa analysis, named the principle of scale invariance. Thus, when analyzing the data, the functions used to describe the composition should be invariantly multiplied by a positive constant. Consequently, any composition can be expressed as proportions (where the components add to 1) without adding or losing information, regardless of the units in which the data were originally reported. A second assumption is known as the sub-compositional coherence principle. The whole periodic table is never presented; only a subset of elements is measured, and this subset may change over time and across the field. The elements observed form a composition, and any subassembly of them is a sub-composition that is again subject to the principle of scale invariance. Analyses of the initial composition or sub-composition should lead to coherent conclusions describing the roles of common elements (Aitchison 1986).


The CPI balance was obtained based on expert criteria attending a selection of elements (Boente et al. 2022), of which some are considered pollutants while others are not. In the case of the Caveira mine, the main contaminants were selected from typical pollutants, namely As, Zn, Pb, and Hg, while Al and Fe were selected as the main natural-source elements (or non-pollutants). Based on this previous study, the selected balance, the CPI, was constructed as follows:

$${{\text{CPI}}}=\sqrt{\frac{4}{3}} {\text{ln}}\left(\frac{{(As\mathrm{ Zn Pb Hg})}^{1/4}}{{\left(A\mathrm{l }Fe\right)}^\frac{1}{2}}\right).$$
(1)

Spatial modeling—a geostatistical approach

The computed compositional pollution indicator (CPI) is unbounded—a real random variable. Therefore, it fulfills the assumptions underlying a conventional geostatistical approach. Its spatial probability maps were computed by following a two-step geostatistical modeling method: (1) a structural analysis and a computation of experimental variograms (Journel and Huijbregts 1978) were performed, followed by (2) sequential Gaussian simulation (SGS), which was used as a stochastic simulation algorithm over a 100 × 100 km grid mesh.

The new CPI can be considered a regionalized variable (Matheron 1971), as it depends on the spatial location determined by the coordinates and is additive by construction. Indeed, the mean value within a given observed support is equal to the arithmetic average of the sample values independently of the associated statistical distribution (Albuquerque et al. 2017; Rivoirard 2005). Thus, the vector function used to calculate the spatial variation structure was the semi-variogram (Journel and Huijbregts 1978).

$$\gamma \left(h\right)=\frac{1}{2N(h)}\sum_{i=1}^{N(h)}{\left[Z\left({x}_{i}\right)-Z({x}_{i}+h)\right]}^{2}$$
(2)

The arguments taken into consideration are h (the distance) where Z(xi) and Z(xi + h) are the numerical values of the variables assigned to xi and xi +  h. The total number of couples at a specified distance of h is N(h). Therefore, it is the average value of the square of the differences between all couples of points in the geometric field spaced at a distance h (Journel and Huijbregts 1978). Plotting the behavior of the variogram gives an overall view of the spatial structure of the variable. One of the parameters that provide this information is the nugget effect (Co), which supplies the behavior at the origin. The two other parameters are the sill (C1) and the amplitude (a), which define the inertia used in the subsequent interpolation process and the influence radius of the variable, respectively.

The SGS starts by computing the univariate experimental distribution of values and performing a normal score transformation of the original values to a standard normal distribution (Goovaerts 1997). Normal scores at grid node locations are then simulated sequentially with simple kriging (SK) using the normal score data and a zero mean. Once all normal scores have been simulated, they are back-transformed to their original units. The outcome of a simulation is always a random version of the estimation process that reproduces the statistics of the known data and builds a realistic picture of reality. The associated spatial uncertainty is visualized through the construction of probability maps. If multiple sequences of simulation are computed, it is possible to obtain reliable probabilistic maps. The mean image map and maps of the probability of exceeding the third quartile (Q3) and the probability of not exceeding the first quartile (Q1) were computed.

Results and discussion

Geochemical data

The analyses of physicochemical parameters and the determination of the levels of PTEs of variable toxicity (As, Cd, Co, Cr, Hg, Mn, Ni, Pb, Zn, V) as well as the selected elements from lithogenic sources (Fe, Al), accounting for their capacity for solubilization and mobilization, were performed with the aim of achieving contamination mapping. The evaluation of each metal’s mobility was based on partial digestion analysis (using aqua regia), considering the pH values. The element concentrations and pH values in the stream sediment samples are reported in Table 1. The values of the physical–chemical parameter that most affects the solubility, mobility, and precipitation of potentially toxic metals in the sediments from shallow streams, i.e., the pH, range from 2.06 and 7.39. The lower values (2.06–4.57) are found for the sediments from the two creeks that flow through the mine tailings pile. As would be expected, those sediments (Cv1, Cv2, Cv3, Cv26, Cv33, Cv34) contain the highest values of Pb, As, and Hg—the main contaminants in the mine tailings: the levels of those contaminants rise above the critical levels that require immediate intervention according to European regulations (based on the Netherlands legislation—Soil Quality Regulation, 2006). Zn, another element with levels of concern, and one which presents high contents in the massive sulfides that have been exploited in this mine, shows slight contamination levels in all the streams flowing from the tailings pile—mostly in locations that do not coincide with those where the other elements exceed critical levels. The highest values of this element also do not coincide with the most acidic conditions in the environment. Although the ores that were exploited in this mining area contained all of these elements, Zn has a high chemical mobility that is mostly influenced by the oxidation conditions that occur in all the sediments (240–650 mV), so its distribution is more diffuse.

Table 1 Element concentrations (mg/kg) in the stream sediment samples (spring season) from Grândola and its tributary streams. These waterways belong to the Sado watershed

Descriptive statistical analysis

A preliminary descriptive analysis was conducted to gain a comprehensive overview of the dataset in terms of the statistical distribution of the elements chosen for the CPI balance. This crucial step offers a preliminary glimpse into the central tendencies, dispersion, and distribution of the variables under scrutiny and at the same time provides a succinct summary of the main features of the studied dataset (Table 2).

Table 2 Descriptive statistics of the main selected contaminants: As, Zn, Pb, and Hg (pollutants) as well as Al and Fe (the main natural-source elements or non-pollutants)

Pollution assessment relies heavily on the presence of severe outliers for Pb, which clearly demonstrates a consistently high concentration of this element (Fig. 2).

Fig. 2
figure 2

Box-plot diagrams of the CPI’s elements

Descriptive statistics of the main selected contaminants, As, Zn, Pb, and Hg as pollutants and Al and Fe as the main natural-source elements or non-pollutants. Furthermore, a heat map was used for exploratory data analysis of the geochemical composition and sample clustering simultaneously in a synthetic way (Fig. 3) (Wilkinson and Friendly 2009; Langella et al. 2013). The heat map shows that the elements are divided into two groups (upper dendrogram). The first group corresponds to Al and Fe (non-pollutants) and the second group corresponds to As, Cd, Co, Cr, Mn, Ni, V, Zn, Hg, and Pb. However, Pb is significantly separated from the other covariates in the major group. Based on an expert-driven approach, As, Zn, Pb, and Hg were selected as pollutants. The dendrogram of samples (the left dendrogram) is divided into three groups. The central group corresponds to samples Cv6, Cv8, Cv10, Cv13, C14, Cv16, Cv17, Cv18, Cv21, Cv23, Cv25, Cv28, Cv29, CV7 and Cv32; the left group corresponds to samples Cv1, Cv2, Cv15, and Cv26; and the right group corresponds to samples Cv3, Cv4, Cv5, Cv9, Cv11, Cv12, Cv19, Cv20, Cv24, Cv27, Cv30, Cv31, Cv33, and Cv34. The map presents values in the dataset rearranged according to the dendrograms. Focusing on the rectangle/square patterns (note that the level of significance increases from red to blue through white) in the map, in particular those for the bottom group of samples Cv1, Cv2, Cv15, and Cv26 together with samples Cv34 and Cv3 of the upper group, it is possible to discern a lower-significance cluster for Al and a higher-significance cluster for Pb relative to the other samples. In future work, the relationship between the samples’ geochemical print and the associated geology will be explored.

Fig. 3
figure 3

Heat map and simultaneous sample/geochemical print dendrograms

The compositional pollution indicator

The compositional balance of the CPI was obtained according to expert criteria. These criteria account for a selection of factors, some of which are considered pollutants while others are not. In the case of the Caveira mine, the identification of the main pollutants was addressed in previous studies (e.g., Ferreira da Silva et al. 2015), where typical pollutants such as As, Zn, Pb, and Hg were identified as being related to the activities of the old Iberian Pyrite Belt mines. The main natural-source elements (or non-pollutants) were two major elements (i.e., Al and Fe). Spatial modeling of the CPI was performed to identify hazardous clusters. A two-step geostatistical approach was used. As no clear evidence of anisotropy was found, the experimental isotropic variogram was computed, and the corresponding fitted model is shown in Fig. 4. The cross-validation correlation index for the observed and estimated CPI values is 0.70 and is therefore considered satisfactory for the selected models. Furthermore, a hundred simulations were performed using SGS as a conditional stochastic simulation of the CPI value distribution, and a hundred equiprobable scenarios were computed.

Fig. 4
figure 4

Experimental spherical omnidirectional variogram and the fit to it

Probability maps corresponding to different thresholds allowed the visualization of spatial variability while setting aside the discussion of local accuracy, and they allowed the identification of hotspot clusters of pollution in the subject area. Realization numbers 1, 15, 32, 52, 67, and 99 are shown in Fig. 5.

Fig. 5
figure 5

Six different scenarios obtained by sequential Gaussian simulation (SGS)

The problem is that all representations (scenarios) have the same reliability, which means that a single achievement cannot be seen as a better representation of reality. Therefore, the mean spatial image (MI)—the average map—was computed and used as the CPI spatial distribution (Fig. 6a). The presentation of the probability of exceeding the third quartile (Q3) and the probability of not exceeding the first quartile (Q1) allows broad discussion of the CPI spatial distribution and the identification of hazard clustering (Fig. 6b, c). To create distinct classes by reducing the within-class variance and maximizing the between-class variance, the Jenks natural break classification (Jenks 1967) was used, which allowed the determination of the best arrangement of values. The software Space-Stat v.4.0.18 (BioMedware) was used for the computation (Boente et al. 2022; Albuquerque et al. 2017).

Fig. 6
figure 6

a Average SGS image (MI). Maps of the probability of b exceeding Q3 and c not exceeding Q1

The compositional pollution indicator (CPI) provides a fair representation of hotspots, especially along Grândola Stream and its tributaries, thereby confirming the high pollution detected around the old mine tailings and associated waste rock.

Conclusions

Geochemical data are compositional data, as the concentrations of the elements in any environmental matrix are commonly expressed as parts of the whole and vary together. Once this feature is accepted, compositional data procedures can be applied to obtain indicators that address pollution in, for example, stream sediment. The method was tested with up to 11 chemical elements in 33 sediment samples from the old Caveira mine in Portugal.

A high risk of contamination is observed along the Grândola River and in the vicinity of the mine tailings. It is important to consider agricultural and organic stocks as the main economic activities when establishing two lines of intervention: (1) the installation of a surveillance network for continuous control in all areas and (2) the definition of mitigation actions for the northern area, where high levels of contamination are observed.