Assessment of groundwater safety surrounding contaminated water storage sites using multivariate statistical analysis and Heckman selection model: a case study of Kazakhstan

Petrochemical enterprises in Kazakhstan discharge polluted wastewater into special recipients. Contaminants infiltrate through the soil into the groundwater, which potentially affects public health and environment safety. This paper presents the evaluation of a 7-year monitoring program from one of the factories and includes nineteen variables from nine wells during 2013–2019. Several multivariate statistical techniques were used to analyse the data: Pearson’s correlation matrix, principal component analysis and cluster analysis. The analysis made it possible to specify the contribution of each contaminant to the overall pollution and to identify the most polluted sites. The results also show that concentrations of pollutants in groundwater exceeded both the World Health Organization and Kazakhstani standards for drinking water. For example, average exceedance for total petroleum hydrocarbons was 4 times, for total dissolved solids—5 times, for chlorides—9 times, for sodium—6 times, and total hardness was more than 6 times. It is concluded that host geology and effluents from the petrochemical industrial cluster influence the groundwater quality. Heckman two-step regression analysis was applied to assess the bias of completed analysis for each pollutant, especially to determine a contribution of toxic pollutants into total contamination. The study confirms a high loading of anthropogenic contamination to groundwater from the petrochemical industry coupled with natural geochemical processes.


Introduction
Safe drinking water is one of the sustainable development goals announced by the UN; however, in many countries, the goal remains far off. In 2015, the distribution of global groundwater use was estimated to 50% for drinking purpose and 43% for irrigation (UNESCO 2015). Historically, groundwater quality has been deteriorated by human activities, such as agricultural, industrial, and urbanization processes (WHO 2006). In Kazakhstan, groundwater withdrawal amounted to 1.078 km 3 in 2016 (UN 2019). One crucial problem in the country is toxic wastewater from petrochemical factories (Radelyuk et al. 2019), a very important factor in Kazakhstani economy. The oil refinery industry is represented by three large factories, and their capacity is estimated to be 360 thousand barrels daily with an annual growth of 2.9% (BP 2018). Additionally, refineries are associated with petrochemical industry. Industrial clusters are established around core refineries. It leads to growth of production and increasing level of contamination. The problem is that the current methods of wastewater treatment in the petrochemical sector, as well as the conditions of the treatment units built during the Soviet era, do not assure a safe level of contaminants concentrations for the ecological systems. Thus, the existing discharge system has a significant negative impact on the environment and could potentially become a health issue for the population.
The groundwater is the main source for decentralized and centralized drinking water supply in rural areas in Kazakhstan, where more than half of the population live (Zhupankhan et al. 2018;Bekturganov et al. 2016). The perceived water quality has been assessed in several research and showed relative satisfaction (Tussupova et al. 2015. However, in situ water quality and potential risk for groundwater safety have not been covered within existing scientific literature. Simultaneously, petrochemical plants in Kazakhstan continue to discharge wastewater with high concentrations of different pollutants and these contaminants may reach the groundwater very easily. Despite of existing system of ecological monitoring, oil refinery cluster in Kazakhstan is ranked as one of the biggest sources of water contamination by United Nations Economic Commission for Europe (UNECE 2019). Recent studies showed that approximately 1.5% of total deaths in Kazakhstan caused by waterborne diseases related to water pollution, including industrial sources (Karatayev et al. 2017).
While contaminated sites occupy relatively small area, they belong to larger aquifers and potentially cause serious hazard (Maskooni et al. 2020). The contaminated sites are considered as a serious problem worldwide (Kovalick and Montgomery 2017). Moreover, the situation becomes worse if governments deny any environmental pollution or the contaminated sites are not investigated (Naseri Rad and Berndtsson 2019). Research-based approach can deal with the situations and helps do right decisions about remediation programs and protect population and environment from related risks . Thus, it is urgent to identify the main sources of groundwater pollution from petrochemical industry in Kazakhstan in order to eliminate the risks.
Multivariate statistical techniques have been widely used for assessment of surface and ground water quality (Shrestha and Kazama 2006;Naseh et al. 2018;Cloutier et al. 2008;Ghahremanzadeh et al. 2018;Noori et al. 2010;Patil et al. 2020). The natural transformations happen due to saltwater intrusion, lithological/geochemical processes, rainfall and snowmelt, eutrophication processes. The anthropogenic invasion due to urban development, industrial and agricultural activities, influence by rural settlements significantly contributes to groundwater pollution, and consequently, affects the water quality. Thus, multivariate statistical techniques are efficient tools identifying and separating the main probable sources of pollution in the context of land-use changes. Three techniques are particularly common: Pearson's correlation, Principal Component Analysis and Cluster Analysis. Correlation matrix is used to determine potential interactions between different chemicals by pairwise variables comparison. PCA is used to identify statistically the most significant parameters, which are considered as major contributors to total contamination. Finally, CA combines similar groups of observations together. The techniques have been successfully applied, e.g., Egbueri (2019) divided his study area in Nigeria into insignificantly and highly polluted sites by using CA. Awomeso et al. (2020) investigated and identified possible sources of groundwater contamination such as leachate from septic tanks, nutrients from agricultural fields and chlorine pollution. The multiple natural and anthropogenic sources of surface and groundwater pollution have been presented by Omo-Irabor et al. (2008) in Nigeria. Impact on shallow groundwater in irrigated areas has been investigated by Trabelsi and Zouari (2019) in Tunisia. Shrestha and Kazama (2007) combined sites as less polluted, medium polluted and highly polluted, based on the similarities of water quality indicators in Japan. The same was for Kazi et al. (2009) who investigated the problem of water contamination by agriculture and industry in Pakistan. Liu et al. (2003) showed influence of processes of saltwater intrusion and arsenic pollution in Taiwan. Groundwater pollution sources apportionment in a land with high density of agriculture, industry and urbanization has been investigated in southwestern China (Li et al. 2019). Hence, the multivariate statistical techniques let researchers successfully investigate certain case studies.
The aim of this paper is to analyse and interpret a dataset obtained during a 7-year (2013-2019) monitoring program of the wastewater discharge systems in one of the Kazakhstani industrial clusters. This dataset includes concentrations of substances in groundwater from nine observed wells surrounding the wastewater recipient. Kazakhstani law (Kazakhstan 2015) requires that strict standards for groundwater quality surrounding recipients are followed. If the requirements are neglected, the responsible company should take actions to eliminate the risks for the environment and people. Matrix correlation, PCA, and CA multivariate techniques were applied to (1) determine main pollutants with elevated concentrations in groundwater, (2) assess the contribution of each contaminant to temporal variations in groundwater quality and identify their potential origin, and (3) group the contamination sites affecting water quality and their potential sources by relevant similarities. The results contribute to the description of the spatial-temporal changes in groundwater quality of the study area. Heckman selection model was used to avoid bias of the results and look at specific properties of each pollutant more carefully. Moreover, the study highlights the main sources of contamination at the different locations of the study area and is thus of interest for local key stakeholders, groundwater modelling researchers, and risk analysis managers.

Study area
The industrial site of this study belongs to the special economic zone and is located in the north-eastern part of Kazakhstan. The region is located in a sharply continental zone, where mean monthly temperatures range from -19.3°C in January to ? 21.5°C in July, with an annual mean of 3.5°C, absolute maximum of ? 42°C and absolute minimum of -47°C.
Annual precipitation is around 303-352 mm, including 264 mm in liquid phase. The driest months are May, June, and July. Potential annual evaporation is around 957 mm (Heaven et al. 2007). Average relative humidity equals 82% and 45% for the coldest and the hottest period of the year, respectively. 70-85 days of the year is represented with the humidity 80% and more.
The recipient pond ( Fig. 1) is based on a natural bitter-salty pond for receiving and storing biologically treated wastewater from the nearby located petrochemical industry. According to Kazakhstani legislation (Kazakhstan 2012), this pond is not a source for drinking, domestic and irrigation water. The annual volume of received wastewater amounted to 1.63-2.21 million m 3 for the period 2009-2019, instead of designed 4.12 million m 3 . The water volume and water surface for the same period are maintained within 3.6-6.7 million m 3 and 2.45-3.73 km 2 , respectively, instead of the designed 23.5 million m 3 and 5.23 km 2 , respectively. Observation wells are located out of barrier for groundwater quality monitoring and belong to permanent control from governmental bodies. The installation procedures followed appropriate installation technique in case of required installation materials and methods and planning of the location of the monitoring system (Houlihan and Lucia 1999). The depth of the wells varies between 10.1 and 24.6 m below ground level. The groundwater depth in the wells varied between 1.1 and 4.9 m.
The hydrogeological conditions of the study area have been poorly investigated during soviet and postsoviet periods. The geological cross-section is represented by four geologic-genetic layers: contemporary sediments (land cover), upper-quarternary and contemporary aeolian-deluvial deposits (clayey sand) and upper-quarternary alluvial deposits (loam and/or fine to medium-grained sands). The geological profiles of the examined wells are presented in Fig. 2. Groundwater is represented by two aquifers: shallow unconfined and confined aquifers. The upper aquifer is composed of clay-sand and mixed size sands. The bottom of the aquifer lays on the depth 8.0-24.0 m below surface level. The aquifer is mainly recharged from water infiltration from the surface. The discharge is partly due to evapotranspiration and partly due to percolation to the underlying aquifer. Amplitude of seasonal fluctuation of groundwater table is about 0.7 m (Fig. 3). The figure shows that the GW level has peak values after the winter during the snowmelt season and after that reaches its minimal values during the summer. Interpolation using inverse distance method was used to establish GW flow direction and the bottom of the first aquifer. Figure 4 shows a contour map of the groundwater level and the elevation of the bottom of the unconfined aquifer. The second aquifer composed of medium-grained and small-grained sands. It is recharged from the head water and from the upper aquifer. The aquifer discharges to the nearest river, which is located 4 km west from the pond.
A total of 117 groundwater samples from the shallow aquifer were collected and analyzed between 2013 and 2019, from all observation wells. Sampling was made two times per year, in spring and autumn. The groundwater depth was measured regularly from March to November each year. The procedures of the sampling and measurements are controlled by Kazakhstani legislation from the sufficient international standards (Houlihan and Lucia 1999). Before sampling, the groundwater in the well was evacuated several times (usually, three times) by pumping. The pumping equipment was also flushed prior to sampling  to avoid unwanted pollution. After establishing a static water level, the sampler was immersed to a depth below the water table by 0.5 m or less. Water samples were collected in 1-l dark glass bottles. The vessels were moved into a transportable fridge for immediate delivery and analysis to the licensed factory

Multivariate statistical techniques
Correlation analysis, principal components analysis (factor analysis), and hierarchical cluster analysis were applied to identify the multivariate relationships between different variables and samples in the study area. The dataset was normalized for elimination of the effect from differences in units (Eq. 1).
where Z ij are normalized values from x ij , i is represented variables, j is the sample number, m i is the mean value and SD is the standard deviation of the sample. The relation between each pair of variables was measured by Pearson's correlation coefficient to determine the geochemical associations among different variables. Correlation coefficients greater than 0.5 were considered significant. PCA recognizes the most significant parameters from a big dataset of intercorrelated parameters and created independent variables (Eq. 2).
where z is the component score, a is the component loading, x is the measured value of variable, i is the component number, j is the sample number and m is the total number of variables. Factor analysis (FA) is a similar technique as PCA. However, PC is presented as a linear combination of parameters. FA follows PCA and takes into account unobservable, hypothetical, and latent variables. They are included in equation with the special residual term (Eq. 3).
where z is the measured variable, a is the factor loading, f is the factor score, e is the residual term according to errors or other source of variation, i is the sample number and m is the total number of factors. Cluster analysis was used to assemble similar groups of observed wells due to similarities between their variables. Hierarchical agglomerative CA provided Ward's linkage distance, reported as D link /D max , which represents the quotient between the linkage distances for each case divided by maximal linkage distance. Produced dendrogram lets to analyse similarities easily. Ward's linkage, the Euclidean distance as similarity measurements, and Q-mode are usually used for cluster analysis for assessment of groundwater quality (Egbueri 2019;Cloutier et al. 2008;Kazi et al. 2009;Awomeso et al. 2020;Trabelsi and Zouari 2019;Amanah et al. 2019;Bouteraa et al. 2019).

Heckman selection analysis
Heckman selection analysis, to the authors' knowledge, has never been applied to assess the environmental characteristics. This type of analysis was adapted from the original work of Heckman in the economical science (Heckman 1979) and from the application this type of this method in other fields, for example in the assessment of energy production (Sun et al. 2014), urban transportation research (Kaplan et al. 2016) and estimating crash rate (Xu et al. 2017). The method in this study is used to assess unobservable variables, that potentially impact on the total contamination rate. Gadgil investigated the list of chemicals in the WHO guidelines (Gadgil 1998) and concluded that certain chemicals have no strong requirements for their concentrations in drinking water, as the exposure of exceeded concentrations for human health is not significant. The idea of this assessment is not just looking at several contaminants and their concentrations, but also to consider and evaluate other important factors such as location of the sampled value, percent of exceeding of the certain contaminant and individual characteristics of the contaminant. Selected variables were divided into two categories. First: chemicals seriously affecting health (rated as sanitary toxic due to Kazakhstani standard (Kazakhstan 2015)); second: other hazardous materials (rated as non-toxic). It is aimed to compare potential effect of toxic and non-toxic contaminants. We focused, on the one hand, on several pollutants with elevated concentrations, such as chlorides or sulfates, which are not rated as significant impact on health, but can be dangerous for other cases, for instance, for corrosion of pipes, or for irrigation properties of soil; on the other hand, on the contaminants, rated as dangerous for the health, or toxic (for example, hardness or petroleum hydrocarbons).
This model includes two-step equation, which is assumed as an advanced regression model equation: where Y i is considered as total contamination, S i represents the concentration of chemicals, and X i shows several contaminants as a set of control variables. The effect of the exceeded concentrations on the total contamination is given by the parameter b 1 . Parameter i represents each individual observation.
Equation (4) does not consider other potentially important independent variables which can affect for final result. For example, it could be locations of the wells or individual characteristics of different contaminants such as their toxicity and exposure level in case of influence of chemicals for people's health. There could be a different input of high exceeding of non-toxic contaminant and low exceeding of toxic contaminant. Second one would be much more dangerous for health. Thus, more attention should be paid to the level of toxicity. Specific description of this equation can be written as: where (Y i , D i , S i , X i , Z i ) are observed random variables and 1(.) is an indicator function. The first equation represents the total contamination of all contaminants. The second equation is the selection equation, where D i is added as a dummy variable indicating whether value i represents a measurement of toxic/non-toxic pollutant. A set of variables Z i includes additional parameter such as a well value i. Set of control variables Z i must include at least one variable which is not included in X i (Sartori 2003). All mathematical and statistical computations were performed using Microsoft Office Excel 2016, IBM SPSS Statistics 26 software and STATA 15.0 (StataCorp LP).

Results and discussion
Groundwater quality parameters Table 1 presents the results of measurements of groundwater quality from the wells surrounding the recipient pond. Kazakhstani and WHO standards for drinking water were used for assessing all parameters.
The concentrations of several parameters in wastewater, which are discharged into the recipient pond, are also presented in the table. Those characteristics came from the previous publication of the authors (Radelyuk et al. 2019).
As shown, all wells had exceeding concentrations of total petroleum hydrocarbons (see also Fig. 5). When the permissible concentration of TPH is 0.1 mg/ L, the concentrations of TPH varied between 0.08 and 1.20 mg/L with mean value 0.42 mg/L, which exceeded the norm 4 times. Although low concentrations of TPH in water might be considered harmless, researchers found that long-term exposure to TPH causes carcinogenic diseases (Pinedo et al. 2013;Wake 2005). Table 1 also shows that dangerous concentrations of phenols were identified in all nine wells. This pollutant had been evaluated as very toxic and was included in the list of priority pollutants by Environmental protection agency (EPA 2012). The number of disorders has been discovered by acute exposure of phenol: muscular convulsions, hypothermia, muscle weakness and tremor, collapse, coma, etc. (Nair et al. 2008). There is a limitation in the assessment of the presence of phenol in our case study. However, the limit of the concentration of simple phenol (phenol index) is 0.25 mg/L according Kazakhstani standard. The same value is established in the standard of the factory for the observed wells (Radelyuk et al. 2019). At the same time, protocols of GW quality measurements name this parameter ''volatile phenols''. This type of phenolic compound is considered to be limited 0.001 mg/L. Thus, there is unclear situation of what limit should be used. Measured TDS values exceeded the KZ and WHO maximum permissible levels of 1000 mg/L in most cases on average five times (Fig. 5). Further, the total hardness in the groundwater samples ranged between 2 and 390 mmol/L with mean exceeding the standard six times (Fig. 5). According the Todd classification, almost all samples might be categorized as very hard water. Hard water may cause cerebrovascular and cardiovascular diseases (Stambuk-Giljanovic and Stambuk 2005). The chloride ion presence were between 56 and 24,757 mg/L, with most samples elevated WHO's 250 mg/L recommended limit (exceeding 9 times on average) (Fig. 5). There are possible health-related concerns regarding Na ? content in the groundwater because the mean elevated concentrations in the wells were six times over the  non-described *WHO does not cover all chemical contaminants in the guidelines, but only those, which pose a risk in a high level (Gadgil 1998) **EPA, EU and WHO present a range of phenol-derivatives according their toxicity rate. Kazakhstani standard assumes ''phenols'' as phenolic compounds, which evaporate under high temperature (Angelino and Gennaro 1997) ***From WHO Guidelines for drinking water quality (1984) permissible KZ limits and WHO indirect recommendation (Fig. 5). Consumption of high amount of sodium has been correlated with cardiovascular disease, such hypertension and stroke (Lucas et al. 2011). Finally, individual exceedings of surfactants were identified. Such high level of surfactants is related to several potential problems. The presence of some surfactants in connection with other contaminants may decrease the biodegradation rate of contaminant or stops the process at all. In other cases, the presence of the surfactants enhances the biodegradation rate. The desirable result is not clear without knowing the role of the surfactant participating the biodegradation process in a given remediation plan (West and Harwell 1992). Moreover, special focus should be paid to Well 9 which had extremely high values. For example, TDS had a value 37 times above the limit, chloride 99 times higher than limit, sulphate exceeded the limit 38 times, total hardness with associated cations by 56 times as well as highly elevated concentrations of ammonia, nitrites, nitrates, potassium, sodium and surfactants (Table 1). This is the reason why Fig. 5 does not include Well 9 presenting the concentrations of some chemicals comparatively with WHO recommendations.
The water containing such levels of those substances would normally be rejected by consumers. Additional epidemiological research should be provided in municipalities nearby the area of pond to assess potential connections between the high concentrations of some parameters, such as TPH, phenols, Na ? , Cl -, SO 4 2-, TDS and TH and cardiovascular and oncological diseases in the region. Figure 6 shows temporal distribution of some chemicals. The pH values (Fig. 6a) normally were highest during the spring, while the value for W9 differs significantly and instead shows the lowest values during the same period. It could be explained by influence of recharge of snowmelt and geological characteristics of the area. The same situation can be applied for TPH. All wells show the highest concentrations of TPH during the spring (Fig. 6b). Moreover, the graphs mainly tend to raise their fluctuations and display an increasing trend. It potentially says that the pollution problem is growing in the area. Figure 6c represents the fluctuation of TDS in the groundwater. There are relatively flexible lines without significantly extremal changes.
Spatial distribution of the chemicals is presented in Fig. 7. pH values (Fig. 7a) are more than 7 for all wells, defining groundwater alkaline. According to Hem (1970) dissociation of carbonate and carbonate salts is a dominant process in nature, which leads to pH above 7. The maximal value of pH is found in well 6, and minimal value belongs to the wells 2 and 9. Piper diagram is widely used to show the dominant hydrogeochemical faces (Piper 1944). The Piper plot (Fig. 8) verifies the direct relationships between the hydrochemical regime of groundwater in the area and the pH value. Total petroleum hydrocarbons have a maximal value in the well 9 and minimal in the well 3 (Fig. 7b). There are plotted only TDS, instead of TH, Ca 2? , Mg 2? , Na ? , Cland SO 4 2-, on the figure, because they are parts of the TDS and distributed in the same manner (Fig. 7c). Thus, we can consider from the hydrogeological characteristics of this site (Fig. 4) and spatial distribution of pH and pollutants (Fig. 7) that groundwater flow has a slope toward western direction from the pond.  The correlation matrix (Table 3) was employed for all 117 measurements for determining the loads of the principal components (PCs) shown in Table 2. The first six PCs were selected for the following reasons as variables of dimensionality reduction: the six PCs together gave a cumulative contribution of 78.34%, which is typically regarded as being sufficiently high; the eigenvalues of these PCs are all greater than 1.0 and, according to the Kaiser criterion these PCs must be chosen (Table 2) (Kaiser 1958). The factors can be conditionally divided into two groups. First group accounts to 52.34% of the total variance and is represented by Factors 1 and 2. Usually, the parameters, belonging to those factors, characterize natural conditions of the groundwater. Factors 3-6 contribute to 26% of the total variance and can be categorized, as anthropogenically appeared factors. The detailed interpretation of each Factor is explained below.

PC 1
PC 1 explains 42.02% of the total variance (Table 2). It is characterized by high positive weight values TH, Ca 2? , Mg 2? , TDS, Na ? , K ? , Cl -, SO 4 2and surfactants. As Table 3 indicates, there is a strong positive correlation between TDS and Ca 2? , Mg 2? , Na ? , K ? , SO 4 2-, Cl -. These ions are the major contributors to the total dissolved solids. Additionally, these ions correlate with each other. These results show that the groundwater has suffered serious mineralization There also is a clear correlation between TH and subsequent ions: Ca 2? , Mg 2? , Cl -, SO 4 2- (Table 3). In addition, it can be seen that all these ions correlate with each other. This correlation points to the existence of non-carbonate, or constant hardness, (MeSO 4 , MeCl 2 , where Me-Ca, Mg), which is difficult to remove. It is clear from Table 3 that there is no correlation between carbonate ions and the hardness metals ions, which suggests a weak temporary hardness. This factor can be explained by the natural conditions of the site. In contrast, surfactants are synthetic compounds. Surfactants are produced for cleaning and washing operations (West and Harwell 1992). Their existence in groundwater is not natural.

PC 2
PC 2 explains 10.32% of the total variance (Table 2) with negative weight values of pH and CO 3 2-, and positive value of CO 2 . It is important to note a correlation between CO 2 , CO 3 2and pH (Table 3), which points to alkalinity reactions in the groundwater (Eq. 6). The relationship exists between these parameters and CO 2 , which potentially could be described a process of CO 2 creation or the presence of the CO 2 as an atmospheric gas in the unsaturated zone (Hem 1970). Moreover, the high concentration of chlorides in wastewater coupled with the natural salt water leads to changing pH in groundwater by decreasing pH. These processes are naturally based.

PC 3
Factor 3 is characterized by a positive value of nitrite ion (  Fig. 8 Piper diagram for identification of water type of the study area variance. NO 2 does not correlate with any chemicals. The presence of the parameter could be explained as a semi-product of the natural denitrification/deammonification processes in the groundwater environment according to Hisckock et al. (1991).

PC 4
TPH and ammonia ion represent PC 4 and account for 6.89% of the total contamination (Table 2). Both chemicals have no correlation according Table 3, which shows their independence on the other variables. This level of petroleum hydrocarbons in drinking water can lead to damage of the nervous system and carcinogen and narcotic effects associated caused by some hydrocarbons (Logeshwaran et al. 2018). In addition, even a few micrograms of TPH per litre deteriorate the odour and taste of the contaminated water. The high loading of NH 4 ? is associated with extremally high concentrations of ammonia in discharges (Radelyuk et al. 2019). Hence, the amount of ammonia is not degraded during the saturation processes and some traces still presence in the groundwater. This factor is certainly attributed to groundwater pollution from the petrochemical industry.

PC 5
PC 5 is characterized by positive value of phenols (Table 2), which accounts for 6.11% of the whole contamination. This parameter is characterized as a very toxic pollutant. Concentrations of the phenolic compounds probably exceed the permissible level (Table 1); the exposure is evaluated as a potential risk for public health. The loading of this parameter is directly related to the specification of petrochemical wastewater.  One more significant factor belongs to the influence of phosphate-ions and is rated by 5.34% of the total variance (Table 2). It should be pointed out that the enterprise does not provide monitoring of phosphate concentration in the discharges. Nevertheless, the refining process is associated with a vast number of washing processes, which leads to big consumption of different detergents, which contain phosphate substances. As the rocks and fertilizers are absent in the study area (Rao and Prasad 1997), we can conclude that the loading of the contaminant is an indicator of anthropogenic impact on the groundwater.

Cluster analysis
Based on the performed CA and results above, the study area was divided into three clusters. Figure 9 shows a dendrogram of all nine sampling sites into three statistically meaningful clusters yielded by cluster analysis. Cluster 1 combines observed wells W9 and W2. These wells are labelled as highly contaminated with the highest exceeding of many chemical parameters. Figure 4a shows their similarities in the distribution of pH, which is followed by host geology. The wells are situated on the southwest site from the pond and probably approve an assumption about direction of groundwater flow. Cluster 2 is formed by wells W7, W8, W1 and W3. These wells are located on the south and west sides of the pond and characterized by twofold characteristics: firstly, significant pollution rate, including the same concentrations of the TDS and TDS related chemicals and secondly, the equal temporal distribution of pH. It means that groundwater on that site is affected by pollutant transport from the pond in the same manner. Finally, Cluster 3 is represented by wells W6, W4 and W5. All wells are located north of the pond and are characterized by lower concentrations of the pollutants compared to other wells. We may consider that groundwater flow originates from east to west, and potential hazard exists for rural inhabitants towards to west and south-west direction from the pond.

The Heckman selection model
This study uses the Heckman selection model to estimate relationships between total contamination and other characteristics, especially, significance of toxicity rate. If we adapt Eqs. (4) and (5) for our case according Stata manual (STATA 2013), we can represent the equations respectively as: and we assumed that ''% of exceeding'' is estimated if where ui and mi have positive correlation q. Table 4 shows the selected variables used in this analysis and their descriptive statistics. The first dependent variable (D i ) represents toxicity of the chosen chemical. The value equals 1 if the pollutant is toxic and 0 if not. The second set of dependent variables (Y i ) includes percentage of exceeding. This characteristic mathematically represents rate of contamination. Mean percentage of selected (toxic or nontoxic) exceeding was calculated. For example, if the concentration of TPH measurement was 0.25 mg/L, but standard value is no more than 0.1 mg/L, then dependent variable equals 250%. This variable includes only exceeded values. Otherwise, if the value is normal, a cell in a matrix is empty. Numbers in parentheses are standard deviations of the average values. Set of control variables (X i ) includes chosen contaminants, their concentrations and locations. According requirements (Kazakhstan 2015), TH, TPH, and Na ? are considered as hazardous for public health and rated with value 1.0 for the variable D i . TDS, sulphates and chlorides are considered as nontoxic and were rated as value 0.0 for the variable D i . We encrypted TDS, Cl -, SO 4 2-, Na ? , TH and TPH in the table of variables as ''1'', ''2'', ''3'', ''4'', ''5'' and ''6'', respectively. The contaminants are not a subject for assessment in this analysis. Table 5 presents the estimation for this type of analysis. Rho has a positive value, which means that it is possible to estimate relationships between chosen variables and final contamination. All variables, excluding number of well (which represents location of the wells), are considered as significant. The concentration of pollutants has the greatest influence on total contamination. Positive value explains likelihood of potential hazard for people health. Obviously, the high concentrations of the pollutants lead to deterioration of health, especially during long-term exposure. In our case, 503 of 675 values exceed acceptable limits by 7-8 times averagely. The variable of toxicity rate is the second significant factor. This variable reflects to lower percentage of exceedings for toxic contaminants than for non-toxic, instead of higher number of exceeded values for toxic contaminants than non-toxic. Our hypothesize assumes that even if the concentration of the toxic contaminant exceeds the standard by just a few units, the toxic properties could be much more dangerous for human health, compared with the consumption of highly polluted water by non-toxic contaminants. The independent chemicals represent the third significant variable. Individual characteristics of chosen chemicals are explained in sub-section ''Groundwater quality parameters''. The location of the well is rated as not significant parameter. Nevertheless, the investigation of hydrogeological characteristics deserves attention in the future work and determines the spread of contamination.

Conclusions
This study investigated the current situation of groundwater safety for public health surrounding a contaminated site in Kazakhstan. The results show that PCAs have high loading of anthropogenic contamination to groundwater from the oil refinery industry coupled with natural geochemical processes. In addition, exceeding concentrations of hazardous substances, including TPH, phenols, TH, and TDS were identified. By means of cluster analysis we were able to combine the examined wells in three groups according to the concentrations of chemicals and their locations. Highly polluted groundwater was distributed especially in west and south-west direction from the pond. The results enable the prediction of the groundwater flow in the study area as well as the estimation of sites heavily affected by contamination. The usage of Heckman selection model, to the authors' knowledge, is the first attempt in the literature, applied to evaluation of environmental factors. According to obtained data from Heckman analysis, focus should be paid to the distribution of toxic contaminants.
For this purpose, further research considers: (1) Groundwater modelling for definite identification of groundwater flow and potentially affected rural areas; (2) Contamination transport modelling, as the industry continue polluting the environment, the assessment of present and future hazards is highly needed; (3) Development of a remediation plan, which has to be built on the qualitative studies (1) and (2).
This study might be used as a trigger to drive and engage all stakeholders into the transparent dialogue about potential consequences of non-sustainable wastewater management at oil refinery industry. The potential actions might include implementation of successful legislative standards, development of new efficient monitoring programs, stimulation the industry to innovative and water-saving treatment methods and a creation of a site contamination/remediation programs.
This research has several limitations. Firstly, the limited dataset covers only period from 2013 to 2019. Secondly, despite of the concentrations of TPH are identified, the lack of data on specific hydrocarbon type such as PAH and BTEX limited the analysis on the toxicity. Thirdly, the lack of access to hydrogeological data limited the accuracy of the ground water flow estimation. Authors of this paper recommend initiating a dialogue between industry, government, and academia for research-based decision-making in this area.