Mapping of total suspended solids using Landsat imagery and machine learning

Torres-Vera, M.-A.

doi:10.1007/s13762-023-04787-y

Mapping of total suspended solids using Landsat imagery and machine learning

Original Paper
Open access
Published: 14 February 2023

Volume 20, pages 11877–11890, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Environmental Science and Technology Aims and scope Submit manuscript

Mapping of total suspended solids using Landsat imagery and machine learning

Download PDF

M.-A. Torres-Vera ORCID: orcid.org/0000-0001-6059-0231¹

2308 Accesses
3 Citations
Explore all metrics

Abstract

The main objective of this work is to propose a new technique for water quality parameters monitoring by applying artificial intelligence methods to optimize remote sensing data processing. A multiple regression model was developed to create a total suspended solids (TSS) prediction model, using unsupervised machine learning. Currently, water bodies throughout the world are poorly supervised in terms of quality, so it is necessary to implement efficient mechanisms to obtain synoptic information for a good diagnosis in TSS evolution, because they are a key indicator of the biophysical state of lakes and an essential marker for continuous monitoring. Conventional methods used to monitor the physical parameters of water bodies, for example, in situ sampling, have proven impractical due to time, cost and space constraints, and remote sensing tools can help to achieve this purpose more efficiently. The proposed multiple regression model requires calibration and to that end, Lake Chapala data from the monitoring time series collected by the National Water Commission (CONAGUA) were used. Lake Chapala is the largest freshwater body in Mexico, and the human intervention that develops around the lake has caused drastic changes such as decrease in the size of the lake and increase in suspended matter and aquatic vegetation. These changes alter the balance of the system, endangering the health of the lake. This work presents a generalized semi-empirical model that uses Landsat image data and machine learning methods for estimating total suspended solids (TSS) in water bodies, with a good prediction precision (R = 0.81, RMSE = 32.52).

Synergizing remote sensing and ecological indicators (RSEIs) for evaluating ecological environmental quality (EEQ) in Asansol Municipal Corporation: an integrated approach

Article 19 June 2024

Water quality prediction using machine learning models based on grid search method

Article Open access 29 September 2023

A comprehensive review of water quality indices (WQIs): history, models, attempts and perspectives

Article 11 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Remote sensing applications are common techniques to map TSS concentrations and their temporal and spatial fluctuations. Nechad et al. (2010) demonstrated that using just one Landsat band can yield a sensitive approach for estimating TSS but only if the right band is selected; the first four Landsat bands are closely correlated with total suspended matter, according to several studies, but the strength of this link varies with wavelength and water depth (Cox et al. 1998; Dekker et al. 2002; Brezonik et al. 2005; Akbar et al. 2010).

Few remote sensing studies have been carried out in which a long-term follow-up is performed for monitoring biophysical variables in lakes, particularly in the case of TSS. In the history of remote sensing, the monitoring of coastal water quality dates back almost to the beginning of the satellite exploration era, and these studies have continued to be developed, mainly based on the correlation between "in situ" observations and the direct correlation with satellite images of the same dates. Nevertheless, these studies focus on river mouths and coastal systems, with important studies including estimates of TSS input to the ocean (Overeem et al. 2017), variability in sediment plume size (Brando et al. 2015), reservoir impacts on the concentration of sediments (Pereira et al. 2017), the impacts of land use change on the entry of sediments (Telmer et al. 2006) and the variability of sediments in the lagoons (Volpe et al. 2011). An important primary continental source of freshwater for industrial, recreational, and irrigating purposes is the lakes (Carvalho et al. 2013).

Due to limitations in time, money, and space, the traditional on-site monitoring system has been found to be ineffective (Philipson et al. 2016). For instance, total suspended solids (TSS), organic properties (TOC and TIC), or microbiological properties (chlorophyll-a (Chl-a)) can all be computed simultaneously using remote sensing, which records synoptic radiation from the water surface (Matsushita et al. 2015; Tyler et al. 2016; Dörnhöfer et al. 2018). Numerous research projects have attempted to determine the concentration of TSS using mainly two satellite remote platforms and sensors: Landsat (Zhou et al. 2007; Kallio et al. 2008; Wu et al. 2008; Zhang et al. 2014; Vanhellemont and Ruddick 2014; Wu et al. 2015), and MODIS (Miller and McKee 2004; Chen et al. 2007, 2015; Raag et al. 2013; Hudson et al. 2014; Petus et al. 2014; Ayana et al. 2015).

Three approaches are generally used to establish the relationship between spectral properties detected by satellite and in situ measurements (Ma and Dai 2005). These are: (i) establishing a relationship through theoretical formulas, (ii) the empirical method in which curve fitting is used, and (iii) the semi-empirical method, which is the combination of theoretical and empirical techniques. Analytical models consider parameters related to water surface reflectance affected by TSS concentration, and regression methods are commonly used in empirical approaches. The regression method is one of the most widely used techniques to study the relationship between reflectance values of multispectral images and in situ measurements. Acquired satellite data are used to establish relationships with water quality parameters using multiple regression techniques. In empirical approaches, remote sensing data are correlated with TSS concentration using interpolation techniques applied to a set of in situ sampling samples. Several AI-based algorithms have been proposed for the empirical approach (Balaguer-Ballester et al. 2002).

The use of a standard linear, or non-linear, regression between band-spectrum data and in situ water quality measurements is the remote sensing research method most frequently used to study inland waters. In these models, a single spectral band or combinations of few of several spectral bands are used (Tyler et al. 2006; Kallio et al. 2008; Kratzer et al. 2008; Wang et al. 2008; Wu et al. 2008; Duan et al. 2009; Cui et al. 2013; Espinoza-Villar et al. 2013; Qiu 2013; Choi et al. 2014; Kaba et al. 2014; Shen et al. 2014; Chen et al. 2015; Shi et al. 2015; Membrillo-Abad et al. 2016). However, there are no techniques that employ all the spectral bands as input variables to correlate with the results of in situ measurements.

Due to their straightforward development, empirical approaches are the most widely employed to estimate TSS concentration; however, because these methods lack physical foundation, their applicability is restricted to the location and time in which they were produced (Chen et al. 2015). Empirical approaches are primarily constrained to the place where they were developed by their inability to generalize over wide spatial and temporal scales, because of the changes in water composition and the significant variability in the spectral signatures. As a result, the range and frame of the input data limit the reliability of empirical model predictions. These models, however, do not incorporate any inverse models of the optical qualities that are inherent to a particular body of water. Semi-empirical models use multiband ratio values based on the physical properties of interest, like land vegetation indices.

The results of previous studies indicate the need for new approaches to replace traditional methods. An inherent limitation of the traditional approaches is the difficulty to generalize their results into large spatial and temporal scales due to variations in atmospheric composition and the specific characteristics of the studied area. In previous investigations, the TSS determination algorithm has been generated using linear regression in a single band or a combination of several bands as predictors (Nezlin and DiGiacomo 2005; Nechad et al. 2010; Caballero et al. 2014). However, the linear regression method has some major drawbacks: (i) training can be expensive; (ii) minimizing training errors can lead to poor algorithm performance in generalization.

In this work, a novel method is presented to determine the concentration of various parameters (TSS, Chl-a, TOC and TIC) in lakes that can be used to evaluate water quality, this method is based on the spectral response related to those parameters and the use of artificial intelligence. The main contribution of this method is that instead of using a specific combination of bands (ratios, indices, etc.), the 7 bands of the Landsat images are used simultaneously and subsequently a multiple correlation is performed using unsupervised artificial intelligence, in order to make this method more robust. The site chosen to develop this method, as well as to test its precision and accuracy, was Lake Chapala, because of the availability of a data time series that represents more than a decade of in situ sampling, and its environmental, economic, and social importance. Lake Chapala is located in the central part of Mexico, and it is currently in imminent risk of collapse mainly due to 3 factors: high levels of pollution caused by industrial discharges, overexploitation of its waters, and lack of wastewater treatment plants. The negative impacts are accentuated by climate change, the introduction of exotic species, the construction of local infrastructure, changes in land use on the shore and increased anthropogenic activities that result in eutrophication processes and pollution. This has caused its size to decrease and affects its water levels and productive capacity.

Study area

Lake Chapala is located in the western part of central Mexico (20° 06′ 36''–20° 18′ 00'' North, 102° 42′ 00''–103° 25 ′30'' West, 1520 m altitude) in the Lerma-Chapala basin (Fig. 1) and has an area of approximately 1,100 km². Its major axis has an east–west orientation with a maximum length of 77 km, and its minor axis has a length of 22 km (De Anda et al. 1998). The region's climate is classified as subtropical, according to the modified Köppen scale (García 1988), with strong winds, up to 8–10 m/s that generate strong mixing in the water column (Filonov et al. 1998). Its main tributary is the Lerma River, which is fed mainly by drainage in the Lerma–Chapala basin; the main flow losses are related to evaporation and its use for human consumption, and only in the case of accumulation to its maximum capacity, there is an outflow of water towards the Santiago River towards the north (Sandoval 1994; Aparicio 2001). Due to the shallow depth of the lake, a great inorganic turbidity is produced due to fine clay resuspension (Lind and Dávalos 2001). In addition, the clay particles act as a substrate for plankton growth, which promotes changes in the optical properties of the water column (Sandoval 1994; Aparicio 2001).

The cartography reveals two clearly differentiated periods in the lacustrine contours: one prior to the construction of the embankment or dike of the area known as Ciénega de Chapala (in 1905) in the dusty north-eastern end of the old lake. The storage capacity of the lake was a maximum of 5600 Mm³, before the Ciénega construction that reduced its volume to 4,500 Mm³. This work was carried out between 1905 and 1910.

Lake Chapala has tributaries in the Lerma–Chapala–Santiago hydrological system with an estimated area of 130,000 km² (Fig. 1), and the calculated average annual total volume of solids contributed by the tributaries to Lake Chapala is 69,506 tonnes per year, which for the 400 ha of Lake reservoir area means that the average sediment deposit is 40 cm per year (Zarate et al. 2001).

Materials and methods

Methodology

To develop the TSS prediction model is necessary to determine the relation between remote sensing data and the in situ data. Since field measurements do not always coincide in space and time with satellite images, a selection was made with the closest correspondence between the dates and the corresponding pixels.

The methodology to correlate Landsat images with in situ measurements consists of four phases (Chang et al., 2014): “(1) radiometric correction; (2) extraction of spectral reflectance values in sample point windows; (3) statistical analysis of in situ data and (4) generation of the most efficient multiple linear regression model by statistical analysis”; in this work, the selected statistical tools selected were the following: EDA statistical analysis, multiple linear regression, MRL generation, and advanced machine learning application.

Remote sensing data

Satellite data acquisition Landsat satellite images

The data used in this study included 32 cloud-free Multispectral Landsat-5 TM and Landsat ETM + scenes from May 2005 to November 2016 (Table 1) that cover the total area (path 28-row 46 and path 29-row 46).

Table 1 Dates of water sampling, acquisition of Landsat images and number of in situ observations

Full size table

Pre-processing

Pre-processing operations involve georeferencing and atmospheric correction as image restoration and rectification, they are aimed to correct for atmospheric radiometric distortions of data, and platform geometric distortions of data. The atmospheric correction was calculated using Jensen’s (1996) dark body reflectance method. After pre-processing Landsat data and pairing with in situ measured TSS, a total of 22 matched data sets were available for development of the turbidity simulation model.

On-site measurements

In Lake Chapala, the Mexican government agency responsible for water resources management carries out continuous monitoring of water quality (IMTA, 2009). They collect field information on turbidity and visibility using Secchi disc and water sampling. Chemical laboratory analyses are performed on the collected water samples to determine TSS concentration. The chemical data of the lake water are available from the database of the National Water Commission (CONAGUA). The data used here are those that coincide with the available cloud-free Landsat images from 2005 to 2015.

Satellite-in situ match-ups

The TSS-RS model requires determination of the relationship between satellite data and in situ experimental data. A match-up is obtained following consecutive steps. First, satellite data are extracted over a pixel that corresponds to the location of the field station. Second, the value of reflectance is obtained. In situ measurements were matched with reflectance values from the processed Landsat Images. Match-up data are obtained by coordinating the in situ observations with the satellite data.

Model construction

The objective of this step was to generate a general model to quantify the concentrations of suspended solids present in the water, using the information obtained through remote sensors regardless of the time. Thus, multiple regressions with all independent variables combined were considered, taking care to avoid dependency between covariables.

Considering that the sample size of this work is not large enough, all observations (Table 1) were used to compute the MLR model, due to the probabilistic assumptions of this set of models: the sample size for the multiple regression analysis requires at least 10 cases per independent variable in the analysis, in our case 90 samples. The determination of this sample size is based on the 95% confidence intervals associated with correlations at the degree of precision (Green, 1991). After adjustment, a cross-validation step could be performed. In this work, two processing steps were considered in the machine learning processing for the prediction of TSS, the first is the model development from a training subgroup, and the second is the test with a data set different from the one used in the first group. In this study, we used 80% of the collected data for machine learning training and 20% for testing.

The model to estimate the TSS of the Chapala Lake is developed using the relation between the Landsat data and the measured TSS data for each date. The model development method, multiple linear regression (MLR) algorithm, was selected for the TSS model, assuming that the dependent variable TSS ground measurement is a linear function of reflectance values from the Landsat image bands. The regression variance ratio can be denoted as R², which indicates the model prediction ability. Additionally, the magnitude of R² represents the correlation between the independent and dependent variables for each date as presented in Table 2.

Table 2 Descriptive statistics, including sampling date, minimum (Min), maximum (Max), mean (Mean), standard deviation (Std Dev), R-squared linear regression analysis (R²), and the root mean square error (RMSE). No single analysis indicates that was not possible to calculate the linear regression due to the scarce number of samples

Full size table

Statistical methods

Statistical analysis of in situ data

We selected the value for each point and correlated all bands with TSS ground data, we treat each event as an independent sample. We first performed exploratory data analysis (EDA) and completed an examination of the data to suggest partial descriptions and hidden relationships, regardless of the statistical criteria used in confirmatory settings (Table 2).

Regression analysis can be used to estimate the concentration of TSS, and a model generated through a best-fit equation can describe the relationship between field data and the corresponding in situ data. The correlation coefficient R² is used to evaluate the precision of these models. The objective of regression analysis is to build a function of predictor variables to express the response variable.

Exploratory data analysis (EDA)

Exploratory data analysis (EDA) is a term coined by John W. Tukey in his book on Statistics (Tukey 1977), it is also known as descriptive statistics. The purpose of EDA is to inspect and explore data, use summary statistics. The EDA is the forerunner of any geostatistical analysis; it is performed to familiarize with the data and detect pattern regularities. Exploratory analysis provides the distribution and experimental or empirical behaviour of the data regardless of their location (Kitanidis 1997). One of the most important purposes of the EDA on geospatial data is to characterize the range of autocorrelation presented by the data, as well as the possible correlation between different variables. The application of geostatistical techniques to environmental variables requires that these data have a normal distribution (Webster and Oliver 2007); therefore, a previous evaluation of the data is necessary.

Regression analysis

The simple linear regression model has been used successfully by many investigators in a wide variety of disciplines to relate the dependent variable to a single predictor variable. However, in many situations, the relationship between the dependent variable and single predictor variables is not strong. Therefore, the MLR has been proposed recently as a better approach to solve this problem. The application of these techniques helps in the interpretation of complex data matrices; in this work, we use multiple linear regression models (MLR). Multiple linear regression models relate dependent variables to several independent variables (explanatory variables) using a predefined equation (Rogerson 2001). Equation 1 shows the general multiple linear regression model:

$${\text{Y }} = {\text{ A }} + {\text{ B1X1 }} + {\text{ B2X2 }} + \, \cdot \, \cdot \, \cdot \, + {\text{ BnXi}}$$

(1)

where Y is the dependent variable, Xi are the independent variables, and B1… Bn are the regression coefficients. If the coefficients and input variables are known, then the regression equation can be used to make predictions. However, the prediction made by the regression model (Eq. 1) frequently does not coincide with the observed values of Y, so it is necessary to calculate the error of the model, that is, the difference between the predicted values and the actual values (Eq. 2):

$${\text{Y }} = {\text{ A }} + {\text{ B1X1 }} + {\text{ B2X2 }} + \, \cdot \, \cdot \, \cdot \, + {\text{ BnXi}}, + \, {{\epsilon}}$$

(2)

where ε is the random error, the value that indicates the amount of dispersion in the estimation of the Y value. The most common method to estimate the regression coefficients is the least squares, which is used in this work. In regression analysis, three criteria are required so that the fitness of the function would be acceptable: (1) the mean and the variance of the random error should be zero and a constant value, respectively; (2) the function fitted to the data should be significant (α = 0.05 is the significant level) so that the analysis of variance could be used; and (3) the value of the coefficient of determination (R²) is as close to 1 as possible. Based on Granian et al. (2015), the criteria to evaluate the regression analysis are: “1. The variance and the mean of the random error should be a constant value and zero, respectively. 2. The coefficient of determination value which is called (R²) should be tested 3. Given the fact that adding independent variables to the model will increase the R² value, the adjusted determination coefficient which is called (R²adj) 4. In regression analyses, the p value of final coefficients for each specific model could be applied after choosing the best model. Accordingly, the p value of the regression model in the analysis of variance test should be acceptable (less than or equal to 0.05)”.

Best regression subset for the data set

When there are multiple predictor variables, there are many combinations that can be used in a regression model, the optimal combination can be determined through a series of trial-and-error attempts. The need for selecting a subset of the available Landsat spectral bands to reduce the dimensionality of remote sensing data was considered in many works of the TSS relation with remote sensing. Considerable attention has been paid in the literature to the choice of the criterion used to evaluate and determine the best subset. There are automatic variable selection procedures that choose which variables to include in a regression model in a scenario where there are many predictor variables (spectral bands in the Landsat image) and a response variable (TSS), the method to select the best regression model is the best regression subset (BRS).

In the best subset regression method, all possible models based on the specified independent variables are fitted, and then, BRS compares all possible models using a specific set of predictors and shows the most appropriate models that contain one predictor, two predictors, etc. The criterion for selecting the most suitable models for this process is R², because R² is used to determine the degree of predictability of the dependent variable based on the set of predictive variables to determine the best model, the best subset would be the one with the R² nearest to 1.

Machine learning

Machine learning (ML) is an evolution of artificial. ML is great for solving problems where our theoretical knowledge is still incomplete, but we have numerous observations and additional data. Such systems can be massively multivariate, involving even thousands of variables. Through retrieval algorithms, ML has been shown to be useful in numerous applications in many parts of the Earth system (land, ocean, and atmosphere) and beyond. In remote sensing, ML is an effective empirical method for regression and classification.

When analysing remote sensing data, the four most common goals of machine learning are classification, clustering, regression, and dimensionality reduction. Regression methods are suitable when the objective is to estimate or predict the effect of a variable based on a set of covariates. A regression model is developed or trained based on a set of input variables with known answers. In this work, the regression results estimate or predict TSS concentrations from spectral bands extracted from Landsat images.

In supervised machine learning, an artificial intelligence system "learns" to determine the best fit for the predictor variable by analysing the supplied data (examples). The goal of supervised ML is to learn the rules for mapping sets of inputs and outputs; in this case, the input set consists of the Landsat spectral bands, and the output consists of the TSS. The construction of the model from a training subgroup was the first phase that was performed in the ML processing for the prediction of TSS. The second step involved testing the model using data that was distinct from the training subgroup. In this study, we randomly selected 80% of the obtained data for machine learning training and 20% for testing. By explicitly fitting a model to the data, machine learning seeks to identify the optimal relationship between the input (reflectance) and output (TSS). In order to find the ideal settings for the model, the model parameters are modified by minimizing the prediction error in the validation data set.

Results and discussion

Match-up between satellite images and in situ measurements and EDA of in situ TSS data

As a result of the match-up of all images and field observations, a total of 315 observations were obtained (Table 1). We used these data to predict TSS in Lake Chapala. Table 2 shows the descriptive statistics of TSS concentration measured which ranged from 1 (mg/L), on 05/25/2009, to 257 (mg/L); on 04/06/2015. The lowest average value is 10.34, 06/09/2011, (mg/L) to the highest average value 70.46 (mg/L), 02/27/2013.

MRL results of data sets for each date

22 MLR-models were constructed to identify the single relation for each date and explain the highest variance proportion of TSS measurements, inferred from R² and adjusted R². The performance of MLR with respect to minimum, maximum, mean, standard deviation, R², and RMSE for the 22 models is displayed in Table 2.

The objective of this calculation was to identify the feasibility to generate a general the model of TSS present in the water from the Landsat satellite images for every specific date and then use all data to obtain a general relation for the eleven years. To evaluate the general multiple linear regression model, it is necessary to analyse the results for a single date. In Table 2, R² and RMSE for all the single date analysis are presented. The values of R² nearest to 1 correspond to the 05/24/2006 data, with 0.96, which is a very good correlation, and the RMSE value is 3.01. Conversely, the lowest value of R² is 0.44, 25/05/2009 with a RMSE value of 9.61, which is a very poor correlation. Other important aspect to consider is that these particular results yield a correlation that is valid only for each specific date and are not valid for a general model. In the last line of Table 2, the results for all data sets were input in the model, assuming that all the in situ measurements are independent variables; in this general model, the correlation R² is 0.52 and RMSE is 25.52; however, this exercise is useful to compare the results of a broad model constructed by simply including all data in the MRL without the application of machine learning in the calculation process. It is important to consider that in this specific case using all data, the results are not valid because the data are not independent and do not fulfil the requirements of the method.

Best regression subset

Table 3 shows the results of the best subset regression for all data sets (see Table 1 for data sets). The regression of the best subset method calculates all possible models and shows the best candidates (the spectral bands of the Landsat images) based on R² (Table 3). Each row represents a different case, showing the best two options for each number of included independent variables. The X denotes the independent variables used in each model.

Table 3 Results of the application of regression of the best subsets for all data sets. X indicates the variables included in each model. R² represents the maximum value for each subset

Full size table

After calculation of all the possible band combinations, Table 4 presents the result of the best subgroup analysis regression for all sampling dates, including the group "all data". R² min is the minimum value of R² in the analysis and corresponds to 1 band and Max is the maximum R² that represents the best subset of the regression group; the best results always use the 7 Landsat spectral bands. This analysis shows that the best TSS MLR requires the use of all available satellite imagery bands. It is important to note that the “all data” group does not yield the best R².

Table 4 Minimum (1 spectral band) and maximum (7 spectral bands) R² of the regression of the best subsets

Full size table

Machine learning

Using the best-fitted model for all cloud-free available Landsat scenes for the study area, we obtained the best-estimated correlation between Landsat images and TSS for the period February 2005 to August 2015.

As mentioned in the Methods section, predictive performance was evaluated using the correlation coefficient (R²) and the root mean square error (RMSE) from predictions of the training model, by using machine learning to create a model to correlate the in situ measurements of the TSS and Landsat satellite data.

The multiple linear regression model presented in Eq. (3) was generated using multiple linear regression, and applying machine learning to correlate the TSS in situ sampling data with the 7 bands of the Landsat satellite:

$${\text{TSS}} = {18}.{246} - {1}.{\text{637B1}} - 0.{\text{826B2}} + {2}.{\text{541B3}} + {2}.{\text{322B4}} - {1}.0{\text{18B5}} + 0.{\text{373B6}} - 0.{\text{824B7}}$$

(3)

where B1, B2, B3, B4, B5, B6, B7 are the Landsat image bands.

This model has R² value of 0.818, RMSE of 22.89 and p values less than 0.05. These 3 values indicate that the model is valid and reliable. The correlation equation was better than the previously obtained correlations using all data, R² increased from 0.56 to 0.818 and RMSE decreased from 31.52 to 22.89, a decrease of 37%.

This model shows that the Landsat spectral bands B3 [red], B4 [near infrared] and B6 [thermal infrared] have a positive correlation, conversely B1 [blue], B2 [green], B5 [medium infrared] and B7 [medium infrared] have a negative correlation. Similarly, the obtained coefficients indicate that B1, B3, B4 and B5 had the strongest linear relationship with TSS, on the other way B6 had the weakest linear relationship with TSS. That is consistent with previous works about correlation of TSS spectral characteristics.

Equation (3) represents the relationship between the TSS data measured in situ and the reflectance information contained in the spectral bands of the Landsat images (Fig. 2). Figure 2 shows the scatter plot of the predicted values vs actual values in (mg / L), when all the calculated values are correct this plot must follow a 45 degree line that indicates that values are the same in both axes: predicted and actual values. We can note that most values are concentrated in a region close to that relation. Therefore, model assumptions seemed to be satisfactory.

Figure 3 shows the residual patterns of the computed model (Eq. 3). As it can be observed, the model residuals seemed to have a normal distribution with mean value 0, and no marked trend (homoscedasticity) in residuals versus fitted values.

The plot in Fig. 3 can be used to complement and better understand these values; the difference between the predicted values for the model versus the real values in the set of data to generate the model is plotted. A red 45-degree line is included to help to visualize the results presented in this plot. The best-fitted values of the model are the values that are closest to that red line and the worst-fitted values are the values that are far from the line. In this plot, some outliers are evident, in the lowest values and most importantly in the highest values of the TSS. This is an important issue because it indicates that the model is good to predict values lower than 100 mg/L, but at higher values, the model correlation decreases.

MLR with machine learning using the 315 observations was implemented with the best performance to determine the predictive relationships for all spectral bands of Landsat images (independent variables) with TSS in situ measurements and obtained the best regression relationship (p < 0.01) (R² = 0.818; Table 5).

Table 5 Multiple linear regression with machine learning EDA summary

Full size table

Application to the Landsat image of Lake Chapala

The resulting equation by the MLR method was applied to a multispectral Landsat image of Lake Chapala. This image was selected because there is a previous investigation that determined the content of SST in numerous sampling sites correlated with the date of a satellite image (Membrillo-Abad et al. 2016).

The results of applying the TSS MLR-ML model (Eq. 3) to the January 2013 Landsat TM image are shown in Table 6. The calculated TSS interval is compared with the observations: the lowest measured TSS is 10.00 mg/L, and the highest TSS is 215 mg/L; on the other hand, the lowest calculated TSS is 10.79 mg/L and the highest TSS is 252.83 mg/L, with a mean and a range of 59.29 ± 70.46 standard; 73.71 ± 7.54, respectively (Table 6). Average, maximum, and minimum values are overestimated in the MRL-ML calculation by approximately 15–20%. More importantly, the calculated and measured patterns of TSS variation are similar, allowing the sources and their distribution in the lake to be identified.

Table 6 Descriptive statistics, including observed suspended sediments and calculated suspended sediments, minimum (Min), maximum (Max), mean (Mean), standard deviation (Std Dev)

Full size table

Figure 4 shows the results of applying the MLR-ML model to satellite imagery, which was developed in this research to determine the concentration of TSS using the multiple linear regression model with machine learning in the Landsat image from January 2013. The map generated is the result of the application of Eq. 3 to the 7 bands of the Landsat image of January 2013; the statistical results are shown in Table 6. The map in Fig. 4 shows the calculated TSS concentration distribution, in which it is clear that the highest concentration occurs in the eastern part of the Chapala Lake, where the tributary rivers discharge the major contribution of sediments, (see Fig. 1). Correspondingly, the lowest values are in the western area of Chapala Lake because the major entrance of sediments in the east does not disturb the central and western area of the lake and results in low TSS values. The calculated TSS values have a good correlation with the actual TSS data.

Discussion

The results have shown that multiple linear regression analysis can be improved by the addition of machine learning to the modelling. The use of various statistical methods in data analysis is a strategy that should be more widely applied because the statistical methods provide suitable correlations among any amount of data, in a way that any outliers are evident from the results.

This study correlated Lake Chapala in situ TSS measurements paired with spectral images from the Landsat satellite using the MLR model. 22 data and associated image pixels, or 315 TSS data, obtained from Lake Chapala were measured and used for model development. The MLR model was selected to build the TSS machine learning model, and R² was used to assess the accuracy of the model. MLR was performed to determine the relationships of different variables, for each in situ measurement, as well as a linear regression for the complete data (315 data), in which we obtained a correlation of 0.818, which is one of the highest correlations obtained for this type of applications.

The comparison between the MRL and the MRL-ML results demonstrates that machine learning has the capability to improve the accuracy of predictions, the results of this study support the hypothesis that implementing ML algorithms to predict water quality parameters improves the overall predictive accuracy of spectral relationships and interactions.

The methodology presented in this work is the first attempt to correlate in situ measurements with surface reflectance provided by Landsat images that do not use “a specific combination of bands” method, but a more robust approach that applies multiple linear correlation to both data sets. The results show that the developed MLR algorithm successfully correlated the values of TSS with spectral data from Landsat TM images using a multiple linear regression algorithm applied to measurements in situ with the reflectance of the 7 spectral bands, and that the application of machine learning techniques contributed to make this model more robust. This technique provided a linear equation, which was used to generate a TSS map of Lake Chapala by correlating the reflectance measured by the Landsat TM satellite with the TSS values. The obtained map correlates well with previously published reports; therefore, Landsat TM images can be used to produce high-precision TSS maps, and such a model can be used to monitor the variation in TSS values of Lake Chapala without performing continuing in situ measurements.

Conclusion

This research has accomplished two different aims: first one to improve a methodology to combine remote sensing, multiple linear correlation, and machine learning to correlate Landsat satellite images; and the second is to obtain the multiple linear regression between predictor variables (Landsat spectral bands) and dependent variable (TSS) in Chapala Lake. Multiple R-values show the reliability of the relationship between the Landsat data and TSS field data.

The results of this research contribute to the search for a global empirical model that has become very important in order to develop a continuous monitoring system for inland water masses, as they provide a reliable relationship between the TSS concentrations, and the reflectance pattern of the water surface detected by satellite images (Gordon et al. 1980, 1983; Clark 1981).

Two important improvements in the proposed methodology can be considered. The first is to reduce the errors involved in the correlation algorithms between remote sensing and TSS values, using all available spectral bands instead of band combinations or trial and error to obtain the best match. The second improvement is the use of multiple regression with machine learning as an application of the multiple correlation model.

In addition to proposing this new methodology, it was applied to a case study with the main objective of showing its pertinence and how processing methodology is used to found relationships between Landsat images and TSS samples. As a result, an applicable algorithm for Lake Chapala was developed and demonstrated its suitability. From this study, it is clear that the information obtained from the developed MLR model can be used to map water bodies in a wide range of conditions and sizes, using freely available data, which will make more feasible to keep control on the environmental health of water bodies.

Unsupervised artificial intelligence has been used in this work to find a general multiple correlation model. This is the most important contribution of this work because the use of a robust statistical method, artificial intelligence, implies that a multiple regression can be obtained that applies to the entire data series and this general model can be employed to carry out extrapolations and to monitor TSS in lakes, in our case Lake Chapala, without the requirement of continuing in situ sampling data. The results of this study indicate that the established model has a great potential for reliably mapping different water quality parameters.

Data availability

Datasets related to this article are available at: https://www.gob.mx/conagua/articulos/calidad-del-agua.

References

Akbar, T., Q. Hassan, and GA Achari. 2010. Framework based on remote sensing to predict water quality from different water sources. Proceedings of the ISPRS Commission I Midterm Symposium, Image Data Acquisition–Sensors and Platforms, Calgary, AB, Canada, 15–18.
Aparicio J (2001) Hydrology of the Lerma-Chapala Basin. In: van Afferden M, Hansen AM (eds) The Lerma-Chapala Basin. Evaluation and management. Kluwer Academic/Plenum Publishers, USA, pp 3–30
Google Scholar
Ayana EK, Worqlul AW, Steenhuis TS (2015) Evaluation of stream water quality data generated from MODIS images in modeling total suspended solid emissions to a freshwater lake. Sci Total Environ 523:170–177
Article CAS Google Scholar
Balaguer-Ballester E, Camps-Valls G, Carrasco-Rodríguez JL, Soria-Olivas E, Del Valle-Tascon S (2002) Effective prediction 1 day in advance of hourly surface ozone concentrations in eastern Spain using linear models and neural networks. Eco Modeling 156:27–41
Article CAS Google Scholar
Brando VE, Braga F, Zaggia L, Giardino C, Bresciani M, Matta E, Bellafiore D, Ferrarin C, Maicu F, Benetazzo A et al (2015) High-resolution satellite observations of sea surface temperature and turbidity of river plume interactions during significant flooding. Ocean Sci 11:909–920
Article Google Scholar
Brezonik P, Menken KD, Bauer M (2005) Landsat-based remote sensing of lake water quality characteristics, including chlorophyll and colored dissolved organic matter (CDOM). Lake Reserv Manag 21:373–382
Article Google Scholar
Caballero I, Morris E, Prieto L, Navarro G (2014) The influence of the Guadalquivir River on the spatio-temporal variability of suspended solids and chlorophyll in the Eastern Gulf of Cádiz. Mediter Mar Sci 15(4):721–738
Article Google Scholar
Carvalho L, Poikane S, Solheim LA, Phillips G, Borics G, Catalan J, Hoyos DC, Drakare S, Dudley B, Jrvinen M et al (2013) Strength and uncertainty of phytoplankton metrics to assess the impacts of eutrophication on lakes. Hydrobiology 704:127–140. https://doi.org/10.1007/s10750-012-1344-1
Article CAS Google Scholar
Chen Z, Hu C, Muller-Karger F (2007) Monitoring turbidity in Tampa Bay using MODIS/aqua 250-m images. Remote Sens Environ 109:207–220
Article Google Scholar
Chen S, Han L, Chen X, Li D, Sun L, Li Y (2015) Estimation of wide-range total suspended solids concentrations from 250-m MODIS images: an improved method. ISPRS J Photogramm Remote Sens 99:58–69
Article Google Scholar
Choi JK, Park YJ, Lee BR, Eom J, Moon JE, Ryu JH (2014) Geostationary ocean color imager (goci) application to map temporal dynamics of coastal water turbidity. Remote Sens Environ 146:24–35
Article Google Scholar
Clark DK (1981) Phytoplankton pigment algorithms for Nimbus-7 CZCS. In: Gower JFR (ed) Oceanography from Space. Plenum Press, New York, pp 227–237
Chapter Google Scholar
Cox RM, Forsythe RD, Vaughan GE, Olmsted LL (1998) Assessing water quality in Catawba river reservoirs using Landsat thematic mapper satellite data. Lake Reserv Manag 14:405–416
Article CAS Google Scholar
Cui L, Qiu Y, Fei T, Liu Y, Wu G (2013) Using remotely detected suspended sediment concentration variation to improve Poyang lake management. China Lake Reserv Manag 29:47–60
Article Google Scholar
De Anda J, Quiñones SE, French RH, Guzmán M (1998) Hydrological balance of Lake Chapala (Mexico). J Am Water Resour Assoc 34(6):1319–1331. https://doi.org/10.1111/j.1752-1688.1998.tb05434.x
Article Google Scholar
Dekker AG, Vos R, Peters S (2002) Analytical algorithms for estimating lake water SST for retrospective analysis of TM and SPOT sensor data. In T J Remote Sens 23:15–35
Article Google Scholar
Dörnhöfer K, Klinger P, Heege T, Oppelt NN (2018) In situ and multisensor satellite monitoring of phytoplankton development in a eutrophic-mesotrophic lake. Sci Total Environ 612:1200–1214. https://doi.org/10.1016/j.scitotenv.2017.08.219
Article CAS Google Scholar
Duan H, Ma R, Zhang Y, Zhang B (2009) Remote sensing assessment of water clarity of regional inland lakes in Northeast China. Limnology 10:135–141
Article Google Scholar
Espinoza-Villar RJMM, Le Texier M, Guyot JL, Fraizy P, Meneses PR, Oliveira ED (2013) Study of sediment transport in the Madeira River, Brazil, using MODIS remote sensing images. JS Am Earth Sci 44:45–54
Article Google Scholar
Filonov AE, Tereshchenko IE, Monzón CO (1998) Oscillations of the hydrometeorological characteristics in the region of Lake Chapala by intervals of days to decades. Int Geophys 37(4):293–307
Google Scholar
García E (1988) Modifications to the Köpen climatic classification system (to adapt it to the conditions of the Mexican Republic), Talleres de Offset Larios, México
Gordon HR, Clark DK, Mueller JL, Hovis WA (1980) Phytoplankton pigments from the coastal Nimbus-7 color scanner: comparisons with surface measurements. Science 210:63–66
Article CAS Google Scholar
Gordon HR, Clark DK, Brown JW, Brown OB, Evans RH, Broenkow WW (1983) Phytoplankton pigment concentrations in the Mid-Atlantic Bay: comparison of ship determinations and CZCS estimates. Appl Opt 22:20–36
Article CAS Google Scholar
Hudson B, Overeem I, McGrath D, Syvitski JPM, Mikkelsen A, Hasholt B (2014) MODIS observed an increase in the length and spatial extent of sediment plumes in the Greenland fjords. Cryosphere 8:1161–1176
Article Google Scholar
Jensen JR (1996) Introduction to digital image processing: a remote sensing perspective, 2nd edn. Prentice Hall, Upper Saddle River
Google Scholar
Kaba E, Philpot W, Steenhuis T (2014) Evaluation of the suitability of MODIS-Terra images to reproduce historical sediment concentrations in water bodies: lake Tana, Ethiopia. Int J Appl Earth Obs Geoinform 26:286–297
Google Scholar
Kallio K, Attila J, Härmä P, Koponen S, Pulliainen J, Hyytiäinen UM, Pyhälahti T (2008) Landsat ETM + images in estimating the water quality of seasonal lakes in the basins of the boreal rivers. Reign Manag 42:511–522
Google Scholar
Kitanidis PK 1997 Introduction to geostatistics: applications in hydrogeology Cambridge University Press, Science–249 pages
Kratzer S, Brockmann C, Moore G (2008) Using full resolution MERIS data to monitor coastal waters —a case study from Himmerfjärden, a fjord-like bay in the northwestern Baltic Sea. Remote Sens Environ 112:2284–2300
Article Google Scholar
Lind O, Dávalos-Lind L (2001) Introduction to the Limnology of Lake Chapala, Jalisco, Mexico. In: Hansen AM, van Afferden M (eds) La Cuenca Lerma-Chapala Evaluation and management. Kluwer Academic/Plenum Publishers, USA, pp 139–149
Google Scholar
Ma R, Dai J (2005) Investigation of chlorophyll-a and total suspended matter concentrations using Landsat ETM and field spectral measurement in Taihu Lake China. Int J Remote Sens 26(13):2779–2795. https://doi.org/10.1080/01431160512331326648
Article Google Scholar
Matsushita B, Yang W, Yu G, Oyama Y, Yoshimura K, Fukushima T (2015) A hybrid algorithm for estimating the chlorophyll-a concentration across different trophic states in Asian inland waters ISPRS. J Photogramm Remote Sens 102:28–37. https://doi.org/10.1016/j.isprsjprs.2014.12.022
Article Google Scholar
Membrillo-Abad AS, Torres-Vera MA, Alcocer-Durand J, Prol-Ledesma RM, Oseguera-Pérez LA, Ruiz-Armenta JR (2016) Estimation of the trophic state index from remote sensing data from Lake Chapala Mexico. Mex J Geol Sci 33(2):183–191
Google Scholar
Mexican Institute of Water Technology (IMTA), 2009, General strategy for environmental rescue and sustainability of the Lerma-Chapala Basin. IMTA, Mexico
Miller RL, McKee BA (2004) 2004 Using MODIS terra 250 m imagery to map concentrations of total suspended matter in coastal waters. Remote Sens Environ 93:259–266
Article Google Scholar
Nechad B, Ruddick K, Park Y (2010) 2010 Calibration and validation of a generic multisensor algorithm for mapping total suspended matter in turbid waters. Remote Sens Environ 114:854–866
Article Google Scholar
Nezlin NP, DiGiacomo PM (2005) Satellite observations of the ocean color of stormwater runoff columns along the San Pedro shelf (Southern California) during 1997–2003. Cont Shelf Res 25(14):1692–1711
Article Google Scholar
Overeem I, Hudson BD, Syvitski JPM, Mikkelsen AB, Hasholt B, van den Broeke MR, Noël BPY, Morlighem M (2017) Substantial export of suspended sediments to global oceans by glacial erosion in Greenland. Nat Geosci 10:859
Article CAS Google Scholar
Pereira LSFF, Andes LC, Cox AL, Ghulam A (2017) Measuring suspended sediment concentration and turbidity in the Middle Mississippi and Lower Missouri rivers using Landsat data. JAWRA J Am Water Resour Assoc 63103:1–11
Google Scholar
Petus C, Marieu V, Novoa S, Chust G, Bruneau N, Froidefond JM (2014) Monitoring the spatio-temporal variability of the cloudy plume of the Adour River (Bay of Biscay, France) with MODIS images of 250 m. Cont Shelf Res 74:35–49
Article Google Scholar
Philipson P, Kratzer S, Mustapha SB, Strmbeck N, Stelzer K (2016) Satellite monitoring of water quality in Lake Vnern Sweden. Int J Remote Sens 37:3938–3960. https://doi.org/10.1080/01431161.2016.1204480
Article Google Scholar
Qiu Z (2013) A simple optical model to estimate suspended particulate matter in the Yellow River estuary. To Opt Fast 21:27891–27904
Google Scholar
Raag L, Uiboupin R, Sipelgas L (2013) In Analysis of historical data from MERIS and MODIS to evaluate the impact of dredging on the monthly mean surface tsm concentration. SPIE, Proc
Google Scholar
Rogerson P (2001) Statistical methods for geography. Sage Publications, London
Book Google Scholar
Sandoval FP (1994) Past and Future of Lake Chapala, General Secretariat Editorial Unit. Government of the State of Jalisco, Mexico
Shen F, Zhou Y, Peng X, Chen Y (2014) Satellite multisensor mapping of suspended particulate matter in turbid estuaries and coastal oceans, China. Int J Remote Sens 35:4173–4192
Article Google Scholar
Shi K, Zhang Y, Zhu G, Liu X, Zhou Y, Xu H, Qin B, Liu G, Li Y (2015) Long-term remote monitoring of total suspended matter concentration in Lake Taihu using MODIS-aqua data of 250 m. Remote Sens Environ 164:43–56
Article Google Scholar
Telmer K, Costa M, Angélica RS, Araujo ES, Maurice Y (2006) The source and destination of sediments and mercury in the Tapajos River, Para, Brazilian Amazon: terrestrial and spatial evidence. J Environ Manag 81:101–113
Article CAS Google Scholar
Tukey JW (1977) Exploratory data analysis, reading, mass. Ad- dison-Wesley
Google Scholar
Tyler AN, Svab E, Preston T, Présing M, Kovács WA (2006) Remote sensing of shallow lake water quality: a mixing modeling approach to quantify phytoplankton in water characterized by high suspended sediments. Int J Remote Sens 27:1521–1537
Article Google Scholar
Tyler AN, Hunter PD, Spyrakos E, Groom S, Constantinescu AM, Kitchen J (2016) Developments in Earth observation for the assessment and monitoring of inland, transitional, coastal and marine platform waters. Sci Total Environment 572:1307–1321. https://doi.org/10.1016/j.scitotenv.2016.01.020
Article CAS Google Scholar
Vanhellemont Q, Ruddick K (2014) Cloudy trails associated with offshore wind turbines observed with Landsat 8. Remote Sens Environ 145:105–115
Article Google Scholar
Volpe V, Silvestri S, Marani M (2011) Remote sensing recovery of suspended sediment concentration in shallow waters. Remote Sensing Environ 115:44–54
Article Google Scholar
Wang F, Zhou B, Xu J, Song L, Wang X (2008) Application of the neural network and MODIS 250 m images to estimate the concentration of suspended sediments in Hangzhou Bay. China Reign Geol 56:1093–1101
Article Google Scholar
Webster R, Oliver MA (2007) Geostatistics for Environmental Scientists. Wiley, UK
Book Google Scholar
Wu G, De Leeuw J, Skidmore AK, Prins HHT, Liu Y (2008) Comparison of MODIS and Landsat TM5 images to map the tempo - spatial dynamics of the depths of the secchi disk in the Poyang Lake national nature reserve China. J Remote Sens 29:2183–2198
Article Google Scholar
Wu G, Cui L, Liu L, Chen F, Fei T, Liu Y (2015) Statistical model development and estimation of concentrations of particulate matter in suspension with Landsat 8 OLI images of Dongting Lake, China. Int J Remote Sens 36:343–360
Article Google Scholar
Zarate-del Valle PF, Michaud F, Parrón C, Solana-Espinoza G, Alcántara I, Ramírez-Sánchez HU, Fernex F (2001) Geology Sediments and soils. In: Hansen AM, Van Afferden M (eds) The Lerma-Chapala Basin Evaluation and management. Kluwer Academic/Plenum Publishers, USA, pp 31–57
Google Scholar
Zhang M, Dong Q, Cui T, Xue C, Zhang S (2014) Monitoring and evaluation of suspended sediments for the Yellow River estuary from Landsat TM and ETM + images. Remote Sens Environ 146:136–147
Article Google Scholar
Zhou F, Liu Y, Guo H (2007) Application and multivariate and statistics and methods and water and quality and evaluation. Reign Monit Evaluate 132:1–13
CAS Google Scholar

Download references

Funding

The author declares that he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Author information

Authors and Affiliations

Faculty of Engineering, National Autonomous University of Mexico, Ciudad Universitaria, 04510, Mexico City, Mexico
M.-A. Torres-Vera

Authors

M.-A. Torres-Vera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M.-A. Torres-Vera.

Ethics declarations

Conflict of interest

The author declares that he has no his work or state if there are no interests to declare.

Additional information

Editorial responsibility: Samareh Mirkia.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Torres-Vera, MA. Mapping of total suspended solids using Landsat imagery and machine learning. Int. J. Environ. Sci. Technol. 20, 11877–11890 (2023). https://doi.org/10.1007/s13762-023-04787-y

Download citation

Received: 27 September 2021
Revised: 08 December 2022
Accepted: 17 January 2023
Published: 14 February 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s13762-023-04787-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Mapping of total suspended solids using Landsat imagery and machine learning

Abstract

Similar content being viewed by others

Synergizing remote sensing and ecological indicators (RSEIs) for evaluating ecological environmental quality (EEQ) in Asansol Municipal Corporation: an integrated approach

Water quality prediction using machine learning models based on grid search method

A comprehensive review of water quality indices (WQIs): history, models, attempts and perspectives

Introduction

Study area

Materials and methods

Methodology

Remote sensing data

Satellite data acquisition Landsat satellite images

Pre-processing

On-site measurements

Satellite-in situ match-ups

Model construction

Statistical methods

Statistical analysis of in situ data

Exploratory data analysis (EDA)

Regression analysis

Best regression subset for the data set

Machine learning

Results and discussion

Match-up between satellite images and in situ measurements and EDA of in situ TSS data

MRL results of data sets for each date

Best regression subset

Machine learning

Application to the Landsat image of Lake Chapala

Discussion

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation