Introduction

Land cover (LC) can be significantly altered by factors such as excessive agricultural development, rapid population growth, and the overexploitation of natural resources, leading to landscape degradation (Beuchle et al. 2015). The mountain environments worldwide, including the Hindu Kush, Karakoram, and Himalayan (HKH) range, are highly susceptible to various anthropogenic and natural hazards, such as climate change, tourism, urbanization, population growth, and economic development (us Saqib et al. 2019; Saini & Singh 2024). These mountain regions are prone to natural disasters like landslides, flash floods, earthquakes, and glacier lake outburst floods (GLOFs) (Bacha et al. 2018; Jamil et al. 2019; Xu et al. 2009). Understanding the LC patterns in such areas can help in identifying locations that are at greater risk of such kind of natural disasters and can aid in developing strategies to mitigate them (Saini & Singh 2024).

Apart from these natural disasters, the rapid population growth, excessive agricultural expansion, and uncontrolled urbanization in mountainous area have led to the overexploitation of natural resources, landscape degradation, and land deterioration. Understanding the LC features and implementing better management strategies for natural resources are crucial for sustainable development in the environmental, social, and economic sectors (Dang and Kawasaki 2017). Any adverse impact on the fragile mountainous environment can pose severe challenges for the large human population residing in these areas. Different studies have emphasised the importance of LC analysis for effective management of natural resources in the fragile mountainous environments (Satti et al. 2023). To examine such changes in LC with exceptionally high accuracy, remote sensing (RS) images classified with machine learning (ML) algorithms are considered as standard tools that are persistently used all around the world (Gargiulo et al. 2020; Jia et al. 2023).

A large number of ML-based LC classification algorithms have been explored over the past decade to produce accurate, up-to-date, and long-term LC maps (Zhang & Zhang 2020; Wang et al. 2023). For instance, artificial neural network (ANN) (Yuan et al. 2009), random forest (RF) (Gislason et al. 2006; Wang et al. 2023), classification and regression tree (CART) (Shao & Lunetta 2012), and support vector machine (SVM) (He et al. 2005) have demonstrated superior performance in mapping different LC types compared to traditional classifiers (Belgiu and Drăgu 2016). The RF classifier is particularly popular in the RS community (Xiong et al. 2017) due to its high accuracy, achieved by constructing multiple decision trees (DTs). For example, RF algorithm was used in Western Himalayas to classify vegetation types with an overall accuracy (OA) of 80%, considering topographic and climate variables for improved accuracy (Singh et al. 2023). Similarly, a study conducted by Zurqani (2024) for forest canopy cover using RF achieved an OA ranging between 83.31 and 94.35%. Moreover, Mansaray et al. (2019)deployed both SVM and RF classifiers for mapping paddy rice in China using Landsat-8 and Sentinel-2 images for the 2015 and 2016, and the RF classifier demonstrated higher accuracy (95%) compared with the SVM classifier (90.8%). Delalay et al. (2019) utilized CART, maximum entropy, and RF classifiers within the Google Earth Engine (GEE) environment for LC classification using Sentinel-2 data. The results showed that the RF technique had the highest OA (95%), followed by maximum entropy (93%) and CART (61%) in the mountainous region of Nepal. A recent study (Mahmoodzada et al. 2024) utilized the SVM and the multilayer perceptron (MLP) to map snow cover area in Pamir region of Hindukush with kappa coefficient of 0.75 and 0.83, respectively. Shetty et al. (2021) evaluated the impacts of training sampling design on LC classification results within the GEE, concluded RF outperformed both CART and SVM.

The selection of an appropriate ML classifier for LC mapping is a challenging task due to the large number of available algorithms, their varying computational performance, and the conflicting information about their OA. Additionally, the combination of spectral bands used as input can significantly affect the classification accuracy (Shetty et al. 2021; Xiong et al. 2017). Various researchers have explored the use of different spectral band combinations from RS data, such as Sentinel-2 (Silveira et al. 2023) to improve the LC classification accuracy (Gumma et al. 2020; Stromann et al. 2020). However, there appears to be a lack of published research evaluating the performance of diverse ML algorithms applied to Sentinel-2 imagery for LC mapping in the HKH region of Pakistan. The majority of existing studies in this geographic context have focused on classifying a limited set of LC types (Khan et al. 2020a, b; Qamer et al. 2016; Satti et al. 2023, 2024). These investigations have typically employed single image and conventional classification algorithms for mapping within small sub-regions, often resulting in varying and even contradictory outcomes. For instance, in the current study area Khan et al., (2019) and Ali et al., (2019a, b) performed LC mapping for Gilgit city and Gilgit district in Pakistan, respectively, using Landsat data and maximum likelihood classifier (MLC) to identify five generic LC classes. This makes it challenging to compare the accuracy of the generated LC maps and identify the most reliable approach for this complex mountainous environment (Delalay et al. 2019). As of the time of writing this paper, there have been no published studies focusing on detailed LC mapping in the HKH region of Pakistan utilizing high-resolution RS imagery, despite the significant advancements in the field.

This gap in the literature underscores the need for a comprehensive evaluation of the performance of a wider range of ML algorithms, including their computational efficiency and classification accuracy, when applied to high-resolution Sentinel-2 data for LC mapping in the HKH region of Pakistan. Such an assessment would provide valuable insights to support the selection of the most appropriate LC classification approach for this ecologically significant yet geographically challenging area. When mapping LC over a large extent (such as the current study area), researchers had to consider the key challenges regarding the processing of ‘big earth data’ and the availability of images (Satti et al. 2024). Previous researchers in the study area were limited in their ability to run ML classification algorithms due to constraints in computing power and storage. However, the utilization of GEE free cloud-based computing platform has enabled scientists to utilize the satellite data for large-scale LC mapping in a more efficient and effective way (Gorelick et al. 2017; Zurqani 2024).

The objective of this scientific study is to compare and evaluate the performance of various machine learning algorithms available within the Google Earth Engine platform for land cover mapping in the HKH region of Pakistan. This assessment will be conducted without any hyperparameter tuning, using the full range of spectral band combinations from Sentinel-2 imagery with a temporal aggregation method. Moreover, the final product generated from this study will be made freely available for users, facilitating broader access and utilization of the LC mapping results to support land management, spatial planning, and disaster risk management to achieve sustainable development in the region.

Material and Methods

Study Area

Gilgit-Baltistan (GB) located in the north of Pakistan is characterized by a remote mountainous environment, surrounded by the world’s famous highest mountain ranges, i.e., Hindu Kush, Karakoram, and Himalaya. Administratively, GB is divided into ten districts, namely, Astore (5179 km2), Diamer (6901 km2), Ghanche (8525 km2), Ghizer (12043 km2), Gilgit (4009 km2), Hunza (11343 km2), Nagar (2993 km2), Kharmang (2802 km2), Shigar (8810 km2), and Skardu (7200 km2) (Amin et al. 2021) (Fig. 1).

Fig. 1
figure 1

Overview of the study area Gilgit-Baltistan (GB) overlaid over an elevation map derived from 30 m Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM). The inset map indicates the location of study area in Pakistan

GB can be broadly classified into five distinct ecological zones: dry alpine zones and permanent snowfields, alpine meadows and alpine scrub, sub-alpine scrub, dry temperate coniferous forest, and dry temperate evergreen oak scrub. Moreover, GB has an exceptionally complex mountain system, with approximately 90% of its area covered by rugged mountain ranges and glaciers, while the remaining area consists of arable land (Hussain & Bangash 2017).

The climate of Gilgit-Baltistan is influenced by both the monsoon season, which contributes up to 80% of the region’s summer precipitation, and westerly cyclones, which account for approximately 66% of the high-elevation snowfall. However, the steep topography of the Karakoram range diminishes the influence of these wind systems as one moves towards the northern parts of the region (Bolch et al. 2012; Rankl et al. 2014). Generally, the weather conditions are severe with cold winters and extremely hot summers. The region receives precipitation of approximately 200–2000 mm per year, varying in different elevation zones, whereas temperature ranges between 10 °C (in winters) and 40 °C (in summer) depending on the valley’s elevation range (Gilani et al. 2020; Nawaz et al. 2019).

Training Sample Selection

Nine LC classes (Table 1) were defined based on a detailed literature review of the study area (Ali et al. 2019a, b; Gumma et al. 2020; Khan et al. 2020a, b; Qamer et al. 2016; Rahim et al. 2018). Training and validation sample points for each LC class were selected through simple random sampling using high-resolution Google Earth imagery and the authors’ personal experience of the study area. During the preparation and verification of the sample points, the principles of ‘consistency’ and ‘reliability’ were carefully maintained (Hill et al. 2008). This involved minimizing the inclusion of mixed pixels by avoiding sample collection from the edges of LC class boundaries and fragmented landscapes (Hu & Hu 2019; Phan et al. 2020).

Table 1 General information about various land cover classes with labels identified in the study area

Auxiliary data available in the GEE, such as the Advanced Land Observing Satellite (ALOS) Digital Surface Model (DSM) (Tadono et al. 2014), Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM) (Farr et al. 2007), and Defence Meteorological Satellite Program (DMSP) Operational Line-Scan System (OLS) night time imagery, were used to support the selection of training samples and improve the overall classification accuracy. Additionally, DEM-derived products, such as slope, aspect, and elevation, were utilized as supplementary data for visual interpretation during the training sample selection process.

In total, 2759 sample points were randomly divided, with 70% used for training the classification models and 30% reserved for validation and accuracy assessment. The distribution of sample points across the nine LC classes was as follows: water (341), forest (304), grasses (306), wetland (364), agriculture (389), barren land (467), built-up (197), glacier (176), and snow (215). The same training and validation datasets were used to evaluate the performance of each classification model.

Data Acquisition and Processing

The European Space Agency (ESA) and the European Commission (EC) launched the Sentinel-2 (S-2) satellites in 2015 and 2017 under the Copernicus program. The Sentinel-2A and 2B satellites have a revisit interval of 10 days individually and a combined revisit time of 5 days. The improved spatial and spectral resolution of the S-2 imagery has opened up new possibilities for environmental studies and monitoring (ESA 2015; Gargiulo et al. 2020). The twin S-2 satellites provide global coverage with 13 spectral bands, ranging from the visible to the short-wave infrared (SWIR) wavelengths, offering different spatial and spectral resolutions (Table 2). In the GEE catalogue, the Sentinel-2 Multispectral Instrument (MSI) data is available as Level-2A Surface Reflectance (SR) product, processed by running the sen2cor algorithm (COPERNICUS 2017). The Sentinel-2 Level-2A SR data within the GEE can be accessed using the code snippet ‘ee.ImageCollection(“COPERNICUS/S2_SR”)’, which provides 12 UINT16 spectral bands representing surface reflectance scaled by a factor of 10,000.

Table 2 Details of spatial and spectral resolution of Sentinel-2 satellite

Figure 2 provides an overview of the image processing steps employed to map the LC of the study area. The methodology implemented in this study involved temporal aggregation of all available images within the study area to generate a composite image free from cloud cover. To mitigate the influence of monsoon rains and fresh snow on the classification outcome, a temporal frame ranging from 1st May to 30th September 2019 (152 days) and cloud coverage threshold of less than 20% was selected. This resulted in an image collection of 192 scenes, distributed across ten Sentinel-2 tiles. Subsequently, a 10-m spatial resolution cloud free image composite was generated by calculating the median pixel value from the entire image collection.

Fig. 2
figure 2

Flowchart of land cover mapping using Google Earth Engine. The developed methodology included the data acquisition and processing, training sample selection, and accuracy assessment for GEE in-build machine learning classifiers

The median function was utilized as the temporal aggregation method to minimize the impact of missing or gapped cells in the Sentinel-2 imagery. This approach is commonly employed in the literature to reduce noise, particularly along scene borders (Carrasco et al. 2019; Rudiyanto et al. 2019). For each of the 10 spectral bands considered, any pixel values identified as blanks were replaced by the median value of that band across all images in the acquisition period. This ensures that a pixel identified as blank remains as such only if all images taken during the study period have a blank value for that location. However, in this study, no blank or void pixels were identified after the temporal aggregation process was applied.

For Sentinel-2 image classification, ten scenarios with different band combinations were used (Table 3) based on comprehensive literature review (Adepoju & Adelabu 2020; Alifu et al. 2020; Li et al. 2020; Xiong et al. 2017). These studies have suggested specific band combinations that have shown promising results in LC mapping applications. By evaluating these band combinations against ML classifiers in the GEE platform, we aimed to identify the best performing classifier without engaging in any hyperparameter tuning.

Table 3 Rationale and detail of experimental scenarios used in the study based on S-2 SR band combinations

Classification

The ML classifiers included in the current study were classification and regression tree (CART), maximum entropy, minimum distance, support vector machine (SVM), and random forest (RF). All of these classification models were implemented without any hyperparameter tuning, using the default parameter values to ensure a consistent comparison across all models (Table S1), as suggested by Maxwell et al. (2018).

CART Classifier

The CART algorithm proposed by Breiman et al. (1984) is a decision tree construction method that works on the principle of a dichotomous recursive segmentation system. The CART algorithm utilizes the Gini coefficient as the criterion for identifying the ideal test variances and segmentation thresholds to create a binary tree-based decision tree (DT) for classification. The CART algorithm operates by recursively splitting the training data at each decision node, known as the greedy splitting approach, to increase the homogeneity of the data in the resulting nodes based on a statistical test such as the Gini index. The Gini coefficient is defined as follows in Eqs. (1, 2 and 3):

$$\text{Gini} \text{Index} =1-{\sum }_{j}^{.}{P}^{2}(j|h)$$
(1)
$$P(j|h)=\frac{{n}_{j}(h)}{n(h)}$$
(2)
$${\sum }_{j=1}^{.}P(j|h)=1$$
(3)

where P(j|h) is a randomly selected sample from a training set or relative frequency of category, nj(h) corresponds to the number of samples in category j when the value of test variable in the training set is h node, where n(h) is the number of samples in the training dataset with the test variable value of h, and j denotes the category number.

To be precise, the CART algorithm takes the training dataset and partitions it into smaller subsets recursively. This partitioning process continues until the smaller cells are grouped based on the same class label, with the maximum accuracy of prediction validated by the pruning value (Hayes et al. 2015; Mondal et al. 2019). The CART algorithm does not require parameters and has the advantages of fast operating speed and easy manipulation (Shao & Lunetta 2012); different studies have successfully deployed CART for classification of satellite imagery with promising accuracy (Hu et al. 2018; Johansen et al. 2015).

Maximum Entropy Classifier

Maximum entropy (MaxEnt) or gmoMaxEnt classifier in GEE is based on the maximum entropy principle to select the data with maximum entropy from all training sets (Mcdonald et al. 2009). It works best in a condition where the prior distribution and conditional dependency are unknown making it difficult to perform prediction with any assumption. Therefore, the gmoMaxEnt classifier utilizes a machine learning approach to perform spatial predictions using incomplete or limited training data (Moreno et al. 2011).

Minimum Distance Classifier

The minimum distance classifier uses spectral characteristics of the training samples which have been selected as representatives of the different feature classes. The Euclidean distance between the selected pixel values and the mean values of each class is calculated. Later, the candidate or selected pixel is assigned to the class with which it has the shortest Euclidean distance (Hu & Hu 2019).

SVM Classifier

SVM was described by Cortes and Vapnik (1995) which is commonly used in a range of RS applications (Rudiyanto et al. 2019; Stromann et al. 2020). SVM is a supervised machine learning technique that aims to find an ideal hyperplane that discriminates different classes from their decision boundary. During classification, SVM classifiers use an iterative process to allocate candidate pixels to classes by maximizing class separability from the training set and labels each pixel according to their nearest class in feature space (Boser et al. 1992; Tsai et al. 2018). The selection of the support vectors mainly depends on the choice of cost parameter C, kernel functions, and Gamma. The most used kernel functions include linear, polynomial, and radial basis function (RBF). A detailed description of the SVM classifier can be found in Melgani and Bruzzone (2004). The mathematical equations of linear, polynomial, radial basis and sigmoid kernel functions are listed below as Eqs. (47).

$$k\left({x}_{i, } {y}_{i}\right)= {x}_{i}^{T }. {x}^{j}$$
(4)
$$k({x}_{i }{y}_{i)= }{(\gamma . {x}_{j }^{T}.{x}_{j}+ r)}^{d}$$
(5)
$$k({x}_{i }{y}_{i)= }{e}^{{-\gamma ({x}_{i}{-x}_{j)}}^{2}}$$
(6)
$$k({x}_{i }{y}_{i)= }\text{tan}h(\gamma .{x}_{j}^{T}.{x}_{j+ }r)$$
(7)

where k is kernel, j is the feature, \({x}_{i}\) are input data points, and \({y}_{i}\) are the corresponding output data points. In polynomial kernel, d shows the degree of polynomial, whereas \(r\) in the polynomial and sigmoid function is considered as bias term.\(r\) is the gamma term that presents in all types of function except linear which describes the impact of the training range. The model will be constrained and not be able to handle the complexity of data if the value of gamma is too small and contrariwise.

RF Classifier

Random forest (RF) is a nonparametric supervised ML algorithm (Lee et al. 2018). The RF classifier is a nonlinear, relatively fast classifier that acts robustly to noisy training data. The RF algorithm was developed as an extension of the CART decision tree method and generates multiple classification trees to improve the overall prediction performance of the model. It operates by using a number of decision trees (DTs), where each tree is created from an independently constructed random sample of the training data to assign classification labels to each class (McCord et al. 2017). The RF algorithm applies a bagging technique, randomly selecting a subset of features from the input observations for each decision tree to be grown (Belgiu and Drăgu 2016).

As mentioned in Ahmed et al. (2019), for each tree ‘d’ from total ‘D’ number of trees, select any random data from training dataset and create the random forest tree Td by randomly select ‘i’ point. This process must be recursive until the minimum node size is achieved. Combining the outputs of all trees as we get Eq. (8):

$$\sum\nolimits_{d=1}^{D}={T}_{d}(i)$$
(8)

For the new prediction at any point i, the regression will be Eq. (9):

$${F}_{rf}^{d}(i)=\frac{1 }{d}\sum\nolimits_{d=1}^{D}{T}_{d}(i)$$
(9)

Methods for Accuracy Assessment

The accuracy of the LC classification was measured using various derivates of the confusion matrix, such as overall accuracy (OA), kappa coefficient, producer’s accuracy (PA), and user’s accuracy (UA) for each class. The OA represents the percentage of pixels that were correctly labelled by the classifier, while the kappa coefficient is a measure of the overall agreement between the classification results and the reference data, accounting for chance agreement. The OA estimation and the kappa coefficient were used to compare the classification accuracy of each machine learning algorithm across the different band combination scenarios, ultimately supporting the selection of the best-performing classifier. Additionally, the Pearson correlation (PC) coefficient was calculated between the OA and kappa values. The PC coefficient measures the statistical relationship or association between these two continuous variables, providing information about the magnitude and direction of the correlation, which can range from − 1 to + 1 (Benesty et al. 2009).

To further evaluate the performance of the LC classification models, we employed standard metrics including precision, recall, and F1-score (Saini & Singh 2024). These metrics were computed for each LC class, providing a comprehensive evaluation of the classifier’s ability to accurately identify and differentiate between distinct LC types. Precision quantifies the proportion of correctly classified samples within a specific class, while recall represents the proportion of samples from a given class that were correctly identified. The F1-score, a harmonic mean of precision and recall, offers a balanced measure of the classifier’s overall performance for each class, providing a robust indicator of the model’s effectiveness in classifying different LC types (Rapinel et al. 2023).

Results

Comparison of Classification Accuracy of GEE Classifiers

The results show that the various ML classifiers achieved differing levels of OA and kappa index under the different band combination scenarios (Table 4 and Fig. 3). The 10-band set (S1, as detailed in Table 3) of Sentinel-2 data, covering the visible to SWIR band regions, resulted in OAs ranging between 0.59 and 0.79 and kappa coefficients between 0.54 and 0.76 across the ML algorithms. In the S2 scenario (Table 3), which utilized a different band configuration, the CART classifier achieved the highest OA of 0.73 and kappa of 0.69 (Table 4). In contrast, the S9 scenario, which used only the combination of Sentinel-2 bands B2-B4, did not perform well, with no classification method exhibiting satisfactory accuracy. The comparative analysis of the ML algorithms and band combinations provides valuable insights for the final selection of the optimal approach for the LC mapping task.

Table 4 Overall accuracy and Kappa statistics for each classification method under investigation for each scenario. The best assessed statistical values are indicated in bold font and poor results are indicated in italic and bold (column wise)
Fig. 3
figure 3

Five selected thematic land cover maps of the Gilgit-Baltistan in HKH region (with highest OA) produced in the GEE for the year 2019 using S-2 imagery, where (a) RF, b gmoMaxEnt, c CART, d SVM, and (e) minDistance

From the results presented in Table 4, it is evident that the RF classifier generally had the highest OA and kappa values across all the evaluated scenarios. The SVM classifier had the second-highest OA and kappa values in some scenarios, but its performance was generally lower than that of RF. In contrast, the minDistance classifier exhibited the lowest OA and kappa values across all scenarios and among all the classifiers. The gmoMaxEnt and CART classifiers generally had an intermediate level of performance, with OA and kappa values that were lower than RF but higher than minDistance. To better represent and explain the results, the relatively low-performing scenarios for all classifiers were dropped, and only the top five scenarios were selected for further analysis and discussion (Fig. 3). This approach allows for a more focused and informative presentation of the most promising classification outcomes for Gilgit-Baltistan.

A PC coefficient of 0.9 with a p-value of 0.0 was observed between the OA and kappa values across all ML classifiers. This indicates a strong positive and statistically significant relationship between these two-performance metrics. The close correspondence between OA and kappa suggests that the kappa index was closely tied to the overall classification accuracy, such that higher OA values were consistently associated with higher kappa scores, and vice versa.

The analysis of individual LC class accuracies revealed several notable trends. The RF exhibited strong performance, achieving high PA of 0.97 and UA of 0.88 for the water class. In contrast, the SVM and minDistance were unable to match the UA and PA values attained by RF for this class. RF also demonstrated high UA (0.77) and PA (0.75) for the forest cover, as well as for the grasses (PA = 0.71, UA = 0.78). Among the classifiers, CART and minDistance showed the least underestimation for the grasses class, with UA = 0.55 and PA = 0.48. For agriculture class, RF achieved a high PA of 0.89 and UA of 0.78, while minDistance recorded relatively lower UA and PA values (UA = 0.60, PA = 0.65). The RF classifier also performed exceptionally well for the wetland class, with UA of 0.85 and PA of 0.79, whereas minDistance exhibited considerably lower UA (0.53) and PA (0.69) values. Interestingly, SVM attained the highest PA (0.93) but the lowest UA (0.59) among all classifiers for the barren land class. Similarly, for the built-up areas, RF achieved the highest UA (0.89), while minDistance had the lowest UA (0.32). However, SVM had the lowest PA (0.07), whereas the gmoMaxEnt classifier recorded the highest PA (0.68) for the built-up class. In the case of glacier classification, CART achieved the highest PA (0.59), while gmoMaxEnt had the lowest PA (0.14). RF, on the other hand, attained the highest UA (0.67), and minimum distance had the lowest UA (0.39). For the snow class, minDistance obtained the highest UA (0.97) and gmoMaxEnt had the lowest UA (0.75), while RF achieved the highest PA (0.97), and CART had the lowest PA (0.89). These findings suggest that classifiers utilizing a higher number of spectral bands generally exhibit more favourable UA and PA performance across the various LC classes.

Moreover, misclassification of LC classes was observed across all classification results. For instance, many barren land validation points were incorrectly labelled as agriculture, and similar issues occurred for built-up areas being mislabelled as water (Fig. 4a–e). These misclassification patterns highlight the challenges in accurately discriminating certain LC types, likely due to spectral similarities or mixed pixels. Despite these issues, the comparison of OA and kappa index indicates that the RF classifier performed excellently among the five implemented algorithms, by exhibiting the highest and most consistent classification results.

Fig. 4
figure 4

Confusion matrix for best classification scenario with a RF with scenario S1, b gmoMaxEnt with scenario S1, c CART with scenario S2, d SVM with scenario S1, e minimumDistance with scenario S1. The LC classes are numbered 0 through 8, where 0 is water, 1 is forest, 2 is grasses, 3 is wetland, 4 is agriculture, 5 is barren land, 6 is build-up, 7 is glacier, and 8 is snow. UA represents user’s accuracy, and PA represents the producer’s accuracy

Moreover, we also compared and evaluated the classification performance of ML algorithms across different LC categories using precision, recall, and F1-score metrics (Fig. 5). These findings reveal that RF demonstrated superior precision, recall, and F1-scores for water bodies, wetlands, agriculture, barren land, and snow LC type. It achieves precision scores of 0.96, 0.83, 0.85, and 0.91, respectively. SVM and gmoMaxEnt also exhibit competitive precision scores, particularly in the water, wetland, and agriculture categories. However, all algorithms struggle to accurately classify the forest category, with significantly lower precision scores across the board. The minimumDistance algorithm consistently performs the poorest in terms of precision for all LC categories. When considering recall, RF again shows strong performance in correctly identifying LC class of water (0.94), grasses (0.67), and snow (0.98). However, SVM and gmoMaxEnt exhibit relatively lower recall scores, especially for the forest and wetland categories. The minimumDistance algorithm consistently has lower recall scores compared to other algorithms, indicating difficulties in correctly classifying various land LC. Comparing the performance of CART, it also shows mixed results across different LC categories. In terms of precision, CART performs relatively well for water (0.82), agriculture (0.71), and barren land (0.75). However, its precision scores are lower compared with RF in most categories. CART struggles particularly in the forest category, achieving a precision score of 0.57, which is significantly lower compared to other algorithms. In terms of recall, CART performs decently for water (0.86) and barren land (0.68) categories. However, it falls behind RF in terms of recall scores across most LC classes. It demonstrates challenges in correctly identifying instances of forest and wetland, with recall scores of 0.59 and 0.73, respectively.

Fig. 5
figure 5

Precision, recall, and F1-score comparison results for top five performing algorithms classification scenario against each LC class. The LC classes are numbered 0 through 8, where 0 is water, 1 is forest, 2 is grasses, 3 is wetland, 4 is agriculture, 5 is barren land, 6 is build-up, 7 is glacier, and 8 is snow. The vertical colour bar shows the accuracy of classification as low (0) with red and high (1) with green

Additionally, focusing on the F1-score (Fig. 5c), RF consistently achieves high scores for water, agriculture, barren land, and snow categories, indicating a balanced performance in terms of precision and recall. SVM and gmoMaxEnt also exhibit competitive F1-scores for the water, wetland, and agriculture categories. Like the other metrics, the forest category poses a challenge for all algorithms, resulting in relatively lower F1-scores. The minDistance algorithm consistently performs the worst across all LC categories, while the CART achieves moderate F1-scores scores for water (0.75) and barren land (0.71) categories. However, its overall performance is lower compared to RF and SVM in most LC categories.

Comparing the performance among LC classes (Fig. 5), it is evident that the algorithms excel in different categories. RF consistently performs well in water, agriculture, barren land, and snow. SVM and gmoMaxEnt showcase strengths in water and wetland categories. However, all algorithms struggle with accurately classifying the forest category, indicating the inherent complexity of distinguishing forest cover. CART, on the other hand, lags behind RF, SVM, and gmoMaxEnt in terms of F1-scores across most LC categories. It struggles in accurately classifying the forest category, resulting in a relatively lower F1-score of 0.58. Moreover, the F1-scores algorithm consistently performs the poorest among all the algorithms, indicating limitations in accurately classifying LC categories.

These results suggest that RF is the most reliable algorithm across multiple LC categories, demonstrating superior precision, recall, and F1-scores. However, the performance of each algorithm varies depending on the specific LC class. These findings emphasize the importance of selecting appropriate ML models based on the target LC type. Further research and experimentation are required to uncover the underlying factors causing the observed performance variations and to refine the classification accuracy for challenging LC categories, such as forests.

Land Cover Classification of Gilgit-Baltistan

The classified maps (Fig. 3) based on ten different scenarios (Table 3) and five ML classifiers were visually the same and consistent, displaying all results was not possible due to limitation of space. Based on accuracy assessment (OA and Kappa), RF (scenario S1) was found to be the best ML classifier among all. Therefore, RF LC classification was selected as a final map and used for area distribution of each LC class in the Gilgit-Baltistan (Fig. 6).

Fig. 6
figure 6

Final classified maps of Gilgit-Baltistan produced by RF under scenario S1 in GEE Platform for year 2019 using Sentinel-2 Satellite data

The final RF-based LC results (Fig. 6 and Fig. 7) depict that barren land (33,201.77 km2) is the largest LC type classified in the study area, followed by snow cover (16,275.59 km2), glacier (5651.42 km2), grasses (5141.72 km2), water (3356.22 km2), wetland (2063.60 km2), build-up (1928.81 km2), agriculture (1316.73 km2), and forest (868.04 km2).

Fig. 7
figure 7

Area estimates of each land cover class resulting from image classification using best performing RF classifier. The label on each bar represents the area of land cover class in km2

The study area is predominantly classified as barren land, which is distributed generally throughout the region, while the snow and glacier classes were found to be distributed primarily across the northern and eastern sides. The agriculture class has a distinct distribution, concentrated along the river and stream channels, which also contain the human settlements and built-up areas distributed throughout the valleys. Forest and wetland classes were found to be concentrated along the high-altitude ridges and rangelands, mostly between the south-west and north of the study area, while grasses were distributed in the northern, southern, south-eastern, and north-western parts, comprising the alpine pastures and rangelands that are usually covered with snow during the winter months (approximately 6 months). The built-up class in GB is typically distributed in the valleys, surrounded by agricultural and uncultivated lands, with the most concentrated areas observed in the main cities of Astore, Ghizer, Skardu, Chilas, and Gilgit (Ali et al. 2019a, b), providing insights into the landscape characteristics and human–environment interactions within the study area.

The district-wise area estimates of each LC class (Table 5) revealed that water is concentrated in Hunza district (1099.85 km2), district Diamer has the largest area of forest (353.66 km2), grasses (1162.57 km2), and wetland (644.36 km2), while the district Astore covers the largest area of agricultural land. Moreover, barren land and build-up were distributed largely in district Ghizer with an area of 7053.31 km2 and 329.99 km2, respectively, while glacier (1751.57 km2) and snow (4114.97 km2) classes contribute to the highest area in district Hunza and Shigar, respectively.

Table 5 District wise area estimates of each land cover class resulting from image classification using best performing RF algorithm with scenario S1. The colour bar indicates the highest (green) to lowest (red) and medium (orange to yellow) coverage of the land cover class area for each column in km2

Discussion

Advantages and Opportunities of GEE Cloud Platform

Mapping LC features over large areas often faces challenges due to the limited availability and inconsistency of cloud-free satellite imagery. In this study, these challenges were addressed by employing temporal aggregation of Sentinel-2 imagery to create a comprehensive LC map for the Gilgit-Baltistan region of HKH. Previous classification efforts in this area had been hindered by the shortage of cloud-free images. The introduction of the web-based GEE cloud platform has significantly addressed the computational limitations that often hinder LC mapping in developing countries (Li et al. 2020; Zhou et al. 2020).

Few localized studies have been conducted in the study area, applying traditional approaches for LC mapping. However, as already mentioned, these previous efforts have utilized fewer LC classes or focused on easily distinguishable classes like barren land, forests, or water bodies in small sub-basin areas (Khan et al. 2020a, b; Qamer et al. 2016). Comparing our results with previous studies is challenging due to differences in methodologies and data sources. Previous studies in the area utilized Landsat-8 (L-8), Landsat-7/5/4, and MODIS products, while our study stands out as the first to utilize Sentinel-2 data. This is significant because Sentinel-2 offers shorter revisit times and higher-resolution imagery, enabling more accurate and practical LC classification in our target area.

The accuracy evaluation in this study demonstrates that the use of the GEE cloud platform enables robust and accurate regional scale LC mapping. GEE offers key advantages, such as the ability to quickly and precisely select sample points using supplementary data like socio-economic information, population, DMSP-OLS, DEM products, and satellite datasets, surpassing the capabilities of traditional tools. Moreover, GEE scripts can be easily enhanced for long-term monitoring of LC changes and driving indicators. Collaboration among stakeholders and agencies can further strengthen the capacity to address challenges like food security and flood mapping using open-source geospatial data. One such example is the SERVIR program, a joint NASA and USAID initiative that helps developing countries utilize Earth observation satellites and geospatial technologies (SERVIR 2005). A similar collaborative approach can be adopted in the Gilgit-Baltistan for improved planning and decision-making. The use of the GEE cloud platform has enabled the development of high-resolution, regional-scale LC products, overcoming the computational limitations that had previously hindered such large-scale mapping efforts (Faqe Ibrahim et al. 2023). This innovative approach provides a strong foundation for future LC monitoring and change analysis to support sustainable management in the study region.

Strengths and Implication of Our Land Cover Classification

The LC information on the nine classes mapped (Fig. 6) in this study is essential for improving our understanding of various earth surface processes in Gilgit-Baltistan. For instance, the delineation of water and glacier bodies can enable disaster monitoring of GLOFs, which are a frequent occurrence in the study area (Jamil et al. 2019).

During the current study, ML classifiers under investigation were not exhaustively tuned using their hyperparameters. Instead, the models were left to operate independently using their default input settings to map the land features based on the same training dataset. This approach identified the most efficient ML classifier, which upon further hyperparameter tuning could potentially achieve exceptional results. Choosing a set of optimal hyperparameters for a ML model is time exhausting process and is user-dependent which may affect the classification accuracy. Although default values for these parameters are usually suggested, to ensure that the accurate classification has been produced (Maxwell et al. 2018). The optimal parameters of the model vary from area to area depending on the quality of the dataset and the number of sample points, spatial distribution, and derivative features such as texture and spectral indices (Tsai et al. 2018). However, the limitation of less quantity of sample points in the current study was likely addressed and improved by the bagging method of RF classifier (Breiman et al. 1984) which performed very well in the current study. To overcome such limitations under various scenarios (Table 3), RF is considered one of the best ML classifiers and has been tested widely by many researchers (Gargiulo et al. 2020; Pradhan et al. 2020; Stromann et al. 2020). Thus, the current approach provides an opportunity for future studies in the Gilgit-Baltistan to select RF for LC classification, which upon hyperparameter tuning would provide excellent classification accuracy.

However, to estimate the extent of LC classes, various regional and global LC datasets exist which are prepared using different imagery (Landsat-5/7/8, S-2, and MODIS, etc.), algorithms, and with varying resolutions. Accuracy of these products is needed to be improved when applied at local or regional level (Wagle et al. 2020) and specially in the mountainous regions. Due to multiple differences such as input data, classifier type, data acquisition time, and spatial resolution, it creates poor agreement among different products when applied at the regional or global level. Thus, producing a reliable local or regional level LC classification products is essential and applicable. Figure comparing the multiple global LC models and current study results is included in the supplementary file (Fig. S1), which provide evidence for the stated argument.

During the study, few challenges were encountered, mostly pertaining to uncertainties between build-up, water, agriculture, wetland, and barren land mapping producing misclassification of these LC classes probably due to similar or alike spectral responses (Fig. 8 and 9). This problem was more related in villages where the houses are of masonry type with roofs covered with clay and sand (Rafi et al. 2016), causing misclassification among barren land and build-up class. Also, it was difficult to distinguish between croplands and seasonal grasslands using the S-2 imagery due to overlapping phenology signatures of agriculture, forest, and grasslands especially in case of their sparse presence (Fig. 8a). We tackled these difficulties by acquiring high-quality samples from secondary data, as well as using the high-resolution Google Earth data (as reference). However, higher mapping accuracy would likely be achieved with larger and more accurate training datasets along with hyperparameter tuning of classifier (Ka & Sa 2018; Tsai et al. 2018).

Fig. 8
figure 8

Comparison of accuracy of various LC classes from current study results and with ESA WorldCover product based on Senitnel-2 and RF model at different locations. a, d, g show basemap from Maxar, b, e, h are results from current study, and c, f, i represent the ESA global LC product, whereas (j) is orthorectified image acquired using unmanned aerial vehicle (UAV) in year July 2022 for location (a)

Fig. 9
figure 9

Comparison of Sentinel-2 SR and wavelength (nm) response for each land cover classes at same locations. a Reflectance over median composite image used for land cover analysis. b Reflectance over single image acquired on 5 July 2019. The grey rectangles illustrate the wavelength ranges for each Sentinel-2 band

To establish a rigorous comparison and ensuring scientific validity of results of our LC model and global LC product, we utilized ESA WorldCover product (Chaaban et al. 2022; Zanaga et al. 2022) which has same spatial resolutions and is also produced using RF model with OA of 74.4% (Fig. 8). It is observed that our RF-based LC model achieves superior results at the local scale and can further improve with hyperparameter tuning. For instance, Fig. 8a–c reveals agricultural fields that are clearly visible and accurately mapped by our model. In contrast, the ESA WorldCover has inaccurately classified these areas as shrubland or grasslands. The reference image (Fig. 8j) acquired by an unmanned aerial vehicle (UAV) over the same area further corroborates the accurate mapping of agricultural fields by our model. Also, Fig. 8d–f illustrates an example where a barren land (braided river) was misclassified as water by WorldCover product (Fig. 7f), while the same extent is accurately mapped by current study model (Fig. 8e). Additionally, Fig. 8i highlights areas of misclassification compared to Fig. 8g–h, where the water class (water stream/river) in Fig. 8i was misclassified as shrubs, trees, and bare classes. Nevertheless, the ESA world cover product has better resolution as compared with other global products, but producing global scale LC maps still remains a challenge. In such cases, regional-level studies prove to be a better option, offering improved efficiency and OA of LC products. Practitioners at the local and regional level can utilize regional accurate data products for planning and designing new development programs for local communities.

Another aspect of current classification is the use of median operation for temporal aggregation of satellite imagery which might influence the overall classification accuracy. The use of median composite from Sentinel-2 data offers significant advantages LC mapping. To understand this, spectral response of median composite using all images (Fig. 9a) and a single image at same locations is provided (Fig. 9b). By combining data from multiple acquisitions, the median composite effectively reduces the impact of atmospheric conditions and minimizes temporal variability, resulting in more accurate and reliable reflectance values. The response of LC classes remained equivalent without showing irregular responses which suggests that the median composite has no erroneous data which might have negatively influenced the overall classification. However, there is an observed overlap in the median composite SWIR region (Band 12) for snow and water (Fig. 9a), which can be attributed the common physical properties of these two materials, such as their high reflectance in the NIR region and their relatively low absorption in the SWIR region (Shao et al. 2020). However, RF is well suited for tackling the challenge of overlapping reflectance values among various LC classes by employing ensemble learning. RF combines multiple decision trees trained on random subsets of the data from all bands, introducing diversity into the model. This allows the algorithm to capture a broader range of spectral patterns and features beyond the overlapping wavelengths, enhancing the discrimination between snow and water. Also, previous studies (Phan et al. 2020; Xie et al. 2019) have also validated the accuracy of median operation for preparing composite image.

Conclusions

The Sentinel-2 is exceptional among presently operating Earth-observation satellites due to its wide spectral wavelength, its 10 m spatial resolution, and revisit time. This study represents a first assessment and evaluation of five ML algorithms using GEE to classify LC classes using temporally aggregated Sentinel-2 data for the year 2019 (May–September) based on ten scenarios (50 LC products) over a complex mountainous environment. The five tested ML algorithms produced OA ranging between 0.59 and 0.79, without any hyperparameter tuning. Among these classifiers, two of the ML algorithms, RF and gmoMaxEnt performed exceptionally well with scenario S1, while CART (S2 scenario) and SVM (S1 scenario) performed ordinary with a difference of OA of 0.06 and 0.14 as compared with RF classifier (S1 scenario), respectively.

Moreover, the current study has compared GEE’s in-build ML algorithms using default input parameter values to remove biasness among classifiers, providing consistent environment. Doing so, the OA-based evaluation identified RF classifier best suitable for mapping mountainous areas like Gilgit-Baltistan with complex mountain system. Therefore, in the future, the best identified RF classifier with scenario S1 within GEE environment should be used for advance multi-source data image classification with hyperparameter tuning to increase overall OA and better prediction. Also, it is suggested to build the capacity of various stakeholders in Gilgit-Baltistan for better monitoring the LC changes and resource management using big data coupled with the GEE cloud platform.