Model development to estimate site index values for six major tree species in North Korea

Due to considerable deforestation in North Korea, there is a need to plan forest restoration programs based on scientific forest management. In this study, a methodology was developed for estimating the site index values of six major tree species and the forest productivity potential. The site index values of these tree species were derived in South Korea using the Chapman-Richards equation. These values were used with data from the 6th National Forest Inventory, which included 20 types of edaphic, topographic, and climatic factors, and random forest analysis—a widely used machine learning technique for spatial prediction—to develop a new model for estimating the site index values of these species across South Korea. The prediction accuracy of this model was evaluated using the root mean square error. The results show that the prediction accuracy was high, with a root mean square error of ~ 1 m. Moreover, the importance of the variables related to climate and geography was generally high. The proposed site index estimation model for six tree species was applied across North Korea, and its effectiveness tested by comparing the estimated values with those reported in literature from North Korea. The differences between the model outputs and recorded data in the northern alpine regions were presumably due to the lack of data for high-altitude regions in South Korea. This model is based on the determination of the suitability of tree species in restoration efforts. Therefore, it can contribute to the evaluation of forest productivity in North Korea and may help plan efficient forest restoration programs.


Introduction
The Korean Peninsula is divided into South Korea and North Korea, and although forest cover has remained relatively constant in South Korea, it sharply declined in North Korea after the 1980s (Engler et al. 2014). North Korea became aware of the seriousness of this forest degradation, and in 2015 set up a 10-year forest restoration plan (2015)(2016)(2017)(2018)(2019)(2020)(2021)(2022)(2023)(2024) to restore 1.84 million ha of mountainous area (Kim et al. 2015a, b). At the 8th Party Congress held in 2021, it was announced that approximately 1 million ha of forest had been restored (DPRK 2021; Kim et al. 2021a). Press releases during 2011-2017 reported that tree survival rate was 90-100% 1 ; however, Huh (2014) reported that the tree 388 S. Park et al. survival rate had been 30-50% over the preceding decade (i.e., until 2014). The slow rate of change in the deforestation rate in North Korea over the past 10 years suggests that, despite continuous research into ways to increase survival rates, the actual survival rate and afforestation are lagging.
Successful afforestation requires the identification of suitable tree species at each site, which is a core element of the planning stage. In North Korea, several studies have examined the correlation between plant growth and environmental factors at various sites (Kim and Lee 2006;Sim et al. 2006;Han and Byung 2019;Lee and Jeon 2019). These data are essential for evaluating suitable tree species for a given afforestation site. South Korea is an exemplary country that has succeeded in forest restoration in a short period of time (Brown 2008;Steiner 2008). The Korean Peninsula is eco-geologically continuous with relatively similar features (except in the high-altitude areas in the northern alpine region). Therefore, using the site index data obtained in South Korea may be a viable strategy to determine suitable tree species for sites in North Korea. Therefore, a scientifically established model to estimate site index values would be useful in setting up reforestation programs in North Korea and would play a positive role in inter-Korean forestry cooperation (Park et al. 2021).
Forest site quality, the capacity of a site to grow trees, is a potential determinant of suitable tree species at a given site. A high site quality indicates a high production potential; i.e., a high potential for tree growth (Clutter et al. 1983;Pyo et al. 2009). The production potential for each species is a basic variable for forest resource measurement and management. Site index is one of the most widely used measures of site quality (Pyo et al. 2009) and is defined as the height of the dominant or co-dominant tree species at a certain base age of a stand. The uppermost tree canopy is relatively less affected by anthropogenic factors, the surrounding environment, and stand density; therefore, this height is typically used as a measure of the site quality and represents the productivity potential of a site (Jung 1994;Gadow and Hui 1999;Husch et al. 2002;Kim and Park 2013;Kang 2017).
In South Korea, site index is estimated using the site index curve of each species derived from the correlation between stand age and height (Son et al. 2016). Although this method is relatively accurate, it does not sufficiently reflect the various environmental factors that affect plant growth at a forest site. Moreover, this method cannot be used in locations where data regarding tree height and stand age cannot be obtained (such as deforested lands, forest fire areas, and forest regeneration areas), or where onsite surveys cannot be conducted, such as in North Korea (Jung 2006). There is ongoing research to estimate site index using environmental factors, which would enable the evaluation of forest productivity in the absence of stand data. The linear regression model is the oldest and most widely used method to estimate site index using environmental factors (Curt et al. 2001;Kim et al. 2001;Shin et al. 2002;Seynave et al. 2005;Aertsen et al 2010) and is a useful tool. However, since natural phenomena are not linear events, there may be unexplainable variations using this method. To solve this problem, nonparametric or machine learning methods have also been used for site index estimation (Aertsen et al 2010; González-Rodríguez and Diéguez-Aranda 2020).
Machine learning techniques have recently attracted attention as a nonparametric approach for site index modeling. Among these, the random forest method has excellent recognition performance and is widely used for efficient site classification (Sabatia and Burkhart 2014;Carvalho et al. 2017;Park et al. 2019). 'Random forest' is an ensemble classifier based on decision trees, and presents several advantages, including the rapid processing of high-dimensional data, fewer instances of overfitting, and good predictive performance. These features make it suitable for handling a large amount of data encompassing a large geographical area with numerous environmental and other variables. In this study, a site index estimation model was developed and evaluated it as follows: (1) the random forest method was used to develop a model for site index estimation based on the data provided by the National Forest Inventory (NFI) and a forest site survey in South Korea. These models were developed for six major tree species and addressed the limitations of existing models; and, (2) the model was used to estimate the site index values for the six tree species across North Korea, which has areas that are inaccessible for surveys. The suitability of the model for North Korea was tested by comparing the importance of variables and estimated site index map with data from relevant literature from North Korea.

Study area
North Korea (the Democratic People's Republic of Korea) is in the northern part of the Korean Peninsula (37.41-43.01° N, 124.18-130.41° E). It encompasses a total land area of 120,540 km 2 , of which forest areas account for 74%. The country lies in a temperate zone with four distinct seasons, and has a continental climate, with winter lasting for more than four months in the northern mountainous region. The average annual temperature is approximately 8.5 °C, and the average annual precipitation (concentrated in the summer months of June-August) is 919.7 mm. Longitudinally, the vegetation zones are classified into cold, subarctic, and temperate from north to south. Attitudinally, the vegetation zones are classified into flat, hilly, mountainous, sub-alpine, and alpine (Korea Forest Service 2016).
As confirmed by satellite images, forest cover in North Korea decreased from 9.17 million ha in 1999 to 8.99 million ha in 2008, and then increased to 9.39 million ha in 2018 (Kim et al. 2020;Statistics Korea 2020). Forests in North Korea are divided into stocked land and deforested land (forest land (slope ≥ 15°) without woody plant). The forest land was severely damaged during the "arduous march" in the mid-1990s, which caused deforestation in the stocked land and led to a significant increase in deforested land area (Central Forest Design Technical Institute 2014; Kim et al. 2020). North Korea possibly underwent deforestation and soil degradation due to its high dependence on firewood and various activities to overcome food shortages, and due to an increase in terrace agriculture, logging, and forest fires (Kim et al. , 2021a.

Selection of tree species
Six tree species were used to construct a site index prediction model for South Korea and the model was then applied to North Korea. The selected species met the following criteria: (1) they were dominant throughout the Korean Peninsula; (2) abundant data sufficient for plotting site index curves was available for these species in South Korea; and, (3) the species were useful for reforestation and were preferred in North Korea. The six species were: Larix kaempferi (Lamb) Carr., Pinus densiflora Sieb. & Zucc., Pinus koraiensis Sieb. & Zucc., Pinus rigida Mill., Quercus mongolica Fisch. ex Ledeb., and Robinia pseudoacacia L. (Fig. 1).
These species were selected for the following specific reasons: L. kaempferi is characterized by fast growth, straight boles, and high wood quality. This species is widely used for timber and contributes to the rapid greening of denuded land. In South Korea, large-scale afforestation efforts began during the Japanese colonial period. Since the 1960s, L. kaempferi has grown to occupy the largest proportion of planted forests in South Korea. It is now a major timber species on the Korean Peninsula, along with P. densiflora (Kang 1995;Son et al. 2000;Bae et al. 2012d).
Pinus densiflora is Korea's native evergreen coniferous species that grows throughout the Korean Peninsula (even on poor soils). Most of P. densiflora forests are naturally regenerated. Its wood is used for building structures and furniture, and to produce resin and paper (Kang 1995;Bae et al. 2012c).
Pinus koraiensis is also an evergreen conifer that grows well throughout the Korean Peninsula. It is preferred by residents for its economic value as a raw material for nut and oil production and is also used as timber wood (Son et al. 2000;Han and Kim 2018).
Pinus rigida grows rapidly, even in poor soils, and is highly resistant to pests and diseases, thus contributing to rapid forest restoration. It has a higher modulus of elasticity than P. densiflora, and is widely used as raw material for construction, industry, and as firewood (Kang 1995;Son et al. 2000;Korea Forest Service 2016).
Quercus mongolica is the most widespread and highest biomass producer among all Quercus (oak) species in Korea. It is distributed across the North Korean territory, (except in the northern alpine region), and accounts for 25% of the total biomass (Kang 1995;Bae et al. 2012a).
Robinia pseudoacacia is a fast-growing species that is beneficial not only for forest restoration, but also for honey production. This is because its fragrant flowers attract honeybees and is therefore preferred to other tree species by residents. It is also useful for soil stabilization and is widely used as firewood and as agricultural and construction material (Sim 2001).
Site index and forest site environmental factors (edaphic, geographic, and climatic factors) were estimated based on the NFI data collected across South Korea for each of these six species. The data were then used as input data to develop a machine learning-based site index estimation model.

Site index for species in South Korea
The 6th NFI data (tree height and stand age) and site index curves were used to calculate the site index of the sample area across South Korea, and the site index curves were applied to the height and stand age data for the six major tree species. The NFI survey was conducted based on the sample area arranged at a resolution of 4 km × 4 km. Of the 7239 NFI points, we used the tree data for 4275 P. densiflora, 281 P. koraiensis, 483 L. kaempferi, 572 P. rigida, 1406 Q. mongolica, and 222 R. pseudoacacia. The site index values were calculated using the Chapman-Richards equation (Clutter et al. 1983) and the parameter estimates of the site index estimation model (tree height curves) by species (as developed by the Son et al. 2016). These were used in Eq. 1 as follows: where SI is the site index, H D is the height of the dominant tree, t i is the stand age, and t j is the reference stand age, which was set at 30 years (the age at which these species are known to have their highest average annual growth rates) (Jeon et al. 2017).

Environmental factors at the forest site
To estimate forest productivity in relation to environmental factors at the forest site, the site index calculated above was matched with the environmental factors at the location of the sample area. Twenty factors were used-including edaphic, geographic, and climatic factors-as indicators of growth properties. In the selection of the factors, factors used in previous site index estimation studies were considered as well as those used in North Korean eco-geological zoning research (Jung 2006;Shin et al. 2007;Ju et al. 2012). For categorical data, upper categories were determined through a comparison of sites from South and North Korea. All site-related information was coded and organized for each factor (Table 1). The 20 environmental factors were categorized into three clusters as follows: (1) Seven edaphic factors were derived from digital forest soil maps (1:25,000) constructed by the NIFoS (South Korea) and the Korea Forestry Promotion Institute (North Korea): soil type, soil depth, soil texture, organic matter content, consistency, moisture content, and parent rock.
(2) Four geographical factors were obtained from the Shuttle Radar Topography Mission Digital Elevation Model constructed by NASA (resolution: 30 m): elevation, slope, aspect, and curvature type. (3) Nine climatic factors were from data collected by WorldClim (resolution: 1 km): annual average biological temperature, precipitation, evaporation, temperature difference between the hottest and coldest months, average temperature of the hottest month, warmth index, Paterson's climate vegetation productivity (CVP), mean annual temperature, and isothermality.
Annual average biological temperature is an indicator that indicates the growth limit temperature and growth period of plants and is an average annual temperature between 0 and 35 °C suitable for growth. Annual biological temperature was calculated as follows: where, BT is annual biological temperature (°C), and T is mean monthly temperature (°C) (0 ≤ T ≤ 35).
The warmth index is an index that measures the degree of warming during the growing season, and is expressed as the sum of the average monthly temperatures for the year based on the plant growth temperature of 5 °C (Kira 1945). Warmth Index is calculated as: where WI is warmth index, and t is mean monthly temperature (°C; t ≥ 5).
Climate vegetation productivity index is used as a factor to evaluate the potential production of forests and is a major indicator that can be used anywhere in the world, as it evaluates the potential productivity by climate, even for bare lands that are difficult to access and have no vegetation (Rahman and Akter 2015). CVP index is calculated as follows: where CVP is Climate Vegetation Productivity, TV is mean temperature of hottest month (°C), P is mean annual precipitation (mm), G is growing period (d), E is mean annual evaporation (mm), and Ta is temperature difference between hottest month and coldest month (°C). Isothermality means the following:

Principal component analysis
Among the 20 environmental factors, continuous variables were selected and examined their correlation coefficients (cutoff: > 0.9) using Pearson's product-moment correlation

Model construction with random forest model
A model was developed to estimate site index using the random forest machine learning technique and the 20 variables Top soil (A layer) wetness 1: 0, Dry / 2: 0-6, Slightly dry / 3: 6-9, Moderately moist / 4: 9-14, Slightly wet / 5: 14-30, Wet X9 Top soil (A layer) organic matter 1: 0 ~ 2% / 2: 2 ~ 4%/ 3: 4 ~ 6% / 4: > 6% X10 Top soil (A layer) consistence 1: very crumbly / 2: crumbly / 3: soft / 4: hard / 5: very hard X11 Annual average biological temperature (°C) Xl2 Average annual precipitation (mm) X13 Average annual evaporation (mm) X14 Temperature difference between the hottest month and the coldest month (°C) X15 Average temperature of the hottest month (°C) X16 Warmth index X17 Paterson's climate vegetation and productivity X18 Mean annual temperature (°C) X19 Isothermality X20 Parent rock 1: sedimentary rock / 2: metamorphism rock / 3: igneous rock extracted from the environmental factors in South Korea. The random forest model is an algorithm first proposed by Breiman (2001). It is a classifier consisting of several decision trees that are created from subsets of the data through the process of bagging. The optimal result is inferred by building an ensemble model through voting to combine individual predictions of multiple base models. Because variables are selected at random, only some of the variables are used. Randomization contributes to preventing inter-model correlations; moreover, the bagging of multiple decision trees helps prevent overfitting and improves the prediction accuracy of the model. In addition, the simplicity of this method allows fast model construction even with large datasets, and it can accurately select important variables from a high-dimensional dataset.
In the random forest model constructed in this study, the number of decision trees was set to 500, and the maximum number of variables stored at each split was set to 6. The input data were resampled at a resolution of 30 m (in line with the DEM resolution), and the analysis was performed in R.

Model evaluation and map construction with the estimated site index by species in North Korea
Since a random forest model uses bagging, it theoretically includes 2/3 of the training data in the bootstrap set. The remaining 1/3 is called the out-of-bag (OOB) set and can be used to evaluate the generalization performance of the model. In this study, the reliability of the model was tested with the root mean square error (RMSE) value using the OOB. The prediction accuracy of the model was evaluated by comparing its performance with those of other site index estimation models proposed by Korean researchers.
The random forest method synthesizes several decision trees into an ensemble because the explanatory power of the related variables is lower. However, this method is useful in deriving the variable importance, which indicates the contributions of the related variables in predicting the outcome. In this model, we estimated the order of importance of the variables for each tree species and verified the results against widely available information on the corresponding species. The importance of the variables for each species was sequentially derived by comparing the OOB error of the trees created in each bootstrap set to the OOB error calculated using the replacement variables.
Using the input data for the South Korean site index estimation model, the site index values of the six major tree species across North Korea was estimated. To do this, the edaphic, climatic, and geographic factors of South Korea were substituted with those of North Korea using the site index prediction model for each species trained with a random forest algorithm.
The estimated site index values across North Korea were visualized on a high-resolution (30 m) ArcMap.

PCA results
The correlation coefficients between variables were calculated using Pearson's correlation test and PCA conducted for variables which show high correlation coefficient with others (Fig. 2). Five continuous variables (X16, X11, X13, X15, and X18) showed high intercorrelations (> 0.9) with each other. Nominal variables were not considered, as their correlation coefficients were generally low (0.015-0.23). In order to highlight the individual characteristics of the variables, the five raster-type variables were replaced with new representative variables using the 'rasterPCA' function in R statistical software.

Evaluation of the prediction accuracy of the site index estimation models for the six species
Site index estimation models were constructed with the random forest technique using the site index estimates of the six species across South Korea. To evaluate the prediction accuracy of these models, we compared the estimated site index values for each tree species across South Korea with those derived using the Chapman-Richards method based on tree height and stand age. The resultant RMSE values for the six species are listed in Table 2.
The RMSE values were in the range of 0.63-1.04 m and showed interspecies differences. Previous studies that estimated site index using linear regression analysis reported higher RMSE values: 1.57-2.14 m in Shin et al. (2007), 0.96-3.23 m in Song (2003), 3.23 m in Kim (2003), and 3.42 m in Jung (2006). These values were higher due to the limitations of the linear regression model noted earlier. Compared to the previously reported RMSE values, our values were significantly lower (approximately 1.0 m). Thus, the site index prediction model proposed in this study showed higher prediction performance by overcoming the limitations of a linear regression model. This suggests that the model constructed has a relatively high prediction accuracy when applied in South Korea.

Factors in decreasing order of variable importance for each tree species
The variable importance to quantify the contribution of each variable to the site index estimation model was derived for each species. The 20 factors included in the model showed different orders of variable importance for each species (Table 3). The order of variable importance by species and the corresponding information from actual data are compared below.

Larix kaempferi
In this model, L. kaempferi showed high variable importance of geographic factors (aspect and elevation) and climatic factors (average annual evaporation, biological temperature, CVP, and average annual precipitation). Of these, aspect showed the highest variable importance. Bae et al. (2012d) reported that L. kaempferi thrives on sites exposed to warm sunlight and does not grow well in barren or shady places. Of the species considered, L. kaempferi has the highest demand for light and lower resistance to cold (Son et al. 2000). It has also been reported to have relatively low soil quality and nutrient requirements (Kang 1995).
Consistent with these patterns, our model also showed that climatic and other factors associated with the influence of sunlight were highly important, whereas soil organic matter was less important.

Pinus densiflora
Climatic factors (CVP, warmth index, average annual precipitation, and biological temperature) and a geographic factor (elevation) were associated with high variable importance. This suggests that the growth of P. densiflora is more affected by climatic and geographic factors than by soil factors.  Previous studies have reported that the growth of P. densiflora is closely related to temperature and precipitation (Jung 2006;Byun et al. 2010;Bae et al. 2012c;Kim et al. 2015a, b). Thus, the results of our model are consistent with those of previous studies.
According to Son et al. (2000), P. densiflora shows relatively low requirements for soil moisture content and soil composition. Soil environmental factors showed lower variable importance in our model, indicating that it reflects the patterns recorded based on empirical evidence.

Pinus koraiensis
The model showed high variable importance for geographic (elevation and aspect) and climatic factors (isothermality, average annual evaporation, and temperature difference between the hottest and coldest months). In particular, elevation was the most important variable.
P. koraiensis is vulnerable to drought and winds and grows only at high altitudes in South Korea (Bae et al.  Table 1 for the identification of variables X 1 − X 20

Tree species
Factors in decreasing order of variable importance ). In addition, Son et al. (2000) notes a considerable influence of light on its growth. In our model, elevation and aspect, and isothermality and annual evaporation were important factors, which is consistent with previous studies. Aspect can affect light conditions and evaporation the dryness of the climate, suggesting that the model yielded results like those of previous studies. In addition, soil environmental factors, except for depth in mid-latitude regions, showed relatively low importance compared to climatic and geographic factors.

Pinus rigida
The model for this species showed high variable importance for climatic factors (isothermality, average annual precipitation, and average temperature) and certain geographic factors (aspect and slope). Son et al. (2000) noted that P. rigida has low requirements for soil moisture and nutrients and is relatively uninfluenced by geographic conditions. In our model, soil moisture and organic contents had low variable importance, whereas aspect and elevation related to ease of accessibility, as P. rigida is prioritized as an afforestation species in South Korea.

Quercus mongolica
The model for Q. mongolica showed high variable importance, in decreasing order, for average annual precipitation, temperature differences between the hottest and coldest months, biological temperature, and CVP, followed by elevation. Shin et al. (2002) derived the environmental factors influencing the site index for Q. mongolica by performing regression analysis and identified temperature and moisture as factors affecting the forest productivity. Lee et al. (2014) reported that the variables related to precipitation and other factors related to humidity and temperature affected the site index.
In our model, annual precipitation was the most important factor. In addition, other climatic factors, such as the temperature differences between the hottest and coldest months and biological temperature, also had high variable importance. This verifies the high reliability of the model. The findings of other studies indicate that models for Q. mongolica generally have low explanatory power. In contrast, the random forest-based model in this study has a significantly higher explanatory power and considerably higher utility than existing models (Curt et al. 2001;Kabrick et al. 2004;Lee et al. 2014).

Robinia pseudoacacia
The model for R. pseudoacacia showed high variable importance for temperature differences between the hottest and coldest months, average temperature of the hottest month, and average temperature, and aspect and elevation.
R. pseudoacacia is light-demanding and is vulnerable to cold. However, although it needs high soil moisture and nutrient content, it also tolerates low humidity and low nutrient levels (Son et al. 2000;Sim 2001). The high variable importance of temperature-related factors and aspect in our model indicates that this model reflects well the light and temperature requirements of R. pseudoacacia. The low variable importance values of soil moisture and organic matter are also consistent with descriptions in the North Korean literature.
The results show that the variable importance values calculated by the model were generally higher for climatic and geographic factors. There were some exceptions in the geographic factors associated with P. rigida that had a limited information on Q. mongolica. Apart from these limitations, the results show a consistent pattern between the calculated variable importance values and those from the literature. Based on these findings, it is suggested that the random, forest-based site index estimation model developed in this study is reliable and adequate in estimating forest productivity. Since South and North Korea have similar climates and geographic features, the application of this model to North Korean data will yield reliable site index values. Figure 3 illustrates the site index map for the six species across North Korea. Currently available maps of North Korea show broad distinctions between forest types without further elaboration, and there are no digital forest cover maps or data for stand age assessment. This makes it difficult to determine whether the site index values estimated using our model match the actual values in the country. However, the current forest classification by eco-geographical zones shows a smooth climatic and geographic gradient across the Korean Peninsula, except in the northern alpine regions. Therefore, it may be assumed that the site index map of North Korea constructed here, using the model developed with South Korean data, can be readily used for practical application (Park et al. 2021). For additional confirmation, the model was validated using site index values and tree species characteristics reported in academic papers, press releases, and other literature from North Korea.

Comparison of the estimated Larix kaempferi site index values
The mean site index for L. kaempferi across North Korea was 14.8 m (range: 12.0-7.9), demonstrating a considerably higher mean height than that of the other species (Fig. 3a). The mean site index for L. kaempferi was higher than the overall average in Gangwon-do and a part of Hwanghaebukdo, with other regions showing values similar to the overall mean. Son et al. (2000) notes that L. kaempferi is widely distributed in Gangwon-do, Pyeongannam-do, and Hamgyeongnam-do. Our model showed that the site index value of L. kaempferi was highest in Gangwon-do, and that the maximum values in Pyeongannam-do and Hamgyeongnam-do were relatively high. Thus, our results are generally consistent with the actual distribution.
Site index values of L. kaempferi reported in North Korea are relatively higher than those estimated by our model, calculated by substituting stand age and height data from in studies from North Korea into the tree height curves of South Korea. Lee (2013b) conducted a full survey of the sample area and estimated the site index of L. kaempferi trees (stand age: 30 years) at 19-20 m (mean height of 35-year-old L. kaempferi = 21.6 m). Son and Kim (2014) reported that the site index of seed stands of L. kaempferi in Inanm-ri, Seoncheon-gu, Pyeonganbuk-do ranged from 16 to 18 m (mean height of an 18-year-old stand = 12 m; mean height of a 13-year-old stand = 7 m). Baek and Yoon (2012) examined the growth characteristics of L. kaempferi in Ganggye, Jagang-do, and found a site index, as measured by the Forest Management Bureau Industry Branch, of 18 m (at 15 years = 9.35 m). Moon (1987) noted that the site index values of L. kaempferi stands in Jueuidong-ri (Deokseong-gun) and Cheonseong-ri (Gowon-gun, Hamgyeongnam-do) ranged from 18.5 to 20.8 m (at 9 years = 5.1 m, 10 years = 6.7 m, 12 years = 9.0 m, 14 years = 10.1 m, 15 years = 11.1 m). Yoon et al. (1988) found that the mean site index for L. kaempferi was approximately 17.5 m, using data from surveys conducted over 1967-1980. Compared to the values in North Korean literature, our model estimated a slightly lower site index value for L. kaempferi in North Korea. This is presumably due to the general geographical distribution of L. kaempferi in the boreal region of the northern hemisphere. In the Korean Peninsula, L. kaempferi range is primarily in cold regions in North Korea and in high-altitude regions in South Korea. The model can be further corrected via additional analysis of training data collected from the border region between North Korea and China.

Comparison of the estimated Pinus densiflora site index values
The mean site index value for P. densiflora across North Korea was 11.9 m, as shown in Fig. 3b. Mean site index were high along the Taebaek Mountain Range in Gangwondo and Hamgyeongnam-do, and wide areas with relatively high site index values were distributed in the Ryanggang-do and Jagang-do regions. Son et al. (2000) reported that P. densiflora forests were most densely distributed in Hamgyeongnam-do, followed by Hamgyeongbuk-do and Pyeongannam/buk-do. The highest proportion of P. densiflora forests was in Pyongyang, followed by Pyeongannam-do, Hwanghaenam-do, Hwanghaebuk-do, and Hagyeongnam-do. Ryanggan-do had the least proportion of P. densiflora in its forest land. Our model underestimated the site index values in Ryanggangdo, possibly due to the lack of training data for the northern highlands. Son et al. (2000) classifies P. densiflora forests into forest site grades using the criteria for the site index of a 30-yearold stand: Grade 1 or Grade 2 = 11.30 m; and Grade 3 or Grade 4 = 71.13 m. These values are within the range of the estimates calculated by our model and are like the estimated site index values across North Korea.
In North Korea, Kim and Lee (2006) measured the height of P. densiflora according to stand age using 20-year forest standard data, and calculated the mean site index to be 11.11 m. Ju (2010) generated a mean height table for the dominant tree species throughout North Korea, and calculated the overall mean site index of a 30-year-old stand at 12.8 m. Kim and Jeon (1989) reported that the mean site index of P. densiflora forests in Pyeongannam/buk-do was 8.7 (Grade 4). These are generally consistent with the values estimated by our model, providing empirical evidence of the validity of the model.

Comparison of the estimated Pinus koraiensis site index values
The mean site index for P. koraiensis across North Korea was 11.9 m, as shown in Fig. 3c. The mean site index values were generally high in North Korea, especially along the Taebaek Mountain Range in Gangwon-do and Hamgyeongnam-do. In contrast, there were lower site index values along the west and east coasts and in the northern parts of Pyeonganbuk-do and Jagang-do. Son et al. (2000) noted that P. koraiensis was naturally distributed in the range of 100-1900 m. a.s.l., primarily in Jagang-do and Gangwon-do, followed by Hwanghaebuk-do, Hamgyeongnam-do, and Pyeonganbuk-do. Pyeongannamdo, Hamgyeongnam-do, Gaesong, and Pyongyang were areas with low site index values. Moreover, Son and Kim (2014) found that P. koraiensis is vulnerable to ocean winds. These findings roughly correspond with the estimates of the model in this study.
According to Son et al. (2000), naturally growing P. koraiensis grows to an average height of 7.5 m by 30years-of age. However, P. koraiensis plantations have been reported to grow to a height of 15.4 m (Son et al. 2000). The crowns of these trees can be trimmed to increase nut production. This suggests that the measured site index values for P. koraiensis do not directly indicate the potential forest productivity and may contain errors due to human intervention. Thus, we advocate caution in interpreting these results.
To estimate the site index values of natural P. koraiensis trees unaffected by anthropogenic factors, a P. koraiensis stand in the Ogasan Nature Reserve was examined. The mean height was 34 m, and its estimated age was 582 years as of 2010 (Son et al. 2000;Yoon and Song 2010). A site index of 18.5 m was estimated by applying the data of this stand to the curves generated using actual data from South Korea, considering the stand age, which was higher than the value estimated by our model. Assuming that height growth is barely discernable on a real forest graph after the standard stand age is reached, our model does not provide a reliable estimate of the site index of a natural reserve forest with a high stand age.
The site index values of P. densiflora in studies from North Korea are detailed below. Baek and Park (2009) conducted surveys in reference locations of P. koraiensis forests and reported the site index values; 12 m in Jagang-do in the northern inland region (stand age: 36.5; height: 13.7 m), 10 m in Gangwon-do in the central east coastal region and parts of Hamgyeongnam-do (stand age: 36.5; height: 11.8 m), and 10 m in the Hwanghaebuk-do region (stand age: 34; height: 11.1 m). Lee (2013a) found site index values in the range of 10.8-12.2 m (stand age: 40; height: 13.2 m; stand age: 20; height: 8.1 m) in Jeoncheon-gun (Jagang-do), Rinsan-gun (Hwanghaebuk-do), and Sepo-gun (Gangwon-do). Based on these studies, it can be verified that the P. koraiensis site index values reported in studies from North Korea are generally like the values estimated by our model.

Comparison of the estimated Pinus rigida site index values
The mean site index for P. rigida across North Korea was 10.8, as shown in Fig. 3d. The mean site index was generally high in wide areas of Gangwon-do and Hamgyeongnam-do along the east coast, followed by Pyeongannam/buk-do and Hwanghaebuk-do. Son et al. (2000) reported that P. rigida grows to an average height of 11.9 m by 30-years of age in ideally humid regions and up to 9.9 m in dry regions. Moreover, P. rigida is adapted to salinity and can grow well in the eastern and western coastal areas. The horizontal distribution map of P. rigida in the Son et al. (2000) shows that the species is distributed along the west coast up to Changseong-gun (Pyeonganbuk-do) and along the east coast up to Kimchaek (Hamgyeongbuk-do), and that the northern isothermal boundary is 8 °C. This map shows a U-shaped pattern, which is similar to the pattern created by our model, consisting of regions with high site index values along the east coast-inland-west coast (Fig. 3d).

Comparison of the estimated Quercus mongolica site index values
The mean site index for Q. mongolica across North Korea was 10.1 m, as shown in Fig. 3e. The mean site index was high along the west coast, and wide areas with high values appeared in Pyeongannam/buk-do and Hwanghaebuk-do. In contrast, wide areas with low site index values occurred in Jagang-do, Ryanggang-do, and parts of Hamgyeongnam/ buk-do.
Since natural Q. mongolica forests are dominant over afforested ones, there has been little research on Q. mongolica plantation forest sites and their site index value. Son et al. (2000) briefly mentions that the site index of a Q. mongolica sprout forest ranges from 12 to 14 m. This report notes that Q. mongolica is distributed across the country except in the northern highlands, and is densely distributed in Jagang-do, except in the Nangrim Plateau, Hamgyeongnam-do, Hamgyeongbuk-do, and Gangwon-do. The distribution of Q. mongolica-as recorded in the Joseon Vegetation Cover Map (Kim et al. 2021b)-spans a wide zone, especially around the cold and subarctic vegetation cover zones in the northern highlands that are populated with Khingan fir, spruce, and larch. Our model also showed low site index values for Q. mongolica in the northern highlands, and its distribution generally coincided with those reported by the Forestry Series of North Korea and Joseon Vegetation Cover Map (Kim et al. 2021b).

Comparison of the estimated Robinia pseudoacacia site index values
The mean site index for R. pseudoacacia across North Korea was 10.2 m, as shown in Fig. 3f. Broad areas with high site index values were in Ryanggang-do, the northern part of Hamgyeongbuk-do, Jagang-do, Pyeongannam/buk-do, and parts of Hwanghaebuk-do, followed by the western coastal areas of Hamgyeongnam-do and Gangwon-do. Son et al. (2000) reported that R. pseudoacacia is distributed across the country except in the northern highlands, especially in Hwanghaebuk-do, Nampo, and Hamgyeongbuk-do. The model overestimated the site index values for Jagang-do and Ryanggang-do in the northern highlands, likely due to the lack of data for high-altitude regions in South Korea. A direct comparison with literature from North Korea could not be made due to a lack of research on the growth patterns and site index values of R. pseudoacacia.

Conclusion
Site index values of six major species across the Korean Peninsula were investigated using the random forest machine learning algorithm. Our model shows enhanced prediction accuracy, and thus addresses the drawbacks of existing models, such as linear regression. By generating high-resolution site index maps for North Korea based on the environmental factors of forest sites, forest productivity of North Korean forests could be evaluated in each region.
The site index values estimated by our model coincide with those reported in the literature from North Korea (Forestry Series of North Korea and research papers), except for those pertaining to regions in the northern highlands. A direct comparison could not be made, as it was not possible to survey and measure forest cover types and stand ages in North Korea. Nevertheless, our model, developed using South Korean data, shows high prediction accuracy. Moreover, the comparison with North Korean literature verified the high reliability of the North Korean site index maps generated using our model-estimated values. The lower prediction accuracy for the regions in the northern highlands is attributable to the lack of data for high-altitude regions (> 1,200 m) in South Korea. However, the regions in the northern highlands in North Korea have a lower deforestation rate. The intended application of this model is to evaluate the suitability of forest sites for reforestation; therefore, it has high utility for regions other than the northern highlands. To improve the reliability and accuracy of site index estimations for the northern highlands, a follow-up study is intended that will include data from the border regions between China and North Korea.
The site index maps for the six species constructed in this study were based on the suitability of tree species for afforestation on various sites. Therefore, these are expected to be useful for restoration programs in deforested lands in North Korea. However, the utility of this model is not limited to North Korea; its use can also be extended to other regions with low accessibility or data availability.