Stream gauge network grouping analysis using community detection

Stream gauging stations are important in hydrology and water science for obtaining water-related information, such as stage and discharge. However, for efficient operation and management, a more accurate grouping method is needed, which should be based on the interrelationships between stream gauging stations. This study presents a grouping method that employs community detection based on complex networks. The proposed grouping method was compared with the cluster analysis approach, which is based on statistics, to verify its adaptability. To achieve this goal, 39 stream gauging stations in the Yeongsan River basin of South Korea were investigated. The numbers of groups (clusters) in the study were two, four, six, and eight, which were determined to be suitable by fusion coefficient analysis. Ward’s method was employed for cluster analysis, and multilevel modularity optimization was applied for community detection. A higher level of cohesion between stream gauging stations was observed in the community detection method at the basin scale and the stream link scale within the basin than in the cluster analysis. This suggests that community detection is more effective than cluster analysis in terms of hydrologic similarity, persistence, and connectivity. As such, these findings could be applied to grouping methods for efficient operation and maintenance of stream gauging stations.


Introduction
Stages or water levels are widely used in various fields, such as hydrology, water resource management, and environmental science, and one of the primary hydraulic structures for stage measurement is a stream gauging station (Sauer and Turnipseed 2010). Stage and flow data observed at stream gauging stations provide important water information for flood forecast warnings, operation of multipurpose dams, identification of available water resources, and operation of agricultural reservoirs.
Consistent efforts have been made to achieve accurate stage measurements and quality control as high-quality stage data affect the reliability of flood, drought, water quality, and ecological management (Joo et al. 2019a, b), which require efficient management of stream gauging stations.
The operation and management of stream gauging stations require basin-scale analysis based on up-and-down streams, mainstreams, and tributaries. Specifically, stream gauge networks with small or mid-sized groups of gauging stations that share common characteristics within a basin are more favorable for the operation and management of the stations. More efficient management of stage data could be possible with accurate grouping methods based on the characteristics of stream gauging stations. In other words, an operation and maintenance strategy tailored to each group of stream gauging stations would allow for the management of stage data in problem situations. This requires a reasonable comparison and review of the grouping methods for stream gauging stations within a basin.
Popular grouping methods in hydrology include the well-known cluster analysis technique. Cluster analysis is based on statistics and can identify differences between groups by bringing similar objects together and organizing them into a group. Cluster analysis based on time series data has been widely used in the field of hydrology (Kumar et al. 2015;Lin and Chen 2005;Kyung et al. 2007;Ouyang et al. 2010;Corduas 2011). Kumar et al. (2015) performed a cluster analysis to distinguish between seasonal periods accurately using a metric function based on the error distribution of seasonal data. Lin and Chen (2005) developed a time series prediction model for groundwater based on the self-organizing map (SOM), which is a two-dimensional map that directly identifies the number of clusters hidden in the radial basis function network (RBFN). Kyung et al. (2007) used cluster analysis to create a Korean version of the hydrological drought severity-areaduration (SAD) curve, and high levels of severity were observed in the north and central areas along the eastern coast of the Korean Peninsula. Ouyang et al. (2010) performed a K-means cluster analysis on the mean monthly discharge, monthly maximum discharge, monthly amplitude, and monthly standard deviation from 1961 through 2000 for the Shaligunlanke Station in the Tarim River basin of China. The results showed that the annual process of daily discharge could be classified into five segments. Corduas (2011) performed a cluster analysis based on the bond energy algorithm (BEA), which is applicable to complex data arrays. The analysis was based on 89 hydrological time series data of mean daily discharge from rivers in Oregon and Washington in the United States.
Network theory, which was invented in the eighteenth century (Euler 1741), was evolved to a next stage with complex network studies such as small-world networks, scale-free networks, network motifs, and community structure during the last two decades. It is one of the important tools that the actual application of complex network theory can describe a complicated and varied phenomenon (Sivakumar and Woldemeskel 2015). Complex network theory has also been applied recently in the field of hydrology (Rinaldo et al. 2006;Malik et al. 2012;Boers et al. 2013;Scarsoglio et al. 2013;Halverson and Fleming 2015;Sivakumar and Woldemeskel 2014;Fang et al. 2017;Han et al. 2018;Alarcòn and Lozano 2019;Kim et al. 2019). Community detection is a method based on network theory for grouping nodes that share similar or common goals. Fang et al. (2017) applied community detection in hydrology and organized communities using six methods (edge between centrality, greed algorithm, multilevel modularity optimization, leading eigenvector method, label propagation method, and the Walktrap method). The analysis was based on the similarity of daily streamflow for 1663 gauging stations across the Mississippi River in the United States. Halverson and Fleming (2015) organized communities of stream gauging stations located in the Coast Mountains of British Columbia and the Yukon in Canada according to seasonal flow regimes for each region and geographical proximity. Alarcòn and Lozano (2019) used Interbasin Transfer (IBT) for Spanish river basins to build a community structure consisting of seven small groups of two or three nodes.
Grouping methods have been used in the field of hydrology, including existing stream gauging stations, to identify differences in hydrological properties between groups. However, there have been insufficient efforts to review the accuracy and reliability of such grouping methods. Moreover, grouping methods have been recognized as a secondary process performed before the primary analysis. Comprehensive maintenance should be ensured across the board for stream gauging stations in the same group by applying more accurate grouping methods. Hydrologic aspects of gauging stations are directly affected by upstream gauging stations, and no individual station is independent.
The aim of this study is to present a grouping method using community detection based on complex networks. The proposed grouping method was compared with a statistical cluster analysis approach to verify its adaptability. This paper is organized as follows. Section 2 describes community detection and clustering methods. Multilevel modularity optimization and Ward's method that are used in this study and the methodology to select optimal number of group were also described. Section 3 constructs the stream gauge network that is consisting of the Node and Link using 39 stream gauging stations in the Youngsan River basin in South Korea and analyzes its community detection characteristics. Hierarchical cluster analysis was conducted according to the similarity between water levels. The grouping result and its characteristics were also analyzed. Based on the basin hydrology, the applicability of the complex network-based community detection method is compared with the statistical-based cluster analysis method. Finally, Sect. 4 provides a summary of the study. A network or graph is a set of points that are connected by a series of lines, as shown in Fig. 1. Points are called vertices or nodes, and lines are called edges or links. A network can be expressed as G = {P, E}, where P is a set of N nodes (P 1 , P 2 , …, P N ), and E is a set of n links. Figure 1 shows a network consisting of N = 7 nodes and n = 8 links. This network has a set of nodes P = {1, 2, 3, 4, 5, 6, 7} and a set of links E = {(1,7), (2,7), (3,5), (3,7), (4,7), (5,6), (6,7)}. Figure 1 shows the simplest form of a network that may appear in a more complex form. Examples of more complex networks include: (1) a network with one or more different types of nodes and links, (2) a network with different weights for different nodes and links, depending on the nodes and connection strength, (3) a network with cyclic or acyclic links, (4) a network with multi-links, selflinks, and hyper-links, and (5) a network with two nodes that are separated from different types and operated independently in a separate type. Sivakumar (2015) provide a more detailed description of such networks.
Network characteristics can be studied in different ways. The key concepts of a network in the context of the modern theory of complex networks include centrality analysis, the clustering coefficient, degree distribution, and community detection. This study uses community detection for grouping the stream gauging stations within a basin.

Community detection
In complex networks, nodes cluster together and form a group. The nodes in a group are closely connected, and the attributes of a group are typically independent of other groups. These groups are called communities, and finding communities is called community detection.
A community can also be called a cluster in a broad sense. However, the difference between clustering and community detection is that groups are formed only by the similarity of data in cluster analysis, while groups are formed by data similarity and by network theory and structure in the community detection method. Modularity is first used when constructing communities. The use of modularity allows us to quantify the differences between the number of connections in a community and the number of random connections, assuming a community within the entire network. The modularity equation proposed by Newman (2004) is commonly cited: where m is the total number of links, n is the total number of nodes, a ij is the connectivity between nodes i and j, and k i is the number of all connections to the nodes i and j. In addition, d c i ; c j À Á is one when c i and c j are in the same community and zero when they are in different communities.
Modularity requires time-consuming calculations. Possible methods for overcoming this disadvantage include edge between centrality (Newman and Girvan 2004), the greedy algorithm (Clauset et al. 2004), the Walktrap method (Pons and Latapy 2005), the leading eigenvector method (Newman 2006), the label propagation method (Raghavan et al. 2007), and multilevel modularity optimization (Blondel et al. 2008). The aim of all methods is to improve modularity to optimize community detection.
The multilevel modularity optimization (or Louvain method) is the most recently developed modularity optimization and was employed for this study because it is designed to address the mentioned problems. For example, the greedy algorithm method and the multilevel modularity optimization method have the fastest community detection but poor optimization as they tend to create super communities. Multilevel modularity optimization consists of two phases that are repeated iteratively, which can be expressed as (Blondel et al. 2008): where P in is the sum of the weights of the links inside the community, P tot is the sum of the weights of the links incident to nodes in community, k i is the sum of the weights of the links incident to node i, k i;in is the sum of the weights of the links from i to nodes in the community, and m is the sum of the weights of all the links in the network. Phase 1 forms communities using the improved modularity, and Phase 2 combines the communities created in Phase 1 into a block that is treated as a node. Next, the algorithm in Phase 1 again merges the newly modified networks. The model stops when no further changes occur in Phase 1 following Phase 2.

Cluster analysis
A cluster is based on the similar properties present in the interconnection between nodes. The task of classifying clusters based on similarity is called cluster analysis or clustering. Cluster analysis is based on statistics and can be classified into two types: hierarchical (agglomerative) clustering and partitional (divisive) clustering. Hierarchical clustering yields different cluster results step by step without predetermining the number of clusters. Partitional clustering is a method of specifying the number of clusters in advance. Based on these methods, the general procedure for network clustering is shown in Fig. 2.
Cluster analysis was employed to split stream gauging stations into groups based on stage data obtained from the stations. Hierarchical cluster analysis was applied because it can derive clusters of stream gauging stations without a predetermined number of clusters (Kaufman and Rousseeuw 2005). Several methods can be used to form clusters, such as single linkage, complete linkage, average linkage, and Ward's method.
Ward's method was used in this study. Unlike other methods, it is less sensitive to noise and outliers in data. Ward's method is very efficient and is widely used in many fields of science (Yoo et al. 2011). Other methods, such as single linkage, complete linkage, and average linkage, establish a group based on the similarity of each group using euclidean squared distance (L 2 ). But, Ward's method measures similarity using error sum of squares (ESS) when the two groups are combined. In other words, it conducts grouping that intends to minimize the increase of the ESS. In the initial clustering, all nodes are clustered one by one, and it can be expressed as ESS i ¼ 0 for all i. The ESS increases as further clustering occurs, which can be written as: where X kj i is the mean cluster for X k in the ith cluster.

Study area
The Yeongsan River basin is located in Southwestern South Korea (N 34°40 0 16 00 -35°29 0 01 00 , E 126°26 0 12 00 -127°06 0 07 00 ). The need for maintenance based on stages at a stream gauging station has long been recognized for the Yeongsan River basin. Yeongsan River is one of the four major rivers in South Korea. It has a basin area of 3455 km 2 and a river length of 129.5 km and accommodates 39 stream gauging stations. Of the 39 stream gauging stations, 14 are deployed in the Yeongsan River, which is the main stream, and 25 are in the tributaries. These stations are under the supervision of the Ministry of Environment (ME) at the national level and reflect the importance of stream gauging station management and stage data. Figure 3 shows the location and the corresponding number of stream gauging stations within the basin.

Stage data setting
The data collection period must be predetermined as the complex network configuration for cluster analysis, and community detection is based on time series stage data obtained at each stream gauging station. To ensure the reliability of the stage data, a large amount of data for an extended period is needed (e.g., 30 years, which is generally considered suitable for hydrological analysis). This study employs a grouping method based on the similarity of stage data, which requires an analysis of data collected over the same period of time from all stream gauging stations (Fang et al. 2017). Considering all the data, this study employed daily stage data collected over five years (January 2011 to December 2015), which provided consistent data for stream gauging stations. Figure 4 summarizes the distribution of stage data for each stream gauging station in a box plot.
Water levels remained constant at most stream gauging stations but showed high degrees of variability at some stations. Stream gauging stations exhibited higher variability in water levels over short and long periods of time, which can be attributed to the fact that the stations are located directly downstream of dams or near weirs (Stations 1, 5, 13, 17, 23, and 30), which are operated for flow control. Moreover, most outliers observed at each gauging station resulted from a rapid rise in the stage caused by localized heavy rainfall during the flood season. Water level events that vary significantly with regional, meteorological, and manmade factors were also included for analysis as these events have effects on other gauging stations at the watershed level.

Community detection of stream gauging stations based on network theory
In complex networks, stream gauging stations can be represented as nodes without links that connect them. Stream gauging stations are installed along a water system that only serves as a means of accommodating the stations, not as a link to connect them. Therefore, links should be constructed to connect each stream gauging station based on the correlation of stage data. This requires a focus on the network configuration that changes with threshold values (T) of similarity in stage data to investigate the diverse communities. Therefore, a complex network was constructed with threshold values based on the correlation between 0.1 and 1.0, and community formation was carried out for the 39 stream gauging stations using multilevel modularity optimization (Fig. 5). The results showed that networks configured differently depending on the threshold values also had different community formation. The number of communities based on threshold value could be up to 39 that is the total number of stream gauging stations. The proper number of communities could be estimated when the modularity are maximized on the multilevel modularity optimization method. The appropriate number of communities could be estimated when the modularity is maximized on the multilevel modularity optimization method (see Eq. 2). The maximum modularity (Q = 0.518) is calculated when communities are eight, and this study employs 2, 4, 6, and 8 communities to consider each event of community number. It is equivalent to 0.2, 0.4, 0.5 and 0.6 in threshold values. Groups of four events represented different types of links and exhibited the typical structure of complex networks. Figure 6 shows the community results for different stream gauging stations, where boxes of the same color indicate that they belong to the same community group. Figure 7 shows the community results for different locations of the stream gauging stations. The solid lines between stream gauging stations are based on threshold values and determine the network structure.
The results of community detection showed that clustering mostly took place in a group (the green group) across all group events, similar to the cluster analysis. However, the results were more centralized in community detection. Furthermore, stream gauging stations (nodes) that are not connected by links did not always organize into different communities. In addition to nodes, community formation in the network involves other factors, such as the number of links for neighboring adjacent nodes and the degree and intensity of connection between links. In other words, even if nodes are not linked to each other due to a lack of similarity in stage data, they can still be organized into the same community if indirectly connected by other nodes.

Cluster analysis of stream gauging station based on stage data
A hierarchical cluster analysis was performed for the 39 stream gauging stations. The resulting dendrogram is shown in Fig. 8. The maximum number of clusters is 39, which is the total number of stream gauging stations investigated. The clustering can be divided into 10 steps depending on the number of groups. Determining the number of clusters is extremely challenging in a typical cluster analysis. This also applies to the case where the number of clusters is determined using the similarity of stage data for efficient management of stream gauging stations. Many studies have been conducted to determine the appropriate number of clusters (Aaker et al. 2001). When using Ward's method, the ESS variance with the number of clusters is represented by the fusion coefficient (Eq. 3). The fusion coefficient derived at every stage of clustering was effectively used for this purpose. The fusion coefficient is estimated by considering the distances between clusters at each stage of clustering, so its value can be used to determine how newly made clusters differ from each other. That is, if the fusion coefficient shows a significant increase when decreasing the number of clusters, two relatively different clusters can be made into a single cluster at the current stage (Aldenderfer and Blashfield 1984;Yoo et al. 2011). Figure 9 summarizes the derived fusion coefficients. Like community detection, the results show that a significant change in the fusion coefficient is observed when the number of clusters is equal to eight, which suggests that the appropriate number of clusters will be eight or less. Therefore, the changes in four events were investigated, which consisted of two, four, six, and eight clusters (groups) for the same comparison with the result of community detection.
The results of cluster analysis performed on four events based on fusion coefficients are illustrated in Figs. 10 and 11. The boxes of the same color in Fig. 7 indicate that they belong to the same cluster group, while Fig. 8 shows the clustering results according to the relative location of the stream gauging stations.
The results show that clustering mostly took place in a group (the green group) across all group events. This can be interpreted as indicating possible similarity of stage data at the basin level. However, at the same time, it indicates that stream gauging stations in the target basin have somewhat unclear clustering without distinguishing characteristics compared community detection (comparing Figs. 8 and 11). For nearby stream gauging stations, water levels are often similar to each other in general. In contrast, stream gauging stations within the target river basin were often found to belong to different groups, despite their proximity. This can be attributed to different stage data resulting from topographic conditions, such as river bed elevation, despite the close proximity of stream gauging stations. A more quantitative analysis is required to provide a more detailed comparison between community detection and cluster analysis.

Comparison and discussion of grouping methods based on basin hydrology
The grouping methods for stream gauging stations based on cluster analysis and community detection were compared in terms of basin hydrology and evaluated for suitability.   Table 1, when the number of groups is two, four, six, and eight, the number of stream gauging stations that can be included in communities and clusters is 156 in total (39 stations 9 4 events). As mentioned in Sect. 3.4, gauging stations were found to belong to one group (the green group) in most cases and for both grouping methods.
Stations were more likely to belong to one group for communities than clusters, and very few stations changed to another group as grouping took place. This indicates that community detection in the stream gauge networks at the basin level resulted in relatively high levels of similaritythat is, both direct and indirect cohesion occurred among stream gauging stations, in contrast to cluster analysis. For hydrologic comparison by group and grouping method, it is necessary to investigate the groups connected by the same stream links between gauging stations. Therefore, the changes in stream gauging stations according to grouping method were studied for a total of 12 stream links. The main stream of the Yeongsan River within the basin was set as stream link Index A, and the   2  33  6  ------24  15  ------4  29  6  3  1  ----24  6  5  4  ----6  26  6  3  1  2  1  --21  4  5  4  3  2  --8  25  4  3  1  2  1  1  2  21  4  3  3  3  2  2  1   Sub sum  113 22  9  3  4  2  1  2  91  29 13 10  6  4  2  1 Total sum 156 156 remaining tributaries were set as stream links B through L ( Table 2 and Fig. 12). Table 2 shows the stream gauging stations located in stream links A through L and the group structure based on cluster and community. The total value in Table 2 represents the number of stream gauging stations located in each stream link. In addition, column (1) represents the number of gauging stations in the group that has the largest number of stations. This value can be used to analyze the cohesion between gauging stations for different stream links in comparison to the entire stations. Column (2) represents the remaining stations other than those listed in column (1). Therefore, the sum of the number of gauging stations in the group consisting of the most stations and the number of the remaining stations is the total number of gauging stations located in each stream link.
As shown in Table 2, the numbers of stream gauging stations located in stream links A through L are 15, 1, 1, 1, 3, 2, 4, 5, 2, 2, 2, and 1, respectively. This indicates that the majority of gauging stations in the Yeongsan River Basin is located in the main stream (stream link A) of the Yeongsan River. In the stream link A, the number of stream gauging stations (column (1)) in the group consisting of the most stations was 12, 12, 9, and 9 for groups two, four, six, and eight, respectively. These values are smaller than those of the community detection method (14, 13, 12, and 12). This indicates that the stream gauging stations in stream link A are densely connected by the stream link and exhibit relatively strong cohesion in the application of community detection.
In contrast, when the number of groups was eight, the number of stream gauging stations in the group consisting of the most stations in the E stream link was two for cluster analysis and one for community detection. This indicates that cluster analysis identifies gauging stations with strong group cohesion. As a result of community detection, three gauging stations were organized into different groups. Figure 12 shows a diagram and basin map of gauging station grouping for different stream links. In both methods, stream gauging stations such as 1, 17, 25, and 39 often did not belong to the same group as the nearby stations, which can be attributed to the dissimilarity of the stage due to the topographic attributes rather than meteorological effects. Moreover, stream gauging station 8 located in stream link D could not be grouped into the main stream of the Yeongsan River (stream link A) in the cluster analysis due to lack of similarity in stage data between the two stream links. In contrast, station 8 was organized into the same group in the community detection method based on network theory. This indicates that stream link D affects the main stream of the Yeongsan River despite the lack of similarity in stage data. This also suggests that station 8 must be operated and managed in connection with gauging stations located on the main stream in community detection.
To make a quantitative comparison between the two methods at the stream scale cohesion among the gauging stations was expressed as: where G C is the cohesion according to grouping methods, S M is the number of gauging stations in the group consisting of the most stations in a certain stream link, and S T is the total number of gauging stations in a certain stream link, which can be expressed as (column (1)/Total) 9 100 based on Table 2. The cohesion calculated using this method is shown in Table 3; the stream links consisting of a single station (B, C, D, and L) were excluded because the stream-scale comparison was irrelevant in this case.
The results showed that community detection identified highly cohesive gauging stations in most stream links compared to cluster analysis. This appears more evident in the stream links containing many gauging stations, such as b Fig. 12 Diagram and basin map based on cluster analysis and community detection A and G, throughout the entire group. In other words, gauging stations in the same stream link are more likely to be grouped together in the community detection method. Other groups (groups three, five, seven) that were not considered in this study show the same results. This indicates that community detection based on the network structure of nodes and links is more suitable than cluster analysis in hydrology. The streamflow in a stream link generally exhibits a high degree of hydrologic similarity, persistence, and connectivity, indicating that gauging stations are not independent but are closely related to each other. Therefore, a network-based community detection method that deals with communities with high cohesion would be a better alternative to the cluster analysis method, which is simply based on data correlation. In the community detection method, stream gauging stations that are not organized into major groups at the stream link scale require special maintenance tailored to the attributes of the nonmajor groups. The results of this study are expected to serve as an appropriate selection method for a small number of stream gauging stations with different characteristics.

Conclusions
This study evaluated the adaptability of community detection based on complex networks as a grouping method for efficient operation and maintenance of stream gauging stations. To achieve this goal, 39 stream gauging stations in the Yeongsan River Basin of South Korea were investigated using the community detection method. These results were compared with statistical cluster analysis results. For community detection and cluster analysis, multilevel modularity optimization and Ward's method were employed. The number of groups was set to two, four, six, and eight based on modularity and fusion coefficient analysis, respectively.
The results showed that communities are more likely to be arranged into a group in the community detection method than in the cluster analysis. This indicates that the grouping of stream gauging stations at the basin scale has higher levels of cohesion in community detection than in cluster analysis. For comparison purposes in terms of hydrological conditions, the changes of the stream gauging stations located in a total of 12 stream links (A through L) and including the main stream of the Yeongsan River were investigated for different groups and methods. Higher levels of cohesion among the gauging stations were observed in the community detection method in most stream links.
High cohesion in a stream link means a high degree of hydrologic similarity, persistence, and connectivity. In turn, this makes the community detection method a better candidate for grouping as it can successfully simulate the general stream attributes. The present findings are expected to serve as a grouping method for the comprehensive management of stream gauging stations. This study analyzed only for water level. However, we may need further work for the integrated grouping of water level and other hydrologic components such as water utilization, flow control, environment, and gauging station impact.