In the following section, we present a systematic framework designed to extract and map the evolution of UAOIs from the subset of geotagged Flickr photographs outlined in the previous section. Our methodology consists of two main parts: cluster detection and boundary delineation.
Extracting urban areas of interest by the hierarchical density-based spatial clustering for applications with noise algorithm
We define UAOIs as those areas where multiple Flickr users have gathered and taken large numbers of spatiotemporally clustered photographs, reflecting a consensual view that some aspect of the urban environment is of interest. The extraction of such areas can be understood as a clustering problem, in particular, as one that has the aim of identifying robust, non-overlapping and dense concentrations of points. Following recent advances in the literature, we selected a density-based method. The advantages of such an approach are that they can produce results without pre-specification of cluster frequency and are robust to arbitrary shapes and the presence of outliers/noise deviating away from the main spatial distribution (Hans-Peter et al. 2011).
We applied the HDBSCAN (hierarchical density-based spatial clustering for applications with noise; Campello et al. 2013) as our clustering method as this overcomes several of the major drawbacks of other density-based algorithms. Contrary to more traditional algorithms, there is only one parameter to tune in HDBSCAN, with the other key parameter in the original DBSCAN implementation, i.e. the minimum cluster size (MinPts), being endogenously determined by the method. This approach represents a step forward in the direction of more robust, automated and data-driven techniques for the delineation of UAOIs. McInnes et al. (2017) describe the HDBSCAN process as comprising five steps:
Transform the space based on the estimates of density by defining a ‘mutual reachability’ distance, which is a new distance metric between points;
Build a minimum spanning tree to implement single-linkage clustering, which is a core feature of this algorithm;
Construct a cluster hierarchy of connected components by iteratively sorting the edges of the tree by distance in an increasing order. The result can be viewed as a dendrogram that shows where robust single-linkage stops;
Condense the cluster hierarchy shown in the dendrogram into a smaller tree by attaching more data to each node;
Extract clusters that persist and are robust from the condensed tree.
Operationally, various epsilon values are generated automatically by the different density levels resulting from the single-linkage hierarchy, which allows HDBSCAN to find clusters of various densities. Also, it ensures improvements over OPTICS and DBSCAN by providing a clustering hierarchy, where a simplified tree of the most significant clusters (i.e. maximised stability) can be easily extracted.
When using HDBSCAN, the only parameter to specify is the minimum cluster size (mclSize), representing the minimum number of points (i.e. Flickr photographs) required for a UAOI to exist. In order to select an appropriate mclSize, we extensively explored the sensitivity of the final solution to changes in the parameter. A few representative thresholds, from 10 to 1000, were set as the minimum cluster size (mclSize) parameter, which were applied in all time slots. Figure 2 presents example outputs from this sensitivity analysis. We can see that if the mclSize is small (e.g. 10 or 50), more UAOIs are identified but there are also greater numbers of points labelled as noise (i.e. not part of any clusters); if the mclSize is larger (e.g. 500 or 1000), more robust results emerge, although clusters are significantly larger, causing potentially interesting but smaller areas to be missed. Furthermore, due to the number of Flickr photographs and users varying between months, it could be argued as being inappropriate to assign an absolute value for all time sequences. To handle these issues, values of 1–4% of the Flickr photographs in each month were assigned to mclSize across different iterations as discussed previously in order to produce appropriate frequencies of groups that fit the definition of a UAOI. After multiple experimental results, 1% of Flickr photographs in each month were used as the value for the minimum cluster size parameter, ensuring a higher number of UAOIs but also being cognisant of smaller clusters that may be of relevance.
As UAOIs should be formed through the collective actions of multiple users within each specific time slice, the 1% parameter selection does not ensure that a set number of Flickr users are captured in each UAOI. As such, it was then necessary to verify the practical significance of the extracted UAOIs. An intuitive approach is to examine the relationship between the number of Flickr photographs and the number of users in each month. If they are correlated, then we can estimate the number of Flickr users by the number of photographs per month. Specifically, the scatter plot in Fig. 3a shows that there is a high positive correlation between the two variables, with a Pearson coefficient of 0.85, implying that as the number of photographs increases, so too does the number of users in a given UAOI. A linear regression model was then fitted using these two variables so that the user number could be estimated based on the number of photographs in each month. The resulting R-Squared was 0.725 with a p value for the coefficient value below 0.05, implying that the model is statistically significant and 72.5% of the variation in photograph numbers could be explained by the model. Figure 3b is a graph presenting the number of photographs, users, and the calculated user number in various time sequences. The red line fluctuates slightly around the black line, meaning that the 1% photograph number as the HDBSCAN parameter value can be interpreted as having at least 1% of users in each UAOI, which satisfies our definition of a UAOI. Therefore, we adopt these clustering results for the next stage of the analysis, which turns clusters of points into polygon boundaries.
Constructing a perceptual boundary to enclose the extracted urban areas of interest
The clusters from the method described above are represented as a group of points. However, within this study, we are interested in extracting largely non-overlapping UAOIs that refer to an area within a specific border. In other words, we are interested in identifying polygons rather than sets of points. The reason behind this procedure is twofold. First, as mentioned when introducing the concept, a UAOI was defined as a section of the city with an extraordinarily large density of images. Under this definition, two overlapping UAOIs would simply be merged into one. Secondly, our focus is to quantify spatiotemporal changes in the shape and extent of these polygons. In this context, even though a UAOI is identified with fixed borders at each point in time, its definition over time is much vaguer and is allowed to change, evolve and morph in line with changes to its underlying structure.
As such, the next step involves the construction of boundaries that enclose all geotagged images identified as part of a UAOI cluster. To delineate these shapes, we adopted a variant of the concave hull algorithm: the alpha shapes (Edelsbrunner et al. 1983). Alpha shapes are a widely used, robustly tested algorithm that create a tighter boundary as compared to the traditional convex hull method, which may produce large empty areas that do not belong to the original point data set (Akdag et al. 2014).
An alpha shape, which is a geometric concept, is a linear approximation of an original shape. It is a generalisation of the convex hull, and a subgraph of the Delaunay Triangulation (Edelsbrunner et al. 1983). It establishes a connection between each point and nearby points and removes the furthest triangles that are away from their neighbours. In this context, α is a parameter that controls the desired level of detail, ranging from the standard “crude” convex hull (α = ∞) to the set of points itself (α = 0, Da 2018). The algorithm first computes a Delaunay triangulation of the set of points (S) and for each Delaunay edge, it computes the values α-min (e) and α-max (e). Next, for each edge e, if α-min (e) ≤ α ≤ α-max (e), the edge is kept in the α-shape of S. We have tailored this general method to our application by developing a technique to find the most appropriate alpha value for each cluster. Like the parameter selection in HDBSCAN clustering, an absolute alpha value for all point clusters would not be suitable in that some areas would contain more empty areas in the range from 0.001 to 0.005. We then identified the first case where a single point was excluded from the main polygon and selected the previous value of alpha. This strategy resulted in the tightest polygon that still contained every point in the cluster. As an illustration, Fig. 4 represents three examples of different UAOIs produced with varying alpha values. In this case, 0.003 excludes a point (which in the original algorithm is still linked through an edge, but not an area), and 0.001 implies too sparse a solution compared to 0.002, which allows a tighter shape that still includes all points in the cluster within the same polygon. Hence, the value selected for this case is 0.002.