5.1 On Mining Spatial Relationships in Spatial Data

Study of relationships in space has been the core of geographical research. In the simplest case, we might be interested in their characterization by some simple indicators. Sometimes we might be interested in knowing how things co-vary in space. From the perspective of data mining, it is the discovery of spatial associations in data. Often time, we are interested in relationships in which the variation of one phenomenon can be explained by the variations of the other phenomena. In terms of data mining, we are looking for some kinds of causal relationships that might be expressed in functional forms. Statistics in general and spatial statistics in particular have been commonly employed in such studies (Cliff and Ord 1972; Anselin 1988; Cressie 1993).

Regardless of what relationships that are of interest, the geographers’ main concern is whether they are local or global. In the characterization of a spatial phenomenon, for example, is it appropriate to use an overall mean to describe the central tendency of a distribution in space? Will it be too over-sweeping an indicator so that it hides the distinct local variations that would be more telling otherwise? The task of data mining is thus to discover whether significant local variations are embedded in a general distribution, and if yes, we need to unravel the appropriate parameters and/or functional form for their description. In the identification of spatial associations, we often wonder if spatial autocorrelations are local or global. Again, it is essential to have a means to unravel such associative relationships. To discover causal relationship in space, the local vs. global issue rests on whether the effect of an explanatory variable on the dependent variable can be summarized by a global parameter, or whether it is localized with different effects at different points in space. In a word, the basic issue is on the discovery of spatial non-stationarity from data.

The inappropriateness of using global estimates to represent local relationships has long been a concern of not only the geographers, but also the statisticians and other social scientists. Simpsons’s (1951) study of the local effect on interaction in contingency table, Linneman’s (1996) examination of international trade flows, Cox’s (1969) and Johnston’s (1973) local analyzes in voting behavior are early examples. Over the years, researchers, particularly geographers, have developed methods for local and global analyzes. The geographical analysis machine (Openshaw et al. 1987), a limited version of the “scan statistics” (Kulldorf et al. 1997), for example, is catered for the study of point patterns with local variations that might not be appropriately captured by the global statistics described by Dacey (1960), Tinkler (1971), and Boots and Getis (1988). Differing from the concept advanced by Cliff and Ord (1972) Which gives a global statistic to describe spatial association, Getis and Ord (1992), Anselin (1995, 1998), Ord and Getis (1995, 2001) propose some local statistics to depict local variations in the study of spatial autocorrelation. It has been demonstrated that local clusters that cannot be detected by the global statistic can be identified by the local statistics. Leung et al. (2003d) make the analysis more rigorous by generalizing the local statistics into quadratic forms.

Besides the development of local statistics for the description of spatial dependency, the local and global issue has also surfaced in the study of spatial relationships within the framework of regression analysis. Similar to the study of spatial association, a key issue in the analysis of causal relationship is to discover whether a cause-effect relation is non-stationary in space. Specifically, we are interested in finding out whether the spatial effect is local or global. Within the context of regression, if the parameters of a regression model are functions of the locations on which the observations are made, then local patterns exist and the spatial relationship is non-stationary. The relationship can then be represented by the varying-parameter regression model (Cleveland 1979). In spatial terminology, the relationship is said to be captured by the geographically weighed regression (Brunsdon et al. 1996). Thus, the data mining task is to determine whether the underlying structure is global or local in terms of some statistics. For complex systems, however, spatial non-stationarity is not restricted to only the variation of parameters of a universal model. Spatial data manifesting such systems may contain several populations embedded in a mixture distribution. In other words, the functional form representing the relationship varies over space. Local relationships take on different functional expressions, and our task is to unravel all of them in a spatial database. It is particularly important to develop robust data mining methods in a highly noisy environment (Leung et al. 2001a).

In this chapter, the discovery of spatial associations is first discussed in Sect. 5.2. The emphasis is placed on the employment of various measures for the mining of global and local associations in space with rigorous statistical test. Discovery of non-stationarity of spatial relationship is then discussed in Sect. 5.3. Local variations are unraveled by detecting the significant variations of the parameters of a regression model in space. The general framework is the parameter-varying regression with geographically weighted regression as a special case. Spatial autocorrelation in geographically weighted regression is further discussed in Sect. 5.4. A more general model of geographically weighted regression is briefly discussed in Sect. 5.5. In Sect. 5.6, spatial non-stationarity is extended to situations in which relationships take on different forms in space. The regression-class mixture decomposition method is employed to mine local variations of spatial relationships captured by different functional forms.

5.2 Discovery of Local Patterns of Spatial Association

5.2.1 On the Measure of Local Variations of Spatial Associations

Many geographical problems can only be adequately analyzed by taking into account the relative locations of observations our failure in taking necessary steps to account for spatial association in spatial data sets often lead to misleading conclusions (see, for example Anselin and Griffith 1988; Arbia 1989). The well known statistics for the identification of global patterns of spatial association are Moran’s \( I \) (Moran 1950) and Geary’s \( c \) (Geary 1954). They are used as an overall measure of spatial dependency about the whole data set. The properties of these two statistics and their null distributions have been intensively studied over the years (see, for example Cliff and Ord 1981; Anselin 1988; Tiefelsdorf and Boots 1995; Hepple 1998; Tiefelsdorf 1998, 2000; Leung et al. 2003d).

However, with the increasingly large geo-referenced data sets obtained from complex spatial systems, stationarity of dependency over space may be an unrealistic presumption. Thus, there has been a surge of interest in discovering local patterns of spatial association based on the local forms of statistics in recent years. The local forms of statistics mainly focus on exceptions to the general patterns represented by conventional global forms, and the search of local areas exhibiting spatial heterogeneities with significant local departures from randomness.

The commonly used statistics for detecting local patterns of spatial association are Ord and Getis \( {G_i} \) or \( G_i^* \) statistic (Ord and Getis 1995) and Anselin’s LISAs (Anselin 1995), including local Moran’s \( {I_i} \) and local Geary’s \( {c_i} \). As defined in Anselin (1995), a LISA must indicate the extent of spatial clustering of observations around a reference location, and it must obey the additivity requirement for any coding scheme of the spatial link matrix. That is, the sum of values of a LISA at all locations must be proportional to a global indicator of spatial association. With its additivity, a LISA can also be used as a diagnosis of local instability in measures of global spatial association in the presence of significant global association. However, the \( {G_i} \) or \( G_i^* \) statistic, while being a statistic for local spatial association, is not a LISA in the sense of the additivity requirement because its individual components are not related to a global statistic of spatial association (Anselin 1995). In addition to the fundamental works by Anselin (1995), Getis and Ord (1992) as well as Ord and Getis (1995), the properties of these local statistics have been extensively studied and applied to many real-world and simulated spatial data sets (see, for example, Bao and Henry 1996; Sokal et al. 1998; Tiefelsdorf and Boots 1997; Fotheringham and Brunsdon 1999; Unwin 1996; Wilhelm and Steck 1998).

One of the important issues in the studies of local spatial associations is to find out the null distributions of these local statistics because only when their null distributions are made available can the other challenging subjects be addressed (Tiefelsdorf 2000). In this aspect, Tiefelsdorf and associates have defined the local Moran’s \( {I_i} \) as a ratio of quadratic forms. By means of this definition and under either the assumption of spatial independence or a conditional on a global spatial process, they have investigated the unconditional and conditional exact distribution of \( {I_i} \) and its moments with the statistical theory for ratios of quadratic forms (Boots and Tiefelsdorf 2000; Tiefelsdorf 1998, 2000; Tiefelsdorf and Boots 1997). Unfortunately, the null distributions of other local statistics have not been examined along this line of reasoning. Furthermore, normal approximation and randomized permutation are still the common approaches for deriving the \( p \)-values of the local statistics. Some GIS modules for spatial statistical analysis also employ the normal approximation to compute the null distribution of \( {I_i} \) (Boots and Tiefelsdorf 2000). Nevertheless, there are problems with these two methods.

For the local statistics \( {I_i} \), \( {c_i} \), and \( {G_i} \) or \( G_i^* \), the underlying spatial structure or spatial contiguity is typically star-shaped. Cliff and Ord (1981, Chap. 2) have shown that the null distributions of global Moran’s \( I \) and Geary’s \( c \) with star-shaped spatial structures deviate markedly from the normal distribution. A series of experiments performed by Anselin (1995), Boots and Tiefelsdorf (2000) and Sokal et al. (1998) have also demonstrated that the normal approximation to the distribution of the local Moran’s \( {I_i} \) is inappropriate because of the excessive kurtosis of the distribution of \( {I_i} \). Although asymptotic normality is a reasonable assumption to the null distribution of \( {G_i} \) or\( G_i^* \), a misleading significance level may be obtained if the number of neighbors at a specific location is too small and the weights for describing the contiguities are too uneven (Ord and Getis 1995).

Although randomized permutation approach seems to provide a reliable basis for inference for both the LISAs and the \( {G_i} \) or \( G_i^* \) (Anselin 1995), this approach may suffer from resampling error and very large sample sizes needed for resampling are rather expensive for the purpose of routine significance test (Costanzo et al. 1983). Furthermore, in the significance tests of spatial association with these local statistics, empirical distribution functions are calculated by resampling from the observations under the assumption of equi-probability of selection across the space. If the spatial units are not uniformly defined, the assumption of equi-probability of selection may not hold and the derived test values may be biased (Bao and Henry 1996). In the regression context, if spatial association among the residuals is to be tested, then the randomized permutation approach is inappropriate since regression residuals are correlated (Anselin and Rey 1991).

Given the above shortcomings in performing the significance tests for local spatial association by normal approximation and randomized permutation, it is especially useful to develop the exact or some more accurate approximate methods for testing local spatial association. The idea is to develop the exact and approximate \( p \)-values of the aforementioned local statistics for testing local spatial clusters when global autocorrelation is not significant. Such a structure discovery process addresses essentially the following statistical test issues:

  1. 1.

    Is a reference location surrounded by a cluster of high or low values? Or

  2. 2.

    Is the observed value at this location positively (similarly) or negatively (dissimilarly) associated with the surrounding observations?

To offer a more formal approach in line with classical statistical framework, Leung et al. (2003d) have developed an exact method for computing the \( p \)-values of the local Moran’s \( {I_i} \), local Geary’s \( {c_i} \) and the modified Ord and Getis \( G \) statistics based on the distributional theory of quadratic forms in normal variables. Furthermore, an approximate method, called three-moment \( {\chi^2} \) approximation, with explicit calculation formulae, has also been proposed to achieve a computational cost lower than the exact method. Their study not only provides exact tests for local patterns of spatial association, but also put the tests of several local statistics within a unified statistical framework.

5.2.2 Local Statistics and their Expressions as a Ratio of Quadratic Forms

I first introduce in this section the local Moran’s \( {I_i} \) and Geary’s \( {c_i} \) of Anselin’s LISAs (Anselin 1995) as well as \( {G_i} \) and \( G_i^* \) of Ord and Getis \( G \) statistics (Ord and Getis 1995), and express them as ratios of quadratic forms in observations. By taking the square of \( {G_i} \) and \( G_i^* \) in particular, the analysis of \( {G_i} \) and \( G_i^* \) can be brought within the common framework of ratios of quadratic forms.

Let \( \mathbf{x} = {\left( {{x_1},{x_2}, \cdots, {x_n}} \right)^T} \) be the vector of observations on random variable \( X \) at \( n \) locations and let \( \mathbf{W} = {\left( {{w_{ij}}} \right)_{n \times n}} \) be a symmetric spatial link matrix which is defined by the underlying spatial structure of the geographical units where the observations are made. The simplest form of W can be such a matrix with elements taking the value one if the corresponding units \( i \) and \( j \) come in contact and zero otherwise. It should be noted that the link matrix can also incorporate information on distances, flows and other types of linkages.

5.2.2.1 Local Moran’s \( {I_i} \)

For a reference location \( i \), the local Moran’s \( {I_i} \) in its standardized form is (Anselin 1995)

$$ {I_i} = \frac{{\left( {{x_i} - \bar x} \right)\sum\limits_{j = 1}^n {{w_{ij}}\left( {{x_j} - \bar x} \right)} }}{{\tfrac{1}{n}\sum\limits_{j = 1}^n {{{\left( {{x_j} - \bar x} \right)}^2}} }}, $$
((5.1))

where \( \bar x = \frac{1}{n}\sum\nolimits_{j = 1}^n {{x_j},\left( {{w_{i1}},{w_{i2}}, \cdots, {w_{in}}} \right)} \) is the \( i \)th row of the symmetric spatial link matrix W and \( {w_{ii}} = 0 \) by convention. A large positive value of \( {I_i} \) indicates spatial clustering of similar values (either high or low) around location\(\ i \), and a large negative value indicates a clustering of dissimilar values, that is, a location with high value is surrounded by neighbors with low values and vice versa.

We actually can express \( {I_i} \) as a ratio of quadratic forms as follows (Leung et al. 2003d):

$$ {I_i} = \frac{{{\mathbf{x}^T}\mathbf{BW}\left( {\mathbf{I}_i} \right)B\mathbf{x}}}{{\tfrac1{\text{n}}{\mathbf{x}^T}B\mathbf{x}}} $$
((5.2))

where

$$ {\left( {{x_1} - \bar x, \cdots, {x_n} - \bar x} \right)^T} = \left( {\mathbf{I} - \frac{1}{n}\mathbf{1}{\mathbf{1}^T}} \right)\mathbf{x} = \mathbf{Bx}, $$
((5.3))

in which \( \mathbf{I} \) is the identity matrix of order \( n \), \( \mathbf{B} = \mathbf{I} - \frac{1}{n}\mathbf{1}{\mathbf{1}^T} \) is an idempotent and symmetric matrix, \( \mathbf{1} = {\left( {1,1, \cdots, 1} \right)^T} \), and \( \mathbf{W}\left( {I_i} \right) \) is the \( n \times n \) symmetric star-shaped matrix defined as:

$$ w\left( {I_i } \right) = \frac{1}{2}\left( {\begin{array}{*{20}c} 0 & \cdots & 0 & {w_{1i} } & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & 0 & {w_{i - 1,i} } & 0 & \cdots & 0 \\ {w_{i1} } & \cdots & {w_{i,i - 1} } & 0 & {w_{i,i + 1} } & \cdots & {w_{in} } \\ 0 & \cdots & 0 & {w_{i + 1,i} } & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \ddots \\ 0 & \cdots & 0 & {w_{ni} } & 0 & \cdots & 0 \\ \end{array}} \right) $$
((5.4))

Since \( \sum\nolimits_{i = 1}^n {\mathbf{W}\left( {I_i} \right) = \mathbf{W}} \), we have

$$ \sum\limits_{i = 1}^n {{I_i} = \frac{{{\mathbf{x}^T}\mathbf{BWBx}}}{{\frac1{\text{n}}{\mathbf{x}^T}\mathbf{Bx}}} = s\mathbf{I}}, $$
((5.5))

where \( s = \sum\nolimits_{i = 1}^n {\sum\nolimits_{j = 1}^n {{w_{ij}}} } \), and \(\mathbf{I} \) is the global Moran’s statistic (Cliff and Ord 1981, p. 47). This means that, when we take \( \mathbf{W}\left( {I_i} \right) \) as a local link matrix, the additivity requirement is fulfilled by \( {I_i} \).

5.2.2.2 Local Geary’s \( {c_i} \)

The local Geary’s \( {c_i} \) at a reference location i is defined by Anselin (1995) as

$$ {c_i} = \frac{{\sum\limits_{j - 1}^n {{w_{ij}}{{\left( {{x_i} - {x_j}} \right)}^2}} }}{{\frac{1}{n}\sum\limits_{j = 1}^n {{{\left( {{x_j} - \bar x} \right)}^2}} }}, $$
((5.6))

where \( {w_{ij}} = 0 \). A small value of \( {c_i} \) suggests a positive spatial association (similarity) of observation \( i \) with its surrounding observations, while a large value of \( {c_i} \) suggests a negative association (dissimilarity) of observation \( i \) with its surrounding observations.

Based on Leung et al. (2003d), \( {c_i} \) can again be expressed as a ratio of quadratic forms as:

$$ {c_i} = \frac{{{\mathbf{x}^T}\mathbf{BW}\left( {c_i} \right)\mathbf{Bx}}}{{\frac1{\text{n}}{\mathbf{x}^T}\mathbf{Bx}}}, $$
((5.7))

where \( \mathbf{W}\left( {c_i} \right) = \mathbf{D}(i) - 2\mathbf{W}\left( {I_i} \right) \) is symmetric, and

\( {\text{D}}(i) = {\text{diag}}\left( {{w_{i1}}, \cdots, {w_{i,i - 1}},{w_{i + }},{w_{i,i + 1}} \cdots, {w_{in}}} \right) \) is a diagonal matrix with the \( i \) th element in its main diagonal being \( {w_{i + }} = \sum\limits_{j = 1}^n {{w_{ij}}} \).

According to the symmetry of \( \mathbf{W} \) and \( {w_{ii}} = 0 \) for all \( i \), it is easy to prove that

$$ \sum\limits_{i = 1}^n {\mathbf{W}\left( {c_i} \right) = \sum\limits_{i = 1}^n {\mathbf{D}(i) - 2\sum\limits_{i = 1}^n {\mathbf{W}\left( {I_i} \right) = 2\left( {\mathbf{D} - \mathbf{W}} \right)} } }, $$
((5.8))

where \( \mathbf{D} = {\text{diag}}\left( {{w_{1 + }},{w_{2 + }}, \cdots, {w_{n + }}} \right) \). From Cliff and Ord (1981, p. 167) as well as Leung et al. (2003), the global Geary’s \( c \) can be expressed as

$$ c = \frac{n - 1}{s}\frac{{{\mathbf{x}^T}\mathbf{B}\left( {\mathbf{D} - \mathbf{W}} \right)\mathbf{Bx}}}{\mathbf{x}^{\text{T}}\mathbf{Bx}}. $$
((5.9))

Therefore

$$ \sum\limits_{i = 1}^n {{c_i} = \frac{{{\mathbf{x}^T}\mathbf{B}\left[ \,{\sum\limits_{i - 1}^n {\mathbf{W}\left( {c_i} \right)} } \right]\mathbf{Bx}}}{{\frac1{\text{n}}{\mathbf{x}^T}\mathbf{Bx}}}} = \frac{2ns}{n - 1}c. $$
((5.10))

That is, the additivity requirement is still fulfilled by \( {c_i} \) with the expression in (5.7).

5.2.2.3 \( G \) Statistics Expressed as Ratios of Quadratic Forms

Ord and Getis \( {G_i} \) and \( G_i^* \) statistics in their original forms (Getis and Ord 1992) are, respectively,

$$ {G_i} = \frac{{\sum\limits_{j \ne 1} {{w_{ij}}{x_j}} }}{{\sum\limits_{j \ne 1} {x_j} }} $$
((5.11))

and

$$ G_i^* = \frac{{\sum\limits_{j = 1}^n {{w_{ij}}{x_j}} }}{{\sum\limits_{j = 1}^n {x_j} }}. $$
((5.12))

For simplicity, \( d \) in \( {w_{ij}}(d) \) (the weight for the link of location \( j \) and a given location \( i \), with \( j \) being within distance \( d \) from \( i \,\)) is omitted here. The statistics \( {G_i} \) and \( G_i^* \) in (5.11) and (5.12) require that the underlying variable \( X \) has a natural origin and is positive (Getis and Ord 1992). In order to overcome this restriction, Ord and Getis (1995) have standardized them as

$$ {G_i} = \frac{{\sum\limits_{j \ne i} {{w_{ij}}\left[ {{x_j} - \bar x(i)} \right]} }}{{{{\left\{ {\frac{1}{n - 1}\sum\limits_{j \ne 1} {{{\left[ {{x_j} - \bar x(i)} \right]}^2}} } \right\}}^{\frac{1}{2}}}}}, $$
((5.13))

and

$$ G_i^* = \frac{{\sum\limits_{j = i}^n {{w_{ij}}\left( {{x_j} - \overline x } \right)} }}{{{{\left[ {\frac{1}{n}\sum\limits_{j = 1}^n {{{\left( {{x_j} - \overline x } \right)}^2}} } \right]}^{\frac{1}{2}}}}}, $$
((5.14))

where \( \bar x(i) = \frac{1}{n - 1}\sum\limits_{j \ne i} {x_j} \). Here, the scale factor in each statistic is omitted because it

does not affect the \( p \)-value to be derived. A large positive value of \( {G_i} \) or \( G_i^* \) indicates a spatial clustering of observations of high values while a large negative value indicates a spatial clustering of observations of low values. However, unlike the LISAs, these two local statistics are not related to a global one and therefore the additivity requirement is not satisfied. In order to put \( {G_i} \) and \( G_i^* \) into the framework of ratios of quadratic forms, Leung et al. (2003d) take the square of \( {G_i} \) and \( G_i^* \) and obtain the modified \( G \) statistics, respectively, as follows:

$$ {\tilde G_i} = {\left( {G_i} \right)^2} = \frac{{{{\left\{ {\sum\limits_{j \ne i} {{w_{ij}}\left[ {{x_j} - \bar x(i)} \right]} } \right\}}^2}}}{{\frac{1}{n - 1}\sum\limits_{j \ne i} {{{\left[ {{x_j} - \bar x(i)} \right]}^2}} }}, $$
((5.15))

and

$$ \tilde G_i^* = {\left( {G_i^*} \right)^2} = \frac{{{{\left[ {\sum\limits_{j = 1}^n {{w_{ij}}\left( {{x_j} - \bar x} \right)} } \right]}^2}}}{{\frac{1}{n}\sum\limits_{j = 1}^n {{{\left( {{x_j} - \bar x} \right)}^2}} }}. $$
((5.16))

A large value of the transformed statistic \( {G_i} \) or \( G_i^* \) indicates a spatial clustering of observations of high values or low values. With this modification, \( {G_i} \) and \( G_i^* \) can then be expressed as a ratio of quadratic forms and their null distributions can be obtained by the distributional theory of quadratic forms. Statistically, it is equivalent to the use of \( {G_i} \) or \( G_i^* \) and the modified one for exploring local spatial association except that a spatial clustering of high values or low values cannot be identified by the extreme values of the modified statistic \( {\tilde G_i} \) or \( \tilde G_i^* \). However, the loss of directional association can be compensated by reexamining the values of the observations at location \( i \) and its neighbors after a significant value of \( {\tilde G_i} \)or \( \tilde G_i^* \) is obtained at location \( i \). Since \( {\tilde G_i} \)and \( \tilde G_i^* \) can be expressed as a ratio of quadratic forms in a similar way, we henceforth only need to discuss the statistic \( \tilde G_i^* \).

It should be noted that the numerator of \( \tilde G_i^* \) in (5.16) can be written as

$$\begin{array}{rl} \left[ \sum\limits_{j = 1}^n {w_{ij}\left( x_j - \bar x \right)} \right]^2 & = \left( {x_1} - \bar x, \cdots, {x_n} - \bar x \right)w(i){w^T}(i) \left( x_1 - \bar x, \cdots, x_n - \bar x \right)^T \\ & = {x^T}Bw(i){w^T}(i)Bx \\ \end{array}$$
((5.17))

Therefore, we obtain

$$ \tilde G_i^* = \frac{{{\mathbf{x}^T}\mathbf{BW}\left( {\tilde G_i^*} \right)\,\mathbf{Bx}}}{{\frac1{\text{n}}{\mathbf{x}^T}\mathbf{Bx}}}, $$
((5.18))

where

$$ \mathbf{W}\left( {G_i^*} \right) = \mathbf{w}(i){\mathbf{w}^T}(i) $$
((5.19))

is a symmetric matrix.

5.2.2.4 The Null Distributions of \( {I_i} \), \( {c_i} \) and \( G_i^* \) and Their \( p \)-values for Spatial Association Test

Based on the above measures, we can derive the \( p \)-values of these local statistics to test for local spatial clusters in the absence of global spatial autocorrelation. Assume that the underlying distribution for generating the observations is normal. Then under the null hypothesis:

H0: no local spatial association is present.

The variables \( {x_1},{x_2}, \cdots, \ {\text{and}} \ {x_n} \) are independent and identically distributed as \( N\left( {\mu, {\sigma^2}} \right) \), a normal distribution with mean \( \mu \) and variance \( {\sigma^2} \). Therefore, \( \mathbf{x}\sim N\left( {\mu \mathbf{1},{\sigma^2}\mathbf{I}} \right) \). In this case, for a specific spatial structure that is stipulated by the spatial link matrix \( \mathbf{W} \), the null distributions of the aforementioned local statistics can be obtained via the distributional theory of quadratic forms in normal variables. Therefore, significance tests for local spatial association can be performed by computing the \( p \)-values of the local statistics. In the following discussion, the exact and approximate methods for deriving the \( p \)-values of the local statistics \( {I_i} \), \( {c_i} \) and \( \tilde G_i^* \) are introduced.

5.2.2.4.1 The Exact Method

Under the null hypothesis H0, \( \mathbf{x}\sim N\left( {\mu \mathbf{1},{\sigma^2}\mathbf{I}} \right) \), we have \( {\text{y}} = \tfrac{1}{\sigma }\left( {\mathbf{x} - \mu \mathbf{1}} \right)\sim N\left( {0,\mathbf{I}} \right) \). Substituting \( \mathbf{x} = \sigma \mathbf{y} + \mu \mathbf{1} \) into the expression of \( {I_i} \) in (5.2) and noting that \( {\mathbf{1}^T}\mathbf{B} = {\mathbf{1}^T}\left( {\mathbf{I} - \tfrac{1}{n}\mathbf{1}{\mathbf{1}^T}} \right) = \mathbf{0} \) and \( \mathbf{B1} = \left( {\mathbf{I} - \tfrac{1}{n}\mathbf{1}{\mathbf{1}^T}} \right)\,\mathbf{1} = \mathbf{0} \), we have, by omitting the scale factor \( {1 \mathord{\left/{\vphantom {1 n}} \right.} n} \),

$$ {I_i} = \frac{{{\mathbf{y}^T}\mathbf{BW}\left( {I_i} \right)\,\mathbf{By}}}{{{\mathbf{y}^T}\mathbf{By}}}. $$
((5.20))

Similar expressions for \( {c_i} \) and \( \tilde G_i^* \) can be obtained by replacing \( \mathbf{W}\left( {I_i} \right) \) with \( \mathbf{W}\left( {c_i} \right) \) and \( \mathbf{W}\left( {\tilde G_i^*} \right) \) respectively.

For any real number \( r \), the value of the null distribution function of \( {I_i} \) at \( r \) can be expressed as

$$ {{\text{P}}_{\text{H}}}_{_0}\left( {{I_i} \le r} \right) = {\text{P}}\left\{ {{\mathbf{y}^T}\mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\,\mathbf{By} \le 0} \right\}. $$
((5.21))

Since \( \mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\,\mathbf{B} \) is a symmetric matrix with real elements and \( {\text{y}} \) is distributed as \( N\left( {0,1} \right) \), the Imhof’s results on the distribution of quadratic forms (Hepple 1998; Imhof 1961; Leung et al. 2003d; Tiefelsdorf and Boots 1995) can be used to obtain the null distribution of \( {I_i} \). That is,

$$ {{\text{P}}_{\text{H}}}_{_0}\left( {{I_i} \le r} \right) = \frac{1}{2} - \frac{1}{\pi }\int_0^\infty {\frac{{\sin \left[ {\theta (t)} \right]}}{{t\rho (t)}}} \;dt, $$
((5.22))

where

$$ \theta (t) = \frac{1}{2}\sum\limits_{k = 1}^m {\left[ {{h_k}\arctan \left( {{\lambda_k}t} \right)} \right]}, $$
((5.23))
$$ \rho (t) = \prod\limits_{k = 1}^m {{{\left( {1 + \lambda_k^2{t^2}} \right)}^{\frac{1}{4}{h_k}}}}, $$
((5.24))

With \( {\lambda_1},{\lambda_2}, \cdots, {\lambda_m} \) being the distinct nonzero eigenvalues of the matrix \( \mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\,\mathbf{B} \), and \( {h_1},{h_2}, \cdots, {h_m} \) being their respective orders of multiplicity.

The same formulae for computing the null distributions of \( {c_i} \) and \( \tilde G_i^* \) can be obtained by replacing \( {\lambda_1},{\lambda_2}, \cdots, {\lambda_m} \) and \( {h_1},{h_2}, \cdots, {h_m} \) with the eigenvalues and their orders of multiplicity of the matrices \( \mathbf{B}\left[ {\mathbf{W}\left( {c_i} \right) - r\mathbf{I}} \right]\,\mathbf{B} \) and \( \mathbf{B}\left[ {\mathbf{W}\left( {\tilde G_i^*} \right) - r\mathbf{I}} \right]\,\mathbf{B} \) respectively.

As a special case of the above results, we can obtain the exact \( p \)-values of the statistics \( {I_i} \), \( {c_i} \) and \( \tilde G_i^* \) for the spatial association test. Let \( {r_1} \), \( {r_c} \) and \( {r_G} \) be, respectively, the observed values of \( {I_i} \), \( {c_i} \) and \( \tilde G_i^* \) which can be computed from (5.1), (5.6) and (5.16), or from (5.2), (5.7) and (5.18), by omitting the scale factor \( 1/n \) in each expression. For \( {I_i} \), the \( p \)-value for testing positive spatial autocorrelation (a spatial cluster of similar values) is \( {{\text{P}}_{{{\text{H}}_0}}}\left( {{I_i} \ge {r_I}} \right) \), and the \( p \)-value for testing negative spatial autocorrelation (a spatial cluster of dissimilar values) is \( {{\text{P}}_{{{\text{H}}_0}}}\left( {{I_i} \le {r_I}} \right) \). For \( {c_i} \), the p-value for testing positive spatial autocorrelation is \( {{\text{P}}_{{{\text{H}}_0}}}\left( {{c_i} \le {r_c}} \right) \) and the \( p \)-value for testing negative spatial autocorrelation is \( {{\text{P}}_{{{\text{H}}_0}}}\left( {{c_i} \ge {r_c}} \right) \). For \( G_i^* \), the \( p \)-value for testing a spatial clustering of observations of high or low values is \( {{\text{P}}_{{{\text{H}}_0}}}\left( {G_i^* \ge {r_G}} \right) \). All these \( p \)-values can be calculated through the corresponding exact formulae in (5.22)–(5.24). The derivations of \( \theta (t) \) and \( \rho (t) \) in (5.22) for \( {I_i} \), \( {G_i} \) and \( {C_i} \) given in Leung et al. (2003). For \( {I_i} \), we have

$$ \begin{array}{*{20}c} {\theta \left( t \right)} \hfill & = \hfill & {\frac{1}{2}\left\{ {\arctan \left[ {\lambda _1 \left( 1 \right) - \left( r \right)t} \right] + ar\tan \left[ {\lambda _1 \left( 2 \right) - r)t} \right]} \right.} \hfill \\ \, \hfill & \, \hfill & {\left. { - \left( {n - 3} \right)\arctan \left( {rt} \right)} \right\},} \hfill \\ \end{array} $$
((5.25))
$$ \rho (t) = \left\{ {\,1 + {{\left[ {{\lambda_1}(1) - r} \right]}^2}{t^2}} \right\}{\,^{\frac{1}{4}}}\left\{ {\,1 + {{\left[ {{\lambda_1}(2) - r} \right]}^2}{t^2}} \right\}{\,^{\frac{1}{4}}}{\left( {1 + {r^2}{t^2}} \right)^{{\kern 1pt} \frac{n - 3}{4}}}, $$
((5.26))

where \( {\lambda_I}(1) \) and \( {\lambda_I}(2) \) are the non-zero eigenvalues of the matrix \( \mathbf{BW}\left( {I_i} \right)\,\mathbf{B}. \)

For \( {G_i} \), we have

$$ \theta (t) = \frac{1}{2}\left\{ {\arctan \left[ {\left( {{\lambda_G} - r} \right)t} \right] - \left( {n - 2} \right)\arctan \left( {rt} \right)} \right\}, $$
((5.27))
$$ \rho (t) = {\left[ {1 + {{\left( {{\lambda_G} - r} \right)}^2}{t^2}} \right]^{\frac{1}{4}}}{\left[ {1 + {r^2}{t^2}} \right]^{\frac{n - 2}{4}}}. $$
((5.28))

For \( {c_i} \), we have

$$ \begin{array}{*{20}c} {\theta \left( t \right)} \hfill & = \hfill & {\frac{1}{2}\left\{ {\arctan \left[ {\left( {w_{i + } - 1 - r} \right)t} \right] + \left( {w_{i + } - 1} \right)\arctan \left[ {\left( {1 - r} \right)t} \right]} \right\}} \hfill \\ \, \hfill & \, \hfill & { - \left( {n - w_{i + } - 1} \right)\arctan \left( {rt} \right),} \hfill \\ \end{array} $$
((5.29))
$$ \rho (t) = {\left[ {1 + {{\left( {{w_{i + }} + 1 - r} \right)}^2}{t^2}} \right]^{\frac{{{w_{i + }} - 1}}{4}}}{\left( {1 + {r^2}{t^2}} \right)^{\frac{{n - {w_{i + }} - 1}}{4}}}.$$
((5.30))
5.2.2.4.2 The Approximate Method

Computing numerically the eigenvalues of a \( n \times n \) matrix and an integral on an infinite interval is in fact computationally expensive. Therefore, the above exact method for computing the \( p \)-values of the statistics is not very efficient in practice, especially when the sample size \( n \) of a data mining task is large. Some approximate methods may be useful in solving this problem. As pointed out above, the null distributions of LISAs cannot be effectively approximated by the normal distribution. Leung et al. (2003d) hence propose a higher-moments procedure, called three-moment \( {\chi^2} \) approximation, to compute the \( p \)-values of the local statistics for spatial association test and derive the explicit computation formulae which can significantly reduce the computational overhead.

The main idea of the three-moment \( {\chi^2} \) approximation is to approximate the distribution of a quadratic form in normal variables by that of a linear function of a \( {\chi^2} \) variable with appropriate degrees of freedom, say \( a + {b_{\chi_d^2}} \). The coefficients \( a \) and \( b \) of the linear function and the degrees of freedom \( d \) are chosen in such a way that the first three moments of \( a + {b_{\chi_d^2}} \) are made to match those of the quadratic form. This method was originally proposed by Pearson (1959) to approximate the distribution of a noncentral \( {\chi^2} \) variable. Imhof (1961) has extended this method to approximate the distribution of a general quadratic form in normal variables.

For local Moran’s \( {I_i} \), we have

$$\begin{array}{rl} {{{\text{P}}_{{{\text{H}}_0}}}\left( {{I_i} \le r} \right)} & = {{\text{P}}\left\{ {{y^T}B\left[ {W\left( {I_i} \right) - rI} \right]By \le 0} \right\}} \\ & \approx \begin{cases} {{\text P} \left\{ \chi_d^2 \le d - \frac {1}{b}tr\left[ B\left[ W\left( I_i \right) - rI \right]B \right] \right\}, \ \ \ if \ \ \ tr \left\{ B\left[ W\left( I_i \right) - rI \right] B \right\}^3 > 0,} \\ {\text P} \left\{ {\chi_d^2 \ge d - \frac{1}{b}tr\left[ {B\left[ {W\left( {I_i} \right) - rI} \right]B} \right]} \right\}, \ \ \ {if} \ \ \ {tr{{\left\{{ B\left[ {W\left( {I_i} \right) - rI} \right]B} \right\}^3 < 0,}}} \\ \end{cases} \end{array}$$
((5.31))

where

$$ b = \frac{{tr{{\left\{ {\mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\mathbf{B}} \right\}}^3}}}{{tr{{\left\{ {\mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\mathbf{B}} \right\}}^2}}}, $$
((5.32))
$$ d = \frac{{{{\left\{ {tr{{\left[ {\mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\mathbf{B}} \right]}^2}} \right\}}^3}}}{{{{\left\{ {tr{{\left[ {\mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\mathbf{B}} \right]}^3}} \right\}}^2}}}. $$
((5.33))

Therefore, the approximate \( p \)-value of \( {I_i} \) for testing local positive or negative spatial autocorrelation can be computed via (5.31) if the observed value \( r\mathbf{I} \) is obtained.

For local Geary’s \( {c_i} \), the probability \( {{\text{P}}_{{{\text{H}}_0}}}\left( {{c_i} \le r} \right) \) can be computed by the same formulae as those in (5.31)–(5.33) except that the matrix \( \mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\mathbf{B} \) is replaced by \( \mathbf{B}\left[ {\mathbf{W}\left( {c_i} \right) - r\mathbf{I}} \right]\mathbf{B} \). For the modified statistic \( \tilde G_i^* \), the probability \( {{\text{P}}_{{{\text{H}}_0}}}\left( {G_i^* \le r} \right) \) can still be calculated by replacing the matrix \( \mathbf{B}\left[ {\mathbf{W}\left( {I_i} \right) - r\mathbf{I}} \right]\mathbf{B} \) in (5.31), (5.32) and (5.33) with \( \mathbf{B}\left[ {\mathbf{W}\left( {G_i^*} \right) - r\mathbf{I}} \right] \).

When the underlying variable for the generating data is normally distributed and the null hypothesis of “no local spatial association” is true, each of the local statistics \( {I_i} \), \( {c_i} \) and \( \tilde G_i^* \) can then be expressed as a ratio of quadratic forms in standard normal variables. Therefore, a well known result saying that “ a ratio of quadratic forms in normal variables with the matrix in its denominator being idempotent is distributed independently of its denominator” (see for example Cliff and Ord 1981, p. 43 as well as Stuart and Ord 1994, pp. 529–530 for the proof) can be employed to obtain the exact moments of \( {I_i} \), \( {c_i} \) and \( \tilde G_i^* \). According to this result, we have from (5.20) that for any positive integer \( k \),

$$ {\text{E}}\left( {I_i^k} \right) = \frac{{{\text{E}}{{\left[ {{\mathbf{y}^T}\mathbf{BW}\left( {I_i} \right)\mathbf{By}} \right]}^k}}}{{{\text{E}}{{\left( {{\mathbf{y}^T}\mathbf{By}} \right)}^k}}}. $$
((5.34))

Similar to the derivation in Tiefelsdorf (2000, pp. 100–102), for example, we can obtain in particular

$$ E({I_i}) = \frac{1}{n - 1}tr\left[ {\mathbf{BW}\left( {I_i} \right)\mathbf{B}} \right], $$
((5.35))
$$ Var\left( {I_i} \right) = \frac{1}{{{{\left( {n - 1} \right)}^2}\left( {n + 1} \right)}}\left\{ {\,\left( {n - 1} \right)\,tr\;\left[ {\mathbf{BW}\left( {I_i} \right)\mathbf{B}} \right]{\,^2} - \left[ {tr\,\left( {\mathbf{BW}\left( {I_i} \right)\mathbf{B}} \right)\,} \right]{\,^2}} \right\}. $$
((5.36))

Leung et al. (2003d) show that the normal approximation of the null distribution of \( {I_i} \) can be expressed as

$$ {{\text{P}}_{{{\text{H}}_0}}}\left( {{I_i} \le r} \right) \approx \Phi \left( {\frac{{r - {\text{E}}\left( {I_i} \right)}}{{\sqrt {{\text{Var}}\left( {I_i} \right)} }}} \right), $$
((5.37))

where \( \Phi (x) \) is the distribution function of \( N\left( {0,1} \right) \). And, we can obtain similar normal approximation formulae as those in (5.37) for the null distributions of \( {c_i} \) and \( G_i^* \) respectively. Simulations conducted by Leung et al. (2003d) demonstrate that this approximation approach performs generally better than normal approximation and very accurate in some instances.

It should be emphasized that both the exact and approximate \( p \)-values of \( {I_i} \), \( {c_i} \) and \( G_i^* \) are obtained under the assumptions that global spatial autocorrelation is insignificant and that the underlying distribution for generating observations is normal. The first assumption means that the results can only be used in significance test for local spatial clusters that the global statistics fail to detect. This is one of the two important purposes that the LISAs intend to serve (Anselin 1995). In practice, a test for the non-existence of a global spatial autocorrelation should first be performed. If global autocorrelation is not significant, results obtained by Leung et al. (2003d) can then be used to assess the significance of local spatial clusters.

5.3 Dicovery of Spatial Non-Stationarity Based on the Geographically Weighted Regression Model

5.3.1 On Modeling Spatial Non-Stationarity within the Parameter-Varying Regression Framework

In spatial analysis, ordinary linear regression (OLR) model has been one of the most useful statistical means to identify the nature of relationships among variables. In this technique, a variable\( y \), called the dependent variable, is modeled as a linear function of a set of independent variables \( {x_i},{x_2}, \cdots, {x_p}. \) Based on \( n \) observations \( \left( {{y_i};{x_{i1}},{x_{i2}}, \cdots, {x_{ip}}} \right) \), \( i = 1,2, \cdots, n \), taken from a study region, the model can be expressed as

$$ {y_i} = {\beta_o} + \sum\limits_{k = 1}^p {{\beta_x}{x_{ik}} + {\varepsilon_i}}, $$
((5.38))

where \( {\beta_0},{\beta_1}, \cdots, {\beta_p} \) are parameters and \( {\varepsilon_1},{\varepsilon_2}, \cdots, {\varepsilon_n} \) are error terms which are generally assumed to be independent normally distributed random variables with zero means and constant variance \( {\sigma^2} \). In this model, each of the parameters can be thought of as the “slopes” between the dependent variable and one of the independent variables. The least squares estimate of the parameter vector can be written as

$$ \hat \beta = {\left( {\mathop {{{\hat \beta }_0}{{\hat \beta }_1}}\limits \cdots {{\hat \beta }_p}} \right)^T} = {\left( {{\mathbf{X}^T}\mathbf{X}} \right)^{ - 1}}{\mathbf{X}^T}\mathbf{Y} $$
((5.39))

where

$$ \begin{array}{*{20}c} {X = \left( {\begin{array}{*{20}c} 1 & {x_{11} } & \cdots & {x_{1p} } \\ 1 & {x_{21} } & \cdots & {x_{2p} } \\ \cdots & \cdots & \cdots & \cdots \\ 1 & {x_{n1} } & \cdots & {x_{np} } \\ \end{array}} \right),} \hfill & {y = \left( {\begin{array}{*{20}c} {y_1 } \\ {y_2 } \\ \vdots \\ {y_n } \\ \end{array}} \right)} \hfill \\ \end{array} $$
((5.40))

Statistical properties of these estimates have been well studied and various hypothesis tests have also been established.

Although the OLR model has been used extensively in the study of spatial relationships, it cannot incorporate spatial non-stationarity in space since the relationships between the dependent variable and the independent variables, manifested by the slopes (parameters), are assumed to be global across the study area. However, in many real-life situations, there is ample of evidence indicating the lack of uniformity in the effects of space. Local variations of relationships over space commonly exist in spatial data sets and the assumption of stationarity or structural stability over space may be unrealistic (see for example, Anselin 1988; Fotheringham et al. 1996; Fotheringham 1997). It is shown that, as stated in Brunsdon et al (1996), (1) relationships can vary significantly over space and a “global” estimate of the relationships may obscure interesting geographical phenomena; (2) variation over space can be sufficiently complex that it invalidates simple trend-fitting exercises. So when analyzing spatial data, particularly in data mining, we should take into account this kind of spatial non-stationarity.

Over the years, some approaches have been proposed to incorporate spatial structural instability or spatial drift into the models. For example, Anselin (1988, 1990) has investigated regression models with spatial structural change. Casetti (1972, 1986), Jones and Casetti (1992), Fotheringham and Pitts (1995) have studied spatial variations by the expansion method. Basing on the locally weighted regression method, Cleveland (1979), Cleveland and Devlin (1988), Casetti (1982), Foster and Gorr (1986), Gorr and Olligschlaeger (1994), Brunsdon et al. (1996, 1997), Fotheringham et al (1997a,b) have examined the following varying-parameter regression model:

$$ {y_i} = {\beta_{i0}} + \sum\limits_{k = 1}^p {{\beta_{ik}}{x_{ik}} + {\varepsilon_i}} . $$
((5.41))

Unlike the OLR model in (5.38), this model allows the parameters to vary in space. However, this model in its unconstrained form is not implementable because the number of parameters increases with the number of observations, i.e., the curse of dimensionality. Hence, strategies for limiting the number of degrees of freedom used to represent variation of the parameters over space should be developed when the parameters are estimated.

There are several methods for estimating the parameters. For example, the method of spatial adaptive filter (Foster and Gorr 1986; Gorr and Olligschlaeger 1994) uses generalized damped negative feedback to estimate spatially-varying parameters of the model in (5.41). However, this approach incorporates spatial relationships in a rather ad hoc manner and produces parameter estimates that cannot be tested statistically. Locally weighted regression method and kernel regression method (Cleveland 1979; Casetti 1982; Cleveland and Devlin 1988; Cleveland et al. 1988; Brunsdon 1995; Wand and Jones 1995) focus mainly on the fit of the dependent variable rather than on spatially varying parameters. Furthermore, the weighting system depends on the location in the “attribute space” (Openshaw 1993) of the independent variables. Along this line of thinking, Brunsdon et al. (1996, 1997), Fotheringham et al. (1997a,b, 2002) suggest a so-called geographically weighted regression (GWR) technique. The mathematical representation of the GWR model is actually the same as the varying-parameter regression model in (5.41). In the following subsection, I will outline the GWR model and the basic issues involved in using it as a means to unravel local variations in spatial relationships.

5.3.2 Geographically Weighted Regression and the Local–Global Issue About Spatial Non-Stationarity

In the GWR Model, the parameters are assumed to be functions of the locations on which the observations are obtained. That is,

$$ \begin{array}{*{20}c} {y_i = \beta _{io} + \sum\limits_{k = 1}^p {\beta _{ik} x_{ik} + \varepsilon _i ,} } \hfill & {i \in C = \left\{ {1,\,2, \cdots n} \right\},} \hfill \\ \end{array} $$
((5.42))

where C is the index set of locations of \( n \) observations and \( {\beta_{ik}} \) is the value of the \( k \)th parameter at location \( i \).

The parameters in the GWR model are estimated by the weighted least squares approach. The weighting matrix is taken as a diagonal matrix where each element in its diagonal is assumed to be a function of the location of observation. Suppose that the weighting matrix at location \( i \) is W \( (i) \). Then the parameter vector at location \( i \) is estimated as

$$ \hat \beta (i) = {\left( {{\mathbf{X}^T}\mathbf{W}(i)\mathbf{X}} \right)^{ - 1}}{\mathbf{X}^T}\mathbf{W}(i)\mathbf{Y}, $$
((5.43))

where W \( (i) = diag\left( {{w_1}(i),{w_2}(i), \cdots, {w_n}(i)} \right) \) and X, Y are the same matrices as in Eq. (4.4). Here we assume that the inverse of the matrix \( {\mathbf{X}^T}\mathbf{W}(i)\mathbf{X} \) exists.

According to the principle of the weighted least squares method, the generated estimators at location \( i \) in (5.43) are obtained by solving the following optimization problem. That is, determine the parameters \( {\beta_0},{\beta_1}, \cdots, {\beta_p} \) at each location \( i \) so that

$$ \sum\limits_{j = 1}^n {{w_j}(i){{\left( {{y_j} - {\beta_0} - {\beta_1}{x_{j1}} - \cdots - {\beta_p}{x_{jp}}} \right)}^2}} $$
((5.44))

is minimized. Given appropriate weights \( {w_j}(i) \) which are a function of the locations at which the observations are made, different emphases can be given to different observations for generating the estimated parameters at location \( i \).

5.3.2.1 Possible Choices of the Weighting Matrix

The role of the weighting matrix is to place different emphases on different observations in generating the estimated parameters. In spatial analysis, observations close to a location \( i \) are generally assumed to exert more influence on the parameter estimates at location \( i \) than those farther away. When the parameters at location \( i \) are estimated, more emphases should be placed on the observations which are close to location \( i \). A simple but natural choice of the weighting matrix at location \( i \) is to exclude those observations that are farther than some distance \( d \) from location \( i \). This is equivalent to setting a zero weight on observation \( j \) if the distance from \( i \) to \( j \) is greater than \( d \). If the distance from \( i \) to \( j \) is expressed as \( {d_{ij}} \), the elements of the weighting matrix at location \( i \) can be chosen as

$$ w_j \left( i \right) = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {1,} \hfill & {if\,d_{ij} \le d} \hfill \\ {0,} \hfill & {if\,d_{ij} > d} \hfill \\ \end{array},} \hfill & {j = 1,2, \cdots n} \hfill \\ \end{array}.} \right. $$
((5.45))

The above weighting function suffers from the problem of discontinuity over the study area. One way to overcome this problem is to specify \( {w_j}(i) \) as a continuous and monotone decreasing function of \( {d_{ij}} \). One obvious choice can be

$$ \begin{array}{*{20}c} {w_j \left( i \right) = \exp \left( { - \theta d_{ij}^2 } \right),} \hfill & {j = 1,2, \cdots ,n,} \hfill \\ \end{array} $$
((5.46))

so that if \( i \) is a point at which observation is made, the weight assigned to that observation will be unity and the weights of the others will decrease according to a Gaussian curve as \( {d_{ij}} \) increases. Here, \( \theta \) is a non-negative constant depicting the way the Gaussian weights vary with distance. Given \( {d_{ij}} \), the larger is \( \theta \), the less emphasis is placed on the observation at location \( j \). The problem in (5.46) amounts to assigning weights to all locations of the study area.

A compromise between the above two weighting functions can be reached by setting the weights to be zero outside a radius \( d \) and to decrease monotonically to zero inside the radius as \( {d_{ij}} \) increases. For example, we can take the elements of the weighting matrix as a bi-square function, i.e.,

$$ w_j \left( i \right) = \left\{ {\begin{array}{*{20}c} {\left( {1 - \frac{{d_{ij}^2 }}{{d^2 }}} \right)^2 ,} & {if\,d_{ij} \le d} \\ {0,} & {if\,d_{ij} > d} \\ \end{array} \quad , \quad j = 1,2, \ldots ,n.} \right. $$
((5.47))

The weighting function in (5.46) is the most common choice in practice.

Compared with other methods, the GWR technique appears to be a relatively simple but useful geographically-oriented method to explore spatial non-stationarity. Based on the GWR model, not only can variation of the parameters be explored, but significance of the variation can also be tested. Unfortunately, at present, only Monte Carlo simulation has been used to perform tests on the validity of the model. In this technique, under the null hypothesis that the global linear regression model holds, any permutation of the observations \( \left( {{y_i};{x_{i1}},{x_{i2}}, \cdots, {x_{ip}}} \right) \), \( i = 1,2, \cdots, n \), among the geographical sampling points are equally likely to occur. The observed values of the statistics proposed can then be compared with these randomization distributions and the significant tests can be performed accordingly. The computational overhead of this method is however considerable, especially for a large data set. Also, since the validity of these randomization distributions is limited to the given data set, this in turn restricts the generality of the proposed statistics. The ideal way to test the model is to construct appropriate statistics and to perform the tests in a conventional statistical manner.

To test whether relationships unraveled from spatial data are local or global, the following two questions are the most important and should be rigorously tested within the conventional hypothesis testing framework:

  1. 1.

    Does a GWR model describe the data significantly better than an OLR model?

That is, on the whole, do the parameters in the GWR model vary significantly over the study region?

  1. 2.

    Does each set of parameters \( {\beta_{ik}} \), \( i = 1,2, \cdots, n \), exhibit significant variation over the study region?

That is, the effect of which independent variable has significant local variation?

For the first question, it is, in fact, a goodness-of-fit test for a GWR model. It is equivalent to test whether or not \( \theta = 0 \) if we use (5.46) as the weighting function. In the second case, for any fixed \( k \), the deviation of \( {\beta_{ik}} \), \( i = 1,2, \cdots, n \), can be used to evaluate the variation of the slope of the \( k \)th independent variable. Since it is very difficult to find the null distribution of the estimated parameter, say \( \theta \) in (5.46), in the weighting matrix, a Monte-Carlo technique has been employed to perform the tests (Brunsdon et al. 1996; Fotheringham et al. 1997a). However, as pointed out above, the computational overhead of the method is considerable. Furthermore, the validity of the reference distributions obtained by the randomized permutation is limited to the given data set, and it in turn may restrict the generality of the corresponding statistics.

5.3.2.2 Goodness-of-Fit Test of the Independent Variables

Based on the notion of residual sum of squares and the following assumptions, some statistics are constructed in Leung et al (2000b):

Assumption 5.1

The error terms \( {\varepsilon_1},{\varepsilon_2}, \cdots, {\varepsilon_n} \) are independently and identically distributed as a normal distribution with zero mean and constant variance \( {\sigma^2} \).

Assumption 5.2

Let \( {\hat y_i} \) be the fitted value of \( {y_i} \) at location \( i \) . For all \( i = 1,2, \cdots, n \) , \( {\hat y_i} \) is an unbiased estimate of \( E\left( {y_i} \right) \) . That is, \( E\left( {{{\hat y}_i}} \right) = E\left( {y_i} \right) \) for all \( i \) .

Assumption 5.1 is in fact the conventional assumption in theoretical analysis of regression. Assumption 5.2 is in general not exactly true for local linear fitting except that the exact global linear relationship between the dependent variable and the independent variables exist (see Wand and Jones 1995, pp. 120–121 for the univariate case). However, the local-regression methodology is mainly oriented towards the search for low-bias estimates (Cleveland et al. 1988). In this sense, the bias of the fitted value could be neglected. So, Assumption 5.2 is a realistic one in the GWR model since this technique still belongs to the local-regression methodology.

  1. 1.

    The residual sum of squares and its approximated distribution

Let \( \mathbf{x}_i^{\rm T} = \left( {1\,{x_{i1}} \cdots {x_{ip}}} \right) \) be the \( i \)th row of X, \( i = 1,2, \cdots, n \), and \( \hat \beta (i) \) the estimated parameter vector at location \( i \). Then the fitted value of \( {y_i} \) is

$$ {\hat y_i} = \mathbf{x}_i^T\hat \beta (i) = \mathbf{x}_i^T{\left( {{\mathbf{X}^T}\mathbf{W}(i)\mathbf{X}} \right)^{ - 1}}{\mathbf{X}^T}{\mathbf{W}}(i){\mathbf{Y}}. $$
((5.48))

Let \( {\mathbf{\hat Y}} = {\left( {{{\hat y}_1}{{\hat y}_2} \cdots {{\hat y}_n}} \right)^T} \) be the vector of the fitted values and \( \hat \varepsilon \) \( = {\left( {{{\hat \varepsilon }_1}{{\hat \varepsilon }_2} \cdots {{\hat \varepsilon }_n}} \right)^T} \) the vector of the residuals. Then

$$ {\mathbf{\hat Y}} = {\mathbf{LY}}, $$
((5.49))
$$ \hat \varepsilon = {\mathbf{Y}} - {\mathbf{\hat Y}} = \left( {{\mathbf{I}} - {\mathbf{L}}} \right){\mathbf{Y}}, $$
((5.50))

where

$$ {\bf L} = \left( {\begin{array}{*{20}c} {x_1^T \left( {x^T {\bf W}\left( 1 \right){\bf X}} \right)^{ - {\bf 1}} X^T {\bf W}\left( 1 \right)} \\ {x_2^T \left( {x^T {\bf W}\left( 2 \right){\bf X}} \right)^{ - {\bf 2}} X^T {\bf W}\left( 2 \right)} \\ \vdots \\ {x_n^T \left( {x^T {\bf W}\left( n \right){\bf X}} \right)^{ - {\bf 1}} X^T {\bf W}\left( n \right)} \\ \end{array}} \right) $$
((5.51))

Denote the residual sum of squares by \( RS{S_g} \). Then

$$ RS{S_g} = \sum\limits_{i = 1}^n {\hat \varepsilon_i^2 = {{\hat \varepsilon }^T}\hat \varepsilon = {{\bf{Y}}^T}{{\left( {{\bf{I}} - {\bf{L}}} \right)}^T}\left( {{\bf{I}} - {\bf{L}}} \right){\bf{Y}}} . $$
((5.52))

This quantity measures the goodness-of-fit of a GWR model for the given data and can be used to estimate \( {\sigma^2} \), the common variance of the error terms \( \begin{array}{*{20}c} {\varepsilon _i ,} & {i = 1,2, \cdots ,n} \\ \end{array} \).

  1. 2.

    Goodness-of-Fit Test

Using the residual sum of squares and its approximated distribution, we can test whether a GWR model describes a given data set significantly better than an OLR model. If a GWR model is used to fit the data, under Assumption 5.2, Leung et al. (2000b) show that the residual sum of squares can be expressed as (5.52) and the distribution of \( {\delta_1}RS{S_g}/{\delta_2}{\sigma^2} \) can be approximated by a chi-square distribution with \( \delta_1^2/{\delta_2} \) degrees of freedom, where \( {\delta_1} = tr{\left[ {{{\left( {{\mathbf{I}} - {\mathbf{L}}} \right)}^T}\left( {{\mathbf{I}} - {\mathbf{L}}} \right)} \right]} \), \( {\delta_2} =tr{\left[ {{{\left( {{\mathbf{I}} - {\mathbf{L}}} \right)}^T}\left( {{\mathbf{I}} - {\mathbf{L}}} \right)} \right]^2} \), and \( {\sigma^2} \) is the common variance of the error terms whose unbiased estimate is \( RS{S_g}/{\delta_1} \).

If an OLR model is used to fit the data, the residual sum of squares is \( RS{S_o} = {{\mathbf{Y}}^T}\left( {{\mathbf{I}} - {\mathbf{Q}}} \right){\mathbf{Y}} \), where \( {\mathbf{Q}} = {\mathbf{X}}{\left( {{{\mathbf{X}}^T}{\mathbf{X}}} \right)^{ - {\mathbf{1}}}}{{\mathbf{X}}^T} \) and \( {\mathbf{I}} - {\mathbf{Q}} \) is idempotent. So, \( RS{S_o}/{\sigma^2} \) is exactly distributed as a chi-square distribution with \( n - p - 1 \) degrees of freedom (Neter et al. 1989; Hocking 1996).

If the null hypothesis, \( {H_o} \) : there is no significant difference between OLR and GWR models for the given data, is true, then the quantity \( RS{S_g}/RS{S_o} \) is close to one. Otherwise, it tends to be small. Let

$$ F = \frac{{RS{S_g}/{\delta_1}}}{{RS{S_o}/\left( {n - p - 1} \right)}}. $$
((5.53))

Then a small value of \( F \) supports the alternative hypothesis that the GWR model has a better goodness-of-fit. On the other hand, the distribution of \( F \) may reasonably be approximated by an \( F \)- distribution with \( \delta_1^2/{\delta_2} \) degrees of freedom in the numerator and \( n - p - 1 \) degrees of freedom in the denominator. Given a significance level \( \alpha \), we denote by \( {F_{1 - \alpha }}\left( {\delta_1^2/{\delta_2},n - p - 1} \right) \) the upper \( 100\left( {1 - \alpha } \right) \) percentage point. If \( F < {F_{1 - \alpha }}\left( {\delta_1^2/{\delta_2},n - p - 1} \right) \), we reject the null hypothesis and conclude that the GWR model describes the data significantly better than the OLR model. Otherwise, we will say that the GWR model cannot improve the fitness significantly compared with the OLR model. Testing the goodness-of-fit via the analysis of variance method and a stepwise procedure for selecting the independent variables are also given in Leung et al. (2000b).

  1. 3.

    Test for Variation of each set of Parameters

After a final model is selected, we can further test whether or not each set of parameters in the model varies significantly across the study region. For example, if the set of parameters \( \left\{ {{\beta_{ik}};\;\;i = 1,2, \cdots, n} \right\} \) of \( {x_k} \) (if \( k = 0 \), the parameters examined correspond to the intercept terms) is tested not to vary significantly over the region, we can treat the coefficient of \( {x_k} \) to be constant and conclude that the slope between \( {x_k} \) and the dependent variable is uniform over the area when the other variables are taken to be fixed. Statistically, it is equivalent to testing the hypotheses

$$\begin{array}{rl} {{H_0}: {\beta_{1k}} = {\beta_{2k}} = \cdots = {\beta_{nk}}\;\;for\;a\;given\;k,} \\ {{H_1}: not\;all\;{\beta_{ik}},i = 1,2, \cdots, n,\;are\;equal.} \\ \end{array}$$

First, we must construct an appropriate statistic which can reflect the spatial variation of the given set of parameters. A practical and yet natural choice is the sample variance of the estimated values of \( {\beta_{ik}},i = 1,2, \cdots, n \). We denote by \( V_k^2 \) the sample variance of the n estimated values, \( {\hat \beta_{ik}},i = 1,2, \cdots, n \), for the \( k \)th parameter. Then

$$ V_k^2 = \frac{1}{n}\sum\limits_{i = 1}^n {{{\left( {{{\hat \beta }_{ik}} - \frac{1}{n}\sum\limits_{i = 1}^n {} {{\hat \beta }_{ik}}} \right)}^2}}, $$
((5.54))

where \( {\hat \beta_{ik}}\;\left( {i = 1,2, \cdots, n} \right) \) are obtained by (5.43).

The next stage is to determine the sampling distribution of \( V_k^2 \) under the null hypothesis \( {H_o} \). Let \( {\hat \beta_k} = {\left( {{{\hat \beta }_{1k}}{{\hat \beta }_{2k}} \cdots {{\hat \beta }_{nk}}} \right)^T} \) and \( {\mathbf{J}} \) be a \( n \times n \) matrix with unity for each of its elements. Then \( V_k^2 \) can be expressed as

$$ V_k^2 = \frac{1}{n}\hat \beta_k^T\left( {\mathbf{I} - \frac{1}{n}\mathbf{J}} \right){\hat \beta_k}. $$
((5.55))

Under the null hypothesis that all of the \( {\beta_{ik}},\;i = 1,2, \cdots, n \), are equal, we may assume that the means of the corresponding estimated parameters are equal, i.e.,

$$ E\left( {{{\hat \beta }_{ik}}} \right) = E\left( {{{\hat \beta }_{2k}}} \right) = \cdots = E\left( {{{\hat \beta }_{nk}}} \right) = {\mu_k} $$
((5.56))

Thus,

$$ E\left( {{{\hat \beta }_k}} \right) = {\mu_k}\mathbf{1}\,, $$
((5.57))

where \( \mathbf{1} \) is a column vector with unity for each element. From (5.57) and the fact that \( {\mathbf{1}^T}\left( {\mathbf{I} - \frac{1}{n}\mathbf{J}} \right) = \mathbf{0} \) and \( \left( {\mathbf{I} - \frac{1}{n}\mathbf{J}} \right)\mathbf{1} = \mathbf{0} \), we can further express \( V_k^2 \) as

$$ V_k^2 = \frac{1}{n}{\left[ {{{\hat \beta }_k} - E\left( {{{\hat \beta }_k}} \right)} \right]^T}\left( {\mathbf{I} - \frac{1}{n}\mathbf{J}} \right)\left[ {{{\hat \beta }_k} - E\left( {{{\hat \beta }_k}} \right)} \right]. $$
((5.58))

Furthermore, let \( {e_k} \) be a column vector with unity for the \( \left( {k + 1} \right) \)th element and zero for other elements. Then

$$ {\hat \beta_{ik}} = {\mathbf{e}}_k^T\hat \beta (i) = \mathbf{e}_k^T{\left( {{{\mathbf{X}}^T}{\mathbf{W}}(i){\mathbf{X}}} \right)^{ - 1}}{{\mathbf{X}}^T}{\mathbf{W}}(i){\mathbf{Y}} $$
((5.59))

and

$$ {\hat \beta_k} = {\left( {{{\hat \beta }_{1k}}{{\hat \beta }_{2k}} \cdots {{\hat \beta }_{nk}}} \right)^{^\prime }}T = {\mathbf{BY}}\,, $$
((5.60))

where

$$ B = \left( {\begin{array}{*{20}c} {e_k^T \left( {X^T W\left( 1 \right)X} \right)^{ - 1} X^T W\left( 1 \right)} \\ {e_k^T \left( {X^T W\left( 2 \right)X} \right)^{ - 2} X^T W\left( 2 \right)} \\ \vdots \\ {e_k^T \left( {X^T W\left( n \right)X} \right)^{ - 1} X^T W\left( n \right)} \\ \end{array}} \right) $$
((5.61))

Substituting (5.60) into (5.58), we obtain

$$\begin{array}{rl} V_k^2 & = \frac{1}{n}{\left( {{\bf{Y}} - E\left( {\bf{Y}} \right)} \right)^T}{{\bf{B}}^T}\left( {{\bf{I}} - \frac{1}{n}{\bf{J}}} \right){\bf{B}}\left( {{\bf{Y}} - E\left( {\bf{Y}} \right)} \right) \\ {} & = {\varepsilon^T}\left( {\frac{1}{n}{{\bf{B}}^T}\left( {{\bf{I}} - \frac{1}{n}{\bf{J}}} \right){\bf{B}}} \right)\varepsilon,\\ \end{array}$$
((5.62))

where \( \varepsilon \sim N\left( {{\mathbf{0}},\,\,{\sigma_2}{\mathbf{I}}} \right) \) and \( \frac{1}{n}{{\mathbf{B}}^T}\left( {{\mathbf{I}} - \frac{1}{n}{\mathbf{J}}} \right){\mathbf{B}} \) is positive semidefinite.

Similar to the method employed above, the distribution of \( {\gamma_1}V_k^2/{\gamma_2}{\sigma^2} \) can be approximated by a chi-square distribution with \( \gamma_1^2/{\gamma_2} \) degrees of freedom, where

$$ {\gamma_i} = tr{\left( {\frac{1}{n}{{\mathbf{B}}^T}\left( {{\mathbf{I}} - \frac{1}{n}{\mathbf{J}}} \right){\mathbf{B}}} \right)^i}\;,\quad i = 1,2. $$
((5.63))

Since \( {\sigma^2} \) is unknown, we cannot use \( {\gamma_1}V_k^2/{\gamma_2}{\sigma^2} \) as a test statistic directly. However, we know that the distribution of \( \delta_1^2{\hat \sigma^2}/{\delta_2}{\sigma^2} \) can be approximated by a chi-square distribution with \( \delta_1^2/{\delta_2} \) degrees of freedom, where \( {\hat \sigma^2} \) is an unbiased estimator of \( {\sigma^2} \), and \( {\delta_i} = tr{\left( {{{\left( {{\mathbf{I}} - {\mathbf{L}}} \right)}^T}\left( {{\mathbf{I}} - {\mathbf{L}}} \right)} \right)^i},i = 1,2 \). So, for the statistic

$$ {F_3}(k) = \frac{{V_k^2/{\gamma_1}}}{{{{\hat \sigma }^2}}}, $$
((5.64))

under the assumption in (5.56), its distribution can be approximated by a \( F \)-distribution with \( \gamma_1^2/{\gamma_2} \) degrees of freedom in the numerator and \( \delta_1^2/{\delta_2} \) degrees of freedom in the denominator. Therefore, we can take \( {F_3} \) as a test statistic. The large value of \( {F_3} \) supports the alternative hypothesis \( {H_1} \). For a given significance level \( \alpha \), find the upper \( 100\alpha \) percentage point \( {F_\alpha }\left( {\gamma_1^2/{\gamma_2},\delta_1^2/{\delta_2}} \right) \). If \( {F_3} \ge {F_\alpha }\left( {\gamma_1^2/{\gamma_2},\delta_1^2/{\delta_2}} \right) \), reject \( {H_0} \), accept \( {H_0} \) otherwise.

The simulation results in Leung et al. (2000b) have shown that the test power of the proposed statistics is rather high and their \( p \)-values are rather robust to the variation of the parameter in the weighting matrix.

5.3.3 Local Variations of Regional Industrialization in Jiangsu Province, P.R. China

The technique of GWR is employed to explore in Huang and Leung (2002) the relationships between the level of industrialization (the share of industrial output in the total output of industry and agriculture) and various factors over the study area. There are many aspects, such as social, economic, human, geographical, historical and financial factors, that are related to the process of industrialization. The determinant factors of regional industrialization include the share of urban labor in total population (UL), GDP per capita (GP), fixed capital investment per unit of GDP (IG), and the share of township and village enterprises output in gross output value of industry and agriculture (TVGIA). UL is an indicator of the level of urbanization. GP represents the level of economic development. UL and GP set up the context of industrialization in an area. On the other hand, IG and TVGIS are considered factors directly related to the process of industrialization.

Before investigating possible spatial variations in the determinants of industrialization across Jiangsu Province, the global regression equation representing the average relationships of 75 spatial units between the level of industrialization and various factors is obtained as follows:

$$\begin{array}{rl} {\hbox{Y}} & = \ 41.211 + 0.440{\hbox{ UL}} + 0.00080{\hbox{66 GP}} + 0.{\hbox{381 IG}} + 0.{\hbox{391 TVGIA}} \\ \qquad & \!\!\! (14.353)\quad \left( {4.190} \right)\quad \quad \left( {3.302} \right)\qquad \qquad \left( {4.268} \right)\qquad \quad \left( {7.598} \right) \\ {\hbox{R}} & = 0.913 \ {\hbox{R2}} = 0.834 \ {\hbox{Adjusted R2}} = 0.824 \\ \quad\ & {\hbox{Significance level}} = 0.001\\ \end{array}$$
((5.65))

The numbers in brackets are t-statistics of the estimated parameters. The R-squared value of the above model is 0.834, which means that the equation explains 83.4% of the variance of the level of industrialization in 1995.

To consider the spatial variation of relationships between the level of industrialization and various determinants, the GWR model is applied. To estimate parameters \( \beta_{ik}^* \), \( i = 1,2, \ldots, n;\,\;k = 1,2, \ldots, p \), the study adopts the commonly used Gaussian function

$$ {W_{ij}} = exp\left( { - \theta \,\,{d_{ij}}^2} \right),\,\,i,j = 1,2, \ldots n $$
((5.66))

to calculate weight \( {W_{ij}} \) in the weighting matrix. Here, \( {d_{ij}} \) is the geometric distance between the central points of locations i and j. However, \( \beta \) is a nonnegative parameter and different \( \theta \) will result in different weights. Thus, the estimated parameters of GWR are not unique. The best \( \theta \) is chosen by the following procedure:

Assume that there are many different possible values of \( \theta \). Then, for each \( \theta \), the weighting matrix \( {W_i} \), i = 1, 2, …, n, is obtained from using (5.66). Consequently, many weighting matrices can likewise be obtained. A weighted OLS calibration is then used to obtain many sets of \( \beta_i^* \), i = 1, 2, …, n in (5.29). It should be noted that the observations at location i are not included in the estimation of its parameters. Thus, many different values of the estimated independent \( {Y_{ \ne i}}^*\left( \theta \right) \), fitted value of \( {{\text{Y}}_i} \), can be estimated at this stage, and therefore the scores of the residuals sum of squares, \( \mathop \sum \limits_i \)[Yi−Y *≠i (θ)]2, can also be calculated. Finally, the best value of θ is selected by minimizing the score of residuals sum of squares.

Applying the above procedure to the analysis of industrialization in Jiangsu province, the best value of \( \theta \) is obtained. Figure 5.1 shows the CV score against the parameter \( \theta \). Thus, the minimum score of the CV value is obtained when \( \theta \) equals 0.9. That is,

Fig. 5.1
figure 1_5figure 1_5

The CV score against the parameter \( \theta \)

$$ \sum\limits_{i = 1}^n {\left[ {{Y_i} - Y_{ \ne i}^*\left( {0.9} \right)} \right]{\,^2} = \min \sum\limits_{i = 1}^n {\left[ {{Y_i} - Y_{ \ne i}^*\left( \theta \right)} \right]{\,^2}} }. $$
((5.67))

Thus, the weighting matrix \( {\mathbf{W}_i} \), i = 1, 2, …, n, is estimated, where \( {W_{ij}} = exp\left( { - 0.9{d_{ij}}^2} \right) \).

Spatial distributions of the parameter estimates are shown in Figs. 5.25.7. Based on the spatial distributions of the parameter estimates, there appears to be significant local variations in the relationships between various factors and industrial development across Jiangsu province. Figure 5.2 shows the spatial distribution of the intercept terms in Jiangsu province in 1995. In principle, the intercept term measures the fundamental level of industrialization excluding the effects of all factors on regional industrialization across Jiangsu province. It is henceforth referred to as “the basic level of regional industrialization.” There is a clear spatial variation with higher constant parameters in the southern region and lower ones in the northern region. Thus the basic level of regional industrialization in Jiangsu province displayed a ladder-step distribution which varies from high in the south to low in the north. It also confirms the existence of significant regional disparity in the level of regional industrialization.

Fig. 5.2
figure 2_5figure 2_5

Spatial distribution of the regression constant in Jiangsu

The spatial distribution of the UL parameter in Jiangsu is shown in Fig. 5.3. It can be observed that the central areas had greater UL parameter estimates while the southern areas had medium parameter estimates, where as the northern areas had lower parameter estimates. It means that the share of urban labor in total population had the most important effect on industrialization in the central region. On the other hand, the parameter estimate of UL in the global model is 0.440 which actually belongs to the relationship in the central areas of the GWR analysis. Therefore, the relationship of the global model was essentially similar to those of the local models in the central region. This is possibly due to the fact that the condition of industrialization in the central region lies between that of the southern and northern regions.

Fig. 5.3
figure 3_5figure 3_5

Spatial distribution of the UL parameter in Jiangsu

The spatial variation in the GP parameter in Fig. 5.4 depicts the differing effect of GDP per capita on the level of industrialization across Jiangsu in 1995. Similar effect of GDP per capita on regional industrialization was found in most areas but some areas in the northern region exhibited a certain extent of spatial variation in 1995. It means that GDP per capita played a more important role in some northern areas than in other areas.

Fig. 5.4
figure 4_5figure 4_5

Spatial distribution of the GP parameter in Jiangsu

The spatial distribution of the IG parameter in Fig. 5.5 shows a trend differing from those of the constant and the GP parameters. The fixed capital investment per unit of GDP had the smallest effect on regional industrialization in the southern areas. On the contrary, it exerted the greatest effect on the development of regional industrialization in the central and northern areas. It means that capital investment per unit of GDP was more important in the central and northern regions than in the southern region. It also indicates that the development of regional industrialization in the southern region did not rely very much on the amount of capital investment. It should be observed that the IG parameter in the global model is 0.381. Clearly, the global model represents an average relationship across the study areas.

Fig. 5.5
figure 5_5figure 5_5

Spatial distribution of the IG parameter in Jiangsu

The spatial distribution of the TVGIA parameter in Fig. 5.6 is very similar to that of the UL parameter in Fig. 5.3. The TVGIA factor had greater effect on regional industrialization in some central and northern areas. It is apparent that TVEs were more important to industrialization in the central and northern areas. The parameter estimate of TVGIA in the global model is 0.391 which is located in the second last group with larger UL parameter in Fig. 5.6. Thus, the global model mainly represents some central and northern areas belonging to the last group of Fig. 5.6.

Fig. 5.6
figure 6_5figure 6_5

Spatial distribution of the TVGIA parameter in Jiangsu

Another important spatial distribution obtained from the GWR analysis is the spatial variation in the goodness-of-fit statistic, R-square, shown in Fig. 5.7. It shows that the R-square value varies from 0.665 to 0.963. As previously analyzed, the global model explains 83.4% of the variance of the level of industrialization which is between the minimum and the maximum values of R-square. Therefore, some local models have a better fit than the global model, while the others are not. It can be observed that the northern areas usually have higher R-square values. It can then be inferred that the relationships between the selected factors and the level of regional industrialization are much better captured by the regression model in the northern region. However, the development of regional industrialization in the southern and the central regions may be affected by other factors or areas outside Jiangsu province. It is very reasonable to suggest that the economic development of Shanghai plays a very important role in the regional industrialization of the southern or the central areas in Jiangsu since they are close in terms of geographical location. But, the analysis of GWR did not consider the external effect coming from areas outside Jiangsu. It may be the reason for the smaller R-square value in the central and the southern regions. Such relationships between Shanghai and Jiangsu are not considered since no consistent data are available.

Fig. 5.7
figure 7_5figure 7_5

Spatial distribution of the R-Square value in Jiangsu

The parameter estimates of various factors affecting regional industrialization in Jiangsu province show different spatial variations, indicating possible spatial nonstationarity. Thus, the GWR technique appears to be a useful method to unravel spatial nonstationarity. However, from the statistical viewpoint, two critical questions still remain. One is whether the GWR model describes the relationship significantly better than the OLR model. The other is whether each set of parameter estimates b *ij , i = 1, 2,…, n; j = 1, 2,…, P exhibit significant spatial variation over the study areas (Leung et al. 2000).

From the result of Table 5.1, it is clear that at the significance level of 0.0081, the GWR model performs better than the OLR model in the analysis of regional industrialization of Jiangsu province. Thus, the relationships of regional industrialization and the factors affecting it exhibit significant spatial nonstationarity over the county-level areas in Jiangsu province.

Table 5.1 Test statistics of the GWR model

In terms of the spatial variation of the estimated parameters, the test result shows that the constant parameter and the GP parameter have robust spatial nonstationarity over the whole study area. Statistically, the other three factors, UL, IG and TVGIA, did not have significant spatial variation. Therefore, spatial variation of the effect of economic factors on regional industrialization is mainly represented by the factors of the basic level of industrialization and GDP per capita among county-level areas in Jiangsu.

In the GWR analysis, it is assumed that spatial relationships between two areas show the distance-decay effect. However, with the advancement of information technology, friction of distance may be weakened. Nevertheless, in developing countries such as China, distance decay still plays a crucial role in the interaction between areas. Therefore, in the study of regional economic development in China, the GWR technique appears to be an effective tool to explore variations among different localities.

5.3.4 Discovering Spatial Pattern of Influence of Extreme Temperatures on Mean Temperatures in China

It has been recognized that the increase in global mean temperature has close relationship with temperature extremes. Extensive studies have been carried out on the extreme temperature events in different regions of the world (Beniston and Stephenson 2004; Bonsal et al. 2001; DeGaetano 1996; DeGaetano and Allen 2002; Heino et al. 1999; Prieto et al. 2004; Robeson 2004) in general and China (Gong et al. 2004; Qian and Lin 2004; Yan et al. 2001; Zhai and Pan 2003; Zhai et al. 1999) in particular. For China as a whole, the frequency of extremely low temperature exhibits a significant decreasing trends while that of extremely high temperature a slightly decreasing or insignificant trend, which may be a main cause of the increase of mean temperature.

In the study of extreme temperatures, concentration has been placed on the temporal trends of extreme temperatures. While spatial characteristics have generally been analyzed on a station-by-station basis (Beniston and Stephenson 2004; Bonsal et al. 2001; Gong et al. 2004; Prieto et al. 2004; Qian and Lin 2004), such analysis, however, does not take into account spatial autocorrelation of the data among the stations. For large territory like China where temperature varies considerably from north to south and east to west, different spatial characteristics may be found in different areas so that spatial non-stationarity may be a common place. Therefore, GWR model would be a useful technique to unravel local relationships if they exist. Wang et al. (2005) give such a study.

The original data of the study consist of daily observed mean temperature and maximal and minimal temperatures of 40 years from 1961 to 2000 collected at 110 observatories on the mainland of China. At each observatory, the mean temperature in a day was obtained by averaging the observed temperature values at 2, 5, 8 and 20 h of the 24-h period, while the maximal and minimal temperatures were, respectively, the smallest and largest values of the continuously measured temperature in a whole day. Based on the daily observed temperatures, a data set is obtained to discover the spatial patterns of influence of extreme temperatures on mean temperature via the GWR model and the associated statistics (Leung et al. 2000b; Mei et al. 2004). It contains the mean temperature, mean maximal and mean minimal temperature. The GWR technique with the associated tests is applied to unravel spatial nonstationarity by taking the mean temperature as the response and the mean maximal and mean minimal temperature as the explanatory variables. The model to be fitted is

$$ {y_i} = {\beta_0}\left( {{u_i},{v_i}} \right) + {\beta_1}\left( {{u_i},{v_i}} \right){x_{i1}} + {\beta_2}\left( {{u_i},{v_i}} \right){x_{i2}} + {\varepsilon_i},\quad i = 1,2, \ldots, 110, $$
((5.68))

where \( \left( {{y_i};{x_{i1}},{x_{i2}}} \right) \), \( i = 1,2, \ldots, 110 \), are the observations of mean temperature and mean maximal and mean minimal temperatures at the 110 observatories located at longitude \( {u_i} \) and latitude \( {v_i} \). Based on the Gaussian kernel function, the distance between any two observatories is computed according to the longitudes and latitudes of the observatories to formulate the weight. The optimal bandwidth value is selected by the cross-validation approach.

For the data set, the bandwidth value selected is \( {h_o} = 0.42 \) (kilometer\( \times {10^3} \)) and the \( p \)-values for testing the significant variation of the three coefficients are, respectively, \( {p_o} = 0.0004239,\;\;{p_1} = 0.0007347\;\;and\;\;{p_2} = 0.0000159 \), which shows that variation of each coefficient across the mainland of China is very significant.

Based on Fig. 5.8, the contribution rate of mean maximal temperature to mean temperature over 40 years varies rather significantly over the mainland of China. In the northwestern region where the latitude is great than about \( 4{5^o} \), it is discovered that the rates (largest) range from about 0.6 to 1.182 from north to south. That is, the sharpest increase in mean temperature with the increase of mean maximal temperature is discovered in the coldest area of China. On the other hand, the smallest contribution rates which vary from about 0.2 to 0.4 are detected around Bohai Bay, the southwestern region and the northern part of Xingjiang province. The remaining part of Mainland China, northwest to southeast, shows a roughly homogenous contribution rates arranging from about 0.6 to 0.8. It is interesting to observe that the contribution rates of mean maximal temperature to mean temperature appear in apparent regional clusters.

Fig. 5.8
figure 8_5figure 8_5

Spatial distribution of the estimates for the coefficient \( {\beta_1}\left( {{u_i},{v_i}} \right) \) of mean maximal temperature over 40 years

From Fig. 5.9, the contribution rates of mean minimal temperature to mean temperature over 40 years reveal a significant increasing trend from north to south over the mainland of China. Specifically, when mean minimal temperature increases a unit, the increase of mean temperature is greater in the southern areas than in the northern areas. The smallest rates, roughly from 0.25 to 0.39, are observed in the northern region where latitude is greater than about \( 4{4^o} \). The largest rates, arranging from 0.47 to 0.62, are unraveled mainly on the south of the Yangzi river where the latitude is less than \( 3{0^o} \). The rates in the remaining areas range from about 0.32 to 0.47.

Fig. 5.9
figure 9_5figure 9_5

Spatial Distribution of the estimates for the coefficient \( {\beta_2}\left( {{u_i},{v_i}} \right) \) of mean minimal temperature over 40 years

Apparently, the influence of mean maximal temperature on mean temperature exhibits spatial non-stationarity that appears as several obvious spatial clusters. The influence is the most intense in northeastern region and the least intense in southwestern region and around Bohai Bay, while the influence is moderate from northwest to southeast. In contrast, the influence of mean minimal temperature on mean temperature is more intense in southern than in the northern region, showing an increasing trend from north to south. This is actually the answer to the spatial non-stationarity problem raised in Sect 1.5 in Chap 1.

5.4 Testing for Spatial Autocorrelation in Geographically Weighted Regression

It should be observed that one of the important assumptions for the GWR technique to be applied to the varying-parameter model in (5.41) is that the disturbance terms are independent and identically distributed. However, the existence of spatial autocorrelation, which is one of the main characteristics of spatial data sets, may invalidate certain standard methodological results. For example, spatial autocorrelation among the disturbance terms in the OLR model can lead to inefficient least-squares estimators and misleading statistical inference results. Furthermore, the standard assumption of constant variance of the disturbance terms may fail to hold in the presence of spatial autocorrelation (Cliff and Ord 1973, 1981; Krämer and Donninger 1987; Anselin 1988; Griffith 1988; Anselin and Griffith 1988; Cordy and Griffith 1993). As is evident in the literature, most statistical tests in regression analysis are based on the notion of residual sum of squares, more specifically on the estimator of variance of the disturbances, as is adopted in the well known OLR technique (Hocking 1996; Neter et al. 1996), the locally weighted regression technique (Cleveland 1979; Cleveland and Devlin 1988; Cleveland et al. 1988), and the GWR technique (Leung et al. 2000b; Brunsdon et al. 1999) for the varying-parameter regression model in (5.41). Heteroscedasticity in the disturbances caused by spatial autocorrelation thus makes such testing methods invalid. Since autocorrelated disturbances pose such serious problems on the use of regression techniques, it is then extremely important to be able to test for their presence.

For the OLR technique, this problem has long been investigated. Substantial effort has been devoted to the tests for spatial autocorrelation in the OLR model. Two basic types of test methods are commonly used in the literature. One is the generalized form of Moran’s \( {I_0} \) (Moran 1950), in order not to confuse it with the notation of the identity matrix \( {\text{I}} \) the Moran’s statistic is denoted by \( {I_0} \) instead of the conventional \( I \) in this discussion, or Geary’s \( c \) (Geary 1954) to the OLR residuals suggested by Cliff and Ord (1972, 1973, 1981). The other is the likelihood-function-based methods such as the Lagrange multiplier form of test (Burridge 1980) or the likelihood ratio test (Griffith 1988; Anselin 1988). Both types rely upon the asymptotic distribution of the statistics under the null hypothesis of no spatial autocorrelation. Recently, based on the theoretical results by Imhof (1961) and the algebraic results by Koerts and Abrahamse (1968), Tiefelsdorf and Boots (1995, with corrections 1996), as well as Hepple (1998) have independently derived the exact distributions of Moran’s \( {I_0} \) and Geary’s C for the OLR residuals under the null hypothesis of no spatial autocorrelation among the normally distributed disturbances. Based on the test statistics of Moran’s \( {I_0} \) and Geary’s c, Leung et al. (2000c) first extend the exact test method developed by Tiefelsdorf and Boots (1995), and Hepple (1998) for the OLR residuals to the GWR case.

A statistical procedure is developed by Leung et al. (2000c) to test for spatial autocorrelation among the residuals of the GWR model. They focus on the test of spatial autocorrelation among the disturbance terms \( {\varepsilon_1},{\varepsilon_2}, \cdots, {\varepsilon_n} \) of the model in (5.41) when the GWR technique is employed to calibrate it. Similar to the case of the OLR model, the null hypothesis for testing spatial autocorrelation in the varying-parameter model can still be formulated as:

\( {{\text{H}}_0} \): There is no spatial autocorrelation among the disturbances, or alternatively

$$ Var\left( \varepsilon \right) = E\left( {\varepsilon {\varepsilon^T}} \right) = {\sigma^2}\mathbf{I} $$

where \( \varepsilon = \left( {\begin{array}{*{20}c} \varepsilon & {\varepsilon _2 } & \cdots & {\varepsilon _n } \\ \end{array}} \right)^T \) is the disturbance vector.

The alternative hypothesis is that there exists (positive or negative) spatial autocorrelation among the disturbances with respect to a specific spatial weight matrix \( \mathbf{W} \) which is defined by the underlying spatial structure such as the spatial contiguity or adjacency between the geographical units where observations are made. The simplest form of \( \mathbf{W} \) can be the one that assigns 1 to two units that come in contact and 0 otherwise. It can also incorporate information on distances, flows, and other types of linkages.

Since the disturbance vector \( \varepsilon = \left( {\begin{array}{*{20}c} \varepsilon & {\varepsilon _2 } & \cdots & {\varepsilon _n } \\ \end{array}} \right)^T \) is not observable, the autocorrelation among the residuals is tested instead, i. e., the errors which result by comparing each local GWR estimate of each \( y \) with the actual value. When the model in (5.41) is calibrated by the GWR technique, we obtain the results from (5.48) to (5.52).

Spatial autocorrelation based on Moran’s \( {I_0} \) and Geary’s c

For the residuals \( \hat{\varepsilon} = \left( {\begin{array}{*{20}c} \hat{\varepsilon} & {\hat{\varepsilon} _2 } & \cdots & {\hat{\varepsilon} _n } \\ \end{array}} \right)^T \) in (5.49) and (5.50), and a specific spatial weight matrix \( \mathbf{W} = \left( {{w_{ij}}} \right) \), Moran’s \( {I_0} \) takes the form of

$$ {I_0} = \frac{n}{s}\frac{{\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^n {{w_{ij}}{{\hat \varepsilon }_i}{{\hat \varepsilon }_j}} } }}{{\sum\limits_{i = 1}^n {\hat \varepsilon_i^2} }} = \frac{n}{s}\frac{{{{\hat \varepsilon }^T}\mathbf{W}\hat \varepsilon }}{{{{\hat \varepsilon }^T}\hat \varepsilon }}, $$
((5.69))

where \( s = \sum\nolimits_{i = 1}^n {\sum\nolimits_{j = 1}^n {{w_{ij}}} } \). The spatial weight matrix is commonly used in its row-standardized form. That is, the row elements are normalized (summed to 1) and this may make \( W \) asymmetric. Nevertheless, if \( \mathbf{W} \) is asymmetric, we can construct from it a new symmetric spatial weight matrix as

$$ {\mathbf{W}^*} = \left( {w_{ij}^*} \right) = \frac{1}{2}\left( {\mathbf{W} + {\mathbf{W}^T}} \right). $$
((5.70))

Since \( {\hat \varepsilon^T}{\mathbf{W}^T}{\hat \varepsilon^T} = {\hat \varepsilon^T}\mathbf{W}{\hat \varepsilon^T} \), we have

$$ \frac{{\hat \varepsilon ^T W^* \hat \varepsilon ^T }}{{\hat \varepsilon ^T \hat \varepsilon }} = \frac{{\hat \varepsilon ^T W\hat \varepsilon ^T }}{{\hat \varepsilon ^T \hat \varepsilon }}. $$
((5.71))

Thus, without loss of generality, we can assume that \( \mathbf{W} \) is symmetric. Also, the term \( n/s \) in (5.69) is purely a scaling factor which can be omitted from the test statistic without affecting the \( p \)-value of the statistic. Hence, we can write Moran’s \( {I_0} \) as

$$ I_0 = \frac{{\hat \varepsilon ^T W\hat \varepsilon ^T }}{{\hat \varepsilon ^T \hat \varepsilon ,}} $$
((5.72))

where \( \mathbf{W} \) is a specific symmetric spatial weight matrix of order \( n \).

It is known that a large value of \( {I_0} \) supports the alternative hypothesis that there exists positive autocorrelation among the residuals and a large negative value of \( {I_0} \) supports the alternative hypothesis that there exists negative autocorrelation among the residuals. For these two alternatives, the \( p \)-values of \( {I_0} \) are, respectively: \( p = P\left\{ {{I_0} \ge r} \right\} \), and \( p = P\left\{ {{I_0} \le r} \right\} \), where \( r \) is the observed value of \( {I_0} \). It should be noted that the above two alternatives belong to the one-tailed test. For spatial autocorrelation which corresponds to a two-tailed test, considering the complexity of the distribution of \( {I_0} \), we may simply take the \( p \)-value as \( 2P\left\{ {{I_0} \ge r} \right\} \) if \( P\left\{ {{I_0} \ge r} \right\} \le {1 \mathord{\left/{\vphantom {1 2}} \right.} 2} \) or \( 2\left( {1 - P\left\{ {{I_0} \ge r} \right\}} \right) \) if \( P\left\{ {{I_0} \ge r} \right\} > {1 \mathord{\left/{\vphantom {1 2}} \right.} 2} \).

Thus, for a given significance level \( \alpha \), if \( p \ge \alpha \), one fails to reject the null hypothesis \( {{\text{H}}_0} \) and concludes that there is no spatial autocorrelation among the residuals. If \( p < \alpha \), one, depending on the assumed alternative hypothesis, rejects \( {{\text{H}}_0} \) and concludes that there exists positive or negative autocorrelation among the residuals. Leung et al. (2000c) show how the p-values can be computed via the Imhof result (Imhof 1961).

Similarly, for the residual vector \( \hat{\varepsilon} = \left( {\begin{array}{*{20}c} \hat{\varepsilon} & {\hat{\varepsilon} _2 } & \cdots & {\hat{\varepsilon} _n } \\ \end{array}} \right)^T \) and a specific spatial weight matrix \( \mathbf{W} = \left( {{w_{ij}}} \right) \), Geary’s C is obtained as

$$ c = \frac{{\left( {n - 1} \right)}}{2s}\frac{{\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^n {{w_{ij}}{{\left( {{{\hat \varepsilon }_i} - {{\hat \varepsilon }_j}} \right)}^2}} } }}{{\sum\limits_{i = 1}^n {\hat \varepsilon_i^2} }}. $$
((5.73))

With respect to a given spatial weight matrix \( {\text{W}} \), a small value of \( c \) supports the alternative hypothesis that there exists positive spatial autocorrelation among the residuals and a large value of \( c \) supports the one saying that there exists negative spatial autocorrelation. For simplicity, we still use \( r \) to represent the observed value of \( c \). The \( p \)-values of \( c \) for testing \( {{\text{H}}_0} \) against the above two alternatives are, respectively, \( P\left\{ {c \le r} \right\} \) and \( P\left\{ {c \ge r} \right\} \). They can again computed by the Imhof method.

To circumvent the computation overhead of the resulting Imhof method, particularly for large sample, the three-moment \( {\chi^2} \) approximation to the null distributions of the testing statistics is derived in Leung et al. (2000c). Based on their simulation runs on the Imhof and approximation tests, the following observations are made:

  1. 1.

    The statistics of Moran’s \( {I_0} \) and Greay’s \( c \) formed by the GWR residuals are quite powerful in exploring spatial autocorrelation among the disturbances of the varying-parameter model, especially for exploring positive autocorrelation. This also implies that in deriving the \( p \)-values of the test statistics, it is reasonable to assume that the fitted value of \( {y_i} \) is an unbiased estimate of the \( E\left( {y_i} \right) \) for all i. However, the test statistics are not so sensitive to moderate negative autocorrelation. Some improvement on the proposed testing methods will be necessary in order to overcome this shortcoming.

  2. 2.

    The three-moment \( {\chi^2} \) approximation to the \( p \)-values of \( {I_0} \) and \( c \) is very accurate. Compared with the computational overhead in obtaining the p-values in the Imhof method, this approximation method is very time-saving, especially for cases with large sample size.

  3. 3.

    The p-values of \( {I_0} \) and \( c \) are quite robust to the variation of the parameter \( \theta \) in the weighting function for calibrating the model. This makes the testing methods applicable in practice since \( \theta \) could still be predetermined by the cross-validation procedure without considering spatial autocorrelation. Although there is some loss in the significance of spatial autocorrelation, the testing methods still give useful indications which are sufficient to achieve certain practical purposes, especially for exploring positive autocorrelation.

For both the Imhof method and the three-moment \( {\chi^2} \) approximation method proposed in Leung et al. (2000c), the assumption that the disturbance terms are normally distributed plays an important role in deriving the \( p \)-values of \( {I_0} \) and \( c \). Although it is a common assumption in regression analysis, this condition is not easy to satisfy in practice. Therefore, it will be useful to investigate the null distributions of the test statistics for the GWR model under some more general conditions. Moreover, some improvements on the proposed methods are still needed to make them more powerful in order to test for moderate negative autocorrelation.

It should be noted that the measures of spatial autocorrelation in Leung et al. (2000c), both Moran’s \( {I_0} \) and Geary’s\( c \), are global statistics and therefore, as shown in the simulations, global association among the GWR residuals can be efficiently tested by the proposed methods. They may be insensitive to local spatial autocorrelation. A more practical situation may be to use some local statistics to test more general association among the GWR residuals. The LISA method, i.e., local indicators of spatial association (Anselin 1995) seems to be a promising method to achieve this purpose. Though it will be more difficult to develop formal statistical testing methods such as those proposed in this paper, it deserves to be investigated in further research.

5.5 A Note on the Extentions of the GWR Model

As a further refinement of the basic GWR model, the mixed GWR model, which is a combination of the ordinary linear regression model and the spatially varying coefficient model, was firstly proposed by Brunsdon et al. (1999) to model the situation in which the impact of some explanatory variables on the response is spatially homogeneous and that of the remaining explanatory variables varies over space.

A spatially varying coefficient regression model that the GWR technique calibrates is of the form

$$ {y_i} = \sum\limits_{j = 1}^p {{\beta_j}\left( {{u_i},{v_i}} \right){x_{ij}} + {\varepsilon_i},\quad i = 1,2, \cdots, n}, $$
((5.74))

where (\( {y_i};{x_{i1}}, \cdots, {x_{ip}} \)) are observations of the response \( y \) and explanatory variables \( {x_1},{x_2}, \cdots, {x_p} \) at location \( \left( {{u_i},{v_i}} \right) \), and \( {\varepsilon_1},{\varepsilon_2}, \cdots, {\varepsilon_n} \) are independent random errors with mean zero and common variance \( {\sigma^2} \). Generally, one takes \( {x_1} \equiv 1 \) to accommodate a spatially varying intercept in the model. The GWR technique (Brunsdon et al. 1996; Fotheringham et al. 2002) calibrates the model in (5.1) with the locally weighted least-squares procedure in which the weights in each focal spatial point are generated by a given kernel function and the distance between this focal point and each of the observational locations \( \left( {{u_i},{v_i}} \right) \), \( i = 1,2, \cdots, n \). A mixed GWR model (Brunsdon et al. 1999; Fotheringham et al. 2002) takes some of the coefficients \( {\beta_j}\left( {u,v} \right) \) (\( j = 1,2, \cdots, p \)) to be constant and, after properly adjusting the order of the explanatory variables, is of the form

$$ {y_i} = \sum\limits_{j = 1}^q {{\beta_j}{x_{ij}}} + \sum\limits_{j = q + 1}^p {{\beta_j}\left( {{u_i},{v_i}} \right){x_{ij}} + {\varepsilon_i},\quad i = 1,2, \cdots, n}. $$
((5.75))

By first smoothing the spatially varying coefficients \( {\beta_j}\left( {u,v} \right) \) (\( j = q + 1, \cdots, p \)) with the GWR technique and then estimating the constant coefficients \( {\beta_j}\left( {j = 1, \cdots, q} \right) \) with the ordinary least-squares method, a two-step calibration procedure has been proposed by Fotheringham et al. (2002).

As an extension of the mixed GWR model, it is of interest and practical use to consider another kind of regression models that combines a geographical expansion model with a spatially varying coefficient model. That is, some regression coefficients in a spatially varying coefficient model are assumed to be globally vertain parametric functions of spatial coordinates. Leung et al. (2008b) coin this model the semi-parametric spatially varying coefficient model for the reason that some regression coefficients are parametric functions of spatial coordinates and the others are nonparametric.

Motivated by the geographical expansion method (Casetti 1982, 1997; Jones and Casetti 1992). We can assume that some coefficients in the spatially varying coefficient model in (5.74) are certain parametric functions of spatial coordinates, say \( {\beta_j}\left( {u,v;{\theta_{j1}}, \cdots, {\theta_{j{l_j}}}} \right) \) \( \left( \,{j = 1, \cdots, q} \right) \), and the semi-parametric spatially varying coefficient model can be defined as

$$ {y_i} = \sum\limits_{j = 1}^q {{\beta_j}\left( {{u_i},{v_i};{\theta_{j1}}, \cdots, {\theta_{j{l_j}}}} \right){x_{ij}}} + \sum\limits_{j = q + 1}^p {{\beta_j}\left( {{u_i},{v_i}} \right){x_{ij}} + {\varepsilon_i},\quad i = 1,2, \cdots, n}. $$
((5.76))

For simplicity in estimation and sufficiency in application, each of the parametric coefficients \( {\beta_j}\left( {{u_i},{v_i};{\theta_{j1}}, \cdots, {\theta_{j{l_j}}}} \right) \) \( \left( {j = 1, \cdots, q} \right) \) is taken to be a linear combination of some known functions of spatial coordinates \( \left( {u,v} \right) \), that is,

$$ {\beta_j}\left( {u,v;{\theta_{j1}}, \cdots, {\theta_{j{l_j}}}} \right) = \sum\limits_{k = 1}^{l_j} {{\theta_{jk}}{g_{jk}}\left( {u,v} \right)}. $$
((5.77))

Here for each \( \left(\, {j = 1,2, \cdots, q} \right) \), \( {g_{j1}}\left( {u,v} \right),{g_{j2}}\left( {u,v} \right), \cdots, {g_{j{l_j}}}\left( {u,v} \right) \) are known linearly independent functions.

The semi-parametric spatially varying coefficient model so constructed includes several commonly used spatial regression models as its special cases. The followings are typical cases:

  1. 1.

    When q = 0, the model in (5.76) is the spatially varying coefficient model that the GWR technique calibrates.

  2. 2.

    When q = p, the model in (5.76) becomes a kind of geographical expansion models. In particular, when all of the \( {\beta_j}\left( {u,v;{\theta_{j1}}, \cdots, {\theta_{j{l_j}}}} \right) \) \( \left( {j = 1, \cdots, p} \right) \) are polynomial functions of spatial coordinates u and v, the resulting models become the most commonly used expansion models in geographical research.

  3. 3.

    Let \( {l_1} = {l_2} \cdots = {l_q} = 1 \) and \( {g_{j1}}\left( {u,v} \right) \equiv 1 \) for each \( j = 1,2, \cdots, q \). Then the semi-parametric spatially varying coefficient model becomes the mixed GWR model. Furthermore, of q = p, the model degenerates into an ordinary linear regression model. Based on the local linear fitting procedure in Wang et al. (2008) and the OLS method, Leung et al. (2008b) derive a two-step setimation procedure for the model, with its effectiveness supported by some simulation studies.

5.6 Discovery of Spatial Non-Stationarity Based on the Regression-Class Mixture Decomposition Method

5.6.1 On Mixture Modeling of Spatial Non-Stationarity in a Noisy Environment

In the study of spatial relationship, we generally assume that a single regression model can be applied to a large or complicated spatial data set manifestating certain spatial structure or pattern. Though parameter-varying regression in general and GWR in particular intend to study spatial non-stationarity, they still assume a single model for the whole data set. Local variations are captured by the varying parameters. Unfortunately, conventional regression analysis is usually not appropriate for the study of very large data sets, especially those with noise contamination for the follow reasons:

  1. 1.

    Regression analysis handles a data set as a whole. Even with the computer hardware available today, there are no effective means, such as processors and storage, for manipulating and analyzing a large amount of data.

  2. 2.

    More importantly, it might be unrealistic to assume that a single model can fit a large data set. It is highly likely that we need multiple models to fit a large data set. That is, spatial patterns hidden in a data set may take on different forms that cannot be accurately represented by a single model.

  3. 3.

    Classical regression analysis is based on stringent model assumptions. However, the real world, a large data set in particular, does not behave in accordance with these assumptions. In a noisy environment, it is very common that inliers (patterns) are out-numbered by outliers so that many robust methods fail.

To overcome the above difficulties, we may want to view a complicated data set as a mixture of many populations. If we view each spatial pattern described by a regression model as a population, then the data set is a mixture of a finite number of such populations. Spatial knowledge (patterns/relationships) discovery can then be treated as the identification of these models through mixture modeling.

Mixture modeling is the modeling of a statistical distribution by a mixture of distributions, known as components or classes. Finite mixture densities have served as important models for the analysis of complex phenomena in statistics (McLachland and Basford 1988). This model deals with the unsupervised discovery of clusters within data (McLachlan 1992). In particular, mixtures of normal populations are most frequently studied and applied in practice. In estimating mixture parameters, the maximum likelihood (ML) method, the maximum likelihood estimator (MLE) in particular, has become the most extensively adopted approach (Redner and Walker 1984). Although the use of the expectation maximization (EM) algorithm greatly reduces the computational difficulty for the MLE of mixture models, the EM algorithm still has drawbacks. The slow convergence of the generated sequence of iterates in some applications is a typical example. Other methods such as the method of moments and the moment generating function (MGF) method generally involve the problem of simultaneously estimating all of the mixture parameters. It is clearly a very difficult task of estimation in large data sets. Therefore, the development of an efficient method to unravel patterns in mixtures is important.

In addition to the efficiency of an estimation method, another important feature that needs to be addressed is robustness. To be useful in practice, a method needs to be very robust, especially for large data sets. It means that the performance of a method should not be significantly affected by small deviations from the assumed model and it should not deteriorate drastically due to noise and outliers. Discussions on and comparison with several popular clustering methods from the point of view of robustness are summarized in Dave and Krishnapuram (1997). Obviously, robustness in spatial knowledge discovery is also necessary. Some attempts have been made in recent years (Hsu and Knoblock 1995; John and Langley 1995) and the problem needs to be further studied.

To have an efficient and robust method for the mining of regression classes in large data sets, especially under contamination with noise, Leung et al. (2001a) introduce a new concept named “regression-class” which is defined by a regression model. The concept is different from the existing conceptualization of class (cluster) based on commonsense or a certain distance measure. As a generalization of classes, a regression class contains more useful information. Their model assumes that there is a finite number of this kind of regression classes in a large data set. Instead of considering the whole data set, sampling is used to identify the corresponding regression classes. A novel framework, formulated in a recursive paradigm, for mining multiple regression classes in a data set is constructed. Based on a highly robust model-fitting (MF) estimator and an effective Gaussian mixture decomposition algorithm (GMDD) in computer vision (Zhuang et al. 1992, 1996), the proposed method, coined regression-class mixture decomposition (RCMD), only involves the parameters of a regression class at each time of the mining process. Thus, it greatly reduces the difficulty of parametric estimation and achieves a high degree of robustness. The method is suitable for small, medium, and large data sets and has many promising applications in a variety of disciplines including computer vision, pattern recognition, and economics.

It is necessary to point out that identifying some regression classes is different from the conventional classification problem, which is concerned with modeling the conditional distribution of a response/dependent variable Y given a set of carriers/independent variables \( X \). It also differs from other models, such as piecewise regression and regression tree, in which different subsets of X follow different regression models. The RCMD method not only can solve the identity problem of regression classes, but may also be extended to other models such as piecewise regression. It can be employed to discover local variations taking different functional forms.

5.6.2 The Notion of a Regression Class

Intuitively, a regression class (“reg-class” in abbreviation) is equated with a regression model (Leung et al. 2001a). To state it formally, for a fixed integer i, a reg-class G i is defined by the following regression model with random carriers

$$ {G_i}:\;\;Y = {f_i}(\mathbf{X},{\beta_i}) + {e_i}, $$
((5.78))

where \( Y \in \bf R \) is the response variable; the explanatory variable that consists of carriers or regressors \( \mathbf{X} \in {\mathbf{R}^p} \) is a random (column)vector with a probability density function (p.d.f.) p(•), the error term e i is a random variable with a p.d.f. ψ (u;\( {\sigma_i} \)) having a parameter \( {\sigma_i} \), \( E\left( {e_i} \right) = 0 \), and \( \mathbf{X} \) and e i are independent. Here, \( {f_i}( \cdot, \cdot ):{\mathbf{R}^p} \times \mathbf{R} \to \mathbf{R} \) is a known regression function, and \( {\beta_i} \in {\mathbf{R}^{q_i}} \) is an unknown regression parameter (column) vector. Although the dimension of \( {\beta_i} \) and q i may be different for different G i , we usually take \( {q_i} = p \) for simplicity. Henceforth, we assume that e i is distributed according to a normal distribution, i.e.,

$$ \psi (u;{\sigma_i}) = \frac{1}{{{\sigma_i}}}\phi (\frac{u}{{{\sigma_i}}}), $$
((5.79))

where \( \phi \) (⋅) is the standard normal p.d.f.

For convenience of discussion, let

$$ {r_i}(\mathbf{x},y;{\beta_i}) \equiv y - {f_i}(\mathbf{x},{\beta_i}). $$
((5.80))

Definition 5.1

A random vector \( \left( {\mathbf{X},Y} \right) \) belongs to a regression class G i (denoted as ( X ,Y) ∈ G i ) if it is distributed according to the regression model G i .

Thus, under Definition 5.1, a random vector \( \left( {\mathbf{X},Y} \right) \in {G_i} \) implies that \( \left( {\mathbf{X},Y} \right) \) has a p.d.f.

$$ {p_i}(\mathbf{x},y;{\theta_i}) = p(\mathbf{x})\psi ({r_i}(\mathbf{x},y;{\beta_i});{\sigma_i}),\,\,{\theta_i} = {(\beta_i^T,{\sigma_i})^T}. $$
((5.81))

For practical purpose, the following definition associated with Definition 4.1 may be used:

Definition 5.2

A data point \( \left( {\mathbf{X},Y} \right) \) belongs to a regression class G i (denoted as \( \left( {\mathbf{x},y} \right) \in {G_i} \)) if it satisfies \( {p_i}(\mathbf{x},y;{\theta_i}) \ge {b_i} \) , i.e.,

$$ {G_i} \equiv {G_i}({\theta_i}) \equiv \left\{ {(\mathbf{x},y):} \right.{p_i}(\mathbf{x},y;{\theta_i})\left. { \ge {b_i}} \right\}, $$
((5.82))

where the constant \( {b_i} > 0 \) is determined by \( P[{p_i}(\mathbf{X},Y;{\theta_i}) \ge {b_i},(\mathbf{X},Y) \in {G_i}]\; = a \), a is a probability threshold specified a priori and approaches to one.

Assume that there are m reg-classes G 1 , G 2 , …, G m in a data set under study and that m is known in advance (m can actually be determined at the end of the mining process when all plausible reg-classes have been identified). The objective of knowledge discovery in mixture spatial distriution is to find all m reg-classes, to identify the parameter vectors and to make predication or interpretation by the models. To lower computation cost, we need to randomly sample from a data set to search for the reg-classes. Assumed that \( \left\{ {({\mathbf{x}_1},{y_1}),...,({\mathbf{x}_n},{y_n})} \right\} \) are the observed values of a random sample of size n taken from a data set. Thus they can be considered as realized values of n independently and identically distributed (i.i.d.) random vectors with a common mixture distribution population

$$ p(\mathbf{x},y;\theta ) = \sum\limits_{i = 1}^m {{\pi_i}} {p_i}(\mathbf{x},y;{\theta_i}), $$
((5.83))

i.e, they consist of random observations from m reg-classes with prior probabilities \( {\pi_1},...,{\pi_m}({\pi_1} +... + {\pi_m} = 1{,}\ {\pi_i} \ge 0{,}1 \le i \le m),\;{\theta^T} = (\theta_1^T,...,\theta_m^T) \).

5.6.3 The Discovery of Regression Classes under Noise Contamination

In a noisy data set, regression classes are distribution amidst a large number of outliers. Thus, how to unreal reg-classes under noise contamination becomes a challenge in the discovery of relevant relationships in the overall data set. Leung et al. (2001a) scrutinize the problem under two situations.

The case in which \( {\pi_1},...,{\pi_m} \) are known

In this case all unknown parameters consist of the aggregate vector \( \theta = {(\theta_1^T,...,\theta_m^T)^T} \). If the vector \( \theta_0^T = (\theta {_1^{0^T}},...,\theta {_m^{0^T}}) \) of true parameters is known a priori, and the outliers are absent (\( {\varepsilon_i} \equiv 0 \), \( 1 \le i \le m \)), then the posterior probability that \( ({\mathbf{x}_j},{y_j}) \) belongs to G i is given by

$$ {\tau_i}({\mathbf{x}_j},{y_j};\theta_i^0) = \frac{{{\pi_i}{p_i}({\mathbf{x}_j},{y_j};\theta_i^0)}}{{\sum\nolimits_{k = 1}^m {{\pi_k}{p_k}({\mathbf{x}_j},{y_j};\theta_k^0)} }},\,1 \le i \le m. $$
((5.84))

A partitioning of the sample \( Z = \{ ({\mathbf{x}_1},{y_1}),...,({\mathbf{x}_n},{y_n})\} \) into m reg-classes can be made by assigning each \( ({\mathbf{x}_j},{y_j}) \) to the population to which it has the highest estimated posterior probability of belonging to G i if

$$ {\tau_i}({\mathbf{x}_j},{y_j};\theta_i^0) > {\tau_k}({\mathbf{x}_j},{y_j};\theta_k^0),\,1 \le k \le m{,}\ k \ne i. $$
((5.85))

This is just the Bayesian decision rule:

$$ d = d(\mathbf{x},y;{\theta_0}) = { \arg }\mathop {{ \max }}\limits_{1 \le i \le m} [{\pi_i}{p_i}(\mathbf{x},y;\theta_i^0)],\,\mathbf{x} \in {\mathbf{R}^p},y \in \mathbf{R},1 \le d \le m, $$
((5.86))

which classifies the sample Z and “new” observation with minimal error probability. As \( {\theta_0} \) is unknown, the so-called “plug-in” decision rule is often used:

$$ d = d(\mathbf{x},y;{\hat \theta_0}) = { \arg }\mathop {{ \max }}\limits_{1 \le i \le m} [{\pi_i}{p_i}(\mathbf{x},y;\hat \theta_i^0)], $$
((5.87))

where \( {\hat \theta_0} \) is the MLE of \( {\theta_0} \) constructed by the sample Z from the mixture population,

$$ {\hat \theta_0} = { \arg }\mathop {{ \max }}\limits_{\theta \in \theta } l(\theta ), $$
((5.88))
$$ l(\theta ) = \ln \prod\limits_{j = 1}^n {{p_0}({\mathbf{x}_j},{y_j};\theta ) = \sum\limits_{j = 1}^n {\ln p({\mathbf{x}_j},{y_j};\theta )} }, $$
((5.89))

where \( \rm Theta \) is a parameter space.

For the case in which \( p_i\left( {\mathbf{x},y;{\theta_i}} \right) \) is contaminated, i.e., the \( {\varepsilon_i} \)-contaminated neighborhood is:

$$ {\rm B}({\varepsilon_i}) = \left\{ {p_i^\varepsilon } \right.(\mathbf{x},y;{\theta_i}):p_i^\varepsilon (\mathbf{x},y;{\theta_i}) = (1 - {\varepsilon_i}){p_i}(\mathbf{x},y;{\theta_i}) + {\varepsilon_i}\left. {{h_i}(\mathbf{x},y)} \right\}, $$
((5.90))

where \( {h_i}(\mathbf{x},y) \) is any p.d.f. of outliers in G i , and \( {\varepsilon_i} \) is an unknown fraction of an outlier present in G i .

The effect of outliers on the MLE \( {\hat \theta_0} \) under ε-contaminated models can now be studied. Under this situation, Z is the random sample from the mixture p.d.f.:

$$ {p_\varepsilon }(\mathbf{x},y;{\theta_0}) = \sum\limits_{i = 1}^m {{\pi_i}p_i^\varepsilon (\mathbf{x},y;\theta_i^0)}. $$
((5.91))

Let \( \nabla_\theta^k \) be the operator of the k-th order differentiation with respect to \( \theta \), 0 be a zero matrix with all elements being zero and 1 be a matrix with all elements being 1. Denote

$$\begin{array}{rl} {I_\varepsilon }(\theta; {\theta_0}) & = - {E_\varepsilon }[\ln {p_0}(X,Y;\theta )] \\ \qquad & = - { \mathop{\int \int}\nolimits_{{R^{p + 1}}}} {\ln {p_0}(x,y;\theta ){p_\varepsilon }}(x,y;{\theta_0})dxdy, \hfill \\ \end{array}$$
((5.92))
$$ {B_i}(\theta ) = \int\!\!\!\!\int_{{R^{p + 1}}} {[{h_i}(\mathbf{x},y) - {p_i}(\mathbf{x},y;\theta_i^0)]\ln {p_0}(\mathbf{x},y;\theta )d\mathbf{x}dy}, $$
((5.93))
$$ {J_\varepsilon }({\theta_0}) = - \int\!\!\!\!\int_{{R^{p + 1}}} {{p_\varepsilon }(\mathbf{x},y;{\theta_0})\nabla_\theta^2}\ln {p_0}(\mathbf{x},y;\theta )\left| {_{\theta = {\theta_0}}} \right.d\mathbf{x}dy. $$
((5.94))

It can be observed that \( {I_0}({\theta_0};{\theta_0}) \) is the Shannon entropy for the hypothetical mixture \( p_0(\mathbf{x},y;\theta ) \). Furthermore, \( {J_0}({\theta_0}) \) is the Fisher information matrix

$$ {J_\varepsilon }({\theta_0}) = \int\!\!\!\!\int_{{R^{p + 1}}} {{p_0}(\mathbf{x},y;{\theta_0}){\nabla_\theta }}\,\ln {p_0}(\mathbf{x},y;\theta ){[{\nabla_\theta }\ln {p_0}(\mathbf{x},y;\theta )]^T}\left| {_{\theta = {\theta_0}}} \right.d\mathbf{x}dy, $$
((5.95))

and in regularity conditions

$$ {\nabla_\theta }{I_0}(\theta; {\theta_0})\left| {_{\theta = {\theta_0}}} \right. = \mathbf{0},\,\nabla_\theta^2{I_\varepsilon }(\theta; {\theta_0})\left| {_{\theta = {\theta_0}}} \right. = {J_\varepsilon }({\theta_0}). $$
((5.96))

Theorem 5.1

If the family of p.d.f. \( p(\mathbf{x},y;\theta ) \) satisfies the regularity condition (Kendall 1987), the function \( {I_0}(\theta; {\theta_0}) \), \( {B_i}(\theta ) \) are thrice differentiable with respect to \( \theta \in \rm Theta \), and the point \( {\theta_\varepsilon } = \) \( \mathop {\arg \min }\limits_{\theta \in \Theta }\ \) \( {I_\varepsilon }(\theta; {\theta_0}) \) is unique, then the MLE \( \hat \theta \) under \( \varepsilon \)-contamination is almost surely convergent (as), i.e.,

$$ \hat \theta \mathop \to \limits^{a.s} \theta _\varepsilon \left( {n \to \infty } \right) $$
((5.97))

and \( {\theta_\varepsilon } \in \Theta \) satisfies the asymptotic expansion:

$$ {\theta_\varepsilon } = {\theta_0} + {[{J_\varepsilon }({\theta_0})]^{ - 1}}\sum\limits_{i = 1}^m {{\varepsilon_i}{\pi_i}} {\nabla_\theta }{{\rm B}_i}\left( {{\theta_0}} \right) + {\rm O}\left( {{{\left\| {{\theta_\varepsilon } - {\theta_0}} \right\|}^2}} \right){\mathbf{1}}. $$
((5.98))

(see Leung et al. (2001) for the proof)

Remark 5.1

 It can be observed from Theorem 4.1 that in the presence of outliers in the sample, the estimator \( \hat \theta \) can become inconsistent. It should be noted that \( \left| {{\nabla_\theta }B_i(\theta )} \right| \) depends on the contaminating density \( h_i(\mathbf{x},y) \), \( 1 \le i \le m \), and may have sufficiently large value.

From Theorem 5.1, we have the following result:

Corollary 5.1

In the setting of Theorem 4.1, \( \hat \theta \) has an influence function

$$ IF(\mathbf{x},y;\hat \theta ) = {[{J_0}({\theta_0})]^{ - 1}}{\nabla_\theta }\ln {p_0}(\mathbf{x},y;\theta )\left| {_{\theta = {\theta_0}}} \right.. $$

(See Leung et al. (2001a) for the proof)

Remark 5.2

 The influence function (IF) is an important concept in robust statistics. It can provide the richest quantitative information on robustness by describing the (approximate and standardized) effect of an additional observation in any point \( \left( {\mathbf{x},y} \right) \) on the estimator \( \hat \theta \). Roughly speaking, the IF measures the effect of infinitesimal perturbations on the estimator.

The case in which \( {\pi_1},...,{\pi_m} \) are unknown

Here we adopt the method in McLachlan and Basford (1988). Let \( \pi = {({\pi_1},...,{\pi_m})^T} \), \( \varphi = {({\pi^T},{\theta^T})^T} \), and

$$ l(\varphi ) = \ln \prod\limits_{j = 1}^n {{p_\varepsilon }} ({\mathbf{x}_j},{y_j};\theta ) = \sum\limits_{j = 1}^n {\ln } [\sum\limits_{i = 1}^m {{\pi_i}p_i^\varepsilon ({\mathbf{x}_j},{y_j};{\theta_i}} )], $$
((5.99))
$$ {\tau_i}({\mathbf{x}_j},{y_j};\varphi ) = \frac{{{\pi_i}p_i^\varepsilon ({\mathbf{x}_j},{y_j};{\theta_i})}}{{\sum\nolimits_{k = 1}^m {{\pi_k}p_k^\varepsilon ({\mathbf{x}_j},{y_j};{\theta_k})} }},\,\,1 \le i \le m. $$
((5.100))

It should be noted that \( {\pi_m} = 1 - \sum\nolimits_{i = 1}^{m - 1} {{\pi_i}} \). Therefore, for \( 1 \le k \le m - 1 \), the MLE of \( {\pi_k} \), \( {\hat \pi_k} \), satisfies

$$ {\nabla_{{\pi_k}}}l(\varphi ) = \sum\limits_{j = 1}^n {[\frac{{{\tau_k}({\mathbf{x}_j},{y_j};{\theta_k})}}{{{\pi_k}}} - \frac{{{\tau_m}({\mathbf{x}_j},{y_j};{\theta_m})}}{{{\pi_m}}}]} = \mathbf{0}. $$
((5.101))

By simple computation, the likelihood equation for \( \varphi \), \( {\nabla_\phi }l(\varphi ) = \mathbf{0} \), can thus be rewritten as

$$ {\nabla_{{\theta_k}}}l(\varphi )\left| {_{{\theta_k} = {{\hat \theta }_k}}} \right. = \sum\limits_{j = 1}^n {{\tau_k}({\mathbf{x}_j},{y_j};\hat \varphi ){\nabla_\theta }\ln p_k^\varepsilon } ({\mathbf{x}_j},{y_j};{\theta_k})\left| {_{{\theta_k} = {{\hat \theta }_k}}} \right. = \mathbf{0}, $$
((5.102))
$$ {\hat \pi_k} = \sum\limits_{j = 1}^n {{{{\tau_k}({\mathbf{x}_j},{y_j};\hat \varphi )} \mathord{\left/{\vphantom {{{\tau_k}({x_j},{y_j};\hat \phi )} n}} \right.} n}}, \,\,1 \le k \le m. $$
((5.103))

There is a difficulty with the mixtures in that if \( p_i(\mathbf{x},y;{\theta_i}) \) and \( p_j(\mathbf{x},y;{\theta_j}) \) belong to the same parametric family, then \( p(\mathbf{x},y;\varphi ) \) will have the same value when the cluster labels \( i \) and \( j \) are interchanged in \( \varphi \). That is, although this class of mixtures may be identifiable, \( \varphi \) is not. However, this lack of identifiability of \( \varphi \) due to the interchanging of cluster labels is of no concern in practice, as it can easily be overcome by the imposition of an appropriate constraint on \( \varphi \) (McLachlan and Basford 1988).

However, it may be very difficult to get \( {\hat \theta_0} \) because too many parameters are involved. As a matter of fact, the ML method for directly estimating the parameters of mixture densities actually has many practical implementation difficulties (Zhuang et al. 1996). For example, (1) when there are a large number of clusters in the mixture, the total number of parameters to be estimated can be very large in proportion to the available data samples; and (2) there may be singularities in the log-likelihood function, since the likelihood needs not be bounded from above (Vapnik 1995).

One of the main aims of robust statistics is to develop robust methods which can resist the effect of outliers in data sets. However, almost all of the robust methods tolerate only less than 50% of outliers. When there are multiple reg-classes in a data set, they cannot identify these classes because it is very common that the proportion of outliers with respect to a single class is more than 50%. Recently, several more robust methods have been developed for computer vision. For example, MINPRAN (Stewart 1995) is perhaps the first technique that reliably tolerates more than 50% of outliers without assuming a known bound for inliers. The method assumes that the outliers are randomly distributed within the dynamic range of the sensor, and the noise (outlier) distribution is known. When the outliers are non-uniform, adjustment of MINPRAN to suit other kinds of distributions has also been proposed. However, the assumptions of MINPRAN restrict its generality in practice.

Another highly robust estimator is the MF estimator (Zhuang et al. 1992), which is developed for a simple regression problem without carriers. It does not need assumptions such as those in MINPRAN. Indeed, no requirement is imposed on the distribution of outliers. So, it seems to be more applicable to a complex data set. Extended on the ideas of the MF estimator and GMDD, Leung et al. (2001a) derived the RCMD estimator to unreal regression classes.

5.6.4 The Regression-Class Mixture Decomposition (RCMD) Method for knowledge Discovery in Mixed Distribution

Since a mixture density is observed as a composition of simple structured densities or data structures, with respect to a particular density or structure, all other densities or structures can be readily classified as part of the outlier category in the sense that these other observations obey different statistics. Thus, a mixture density can be viewed as a contaminated density with respect to each cluster in the mixture. When all of the observations for a single density are grouped together, the remaining observations (clusters and true outliers) can then be considered to form an unknown outlier density. According to this idea, the mixture p.d.f. in (5.91) with respect to G i can be rewritten as

$$\begin{array}{rl} {p_\varepsilon }(x,y;\theta ) & = {\pi_i}(1 - {\varepsilon_i}){p_i}(x,y;{\theta_i}) + {\pi_i}{\varepsilon_i}{h_i}(x,y) + \sum\limits_{j \ne i}^m {{\pi_j}p_j^\varepsilon (x,y;{\theta_j})} \ \\ & \equiv {\pi_i}(1 - {\varepsilon_i}){p_i}(x,y;{\theta_i}) + [1 - {\pi_i}(1 - {\varepsilon_i})]{g_i}(x,y). \ \\ \end{array}$$
((5.104))

Ideally, a sample point \( ({\mathbf{x}_k},{y_k}) \) from the above mixture p.d.f. is classified as an inlier if it is realized from \( {p_i}(\mathbf{x},y;{\theta_i}) \) or as an outlier coming from the p.d.f. \( {g_i}(\mathbf{x},y) \) otherwise.

The given data set \( Z = \{ ({\mathbf{x}_1},{y_1}),...,({\mathbf{x}_n},{y_n})\} \) is now generated by the mixture p.d.f. \( {p_\varepsilon }(\mathbf{x},y;\theta ) \), i.e., it comes from \( {p_i}(\mathbf{x},y;{\theta_i}) \) with probability \( {\pi_i}(1 - {\varepsilon_i}) \) together with an unknown outlier \( {g_i}(x,y) \) with probability \( [1 - {\pi_i}(1 - {\varepsilon_i})] \).

Let D i be the subset of all inliers with respect to G i and \( {\bar D_i} \) be its complement. From the Bayesian classification rule, we have

$$ {D_i} = \left\{ {({\mathbf{x}_j},{y_j})} \right.:{p_i}({\mathbf{x}_j},{y_j};{\theta_i}) > \frac{{1 - {\pi_i} + {\pi_i}{\varepsilon_i}}}{{{\pi_i}(1 - {\varepsilon_i})}}\left. {{g_i}({\mathbf{x}_j},{y_j})} \right\}, {\bar D_i} = Z - {D_i}. $$
((5.105))

Define

$$ \begin{array}{*{20}c} {d_i^0 = \min \left\{ {p_i \left( {x_i ,y_i ;\theta _i } \right):} \right.} & {\left. {\left( {x_i ,y_i } \right) \in D_i } \right\},} \\ \end{array} $$
$$ \begin{array}{*{20}c} {d_i^0 = \min \left\{ {p_i \left( {x_i ,y_i ;\theta _i } \right):} \right.} & {\left. {\left( {x_i ,y_i } \right) \in \bar D_i } \right\},} \\ \end{array} $$

Ideally the likelihood of any inlier being generated by \( {p_i}(\mathbf{x},y;{\theta_i}) \)is greater than the likelihood of any outlier being generated by \( {g_i}(\mathbf{x},y) \). Thus, we may assume that \( d_i^0 > d_i^1 \). Therefore, the Bayesian classification becomes

$$ {D_i} = \left\{ {({\mathbf{x}_j},{y_j})} \right.:{p_i}({\mathbf{x}_j},{y_j};{\theta_i}) > \frac{{1 - {\pi_i} + {\pi_i}{\varepsilon_i}}}{{{\pi_i}(1 - {\varepsilon_i})}}\left. {{\delta_i}} \right\}, $$
((5.106))

where we can choose δ i ∈[π i (1-ε i )\( d_i^1 \)/(1-π i +π i ε i ), π i (1-ε i )\( d_i^0 \)/(1-π i +π i ε i )]. So, if we assume that \( {g_i}({\mathbf{x}_1},{y_1}) =... = {g_i}({\mathbf{x}_n},{y_n}) = {\delta_i} \), then we will get equivalent results. Using this assumption, (5.100) becomes

$$ {p_\varepsilon }(\mathbf{x},y;\theta ) = {\pi_i}(1 - {\varepsilon_i}){p_i}(\mathbf{x},y;{\theta_i}) + (1 - {\pi_i} + {\pi_i}{\varepsilon_i}){\delta_i}. $$
((5.107))

The log-likelihood function of observing Z corresponding to (5.89) under ε-contamination becomes

$$ l({\theta_i}) = n\ln [{\pi_i}(1 - {\varepsilon_i})] + \sum\limits_{j = 1}^n {\ln [{p_i}({\mathbf{x}_j},{y_j};{\theta_i}) + \frac{{1 - {\pi_i} + {\pi_i}{\varepsilon_i}}}{{{\pi_i}(1 - {\varepsilon_i})}}{\delta_i}]}. $$
((5.108))

Thus, in order to estimate \( {\theta_i} \) from Z, we need to maximize \( l({\theta_i}) \) with each \( {\delta_i} \) subject to \( {\sigma_i} > 0 \). Since the maximization of \( l({\theta_i}) \) at \( {\delta_i} \) with respect to \( {\theta_i} \) is equivalent to maximizing the G i model-fitting function

$$ {l_i}({\theta_i};{t_i}) \equiv \sum\limits_{j = 1}^n {\ln [{p_i}({\mathbf{x}_j},{y_j};{\theta_i}) + {t_i}]} $$
((5.109))

at t i with respect to \( {\theta_i} \), provided that \( {t_i} = {{(1 - {\pi_i} + {\pi_i}{\varepsilon_i}){\delta_i}} \mathord{\left/{\vphantom {{(1 - {\pi_i} + {\pi_i}{\varepsilon_i}){\delta_i}} [}} \right.} [}{\pi_i}(1 - {\varepsilon_i})] \), then we can discuss the problem of maximizing \( l({\theta_i}) \) subject to \( {\sigma_i} > 0 \). Similar to Zhuang et al. (1996), we henceforth shall refer to each “t i ” (≥ 0) as a partial model. Since each t i corresponds to a value \( {\delta_i} \) of unknown outlier distribution \( {g_i}(\mathbf{x},y) \), we only use the partial information about the model without the knowledge of the whole shape of \( {g_i}(\mathbf{x},y) \).

Leung et al. (2001a) introduce a new concept as follows:

Definition 5.3

For a reg-class G i and the data set \( Z = \{ ({\mathbf{x}_1},{y_1}),...,({\mathbf{x}_n},{y_n})\} \) , the \( t \)-level set of G i is defined as

$$ {G_i}({\theta_i};t) = \left\{ {({\mathbf{x}_j},{y_j})} \right.:{p_i}({\mathbf{x}_j},{y_j};{\theta_i}) > \left. t \right\}, $$
((5.110))

the t-level support set of an estimator \( {\hat \theta_i} \) for \( {\theta_i} \) is defined as \( {G_i}({\hat \theta_i};t) \).

According to this concept, \( {G_i}({\theta_i};t) \) is the subset of all inliers with respect to G i at a partial model t. Maximizing (5.109) may be approximately interpreted as maximizing the “likelihood” over the t-level set of G i . It should be noted that the capacity of \( {G_i}({\theta_i};t) \) will decrease as a partial model level t increases. Moreover, the t-level support set of an estimator \( {\hat \theta_i} \) reflects the extent to which the data set supports this estimator at partial model level t.

Definition 5.4

The RCMD estimator of the parametric vector \( {\theta_i} \) for a reg-class G i is defined by

$$ \hat \theta_i^t = { \arg }\mathop {{ \max }}\limits_{{\theta_{\text{i}}}} {l_i}({\theta_i};{t_i}),\,{\theta_i} = {(\beta_i^T,{\sigma_i})^T},\,{\sigma_i} > 0. $$

When m = 1 and the random carriers disappear in (5.78), the RCMD estimator becomes a univariate MF estimator. In particular, when X is distributed uniformly (i.e., p(x) ≡ constant in some domain) and \( {e_i}\sim N(0,\sigma_i^2) \), the maximization of \( {l_i}({\theta_i};{t_i}) \) is equivalent to maximizing

$$ {\bar l_i}({\theta_i};{\bar t_i}) \equiv \sum\limits_{j = 1}^n {\ln \left[ {\psi [{y_j} - {f_i}({\mathbf{x}_j},{\beta_i});{\sigma_i}] + {{\bar t}_i}} \right]}, $$
((5.111))

where \( {\bar t_i} = {{t_i} \mathord{\left/{\vphantom {{t_i} c}} \right.} c} \). For simplicity, we still denote \( {\bar t_i} \) and \( {\bar l_i} \) by t i and \( {l_i} \), respectively. That is, the above expression is rewritten as

$$ {l_i}({\theta_i};{t_i}) \equiv \sum\limits_{j = 1}^n {\ln \left[ {\psi [{y_j} - {f_i}({\mathbf{x}_j},{\beta_i});{\sigma_i}] + {t_i}} \right]}. $$
((5.112))

In this case, the corresponding expressions in (5.110) and (5.82) become, respectively,

$$ {G_i}({\theta_i};{t_i}) = \{ ({\mathbf{x}_j},{y_j}):\psi [{r_i}({\mathbf{x}_j},{y_j};{\beta_i});{\sigma_i}] > {t_i}\}, $$
((5.113))
$$ {G_i}({\theta_i}) = \{ (\mathbf{x},y):|{r_i}(\mathbf{x},y;{\beta_i})| \le 3{\sigma_i}\}, $$
((5.114))

which is based on the 3 σ-criterion of the normal distribution (i.e., \( a \) in (5.82) is 0.9972).Leung et al. (2001a) shows the convergence of \( \hat \theta_i^k \).

The RCMD method can be summarized as follows:

At each selected partial model \( t_i^{(s)} \), s = 0,1,…,S, \( {l_i}({\theta_i};t_i^{(s)}) \) is maximized with respect to \( {\beta_i} \) and \( {\sigma_i} \) by using an iterative algorithm beginning with a randomly chosen initial \( \beta_i^{(0)} \) or by using a genetic algorithm (GA). Having solved \( {\max_{{\beta_1},{\sigma_i}}}{l_i}\left( {{\theta_i};t_i^{(s)}} \right) \) for \( {\hat \beta_i}(t_i^{(s)}) \) and \( {\hat \sigma_i}(t_i^{(s)}) \), the possible reg-class \( {G_i}({\hat \theta_i}(t_i^{(s)})) \) is calculated and it is followed by the test of normality on \( {G_i}({\hat \theta_i}(t_i^{(s)})) \). If the test statistic is not significant (usually at level α = 0.01), then the hypothesis that the respective distribution is normal should be accepted and a valid reg-class, \( {G_i}({\hat \theta_i}(t_i^{(s)})) \), has been determined, otherwise we proceed to the next partial model if the upper bound \( t_i^{(S)} \) has not been reached. It may be said that the identity of each \( {G_i}({\hat \theta_i}(t_i^{(s)})) \) is based on its t-level set.

Throughout, a valid reg-class is subtracted from the current data set after it has been detected and the next reg-class will be identified in the new size-reduced data set by the recursive process. Individual reg-classes continue to be estimated recursively until there are no more valid reg-classes, or the size of the new data set gets to be too small for estimation. Thus, the RCMD method can handle an arbitrary number of reg-class models with single reg-class extraction. That is, the parameters of each reg-class can be estimated progressively and the data points are partitioned into inliers and outliers with respect to this reg-class. The RCMD procedure is depicted in Fig. 5.10 and the iterature and GA-based algorithms are detailed in Leung et al. (2001a).

Fig. 5.10
figure 10_5figure 10_5

Flowchart of the RCMD method

5.6.5 Numerical Results and Observations

The effectiveness of the RCMD method for data mining is demonstrated by some numerical simulations here.

Example 5.1

Assuming that there are nine points in a data set, where five points fit the regression model: \( Y = {\beta_1}X + {e_1} \), \( {e_1} \) ∼ \( N(0,\sigma_1^2) \), \( {\beta_1} = 1,{\sigma_1} = 0.1 \), and the others fit the regression model: \( Y = {\beta_2}X \) \( + {e_2} \), \( {e_2} \) ∼ \( N(0,\sigma_2^2) \), \( {\beta_2} = 0 \), \( {\sigma_2} = 0.1 \) (Fig. 5.11a). Now To unravel the two regression classes, we select t 1  = 0.1, the objective function is the G 1 model-fitting function

Fig. 5.11
figure 11_5figure 11_5

Results obtained by the RCMD method for two reg-classes and one reg-class. (a) Scatterplot for two reg-classes. (a’) Scatterplot for one reg-class. (b) Objective function plot. (b’) Objective function plot. (c) Contour plot of objective function. (c’) Contour plot of objective function

$$ {l_1}({\theta_1};{t_1}) = \sum\limits_{j = 1}^9 {\ln \left[ {\frac{1}{{\sqrt {2\pi } \sigma }}\exp ( - \frac{{{{(y{}_j - {x_j}\beta )}^2}}}{{2{\sigma^2}}}) + 0.1} \right]}, $$

which is depicted in Fig. 5.11b. It can be observed that this function have two obvious peaks, with each corresponding to the relevant reg-classes. Using the iterative algorithm or genetic algorithm, the two reg-classes are easily discovered. It is clearly shown in the contour plot of this function (Fig. 5.11c). For example, using the GA procedure, we can find: \( {\hat \beta_1} = 1.002{,}\ {\hat \sigma_1} = 0.109 \), and \( {l_{\max }} = - 2.167 \). Using more exact maximization method, we obtain \( {\hat \beta_1} = 1.00231{,}{\hat \sigma_1} = 0.109068 \), and \( {l_{\max }} = - 2.016715 \). The difference between the estimated values and the true parameters is in fact very small. On the other hand, if there is only one reg-class in this set (see Fig. 5.11 a'), our objective function is still very sensitive to this change. It can also find the only reg-class in the data set. As can be observed in the 3D and contour plots, there is only one peak which represents the reg-class (Fig. 5.11b', c').

5.6.6 Comments About the RCMD Method

5.6.6.1 About the Partial Models

From the expression of \( {l_i}({\theta_i};{t_i}) \) in (5.109), it can be observed that maximizing \( {l_i}({\theta_i};{t_i}) \) is equivalent to minimizing

$$ \frac{1}{{2\sigma_i^2}}\sum\limits_{j = 1}^n {{{[{y_j} - {f_i}({\mathbf{x}_j},{\beta_i})]}^2}} + n\ln (\sqrt {2\pi } {\sigma_i}) - \sum\limits_{j = 1}^n {\ln p({\mathbf{x}_j}} ), $$
((5.115))

when \( {t_i} = 0 \). Obviously, the minimization of this expression with respect to \( {\theta_i} = {(\beta_i^T,{\sigma_i})^T} \) can be directly accomplished by the minimization with respect to \( {\beta_i} \) followed by \( {\sigma_i} \), which results in the ordinary least squares (OLS) estimates of \( {\beta_i} \). They are not robust and in the presence of outliers they give a poor estimation.

However, when \( {t_i} > 0 \), the situation is quite different. In fact, the parameter estimation with \( {t_i} > 0 \) is fairly robust and the estimated result can be greatly improved. The introduction to a partial model “\( {t_i} > 0 \)” not only represents the consideration of outliers, but is also the simplification of this consideration in order to perform well. It is the advantage of the RCMD method.

With Example 5.1 we can also demonstrate such a fact: the partial model t plays an important role in the mining of multiple reg-classes, and if t is selected within a certain range, the maximization of the objective function \( l\left( {\theta, t} \right) \)is then meaningful. From (5.110), there is a range of t such that the t-level set is nonempty. In this range, reg-classes contained in the data set can be identified.

Figure 5.12 gives us an explanation for Example 5.1. Even when t is very small (\( {10^{ - 3}} \)), the RCMD method is still effective. However, it becomes invalid when t equals zero. For the data in Example 5.1, when t changes from a very small positive number to approximately 5, the method remains valid. Once t exceeds five, the greater t is, the more difficult it becomes for the RCMD method to identify the reg-classes.

Fig. 5.12
figure 12_5figure 12_5

Effect of partial model t on the mining of reg-classes. (a) t = 0.001. (b) t = 0.01. (c) t= 0.1. (d) t = 1. (e) t = 5. (f) t = 50

5.6.6.2 About Robustness

The RCMD estimator is asymptotically stable though it may be a biased estimator (see Theorem 2 in Leung et al. (2001a)). However, in practice it can be improved by other methods. As shown in the numerical examples in Leung et al. (2001a), the RCMD method also has a very high degree of robustness. It can resist more than 50% of outliers in a data set without assuming the type of distributions of the outliers. Besides, the method also possesses the exact fit property that many robust regression models possess. In robust regression, the exact fit property means that if the majority of the data follows a linear relationship exactly, then a robust regression method should yield this equation. If it does, the regression technique is said to possess the exact fit property. As an illustration, the five data points in reg-class 1 in Example 5.1 are changed into another five points which locate exactly in the straight line: \( y = x \) (see Fig. 5.13a). Applying the RCMD method without the intercept to this data set yields almost exactly the fit: \( y = x \) and the scale σ estimate tends to zero (Fig. 5.13b). The RCMD method has thus successfully found the pattern fitting the majority of the data.

Fig. 5.13
figure 13_5figure 13_5

Exact fit property of the RCMD method. (a) Scatterplot, with exactly five points. (b) Objective function plot located on the line: \( y = x \)

5.6.6.3 About Overlapping of Reg-Classes

In case there is an overlapping of reg-classes, Leung et al. (2001) propose another data classification rule for the overlapping of two reg-classes. Once the parameters of two reg-classes G i and G j have been identified by the RCMD method, we can adopt the following rule for the assignment of data points in \( {G_i} \cap {G_j} \): a data point \( ({\mathbf{x}_k},{y_k}) \in {G_i} \cap {G_j} \) is assigned to G i if

$$ {p_i}({\mathbf{x}_k},{y_k};{\hat \theta_i}) > {p_j}({\mathbf{x}_k},{y_k};{\hat \theta_j}). $$
((5.116))

Combining (5.114) and (5.116), we can reclassify the data set into reg-classes. That is, although the points in the overlapping region are removed from the data set when the first reg-class has been detected, to which reg-class these points eventually belong will be determined only after all reg-classes have been found. Thus, based on the rule in (5.116), the final result in the partitioning of reg-classes is almost independent of the extraction order.

For substantiation, the RMCD method has been successfully applied to solve the problem of switching regression models, mixture of linear and non-linear structures, detection of curves, and mining of reg-classes in large data sets contaminated with noise (Leung et al. 2001).

The extension of the RCMD method for the mining of irregular geometric features in spatial database has been discussed in Chap. 5.2.

5.6.7 A Remote Sensing Application

To demonstrate the practicality of the RCMD algorithm, a real-life mining of line objects in remotely sensed data is also performed (Leung et al. 2001a). In their application, runways are identified in a remotely sensed image from LANDSAT Thematic Mapper (TM) data acquired over a suburb in Hangzhou, China. The region contains the runways and parking apron of a certain civilian aerodrome. The image consists of a finite rectangular 95 × 60 lattice of pixels (see Fig. 5.14a). To identify the runways, Band 5 is used as a feature variable. A feature subset of data, depicted in Fig. 5.14b, is first extracted by using a simple technique which selects a pixel point when its gray-level value is above a given threshold (e.g., 250). For the lattice coordinates of points in the subset, the RCMD method is then used to identify two runways, which can be viewed as two reg-classes. At t = 0.05 level, two line equations identified by the RCMD method are \( y = \;0.774x\; + \;34.874 \) and \( y = \;0.341x\; + \;22.717 \), respectively. The result shows an almost complete accordance with data points in Fig. 5.14b. In other words, line-type objects such as runways and highways in remotely sensed images can easily and accurately be detected. Compared with existing techniques such as the window method, the RCMD method can avoid the problem of selecting the appropriate window sizes and yet obtains the same results.

Fig. 5.14
figure 14_5figure 14_5

Identification of line objects in remotely sensed data

5.6.8 An Overall View about the RCMD Method

It appears that RCMD is a promising method for a large variety of applications. As an effective means for data mining, the RCMD method has the following advantages:

  1. 1.

    The number of reg-classes does not need to be specified a priori.

  2. 2.

    The proportion of noise in the mixture can be large. Neither the number of outliers nor their distributions is part of the input. The method is thus very robust.

  3. 3.

    The computation is quite fast and effective, and can be implemented by parallel computing.

  4. 4.

    Mining is not limited to straight lines and planes as imposed by some previous methods. It can also extract many curves which can be linearized (such as polynomials) and can deal with high dimensional problems.

  5. 5.

    It estimates simultaneously the regression and scale parameters such as the MLE by using all of the information provided by the samples. Thus, the effect of the scale parameters on the regression parameters is considered. This is more effective than estimating separately the regression and scale parameters.

Though the RCMD method appears to be rather successful, at least by the simulation experiments, in the mining of reg-classes, there are problems which should be further investigated. As discussed in the literature, the singularity of the likelihood function for a mixture is an issue that needs to be investigated. Singularity means that the value of the likelihood function becomes infinite as the standard deviation of any one component approaches zero (Titterington et al. 1987). Since the RCMD method is based on the MLE, it is then natural to wonder whether or not singularity will occur in the objective function in (5.109). In light of the theory, the function \( {l_i}({\theta_i};{t_i}) \) is not immune to singularities, but in practice this case rarely occurs. It should be observed that singularities occur only in the edge of the parametric spaces (search spaces). However, with good starting values, singularities are less likely to happen. The study in Caudill and Acharya (1998) indicates that the incidence of singularity decreases with the increase in sample size and the increase in the angle of separation of two linear reg-classes. Obviously, we need to further study this aspect within the RCMD framework, though many researchers think that the issue of singularity in MLE may have been overblown.

The second issue that deserves further study is the problem of sample size in the RCMD method. In RCMD, we analyze a very large data set by examining a sample taken from it. If a small fraction of reg-classes contains rare, but important, response variables, complications may arise. In this situation, retrospective sampling may need to be considered (O’hara Hines 1997). In general, how to select a suitable sample size in RCMD is a problem which needs theoretical and experimental investigations.