Appendix 1: Bandwidth choice and measurement errors for distances
The choice of the bandwidth is important for several reasons. One reason is that it affects the bias of the kernel density estimator (Silverman 1986). Several methods for selecting an optimal bandwidth are discussed in the literature and are available in standard statistical software packages. These bandwidths usually take the statistical properties of the sample dataset into account. A frequently used selection method, Silverman’s rule of thumb, suggests, for example, that the bandwidth should increase with the interquartile range and decrease with the size of the sample. Such a bandwidth, which would range between 50 and 90 km for our data, depending on the functions under study, would oversmooth our Kernel densities greatly because our aggregation of establishment pairs across distance intervals of 500 m (see Sect. 3) reduces our sample size significantly.
Another reason is that the bandwidth should account for the measurement errors in our distance data. Our approximation of interregional distances is subject to measurement errors from three sources. First, Euclidean distances do not take into account the curvature of the earth. This bias can be expected to be negligible for a small country like Germany where the maximum distance is below 1,000 km. Second, Euclidean distances do not take into account the density and quality of the available infrastructure. The actual traveling time per km may differ between high- and low density areas. On the one hand, the denser road networks in high-density areas will offer more direct connections. On the other hand, congestion may reduce the speed. While Combes and Lafourcade (2005) show that Euclidean distances and economic distances are correlated very highly with each other (0.97), we have no reliable information on the magnitudes of the errors that result from approximating economic by Euclidean distances.
And finally, not all establishments are located at their municipalities’ centroids. The corresponding error may be positive or negative, depending on where exactly the establishments are actually situated relative to the centroids. If the municipalities’ areas were perfectly circular, the magnitude of this error ranged from zero to the sum of the radii of the respective two municipalities. The variance of this error thus tends to be higher for larger municipalities, ceteris paribus. To give an idea of the possible magnitudes of the errors in interregional distances, Fig. 2 depicts the distribution across municipalities of their lower and upper bounds, calculated under the assumption that all municipalities are circular. For any two municipalities \(r\) and \(s\) these errors are bounded between \(-(\tau _{r}+\tau _{s})\) and (\(\tau _{r}+\tau _{s})\) where \(\tau \) denotes the radius of a municipality.Footnote 30 For comparison, Fig. 2 also depicts a normal distribution whose tails roughly encompass the highest possible approximation errors. The standard deviation of this distribution is 5 km. A bandwidth of 5 km will thus be sufficient to account for the highest possible measurement errors that result from approximating firms’ locations within municipalities by the municipalities’ centroids. Adding to these 5 km another 15 km to account for the mismatch between economic by Euclidean distances, we choose a bandwidth of 20 km for all pairs of functions as a baseline. Columns (3)–(6) of Table 9 (“Appendix 3”) report the results of robustness checks for bandwidths of 10, 15, 25 and 30 km. They indicate that the classification of industries is fairly robust to choice of the bandwidth.
Appendix 2: Counterfactual references
This appendix discusses the methods of constructing counterfactual references for the uni- and bivariate \(K\) densities as well as for their changes over time.
Significance of localization or colocalization
A question that arises naturally from inspecting the extent of localization or colocalization at a given point in time is whether this is the result of systematic, purposeful location decisions by firms, motivated by the wish to locate functions close to each other, or just the consequence of a series of independent location decisions that happened to generate some accidental spatial clustering of functions. We follow Duranton and Puga (2005) in using Monte Carlo methods to construct a counterfactual reference for each \(K\) density. We construct this reference under the null hypothesis that the location patterns of the functions under study are the results of random, industry-specific location decisions.
For the measure of localization of a single function in (1), this counterfactual reference indicates how the density distribution for the localization of this function may have looked like in 1992 (or 2007), if firms from the respective industry had had no incentives or disincentives for colocating establishments performing this function with each other. We construct this reference for function \(j\) in industry \(i\) by repeatedly resampling (without replacement) all establishments that perform function \(j\) (including their function-\(j\) workers) randomly among the population of all industrial sites occupied by establishments from industry \(i\) in West Germany in the same year, irrespective of whether or not function \(j\) was actually performed at this site. By resampling only among the sites occupied by the industry rather than among those occupied by any manufacturing industry, we focus on the motives for clustering establishments within this industry. We do not want to call a function localized just because its industry is more localized than manufacturing as a whole. By resampling without replacement, we make sure that each feasible industrial site is occupied by at most one establishment in the counterfactual distribution. And by resampling the existing establishments together with their actual number of function-\(j\) workers, we retain not only the number but also the size distribution of this function across establishments. We thereby exclude the effects of managerial decisions on optimal lot sizes from our analysis. We do not want to call a function localized just because its large minimal optimal lot size requires concentrating employment in only a few sites.
Repeating this random resampling 1,000 times, we obtain 1,000 counterfactual spatial distributions of the actual establishments for function \(j\) in industry \(i\), from which we estimate 1,000 counterfactual weighted univariate \(K\) densities in the same way as we estimate the actual \(K\) density (see Eq. 1). We use these 1,000 counterfactual \(K\) densities, in turn, to construct a two-sided 90 % confidence interval, which we take to cover, for each distance, \(d\), the range of densities consistent with no localization of the function in question. The robustness check reported in Column (2) of Table 9 indicates that 1,000 repetitions are enough. Increasing the number of repetitions to 2,000 does not change the classification of industries into domestically fragmenting, domestically localizing or other industries notably.
For the measure of colocalization of two functions in (2), we construct a similar counterfactual reference that indicates how the density distribution for the colocalization of the two functions may have looked like in 1992 (or 2007), if firms had had no incentives or disincentives for colocating them, given the size distributions of the two functions across establishments. We randomly resample each of the two functions independently of each other 1,000 times in the same way as described aboveFootnote 31 and estimate from these 2 \(\times \) 1,000 random distributions 1,000 counterfactual weighted bivariate \(K\) densities in the same way as we estimate the actual bivariate \(K\) density (see Eq. 2). From this, we construct a two-sided 90 % confidence interval, which we take to cover, for each distance, \(d\), the range of densities consistent with no colocalization of the two functions in question.
Significance of changes in localization or colocalization over time
To assess the significance of the changes of localization or colocalization over time, we construct a counterfactual reference that indicates how the density distribution for the localization (or colocalization) may have changed between 1992 and 2007, if there had been no incentives or disincentives for colocating establishments performing this function (these functions) at both points in time. This counterfactual reference should account not only for the changes in the location patterns of the industry as a whole. It should also account for the changes in the locational patterns of each function that are due to changes in their size distributions or optimal lot sizes. We do, for example, not want to conclude that a function became more localized just because its industry as a whole became more localized, or because the optimal lot sizes for individual functions increased over time. We construct the confidence interval for the change of an estimated \(K\) density over time from the differences between corresponding counterfactual \(K\) densities for 2007 and 1992. These counterfactual \(K\) densities are constructed independently of each other for each point in time in the same way as those for the levels of localization or colocalization (see above).Footnote 32 If the difference between estimated \(K\) densities for 2007 and 1992 lies above the upper bound of the 90 % confidence interval of the distribution of these 1,000 counterfactual differences at short distances, we will say that the respective function became more localized (or the two functions became more colocalized) over time. And if it lies below the lower bound of this confidence interval, we will say that the function became more dispersed (or the two functions became more codispersed) over time.
Appendix 3: Additional tables
Table 4 Industry classification
Table 5 Correspondence between functions and occupations
Table 6 Shares of establishments and workers by functions
Table 7 Point estimates of the threshold distances, \(d_{jk}^{*}\), for localization and colocalization
Table 8 Localization and colocalization of functions by industries in West Germany 1992 and 2007
Table 9 Robustness of the classifications of industries