Using a Cobb–Douglas production function framework and multilevel (mixed) models, we control for firm inputs such as employment and capital, but also control for other regional-level indicators like localization, education level, population density and economic openness. In this, we are also explicitly interested to what extent firms in different industries are differently affected by related and unrelated variety. To analyze the relationship between related variety, unrelated variety and firm productivity, we use data from the ORBIS–Amadeus database, a commercial database provided by Bureau van Dijk (Moody’s Analytics). The ORBIS–Amadeus database provides financial and economic information on firms in almost all European countries, including balance sheets and income statement items, and provides a wide range of performance indices. ORBIS–Amadeus also includes information on a firm’s location, ownership and age and assigns NACE codes to companies—the European standard of industry classification after the French Nomenclature statistique des Activités économiques dans la Communauté Européenne—which can be used to classify firms according to sector. The effort to standardize and harmonize financial information included in the ORBIS–Amadeus database and make it comparable across countries has resulted in its wide-spread use in research (Kalemli-Özcan 2015). We employ a cross-sectional database of firms for 13 European Union Countries for year 2010. Only firms with known information of value added for the year included in the model. Our sample consists of 200,463 firms, located across 13 European countries.
We distinguish between several classifications in our models. First, we categorized firms into 184 NUTS-2 regions, second in 9 sectors and third in 1525 sectors by regions (clusters). The countries included in the sample are Belgium, Czech Republic, Germany, Spain, Finland, France, Italy, Portugal, Sweden, UK, Slovakia, Poland and Hungary. The 9 sectors include (1) Manufacturing 1 (food and textiles), (2) Manufacturing 2 (chemicals and plastic), (3) Manufacturing 3 (metals), (4) Manufacturing 4 (transportation and storage), (5) information and communication, (6) real estate services, (7) professional, scientific and technical activities, (8) administrative and support service activities and 9) other service activities. Following Wintjes and Hollanders (2010) and Cortinovis and Van Oort (2015) regions were ranked according to their capacities in terms of knowledge accessibility, knowledge absorption and knowledge diffusion (see “Appendix 1”). We categorized our regions into three technological regimes (high technological regime, medium technological regime and low technological regime). Furthermore, firms were categorized according to their technological and knowledge intensity following Faggio et al. 2017 into four categories, (1) high–medium-technology manufacturing, (2) medium–low technology manufacturing, (3) knowledge-intensive business services and (4) less knowledge-intensive business services. Although service industries are frequently studied in agglomeration analyses (Henderson et al. 1995; Kolko 2010; Jacobs et al. 2013), they are mostly absent in relatedness studies that resort data from input–output methodologies or manufacturing surveys. We further categorized firms according to their size into four categories, micro-sized firms, small-sized firms, medium-sized firms, large-sized firms, according to the European Union categorization.Footnote 1
Our study employs three (composed) firm-level variables of the ORBIS–Amadeus database. The dependent variable in this study is firm labor productivity and is measured as the logarithm of value added on employees for each firm \( i \) for year 2010. Employment is measured as the logarithm of the number of employees per firm. Capital is measured as the logarithm of the sum of tangible fixed assets and depreciation and is measured in thousands of euro (firm specific).
Agglomeration variables are measured at the regional level, and besides related and unrelated variety, two other variables are introduced. Localization economies or sectoral specialization is measured as the concentration (location quotient) of employees per region (measured at the region-by-sector level). Localization economies concern sector-specific externalities that are thought important for productivity, especially in later stages of firm and industry life cycles (Kemeny and Storper 2015). Next, density is measured as population per km2. This dimension of agglomeration is not directly related to localization economies (specialization) and diversity economies, but to pure urban size effects (Puga 2002; Ciccone 2002; Ciccone and Hall 1996). In general, the literature suggests that higher density enables better interaction, enhancing growth (Rice et al. 2006), although recent research that controls for composition effects also moderates the suggested impact (Henderson 2003).
Several variables are introduced on the regional level to control for further determinants of regional and cluster heterogeneity (Kim 1997). The degree of openness of European regions (a measure related to firm competition and regional competitiveness) is calculated as the total value of imports and exports in a region divided by the regions GDP. This volume of trade indicator is based on a make and use tables (IO-Table) for 2000 on NUTS-2 level concerning 14 sectors and 59 product categories, including services. This database is developed by the Netherlands Environmental Assessment Agency (PBL), see Thissen et al. (2017). The volume of trade goes up with the size of the region at a declining rate. It is strongly dependent on global economic development with competition on global markets, driving up productivity and attracting new investments and collaborations. High potential may also spill over to nearby regions or in the regional network of specialized and subcontracting industries and regions. The average educational level of regions is measured by the percentage of tertiary and higher educated in the total population. The relationship with (employment and productivity) growth thought to be is positive, as more skilled people can be more productive, and agglomeration may attract more of these people (Moretti 2004; Rauch 1993, for a more critical interpretation see Shapiro 2006).
To measure related and unrelated variety, we use an entropy measure proposed by Frenken et al. (2007) at the NUTS-2 level. When measuring regional variety to study the effects on firm productivity, decomposition is useful as it is expected that variety at a high level of sector aggregation reflects the possibility to switch between input substitutes (unrelated variety), while one expects variety at a low level of sector aggregation to be an indication of possible knowledge spillovers because of cognitive similarity (related variety). We also use the ORBIS–Amadeus micro-data as source for the calculation of related and unrelated variety at the NUTS 1 Level. Since small firms are underrepresented in this database, firm-level data are weighted by turnover values. In this fashion, we capture the large and sectorial heterogeneous regional economies best. We compute entropy using employment data, which are available for the four-digit level from the ORBIS–Amadeus database. Unrelated variety per region is indicated by the entropy of the two-digit distribution; related variety is indicated by the weighted sum of the entropy at the four-digit level within each two-digit class.
Marginal variety can be computed at all four-digit SIC levels in the dataset, meaning the increase in variety when moving from one digit level to the next. Formally, let \( P_{i} \) be the four-digit SIC levels share of employment in Section G over total employment in region, by summing the four-digit share one can measure variety
$$ V = \mathop \sum \limits_{g = 1}^{G} P_{i} \log_{2} \left( {\frac{1}{{P_{i} }}} \right). $$
This measure is variety in a general form. The higher its value is, the higher the industrial diversification of a region is. This measure can be split into related and unrelated variety. Firstly by summing the four-digit shares \( P_{i} \), one can derive the two digits shares, \( P_{g} \)
$$ P_{g} = \mathop \sum \limits_{{i \in S_{g} }} P_{i} . $$
Then, unrelated variety is measured as the two-digit level entropy and is given by:
$$ {\text{UV}} = \mathop \sum \limits_{g = 1}^{G} P_{g} \log_{2} \left( {\frac{1}{{P_{g} }}} \right). $$
To measure related variety we estimate the marginal increase when moving from the two-digit to the four-digit level. Formally related variety is the weighted sum of entropy within each two-digit sector and is given by:
$$ {\text{RV}} = \mathop \sum \limits_{g = 1}^{G} P_{g} H_{g} $$
whereFootnote 2
$$ H_{g} = \mathop \sum \limits_{{i \in S_{g} }} \frac{{P_{i} }}{{P_{g} }}\log_{2} \left( {\frac{1}{{P_{i} /P_{j} }}} \right) $$
The method of hierarchical or multilevel modeling allows the micro-level and macro-level to be modeled simultaneously. There are two distinct advantages to multilevel models (Van Oort et al. 2012). First, multilevel models offer a natural way to assess contextuality, or the extent to which a link exists between the macro-level and the micro-level. Applying multilevel analysis to empirical work on agglomeration begins from the simple observation that firms sharing the same external environment are more similar in their performance than firms that do not share the same external environment because of shared agglomeration or other externalities. Hence, we can assess the extent to which variance in firm-level productivity can be attributed to between-firm variance, between-area variance, between-sector variance or between cluster (sector-region). With multilevel analysis, we are able to assign variability to the appropriate context. Second, multilevel analysis allows us to incorporate unobserved heterogeneity into the model by including random intercepts and allowing relationships to vary across contexts through the inclusion of random coefficients. Whereas “standard” regression models are designed to model the mean, multilevel analyses focus on modeling variances explicitly. This kind of complexity can be captured in a multilevel framework through the inclusion of random coefficients.
A Cobb–Douglas production function is expressed as follows:
$$ Y_{ijk} = A(T_{ijk} )^{{\beta_{n} }} K_{ijk}^{{\beta_{1} }} L_{ijk}^{{\beta_{2} }} e^{{e_{ijk} }} $$
(1)
where \( y_{ijk} \) is labor productivity (value added per worker) of the \( i \)th firm nested within \( j \)th group of sector by region, which is nested within the \( k \)th region, \( A(T_{ijk} ) \) is TFP, \( K_{ijk} \) the capital stock and \( L_{ijk} \) the labor force. TFP of plant \( i \) depends on its regional and sectorial environment (see Fig. 1), and it is expressed in terms of related and unrelated variety (sectorial), localization, and urbanization economies, openness of the region and level of education of the inhabitants (regional). Taking logs of Eq. (1) and applying a three-level model which involves regions at the higher-level, sectors by regions at the meso-level and firms within regions at the lower level, we obtain Eq. (2):
$$ \begin{aligned} y_{ijk} & = \beta_{0} + \beta_{1} {\text{CAP}}_{ijk} + \beta_{2} {\text{EMPL}}_{ijk} + \beta_{3} {\text{REL}}_{ijk} + \beta_{4} {\text{UNREL}}_{ijk} + \beta_{5} {\text{LOC}}_{ijk} \\ & \quad + \,\beta_{6} {\text{URB}}_{ijk} + \beta_{7} {\text{OPEN}}_{ijk} + \beta_{8} {\text{EDUC}}_{ijk} + u_{00k} + u_{0jk} + e_{0ijk} \\ \end{aligned} $$
(2)
where \( u_{0k} \) is the partitioning of the total variance in variance between regions, \( u_{0jk} \) is the variance between sectors by regions but within regions, and \( e_{0ijk} \) is the variance within sectors by regions. Figure 1 presents the multilevel structure of the analyses.
In Eq. 2, we assume that the firm-level predictor variables are uncorrelated with the sector and regional-level error terms and that the sector-level predictor variables are uncorrelated with the regional-level error terms. However, both theoretical and empirically, such an assumption is difficult to meet. Not correcting for this would lead to inconsistent parameter estimates. However following Snijders and Berkhof (2008), we remove the correlation between the lower-level predictor variables and higher-level error terms, by including sector and regional means of the firm-level predictor variables in the regression model, a procedure known as the Mundlak (1978) correction.
Please note that although the Mundlak correction addresses part of the endogeneity problem, the multilevel modeling framework does not control for endogeneity arising from reverse causality between firms’ productivity and variety. Specifically, firms’ location choices resulting in lower or higher levels of related and unrelated variety in some places could be induced by the migration of firms between places of low productivity to places of high productivity (Melo et al. 2009). One solution to this problem would be to instrument related and unrelated variety, but unfortunately, finding credible instruments is hard. Therefore, the results should be interpreted as conditional associations, rather than causal relationships.