Measuring regional diversification
To study the relationship between R&D subsidies and regional diversification, we focus on 141 German labor-market regions (LMR), as defined by Kosfeld and Werner (2012). Moreover, our data cover the years from 1991 to 2010. In a common manner, we use patent data to approximate technological activities (Boschma et al. 2015; Rigby 2015; Balland et al. 2019). Despite well-discussed drawbacks (Griliches 1990; Cohen et al. 2000), patents entail detailed information about the invention process, such as the date, location, and technology, all of which are fundamental for our empirical analysis. We extract patent information from the OECD REGPAT Database, which covers patent applications at the European Patent Office (EPO). Based on inventors’ residences, we assign patents to the corresponding LMR. For smaller regions in particular, annual patent counts are known to fluctuate, strongly challenging robust estimations. We therefore aggregate our data into four 5-year periods (1991–1995, 1996–2000, 2001–2005, 2006–2010).
Technologies are classified according to the International Patent Classification (IPC). The IPC summarizes hierarchically eight classes at the highest and more than 71,000 classes at the lowest level. We aggregate the data to the four-digit IPC level, which differentiates between 630 distinct technology classes. The four-digit level represents the best trade-off between a maximum number of technologies and sufficiently large patent counts in each of these classes.
Previous studies relied on the location quotient (LQ), also called revealed technological advantage (RTA), to identify diversification processes. For example, LQ values larger than one signal the existence of technological competences in a region, and values below signal their absence. Successful diversification is then identified when the LQ grows from below one to above one between two periods (Boschma et al. 2015; Rigby 2015; Cortinovis et al. 2017; Balland et al. 2019). We refrain from this approach for two important reasons. Firstly, being a relative measure, the LQ approach allows technologies to “artificially” emerge in regions simply by decreasing patent numbers in other regions. Secondly, the LQ is normalized at the regional and technology levels, which can interfere with the inclusion of regional and technology fixed effects in panel regressions.
We therefore rely on an alternative and more direct approach to assess diversification processes by concentrating on absolute changes in regional patent numbers. More precisely, we create the binary dependent variable Entry with a value of 1 if we do not observe any patents in technology k in region r and period t, and a positive value in the subsequent period \(t+1\). We intensively checked the data for random fluctuations between subsequent periods, which can inflate the number of observed entries. The aggregation of regional patent information into 5-year periods, however, eliminated such cases almost completely.
Information on R&D subsidies
Our main explanatory variable, Subsidies, represents the sum of R&D projects in technology class k and region r at time t. The so-called Foerderkatalog of the German Federal Ministry of Education and Research (BMBF) serves as our data source. The BMBF data cover the largest parts of project-based R&D support at the national level in Germany (Czarnitzki et al. 2007; Broekel and Graf 2012) and have been used in a number of previous studies (Broekel and Graf 2012; Broekel et al. 2015a, b; Cantner and Kösters 2012; Fornahl et al. 2011). The data provide detailed information on granted individual and joint R&D projects, such as the starting and ending dates, the location of the executing organization, and a technological classification called Leistungsplansystematik (LPS).
The LPS is a classification scheme developed by the BMBF and consists of 47 main classes. The main classes are, similarly to the IPC, disaggregated into more fine-grained subclasses, which comprise 1395 unique classes at the most detailed level. To create the variable Subsidies, we need to match the information on R&D subsidies with the patent data. Both are based on different classification schemes (IPC and LPS), which prevents a direct matching. Moreover, there is no existing concordance of the two classifications.
We therefore develop such a concordance. To build the concordance, we reduce the information contained in the Foerderkatalog by excluding classes that are irrelevant for patent-based innovation activities. This primarily refers to subsidies in the fields of social sciences, general support for higher education, gender support, and labor conditions. Next, we utilize a matched-patent-subsidies-firm database created by the Halle Institute of Economic Research. This database includes 325,497 patent applications by 5398 German applicants between 1999 and 2017. It also contains information on 64,156 grants of the Foerderkatalog with 10,624 uniquely identified beneficiaries. In this case, beneficiaries represent so-called executive units (“Ausführende Stelle”) (see Broekel and Graf 2012).
In this database, grant beneficiaries and patent applicants are linked by name-matching. Hence, the IPC classes of beneficiaries' patents can be linked to the LPS classes of their grants. In principle, this information allows for a matching of the most fine-grained level of the IPC and LPS. In this case, however, the majority of links are established by a single incidence of IPC classes coinciding with LPS classes, i.e., there is only one organization with a patent in IPC class k and a grant in LPS class l. Moreover, the concordance is characterized by an excessive number of zeros, as only few matches of the 71, 000 (IPC) \(^{*}\) 1395 (LPS) cases are realized.
To render the concordance more robust, we therefore establish the link on a more aggregated level, which also makes the concordance correspond to the data employed in this study. More precisely, we aggregate the IPC classes to the four-digit level and the LPC to the 47 main classes defined in (BMBF 2014). It is important to note that not all LPS main classes are relevant for patent-based innovation (e.g., arts and humanities). We eliminate such classes and eventually obtain 30 LPS main classes that are matched to 617 out of 630 empirically observed IPC classes. For these, we calculate the share of organizations \(S_{l,k}\) with grants in LPS l that also patent in IPC k:
$$\begin{aligned} S_{l,k}=\frac{n_{l,k}}{\sum _{x=1}^{X_l} n_{x}} \end{aligned}$$
(1)
with \(n_{l,k}\) being the number of organizations with at least one patent in k and grant in l. \(X_l\) is the total number of organizations with grants in l. On this basis, we calculate the number of subsidized projects, \( \hbox {Subsidies}_{l,k}\), assigned to region r and technology k by multiplying the number of grants in l acquired by regional organizations with patents in k with \(S_{l,k}\). Following the discussion in Sect. 2, we calculate Subsidies in three versions: on the basis of all subsidized projects (\( \hbox {Subsidies}_{k,r}\)), for individual projects (\( \hbox {Subsidies}^{\mathrm{Single}}_{k,r}\)), and considering only joint projects (\( \hbox {Subsidies}^{\mathrm{Joint}}_{k,r}\)) in technology class k and region r.
Relatedness density
Our second most important explanatory variable is relatedness. We follow the literature in constructing this variable as a density measure (Hidalgo et al. 2007; Rigby 2015; Boschma et al. 2015). More precisely, relatedness density reveals how well technologies fit to the regional technology landscape. It is constructed in two steps.
Firstly, we measure technological relatedness between each pair of technologies. The literature suggests four major approaches: (1) entropy-based (Frenken et al. 2007), (2) input–output linkages (Essletzbichler 2015), (3) spatial co-occurrence (Hidalgo et al. 2007), and (4) co-classification (Engelsman and van Raan 1994). We follow the fourth approach and calculate technological relatedness between two technologies (four-digit patent classes) based on their co-classification pattern (co-occurrence of patent classes on patents). The cosine similarity gives us a measure of technological relatedness between each technology pair (Breschi et al. 2003).
Secondly, we determine which technologies belong to regions’ technology portfolios at a given time. Straightforwardly, we use patent counts with positive numbers indicating the presence of a technology in a region. Following Hidalgo et al. (2007), we measure relatedness density on this basis as:
$$\begin{aligned} \hbox {Density}_{k,r} = \frac{\sum _{m} x_{m} \; \rho _{k,m}}{\sum _{m} \rho _{k,m}} * 100 \end{aligned}$$
(2)
where Density stands for relatedness density. \(\rho\) indicates the technological relatedness between technology k and m, while \(x_{m}\) is equal to 1 if technology m is part of the regional portfolio (Patents \(> 0\)) and 0 otherwise (Patents \(=\) 0). Consequently, we obtain a 141 \(\times\) 630 matrix including the relatedness density for each of the 630 IPC classes in all 141 LMRs indicating their respective relatedness to the existing technology portfolio of regions.
Control variables
In addition to R&D subsidies and relatedness density, the empirical literature has identified a number of other determinants of regional technological diversification. Knowledge spillover from adjacent regions can potentially impact regional diversification processes (Boschma et al. 2013). We account for these potential spatial spillovers and include technological activities in neighboring regions (\(Neighbor \; Patents_{k,r}\)) as a spatially lagged variable. The variable counts the number of patents in technology k of all neighboring regions s of region r. Regions s and r are neighbors if they share a common border.
We also control for a number of time-varying regional and technology characteristics that influence regional diversification processes. Firstly, regional diversification is dependent on the development stage of regions (Petralia et al. 2017). Hence, economically well-performing regions have more opportunities to diversify into new and more advanced activities than less developed regions. We follow existing approaches and use the gross domestic product per capita (\(Regional\;GDP_{r}\), log transformed) to control for the economic performance of regions (Petralia et al. 2017; Balland et al. 2019). Secondly, the size of the region also plays a role. Regions with a larger working force tend to be more successful in terms of diversification (Boschma et al. 2015; Balland et al. 2019). We therefore include the number of employees in a region (\(Regional\;Employment_{r}\), log transformed) in our empirical model. Both variables, \(Regional\;GDP_{r}\) and \(Regional\;Employment_{r}\), are obtained from the German "ArbeitskreisVolkswirtschaftliche Gesamtrechnungen der Länder” (August 2018). Thirdly, we also consider the number of regional patents (\(Regional\;Patents_{r}\)) to control for the size of the regional patent stock, which also serves as a measure of regions’ overall technological capabilities. Fourthly, diverse regions with larger sets of capabilities have more opportunities to move into new fields than regions with narrow sets (Hidalgo et al. 2007). The regional diversity (\(Regional\;Diversity_{r}\)) variable detects this and is defined as the number of technologies k with positive patent counts in a region. Lastly, the size of technologies is controlled for by considering the number of patents in a given technology (\(Technology\;Size_{k}\)). Descriptive statistics and correlations for all variables are reported in Table 1.
Table 1 Summary statistics and correlation matrix Empirical model
We follow an established approach in the literature on regional diversification to set up our empirical model (Boschma et al. 2015; Balland et al. 2019). More precisely, we rely on panel regressions to explain the status of technological diversification in a region. Our basic model is specified as follows:
$$\begin{aligned} Entry_{k,r,t}= & {} \beta _{1}\hbox {Subsidies}_{k,r,t-1} + \beta _{2}Density_{k,r,t-1}\nonumber \\&+ X_{k,r,t-1} + R_{r,t-1} + T_{k,t-1} + \tau _{k} + \pi _{r} + \omega _{t} + \epsilon _{k,r,t} \end{aligned}$$
(3)
Entry indicates the status of diversification into technology k of region r at time t. Accordingly, all estimations are based at the region-technology level. Subsidies summarizes the number of subsidized R&D projects. In alternative models, it is replaced with the number of individual (SubsidiesSingle) and joint projects (SubsidiesJoint). Density is the relatedness density, and X, R, and T are vectors of control variables at the technology-region, region, and technology level. All estimations include technology (\(\tau\)), region (\(\pi\)), and time (\(\omega\)) fixed effects capturing time-invariant, unobserved, heterogeneity. We assume a time delay with which our dependent variable responds to variation in the explanatory variables. R&D subsidies, for example, are unlikely to cause immediate effects visible in innovation activities as approximated by patents. Rather, they unfold their influence in subsequent years. Consequently, we lag the explanatory variables by one time period, which corresponds to 5 years.Footnote 1
As Entry is a binary variable, a logit regression is applicable. Nevertheless, logit regressions with many fixed effects and few time periods can lead to the prominent incidental parameters problem causing biased results (Neyman and Scott 1948). Therefore, we rather rely on a linear probability model (LPM) to assess the probability that technology k emerges in region r. We, nevertheless, report the results of the three-way fixed effects logit regression in our robustness checks (see Table 7 in “Appendix” ). An entry model implies restricting the observations to those cases in which an entry is possible. Accordingly, we reduce the sample to all potential cases of entry, which corresponds to technology k being absent from the regional technology portfolio in \(t-1\) (zero patents).