Introduction

Large cities tend to make more productive use of available resources through the generation of agglomeration economies (Melo et al. 2009; Rosenthal and Strange 2004; Combes et al. 2010). Most mechanisms that give rise to agglomeration economies rest on some type of heterogeneity of workers or firms. “It is very difficult to conceive how interactions within an ‘army of clones’ could generate sufficient benefits to justify the existence of modern cities” (Duranton and Puga 2004a). Although heterogeneity in the skills of workers is traditionally difficult to measure, a recent study confirms a positive relationship between population size and the diversity of professions across US cities (Bettencourt LMA et al. 2014). Indeed a number of quantities related to productivity and innovation scale super-linearly with city size including wages, the number of patents (Bettencourt et al. 2007) and formal employment rates (O’Clery et al. 2016). However, the relationship between city size and economic growth is far from guaranteed, especially in emerging economies (McCann and Acs 2011). Alternative factors such as urban infrastructure and institutional capacity may be more important determinants of a city’s ability to generate economic growth (Camagni et al. 2015; Castells-Quintana 2017; Glaeser 2014; Frick and Rodríguez-Pose 2018).

Agglomeration economies are productivity gains that occur when many firms and workers concentrate their activities in space, i.e. “agglomerate”. Several mechanisms may give rise to such gains, one of them being the quantity and quality of matchings between firms and workers (Duranton and Puga 2004a). A large literature looks at cities as mainly unified labour markets (Goldner 1955; Ihlanfeldt 1997; Bertaud 2004) and it has been argued that the capacity to maintain a unified labour market is the limit to city size and to the growth of megacities in the long run (Prud’Homme 1996). Complete labour mobility within a large city or metropolitan area is a necessary condition for the optimal exploitation of the resulting agglomeration economies. No matter their location within the city or metropolitan area, households should be able to reach “within a reasonable time” all the areas where jobs are offered (Bertaud 2004). However, to our knowledge, no rigorous basis has been provided to define such “reasonable time” and to use it as a technical criterion to determine which municipalities should be considered part of a metropolitan area. Studies by Barrios et al. (2012) and Feser and Sweeney (2002), for example, examined the spatial extent of agglomeration economies in Ireland and the US due to access by firms to a diverse pool of skills, finding evidence of economies at distances of up to 50 kilometres and 50 miles respectively; however, these studies do not consider travelling times.

This paper aims at estimating the commuting times that allow firms to mobilise the diversity of skills available in an urban area. It builds on a theoretical model to describe the mechanisms by which a developing city becomes more productive by increasing its formal employment rate, proposed by (O’Clery et al. 2016). (Formal employment is that which takes place in firms that comply with labour legislation). The model assumes that skill heterogeneity, which increases with the size of the working-age population or the city size, is the main source of agglomeration economies. Complementarities across skills imply that complex sectors that use a larger diversity of skills are more productive than those where a more limited set of skills is sufficient to produce. In large cities, skills are more diverse and matches between firms and workers are more likely. As a result, more complex industries are more likely to appear in large cities, absorbing workers from outside the formal economy. New sectors do not emerge randomly as cities grow. If they did, cities would tend to become similar to each other. But the opposite happens: in India, for instance, Bangalore thrives in the high-tech industries, while Bharuch is strong in the petrochemical and Ludhiana in the garment industry (Duttagupta 2019). Regional specialisation may occur even in closely related industries: in Brazil, the region of Vale dos Sinos, in the Rio Grande do Sul state, concentrates on the production of women’s shoes, while the region of Franca, in the state of São Paulo, specializes in the production of men’s footwear (TheBrazilBusiness.com 2019). Firms and industries build on existing strengths in their locations, meaning that development processes are strongly path-dependent (Nelson and Winter 1982; Hausmann et al. 2007a; Neffke et al. 2011), leading cities to specialise in distinct industries or niches within industries.

The model proposed by O’Clery et al. (2016) to predict the growth of formal employment for a city captures the skill-proximity of the city from current industries to potential new complex industries. Distinct from education and work experience, capabilities and skills are difficult to capture. In order to quantify the complexity of sectors in terms of their diverse skill requirements, the model presented here utilises the network-based economic complexity framework, proposed by Hidalgo and Hausmann (2009). Similarly, the skill proximity or “distance” between sectors is not directly observable. We estimate the skill-proximity between industry pairs based on worker transitions between sectors (Neffke and Henning 2013). The model has robust strong predictive power, and fares better than alternative explanations of formal employment growth across cities.

The main contribution of this paper is to exploit this model in order to investigate the relationship between commuting times in the catchment area of firms and their ability to source the diverse skills needed to increase formal employment. On the one hand, the strength of agglomeration economies increases with the diversity of the skills that firms can access in their catchment areas. On the other hand, while a larger population size leads to higher skill diversity, at some point longer commuting times limit the ability of firms to make effective use of such diversity.

We focus our analysis on Colombia, partly due to data availability, but also because Colombia is a particularly interesting emerging economy due to persistently low rates of formal employment and profound regional differences rooted in a distinct geographical and historical context (Acemoglu et al. 2015). As a result, cities are very heterogeneous in their formality rate, industry composition and in their degree of connection to other cities and to the rest of the world. By some indicators, Colombia is one of the most geographically fragmented countries in the world (Gallup et al. 2003).

We seek to find the commuting time radius that maximises the strength of agglomeration economies in terms of enabling formal firms to access the skills they need to grow their activities in complex sectors. If a longer commuting time radius effectively allows firms in each municipality to reach workers living in other municipalities, we should observe that firms are able to generate more formal employment than if they were tapping skills from only their own municipality. We corroborate this hypothesis and find that a time radius of between 45 and 75 minutes maximises the strength of the agglomeration economies. As a result, we can conclude that around a third of the 96 urban municipalities form part of the labour market of other municipalities.

A range of policy implications can be derived from our findings. One important issue is how to take advantage of the tendency of big cities to absorb neighbouring municipalities, particularly in some countries where urbanisation is growing fast. Our results provide arguments for coordinating transportation, planning and industry re-localisation policies across urban municipalities of different sizes, especially if they are part of the same urban area. Our analysis sheds light on the importance of improving the transportation infrastructure to connect small municipalities to urban centres.

The rest of this paper is organised as follows. In the “Related literature” section we review the main literature strands related to our research question and methods. The “Formal employment and industry complexity in cities” section introduces our definitions of cities, formal employment and industrial complexity, and investigates their relationships. On this basis, the“A network-based model for formal employment growth” section is focused on the underlying ideas and construction of the “complexity potential” metric, the main explanatory variable. The“Model and regression results” section explores the relationship between “formal employment growth” and “complexity potential” (and other controls) for a range of city definitions based on commuting time. We deduce from these results the commuting radius within which firms can make the best use of the skills available around them. Finally, conclusions and policy implications are discussed in Conclusions and policy implications.

Related literature

This paper contributes to a growing literature that emphasises the role that large cities have in better-facilitating matching between employers and employees, and knowledge spillovers between firms and industries via R&D collaboration or the mobility of skilled individuals (Friedrichs 1993; Duranton and Puga 2001 and 2004b; Rosenthal and Strange 2006.) Cities thrive because they act as a cauldron, enabling firms to combine skills in a process of incremental diversification and sophistication of production. The emphasis on skills and the processes of mixing them is not new [see, e.g., Romer (1986); Lucas Jr (1988); Romer (1990); Kremer (1993); Jones (1995); Jones (2002); Barro and Martin (2003); Jones and Romer (2010); Weil (2012)]. Our emphasis here, however, is on the role of skills in the formation of increasingly complex industries. Employing a simple analogy, we consider skills as letters in a game of Scrabble. The more letters -or skills- a firm can access, the greater the number and sophistication of words -products or services- it can build. Under this model, as cities acquire new skills (or letters in Scrabble), the number of possible industries (or words in the game) grows exponentially. This perspective, derived from the economic complexity and evolutionary economic geography literatures (Hidalgo and Hausmann 2009; Hausmann and Hidalgo 2011; Frenken and Boschma 2007), implies that large cities are able to disproportionately benefit from agglomeration economies resulting from skill complementarities.

Beyond skill abundance and heterogeneity, a second key mechanism is at play. Building on locally embedded skills and capabilities, cities move into new economic activities in a path-dependent manner (Nelson and Winter 1982; Hausmann et al. 2007a; Frenken and Boschma 2007; Neffke et al. 2011). This process can be modelled using an industry network, where the nodes represent the industries and the edges represent their level of relatedness (Hidalgo et al. 2007; Neffke et al. 2011). The diversification of a city into new (complex) industries can be described via a dynamical process on the industry network: cities move into new industries located at short “distance” from current industries in the network (Hidalgo et al. 2007; O’Clery et al. 2016). As proposed by Neffke and Henning (2013), here we have used the number of worker transitions (job switches) in order to estimate the skill-proximity between industry pairs. There are, however, a variety of possible approaches to quantifying “relatedness” between industries, including co-location (Hausmann et al. 2007a; Hausmann et al. 2014a), occupational similarity (Farjoun 1994; Chang 1996) and input-output (customer-supplier) relationships (Fan and Lang 2000). Industry networks have been used to model a wide range of local growth and diversification processes, including firm and sector entry (Neffke et al. 2011; Hausmann et al. 2014b).

The focus of this paper concerns the location of capabilities within and around a city, and in particular, the distance between firms and workers with the skills they need. This often overlooked aspect of firm-worker matching is key to the success of the complex economic activities that drive the growth of the formal sector. For example, cities may benefit from the skills and capabilities available in their surrounding catchment areas, and neighbouring cities. Boston, for instance, profits from a high density of world-class universities in neighbouring Cambridge and has become a leading hub for technology and education start-ups in recent years. The importance of distance, location and accessibility in the growth of cities has been well-studied by those interested in commuting zones (Duranton and Overman 2005; Duranton 2015a), transportation links (Venables 2007; Duranton and Turner 2012) and industry clusters (Porter 2011).

A parallel literature, also related to this work, seeks to apply mathematical and data-driven methods to model and analyse transportation networks (Roth et al. 2012; Roth et al. 2011) and mobility patterns (Song et al. 2010; Simini et al. 2012; Prieto Curiel et al. 2018) in cities. Our work is related to studies pertaining to accessibility in cities (see Batty (2009) for a review) and its relationship to the structure of the transport network (Strano et al. 2015; Piovani et al. 2018). A related strand is concerned with social mixing and the fragmentation of social classes in terms of mobility radius and behaviour (Pappalardo et al. 2016; Lotero et al. 2016).

While cities facilitate employment creation, distances between jobs and housing can hinder that process. The ‘spatial mismatch’ hypothesis, originally put forward by John F. Kain in the early sixties, suggested that poor employment outcomes of Afro-Americans are partly due to residential segregation following the decay of the inner cities and the relocation of many jobs to the suburbs (see reviews by Kain (1992) and Ihlanfeldt and Sjoquist (1998)). More recently, Black et al. (2014) found that labour force participation rates of married women are negatively correlated with metropolitan area commuting times, and that metropolitan areas with larger increases in average commuting time had slower growth in participation rates (during the period 1980-2000). These findings are consistent with the idea that, in order to accommodate domestic responsibilities, working women commute shorter distances than men do (Madden 1981; White 1986; Rosenthal and Strange 2012). Consistently, female entrepreneurs locate businesses in less agglomerated locations than do their male counterparts (Rosenthal and Strange 2012).

This paper contributes to the literature that aims at defining metropolitan areas (see Duranton (2015a) for a historical review). The first definitions of metropolitan areas were proposed in the US (the so-called Standard Statistical Metropolitan Areas, SSMA) and other developed countries in the 1950s and 1960s. Currently, the most widely applied methodology is the one established by the OECD in collaboration with the European Union based on decades of research (OECD 2012; 2013). Under this approach, non contiguous urban cores are considered parts of the same metropolitan area if more than 15% of the resident population of one core commutes to work in the other core. Similarly, if more than 15% of the resident population of an individual municipality works in the core area. This methodology has been applied to over 30 countries, including Colombia (Sanchez-Serra 2016). Previously for Colombia, Duranton (2015a) had developed a similar methodology assuming that a municipality belongs to a metropolitan area if at least 10% of its labour force commutes to the rest of the metropolitan area. In Arcaute et al. (2015) cities are defined over the whole range of commuting thresholds in order to test the robustness of scaling exponents to city definitions. In a similar vein, in this work we do not impose any threshold on commuter flows. Instead, we let the model reveal what is the commuting time that allows firms to make the most, in terms of formal employment generation, of the skills available in municipalities other than their own. We compare our results to the OECD and Duranton city definitions for Colombia mentioned (see Appendix).

The choice of thresholds to define metropolitan areas, or urban agglomerations in general (like in Uchida et al. (2010)) should be based on the extent to which such agglomerations benefit from externalities. More specifically, it should take into account the benefits accrued to firms due to access to a large and diverse pool of skills in cities. This view is related to the importance of “labour pooling” in cities, which refers to the sharing of workers and skills by firms in similar industries (Marshall and Marshall 1920). Previous studies by Barrios et al. (2012) and Feser and Sweeney (2002) have examined the spatial extent of labour pooling externalities for firms in Ireland and the US, finding evidence of spillovers at a distance of 50 kilometers and 50 miles respectively (these studies do not provide travelling times).

Our findings may also contribute to the literature that aims at understanding why metropolitan areas that have more fragmented governance structures are less productive than those with unified or better-integrated governance structures (Ahrend et al. 2014).

Finally, our paper is also related to recent literature that aims to assess how urban structure affects productivity. Recent evidence for Latin American cities shows that urban form matters for city productivity; characteristics such as roundness and smoothness seem to facilitate the interaction of firms and workers (Ferreyra and Roberts 2018). The relevant link with our work is that trip patterns within a large city or metropolitan area depend on several aspects of the urban structure, such as density, road patterns, contiguity and whether the city is monocentric or polycentric, among many others (Bertaud 2004). Consequently, labour productivity is also influenced by urban structure, and particularly by the degree of mono- versus polycentricity of the metropolitan area (Meijers and Burger 2010). In this paper, we take into account only travelling times between the centroids of the municipalities that may be part of a metropolitan area. Our approach implicitly assumes that municipalities conform to the monocentric urban pattern, and metropolitan areas to the polycentric urban pattern. Since we use actual travelling times across the centroids, no further assumptions are made with respect to other urban patterns.

Formal employment and industry complexity in cities

In this section we introduce the methodology behind the measurement of “cities”, “formal employment” and “industrial complexity”, and investigate their relationships. These concepts will be needed in “A network-based model for formal employment growth” below to construct the variable named “complexity potential”, which will be the main explanatory variable of the growth of formal employment used to extract optimal city definitions in “Model and regression results”.

Constructing cities

We do not assume a priori a definition for cities in terms of their constituent municipalities. Similarly, we do not predefine the number of cities. Instead, we develop an algorithm to iteratively aggregate urban municipalities into cities based on commuting times. In this way, we produce “city constructs”.

Colombia has 96 municipalities with an urban population larger than 50,000 inhabitants according to DANE (the National Statistics Office); we construct cities as combinations of one or more of these 96 urban municipalities, using commuting time thresholds from zero to 200 minutes. The car commuting time between the centroids of two municipalities is available from the Google Maps APIFootnote 1. Figure 1 shows the geographical location of the 96 urban municipalities with a population of at least 50,000 inhabitants, connected by car commuting times (derived from Google Maps).

Fig. 1
figure 1

Municipality commuting network, formed by the location of 96 urban municipalities in Colombia. The size of each disc represents the population of each municipality. Pairs of municipalities are connected by an edge shaded by commuting time

We first create an undirected, fully-connected network of urban municipalities with edge weights corresponding to commuting times (the municipality commuting network) as illustrated in Fig. 1. We then create a new network for each value of our commute time threshold parameter τ. That is, we remove edges from the fully connected network if the commuting time is larger than τ. For each value of τ, the “cities” are found by detecting the connected components of the corresponding thresholded network.

Figure 2 shows the connected components of the thresholded network for τ = 30, 60, 90 and 120 min respectively. If τ=0, the number of cities is, by definition, the number of urban municipalities. As τ increases, the number of cities decreases as some municipalities become part of a multi-municipality city. Note that, by definition, we do not consider any municipality as the core to which other municipalities are added based on commuting times or flows (as do other methods).

Fig. 2
figure 2

The four maps show the connected components of the municipality commuting network when considering a threshold τ = 30, 60, 90 and 120 minutes respectively, which represent the commuting time between connected municipalities. As τ increases, and municipalities are connected within new city definitions, the number of cities decreases

Certain intermediate municipalities are crucial in the merging process as τ increases. For example, Barranquilla and Cartagena (two large cities in the North of Colombia) merge as a result of two earlier merges between Sabanalarga and both Barranquilla and Cartagena. This means that Sabanalarga, located between Barranquilla and Cartagena, reduces the value of τ for which Barranquilla and Cartagena are merged. A similar situation occurs for the city of Pamplona, located between Bucaramanga and Cúcuta. Figure 3 illustrates this process.

Fig. 3
figure 3

Dendrogram showing how municipalities are merged into cities (and so on) based on the commuting distance between their centroids. For instance, if the threshold τ=50 minutes, then Bogotá and Soacha are considered to be part of the same city

As τ increases, and municipalities are merged, cities increase their working-age population (i.e., the pool of potential workers). For example, in the case of Medellín and its surrounding municipalities (Fig. 4), the working-age population increases from 1.7 million at τ=0 (single municipality) to nearly 2.6 million by τ=49 minutes by which time eight urban municipalities are merged.

Fig. 4
figure 4

In the case of Medellín and its surrounding municipalities, the working-age population increases from 1.7 million at τ=0 (single municipality) to nearly 2.6 million by τ=49 minutes by which time eight urban municipalities are merged

As mentioned in “Related literature” section, other aggregations of municipalities in Colombia have been constructed by Duranton (2015b) and by Sanchez-Serra (2016), the latter applying the OECD methodology (OECD 2013). Using the Rand measure, which is a metric of similarity between two data clusterings (Rand 1971), it is possible to compare those aggregations with the ones produced by us using different values of τ. All the different approaches produce very similar aggregations for the 96 municipalities in our database. With values of τ between 25 and 46 minutes we have the highest agreement with the definitions by the previous works. See the Appendix for more details.

Formal employment

We define formal employment as the number of workers hired by firms throughout the year in compliance with Colombian labour legislation on minimum wages and social security contributions. We use information on formal employment by municipality and industry (330 industries including both manufacturing and services) derived from administrative records of the Colombian Ministry of Health and Social Protection (PILA).

Formally, fempc,i is the formal employment of industry i in city c. Our variable of interest will be the rate of formal employment by city, defined as the ratio of formal employment to the size of the working-age population of the city. Formally, if wpopc is the working-age population of city c, then the rate of formal employment is \({FOR}_{c}=\sum _{i} {femp}_{c,i}/ {wpop}_{c}\).

The left panel of Fig. 5 shows that the rate of formal employment increases with city size as measured by the working-age population (shown here for τ=0). This observation is consistent with the idea that large cities, host to a wide array of diverse skills, are able to foster the sophisticated economic activities that create formal employment. This phenomenon has been previously documented for Colombian, Mexican, Brazilian and US cities (O’Clery et al. 2016). A relevant outlier observed is the municipality of Yopal, located 200 kilometres northeast of Bogota and with a working-age population of roughly 90,000, since its formality rate is considerably high (the third largest in the country). Two big and relevant industries in Yopal are the extraction of crude petroleum and natural gas and the service activities incidental to oil and gas extraction. The oil industries, together with the building of civil engineering works industry, represent nearly 23% of the formal employees of Yopal.

Fig. 5
figure 5

The left panel shows the relationship between formality rate and working-age population of the city (log scale). The Pearson’s correlation coefficient between formality rate and (log) working-age population is ρ=0.57. The right panel shows the relationship between city complexity and working-age population (log scale). The Pearson’s correlation coefficient between city complexity and (log) working-age population is ρ=0.715. In both cases, the city definition corresponds to τ=0, i.e., when no municipalities are merged

Industrial complexity

Building on the idea that high levels of formal employment result from the presence of sophisticated economic activities, we would expect that large cities are host to complex industries encompassing both the production of goods and services. Here we adopt the economic complexity framework of Hausmann et al. (2007a) in order to quantify the industrial complexity of cities.

Industry complexity is a measure of the diversity of capabilities needed by an industry. Highly complex industries are produced by teams of workers with diverse skills pooling their collective and complementary know-how. Capabilities or skills, however, are multidimensional variables that are not directly observable (and should not be confused with years or type of education). What is observable is the outcome of localised capabilities, namely the diversity and uniqueness of the goods and services produced.

Originally proposed by Hausmann et al. (2007a), the product complexity index (PCI) is constructed based on patterns of product exports across countries. Industry complexity is an adaptation of this methodology, developed as part of a project to build an online data-explorerFootnote 2 for Colombia at the Harvard Center for International Development, and is based on patterns of industry employment across cities. The general idea of this approach is that complex goods (or services) are both rare, and produced in places where many other things are produced. Hence, the computation of industry complexity is based on an iterative model which weighs industry diversity by city ubiquity and so on. See the Appendix for the details of the data and algorithm used.

Similarly, city complexity is a measure of the range of capabilities available in a city, and is computed as the mean of the industry complexity of the industries present in the city. As anticipated, we find that large cities tend to be more complex (right panel of Fig. 5), meaning that they have a larger and more diverse skill base necessary for the formation of complementary teams. Some outliers are observed such as Caldas (Antioquia), a small city which has higher city complexity relative to other cities of similar size. This is due to the presence of complex industries such as the fabricated metal products manufacturing industry and other relatively complex industries. Barranquilla also has high city complexity due to the presence of complex industries such as the manufacture of pharmaceuticals.

A network-based model for formal employment growth

In the previous section we introduced the theory and metrics behind the measurement of “cities”, “formal employment” and “industry complexity”. We showed that large cities generate more formal employment and have higher levels of industry complexity. Our task now is to explain the mechanism that produces such relationships. For that purpose, here we review and extend the model developed by O’Clery et al. (2016) to measure the “skill-distance” between a city’s current industrial base and new complex industries needed to increase its formality rate. This approach relies on a methodology proposed by Neffke and Henning (2013) to estimate the similarity between industry pairs in terms of skills or capabilities.

Labour flow industry network

An industry is considered skill-related to another industry if the number of job switches (i.e., worker flows) between these industries is larger than what would be expected from randomising all switches among all pairs (Neffke and Henning 2013).

Formally, if ϕi,j is the number of job switches between industry i and industry j (during a given time period), then the “skill-relatedness” can be computed as a matrix with entries:

$$ S_{i,j}=\frac{\phi_{i,j}/\sum_{j}{\phi_{i,j}}}{\sum_{i}{\phi_{i,j}}/\sum_{i,j}{\phi_{i,j}}}. $$
(1)

The skill-relatedness captures whether a higher flow of workers is observed between industry i and j or between j and i and so the matrix Si,j is made symmetric by averaging with its transpose, and re-scaling the values so that they range from -1 to 1:

$$ A_{i,j}=\frac{S_{i,j}+S_{j,i}-2}{S_{i,j}+S_{j,i}+2}. $$
(2)

We can consider Ai,j as the adjacency matrix of an undirected weighted network where the nodes are the set of industries and the edge weight between nodes i and j represent how relatively far the labour flow between those two industries are from the random expectation, given by the value of Ai,j. Note: only positive values of this matrix (that is, more job switches than expected) are preservedFootnote 3. Full detail on these methodological considerations can be found in Neffke and Henning (2013).

The industry network is visualized in Fig. 6, where industries (the nodes of the network) which belong to the same sector (official industry classification) have the same colour, and pairs of industries which have more switches than expected (Ai,j>0) are connected with an edge (so that the edges of the network represent pairs of industries with a high number of job switches or high skill-relatedness). The size of the node is proportional to its industry complexity. We observe natural clustering of industries based on shared skills or required know-how, in many cases along sectoral lines. For example, on the left-hand side, it is apparent that both social services (purple) and financial services (red) tend to share workers, while manufacturing industries (blue) exhibit a number of distinct clusters.

Fig. 6
figure 6

Visualisation of the skill-relatedness network for Colombia, where nodes correspond to industries and edges correspond to positive values of the adjacency matrix given in Eq. 2. The node size is proportional to industry complexity, and colours correspond to the sector groups given in the legend

We can quantify the observed clustering of industries within sectors by comparing edge connectivity within sectors relative to what would be expected if edges were distributed randomly. The positive entries of the adjacency matrix A are shown in Fig. 7. Nodes are ordered in terms of sector, and sectors delineated by vertical and horizontal lines. The diagonal blocks correspond to within-sector edges. We observe clear within-sector clustering (high density of edges) in many cases, e.g., social services, transport and communications. Our measure of the density of edges between (distinct) sectors G and G is given by:

$$ \mu(G, G') =\frac{e_{G,G'}}{e}\frac{n(n-1)}{n_{G} n_{G'}}, $$
(3)
Fig. 7
figure 7

The positive entries of the adjacency matrix A are shown in the figure, where a positive value represents how relatively far the labour flow between those two industries are from the random expectation, that is, more job switches than expected. The metric μ(G) reveals, for each sector G, how many times more likely is it to observe job switches within the sector relative to overall connectivity. For instance, μ(Social Services)=7.9 means that workers are nearly 8 times more likely to switch between different social service industries than if edges were allocated at random. For off-diagonal blocks, the shading corresponds to μ(G,G) for sector pairs G and G. We observe, for example, that switches between construction and mining and oil industries are denser than average connectivity in the network

where n and nX are the total number of industries and the number in sector G respectively, e and eX,Y are the total number of edges and the number between nodes in sectors G and G’ respectively. In essence, this computes the ratio of actual edges to possible edges within the subgraph induced by the industries in sectors G and G, divided by the ratio of total edges to all possible edges (i.e., a complete graph). A value of μ(G,G)>1 means that there are more switches between sectors G and G than expected.

For a single sector G, this expression is slightly modified:

$$ \mu(G) =\frac{e_{G,G}}{e}\frac{n(n-1)}{n_{G} (n_{G}-1)}. $$
(4)

As before, a value of μ(G)=μ(G,G)>1 indicates that switches within sector G are more frequent than expected.

We find that μ(Construction)=39.6, and hence workers are significantly more likely to switch between different construction industries than other industries. On the other hand, μ(Manufacturing)=0.8 and hence manufacturing industries do not form a single tightly connected cluster, but instead form connections with other industries that share specific skills and competencies.

Considering off-diagonal blocks, we observe significant connections between sectors. For example, switches between construction and mining and oil industries are denser than average connectivity in the network. It is clear that, while some of the structure of the network follows sector groupings of industries, much relatedness between industries is not captured by official sector classifications.

Local industrial structure

We can think of the industry network as a ‘economic landscape’ that describes the possible diversification paths open to a city. Specifically, the location of the set of industries present in a city constrains its future development: cities tend to move into new economic activities that are proximate (in a skill - and consequently network - sense) to those already present (Nelson and Winter 1982; Hausmann et al. 2007a; Frenken and Boschma 2007; Neffke et al. 2011). Before we explore growth paths, we need to formally define “industry presence” with respect to changing city definitions.

For a given city definition, we will use a commonly deployed measure which captures the relative importance of an industry in the city given its overall distribution across the country. The location quotient LQc,i, also known as revealed comparative advantage, is defined as:

$$ {LQ}_{c,i} = \frac{femp_{c,i}/{wpop}_{c}}{\sum_{c} {femp}_{c,i}/ \sum_{c} {wpop}_{c}}. $$
(5)

An industry i is “present” in city c if the corresponding location quotient LQc,i>1, which means that a larger share of the population of city c work in industry i than at a national level (more in the Appendix).

As a case study, we will explore the industrial structure of Medellín, and its surrounding municipalities. Medellín is the second largest city in Colombia and it is located in the Aburrá Valley. Several municipalities can be considered part of Medellín, including Envigado, Itagüí, La Estrella and Bello, which are connected to the core centre of the city with the Medellín Metro system. A clear definition of the limits of Medellín, however, is not obvious. For example, Rionegro is a municipality located to the east of Medellín. Although there is a considerable distance between Rionegro and Medellín, Rionegro is host to the José María Córdova International Airport, the second busiest airport of Colombia, which functions as an aerial hub for Medellín.

The set of industries present in Medellín is shown in Fig. 8. This set changes as the definition of the city changes (as we increase the commuting time threshold τ). Municipalities such as Copacabana, La Estrella, Envigado, Bello and Rionegro become part of the city and, as a result, some new industries develop location quotients higher than 1, thus becoming “present”.

Fig. 8
figure 8

The top panel highlights the network location of industries present in Medellín. The value of τ at which an industry is “added” is shown via the width of the node rim. By time τ=49 minutes, 30 industries are considered to be present which were not at τ=0. The panel below shows the complexity potential of Medellín and its surrounding municipalities as the commuting time τ increases. We observe that, for example, when τ=49 minutes, Rionegro is considered to be part of Medellín and so they merge into a single (and higher complexity potential) city

Several examples are worth mentioning. The glass industry becomes present when Envigado joins Medellín. The firm in question is Peldar which employs around 200 in Envigado (and nationwide 1,200), and is owned by one of the largest business groups in Colombia. Peldar manufactures crystal products and glass containers highly regarded in Colombia. The vehicle industry is added to the industrial portfolio of the city also as a result of Envigado, which hosts the Colombian subsidiary of Renault, the only big exporter of cars from Colombia. Renault has 1 100 workers making it the second largest employer in Envigado - and by far the largest exporter: US 317m in 2017 (or nearly 85% of the municipality’s exports). The electric motors and transformers industry is added when the small municipality of Copacabana is included in the city. This is due to one firm (Rymel) that produces electrical equipment and installations. Not all of the high complexity industries that are added produce manufactured goods. When the commuting radius is extended to τ=49 minutes, Rionegro becomes part of the city and “activites of airports” are added to the industry mix. Although direct employment generated by the Rionegro airport is small (180 workers), it makes the Medellín area a hub for air travel and, more significantly, gives the city an important advantage for exports of high-value products.

Complexity potential

As Fig. 8 suggests, clusters of skill-related industries tend to be present in a city due to the fact that industries that use similar skills tend to co-locate, and new industries emerge that share similar skills to existing competencies. The question arises: can we capture, in a measure, the likelihood that a city moves into new industries more complex than those already present?

The complexity potential of a city, introduced by O’Clery et al. (2016), is a measure of the possibilities for the city to move to more complex industries that are not yet present in the city, taking into account the existing skills of the local labour force. To compute the complexity potential in the initial period (2008), we need to estimate the distance between city c and each missing industry i. This distance weighting factor dc,i, also known as “density” in the literature (Hausmann et al. 2007b; Hausmann et al. 2007a; Hausmann et al. 2014a), is defined as

$$ d_{c,i}=\frac{\sum_{j \in N_{c}} A_{i,j}}{\sum_{j} A_{i,j}}, $$
(6)

where Nc is the set of industries that is present in city c. Effectively, it is the ratio of edge weights connecting i to industries present in city c to the total edge weight (for edges connected to node i). Computed below for the set of currently missing industries, this quantity can be thought of as the likelihood of an industry appearance based on current presences in industries with similar skills. The density of a city-industry pair varies with commuting threshold τ through variation in the set of present industries in a city, Nc.

The complexity potential for a city is the weighted mean of the complexity of its missing industries, where the weight corresponds to the density above:

$$\begin{array}{*{20}l} {CP}_{c} = \frac{1}{|M_{c}|} \sum_{i\in M_{c}}d_{c,i} C_{i}, \end{array} $$
(7)

where Mc denotes the set of “missing” industries for city c (where LQ<1), and the Ci∈[0,1] is the normalised complexity of industry i. The complexity potential varies with commuting threshold τ through variation in the density (and not the complexity which we fix).

The complexity potential of a city varies when more municipalities are added and the set of industries considered present changes. Continuing with our Medellín case study, Fig. 8 shows that the complexity potential increases when first Bello and then Envigado, Itagüí and La Estrella are added (the three municipalities located in the South of Medellín). Finally, it increases again at τ=49 which is when Rionegro (which hosts the airport) is added. The fact that the complexity potential of Medellín grows as more municipalities are added implies that the industries that are added increase the density - or likelihood of an appearance - of complex industries (i.e., the added industries increase proximity to complex industries in the industry network).

In the next section, we will investigate the extent to which the skills embedded in surrounding municipalities, such as those observed in Medellín, are predictive of a cities’ ability to grow its formal employment rate.

Model and regression results

As mentioned, complete labour mobility is a necessary condition for the optimal exploitation of agglomeration economies in a city or metropolitan area. For this, households should be able to reach “within a reasonable time” all the locations where jobs are offered. Our interest in this paper is to shed light on the “reasonable time” of commuting that allows firms to maximise their reach of skills in the surrounding municipalities. This should be the time that maximises the impact of complexity potential on formal employment growth. Within this time radius, one or more municipalities effectively operate as an integrated labour market and can, therefore, be considered as a single city.

This can be tested by regressing the change in formal employment rates on complexity potential for a range of city definitions based on commuting time. Formally:

$$ \Delta {FOR}_{c} = \beta_{0} + \beta_{1} \text{log} {CP}_{c} + \beta_{2} {FOR}_{c} + X_{c} $$
(8)

where ΔFORc is the change in formal employment rate between 2008-2013, FORc is the formal employment rate in 2008, the controls Xc include the working-age population in 2008, GDP per capita and others. More details on all variables may be found in the Appendix.

Figure 9 plots the coefficient of complexity potential, β1, against both commuting threshold τ (left) and the number of distinct cities obtained by varying τ (right). We observe that the coefficient increases and becomes more significant as τ increases, but only up to a certain limit. The coefficient is largest between 45 and 75 minutes, equivalent to between 62 and 43 cities. This region is shown via the two dark vertical lines. It then declines but remains significant up to about τ=90 minutes, which is equivalent to 35 individual cities. After this point, the confidence interval strays into negative values. We can interpret these results as suggesting that a commuting radius of between 45 and 75 minutes enables firms to source the diverse skills they need to move into complex industries, and increase the city-level formal employment rate. This implies that Colombia has between 43 and 62 cities that operate as an integrated labour market.

Fig. 9
figure 9

On the left-hand panel, we show the coefficient of complexity potential, β1, allowing the set of cities to change as a function of commuting-time τ. The coefficient is shown in red, with the confidence intervals shown in blue. The right-hand panel considers how the coefficient varies with the number of distinct cities.

Tables showing the full regression results for both τ=0 and τ=45 are available in the Appendix. Our results are robust to a host of tests, such as the inclusion of various demand shocks and other control variables. We run a placebo test for robustness by randomising the agglomeration of municipalities into cities (thus ignoring geographic or commuting proximity). We find that, in this case, complexity potential has no explanatory power, further supporting our model and the robustness of our results.

Conclusions and policy implications

Political or bureaucratic definitions of municipalities are not adequate to describe today’s metropolitan sprawls as cities circulate workers across traditional boundaries. While previous attempts have been made to delineate Colombia’s cities based on commuting flows (Duranton 2015a; Sanchez-Serra 2016), none has used a criterion based on economic outcomes emerging from agglomeration economies to identify distinct cities as composites of groups of municipalities.

Building on an existing model, we argue that formal employment creation is mainly influenced by the availability of diverse and sophisticated skills including those which may lie in periphery municipalities. Building cities by aggregating urban municipalities within a given commuting time of each other, we show that a commuting radius of between 45 and 75 minutes, or 62 to 43 cities, is the optimum scale at which firms take advantage of the diversity of skills within the city boundary. Beyond a radius of about 90 minutes, the relationship between complexity potential and employment growth is no longer statistically significant, suggesting that firms cannot effectively make use of labour skills beyond that radius.

The most important policy implication of these results is that in order for large cities to take advantage of the greater diversity of labour skills that comes with size, adequate transportation means are necessary to limit travelling times (across the constituent municipalities). There is no reason to discourage cities from expanding or absorbing neighbouring municipalities, provided commuting times are kept within reasonable limits (our study suggests up to 75 min). Similarly, isolated mid or small size cities may be able to generate more formal employment if adequate investments are made to connect them to large cities.

While the objective in transportation investments is usually reducing cargo transportation costs, particularly in developing or emerging economies, our results suggest that passenger transportation is probably more important. Since existing political borders across municipalities may make decisions difficult to coordinate, some external mechanism or institution could be created to encourage coordinated transportation and infrastructure investments alongside urban planning and industry re-localization programs.

Appendix

Appendix A: Formal employment and population variables

The working-age population wpopc of city c is defined as the population 15 or older in the city, and is provided by DANE (Colombian National Administrative Department of Statistics) by municipality and year.

The formal employment of industry i in a city c (fempc,i) is defined as employment covered by the health social security system and/or the pension system (the self-employed are not included). Formal employment by industry and city in 2008 and 2013 is computed as the number of formal employees in an average month, and is collected by the Colombian Ministry of Health (PILA).

The formal occupation rate of an industry i in a city c (FORc,i) is computed as the formal employment in the city-industry divided by wpopc:

$$ {FOR}_{c,i}=\frac{femp_{c,i}}{wpop_{c}}. $$
(9)

Appendix B: Industry presence and complexity variables

The location quotient, LQc,i (also known as revealed comparative advantage), reflects the relative importance of an industry in a city given its overall distribution,

$$ {LQ}_{c,i} = \frac{femp_{c,i}/{wpop}_{c}}{\sum_{c} {femp}_{c,i}/ \sum_{c} {wpop}_{c}}. $$
(10)

Notice that LQc,i is defined with respect to the working-age population, not with respect to the employment.

We say that an industry is present in city c if the corresponding location quotient LQc,i>1, which means that a larger share of the population of city c work in industry i than at a national level. An implication is that industries may become ‘present’ without other industries losing importance or presence.

The diversity of city, kc,0 is the number of industries present in city c. For its computation, let M be a matrix with entry Mc,i=1 if industry i is present in city c (i.e., LQc,i>1) and zero otherwise. The diversity of city c, expressed as kc,0, is

$$ k_{c,0}=\sum_{i} M_{c,i}. $$
(11)

The ubiquity of industry i expressed as ki,0 is the number of cities in which industry i is present

$$ k_{i,0}=\sum_{c} M_{c,i}. $$
(12)

Industry complexity was originally proposed by Hausmann et al. (2007a) for export products and it is computed on the basis of “diversity”, which is the number of industries present in a city, and “ubiquity”, the number of cities where an industry is present. It is a measure of the range of capabilities needed by industry and it is computed iteratively as follows.

Firstly, the diversity of a city is weighed by the ubiquity of the industries present in it. Formally, let the average diversity of the industries present in city c be

$$ k_{c,1}=\frac{1}{k_{c,0}} \sum_{i} M_{c,i} k_{i,0}. $$
(13)

Similarly, the ubiquity of an industry is weighed by the diversity of the cities where it is present and so, the average ubiquity of industry i is

$$ k_{i,1} = \frac{1}{k_{i,0}} \sum_{c} M_{c,i} k_{c,0}. $$
(14)

If this calculation is done iteratively (so that the average diversity of the industries present in city c is weighted by the average ubiquity of those industries and vice versa) to step n, the two previous expressions become:

$$\begin{array}{@{}rcl@{}} k_{c,n}&=&\frac{1}{k_{c,0}}\sum_{i} M_{c,i} k_{i,n-1}\text{, and } \end{array} $$
(15)
$$\begin{array}{@{}rcl@{}} k_{i,n}&=&\frac{1}{k_{i,0}}\sum_{c} M_{c,i} k_{c,n-1}. \end{array} $$
(16)

For industry i, the n-th step can be conveniently expressed in closed form as

$$ k_{i,n}=\sum_{j} \tilde{M}_{ij} k_{j,n-2}, $$
(17)

where the matrix \(\tilde {M}_{ij}\) is defined as:

$$ \tilde{M}_{ij}=\sum_{c} \frac{M_{c,i} M_{c,j}} {k_{c,0}k_{i,0}}. $$
(18)

Hence, if kn is a vector whose i-th element is ki,n then expression 17 becomes:

$$ \mathbf{k}_{n}=\tilde{M}\mathbf{k}_{n-2}. $$
(19)

By computing k0, values of kn can be easily computed for n≥2. A long-run steady-state value of kn occurs when further steps and further weighting between the diversity of cities and the ubiquity of industries does not change the results much. This is given by a vector kn such that knkn−2. This can be found by computing the eigenvectors of the matrix \(\tilde {M}\). Notice, however, that the rows of \(\tilde {M}\) sum to one (which also means that a vector of all ones is an eigenvector of \(\tilde {M}\) and it is associated with the largest eigenvalue of \(\tilde {M}\)) and so, the vector associated to the second largest eigenvalue of \(\tilde {M}\) is taken as the industry complexity (Ci). For more details on the calculation and interpretation of the industry complexity see Hausmann et al. (2014c) and Mealy et al. (2018).

City complexity is a measure of the range of skills or capabilities available in a city. It can be computed jointly with industry complexity or, equivalently, computed as the average of the industry complexity of the industries present in the city.

Appendix C: Control variables

GDP per capita (GDPpc), which is available from 2011 onwards and is calculated from GDP by municipality estimated by DANE.

Oil producing city, a binary variable, Oc which takes the value of Oc=1 if the city has more than one oil well in production per thousand inhabitants. Oil well data refers to 2014, as reported by Ecopetrol (the Colombian hydrocarbon company) for their own internal records.

Government spending shock is the change between 2008 and 2013 in total government spending (in 2008 prices) per working-age person. It is computed from municipality-level government spending data compiled by CEDE Footnote 4.

Sectoral demand shocks, is a Bartik-style measure (Bartik 1991) that quantifies, for each city, the mix of nationwide sectoral demand shocks facing the city. It is computed as

$$ {sds}_{c} = {FOR}_{c}\sum_{i} \frac{femp_{c,i}(2008)}{femp_{c}(2008)} g_{i,c} $$
(20)

where gi,c= log[fempi(2013)]− log[fempi(2008)] is change in (log) employment of industry iexcluding employment in industry i in city c. In other words, here \({femp}_{i}=\sum _{j \in J} {femp}_{i,j}\) with set J containing all cities except city c. It can be interpreted as the expected change in the formal occupation rate of the city given the nationwide sectoral demand shocks (exogenous to the city).

Fig. 10
figure 10

Here we use the Rand Index to compare our method for defining cities (varying depending on commuting distance τ) and the metropolitan areas constructed by Duranton and the metropolitan areas constructed by the OECD (three versions: red corresponds non-aggregated municipalities, blue and yellow correspond to aggregations based on different commuting flow thresholds). We find best agreement with the Duranton method for all values of τ.

Fig. 11
figure 11

The relationship between the growth of the formal occupation rate and complexity potential, a measure of the availability of skills to develop more complex industries, for τ=0 (the number of cities is 96).

Appendix D: Algorithm

The algorithm to merge municipalities into cities, and compute their corresponding variables, depending on the commuting time τ, proceeds as follows.

Table 1 This set of regressions explores the determinants of the change of the formal occupation rate between 2008-2013 for a set of 96 single-municipality cities

For each commuting time τ=1,…,200,

  • We construct a matrix Gτ of size 96×96, and set \(G^{\tau }_{i,j}=1\) if the commuting time between municipality i and j is less than time τ.

  • We then detect the connected components in the network given by adjacency matrix Gτ. Each component corresponds to a set of municipalities (or a single municipality) from the initial set of 96 - these components form the cities.

  • We aggregate employment by city and industry (according to the cities identified above). We then compute the location quotients and complexity potential for each city, as well as the resulting formal employment rates in 2008 and 2013 and their changes between the two periods. We also construct a range of other variable according to our city definition, namely working-age population, GDP per capita, a binary variable for at least one oil well per 1,000 people, the government spending shocks, and the Bartik-style sectoral demand shocks.

  • Finally, we run the regression in Eq. 8. Note: the industry complexity Ci and the matrix of industry proximities Ai,j are fixed, and only the set of present and missing industries (Nc and Mc) vary for each τ/set of cities.

Appendix E: Comparing between two aggregations

We use the Rand measure RM (Rand 1971) to compare between two aggregations of municipalities for the 96 municipalities in our database. The Rand measure divides the number of “agreements” between the two partitions (either pairs of municipalities which are merged together in both partitions and pairs off municipalities which are not merged together) by the total number of pairs, so that the Rand measure has values between (0,1), where higher values are interpreted as more similar partitions.

We compare our method with different values of τ against the three OECD metropolitan areas and compute the Rand measure as τ varies, so we obtain RMi(τ) for i=1,2 and 3 (which correspond to the three OECD metropolitan areas) and with the partition of Colombian municipalities constructed by Duranton (Duranton 2015a).

Table 2 This table presents regressions for the 62 cities corresponding to τ=45
Table 3 Here we assign municipalities randomly to 62 cities in five successive placebo cases and replicate the last regression (column 5) of the previous tables in each case

Results show a high level of agreement (RMi>0.9 for any τ<80) and then a decreasing value for larger commuting distances (Fig. 10). Yet, although there is a high level of agreement, both aggregations provide different information, since RM<0.943 for all values of τ, meaning that constructing cities based on the commuting distance between municipalities and based on metropolitan areas yields different units of observation.

Appendix F: Additional regression results

Ignoring commuting time (τ=0)

In Fig. 11 we show the relationship between the growth of the formal occupation rate and complexity potential, a measure of the availability of skills to develop more complex industries, for τ=0 (the number of cities is 96). Table 1 presents the regression results (including relevant additional control variables). The variable of interest (complexity potential) is significant in all the regressions. The coefficient of complexity potential is very stable, suggesting that a 10% increase in complexity potential is followed in a five-year period by an increase of about 0.28-0.39 percent points of the formal occupation rate.

City definition at τ=45

Here we choose τ=45 as a plausible “reasonable commuting time”, consistent with the results summarised in Fig. 9 and discussed in the main text. As Table 2 indicates, complexity potential is significant again in all the specifications. Furthermore, its coefficient is larger and more significant than in the previous set of regressions. Within the 45-minute time radius, a 10% increase in complexity potential is followed in a five-year period by an increase of about 0.48-0.52 percent points in the formal occupation rate.

Placebo cities

An alternative approach to test the significance of the results is to construct placebo cities, that is, random collections of municipalities without regard to their (geographic or commuting time) proximity. Does “complexity potential” remain significant in such case, which would imply that factors other than commuting time are at play? We randomly allocate the 96 urban municipalities to form sets of 62 cities, and re-compute the necessary variables as above. We present five random draws, all of which include the same controls in the previous tables. As Table 3 shows, the coefficient of complexity potential is very low and never robustly significant, suggesting that information on the diversity and sophistication of the skills in the constructed disconnected cities bears no relation with the growth of the formal occupation rate.