The Inclusive Sustainable Transformation Index

In this paper, we put forth an index of Inclusive Sustainable Transformation that captures the extent to which a country has developed a modern industry or services-based economy that at the same time protects the environment and is gender inclusive. This index distinguishes itself from other indicators that track the structural characteristics of the economy by ensuring that the comparisons between countries account for differences in the level of development, in line with New Structural Economics thinking. The index evaluates how well a country scores given its available resources. In addition, by addressing data availability problems using multiple imputation techniques, the index is able to compare performances on a wide range of topics for almost 200 countries over 25 years, including a large group of developing countries that are often left out. In addition to monitoring the progress made towards the establishment of an inclusive and environmentally friendly, modern economy, the index is a useful tool for policy makers and analysts. By decomposing the total score back into its components, it can help identify areas that require additional attention, as well as ‘best practices’ in countries at similar levels of development.


Introduction
With the adoption of the Sustainable Development Goals (SDGs) and the Paris Agreement on climate change, the year 2015 was a major inflection point in the long struggle for a global consensus on these international priorities. While both agreements set out clear policy goals, the question of how to monitor and evaluate their progress largely remained unanswered. In recognition of this, in December of 2014, UN Secretary-General Ban Kimoon called for a 'comprehensive program of action on data.' However, the Open Working Group who proposed the 17 SDGs recommended tracking as many as 169 individual targets. It is therefore highly improbable that funding is or will be available to carry out the necessary data collection in all countries (Jerven 2014). 1 Under such circumstances, it is better to adopt a more focused approach, which is still based on objective data analysis and monitoring but covers only the factors considered to be the most critical indicators of the SDGs. To that end, this paper is centered on three elements that underlie as much as half of the SDGs: (1) environmentally friendly and (2) socially inclusive (3) structural transformation.
Structural change is the foundation of sustainable and inclusive growth and the condition for achieving the SDGs (Monga 2013). Rarely has a country evolved from a lowto a high-income status without continuous structural transformation from an agrarian or resource-based economy towards an industry-or services-based economy. Industrialization is essential for lifting people out of poverty, creating jobs, advancing technology, and generating prosperity around the world. However, industrial development often comes at a high environmental cost. For example, industrial production is currently the largest emitter of greenhouse gasses representing almost 30% of global emissions. Fortunately, it is possible to transform conventional industrial development patterns to prevent dangerous anthropogenic interference with the atmosphere and other environmental destruction. It is equally important to ensure that these economic opportunities are open to all, regardless of gender or other individual characteristics. After all, for economic growth to be truly impactful, it cannot be limited to (less than) half the population.
The global community needs monitoring tools that provide the right incentives to governments, the private sector, and other development stakeholders to actively promote this kind of structural transformation. To that end, this paper outlines the construction of the Inclusive Sustainable Transformation (IST) index, which captures the extent to which a country has developed a modern economy that protects the environment and is gender inclusive.
In addition to its more directed focus, the IST index sets itself apart from other indicators of sustainable development in three ways. First and foremost, it accounts for a country's development status when assigning scores. New Structural Economics tells us that the feasible and desired characteristics of countries change with their level of development (Lin 2012a, b). This idea is explicitly incorporated in the IST index, which assesses a country's progress towards sustainable development relative to that of countries with a similar level of development. Moreover, the conditional nature of this index aligns with the idea that sustainable development is a continuous process of improvement for all countries, rather than a fixed path with clearly defined end goals.
Second, instead of using the latest available data for each indicator like the SDG indexes of Kroll (2015) and Sachs et al. (2016Sachs et al. ( , 2017, the IST index keeps track of the availability of the underlying indicators. To deal with the gaps in their coverage, we introduce a new way of imputing the missing data: Multiple Imputation using State Space models (MISS). The MISS algorithm substantially increases the reliability of the imputed values and ensures that the confidence intervals of the index reflect the extent to which imputed data was used. This allows us to compute the IST index for almost 200 countries from 1990 to 2016.
Finally, the paper uses a (conditional) cumulative density function (CDF) to rescale the indicators and compare the performance of different countries. The CDF tell us 1 3 what the probability is of finding a country (with a similar level of development) that performs worse; the higher this probability, the more progress a country has made relative to its peers. By transforming the indicators in this way, we avoid the discontinuity problems associated with using discrete methods like rankings or thresholds (cf. Kroll's 2015) while retaining a straightforward interpretation. Moreover, it allows us to take the level of development into account without having to impose a fixed grouping of countries based on their level of development. In short, it enables a straightforward comparison of the IST scores over time and between countries, including those with a completely different developmental status.
The remainder of this paper is organized as follows. The next section defines structural transformation, its importance for growth and how industrialization contributes to it. Section 3 surveys the theoretical challenges for building development indexes and presents the IST methodology. The Sects. 4 and 5 discuss which indicators were chosen and give an overview of the resulting IST index and its subcomponents.

The Need for Structural Transformation
The importance of structural transformation as a process for generating prosperity and for improving the quality of life around the world cannot be overstated. This process typically involves improving the productivity in the agricultural sector to increase food supply, free up labor and provide savings. These resources can then support the process of industrialization, urbanization and the development of a high-performing service sector that can absorb a growing fraction of the educated labor force. Prolonged growth and economic prosperity require a shift of resources out of traditional agriculture and other low-productivity primary activities into more productive sectors of manufacturing and services in both urban and rural areas. The ensuing expansion and upgrading of 'modern' sectors (including non-traditional agriculture) are at the core of the sustained productivity gains that characterize economic development. Indeed, the consensus among economists is that rising productivity accounts for the bulk of longterm growth (Lin and Monga 2014).
Structural transformation (or structural change) is, therefore, the central focus of economic policy for countries at all levels of development. It has five main features: (i) a steadily declining share of agriculture in economic output and employment; (ii) a rising share of urban economic activity in industry and modern services; (iii) an increasingly sophisticated share of manufactured goods in production and exports; (iv) migration of rural workers to urban settings; (v) and a demographic transition that typically involves a spurt in population growth before reaching a new equilibrium.
Sustaining high economic performance, improving living standards, and sharing prosperity widely to maintain social cohesiveness and peace require constant movement of resources to new, more productive industries, sectors, and firms, as well as continuous infrastructural and institutional improvement. Throughout this process, the country's structure of factor endowments-the relative composition of natural resources, labor, human capital and physical capital-will be innately different at every level of development. Because of this, the optimal industrial structure and comparative advantage of any given economy will evolve along with its level of development (Lin 2012a, b;Lin and Monga 2013).

Industrialization as a Source of Growth
The modernization of agriculture and sustainable industrialization are essential features of the structural transformation process. Productivity increases in agriculture provide food, labor, and savings that fuel the process of urbanization and industrialization (Timmer and Akkus 2008). The development of a competitive industrial sector yields an even higher payoff. Economists have established at least since the early 1960s that manufacturing has always played a significant role in the total output of more affluent countries, and that countries with higher incomes are typically those with substantially larger transport and machinery sectors (McMillan and Rodrik 2011). In fact, only in singular circumstances such as an extraordinary abundance of land or resources have countries succeeded in developing without industrializing. Industrialization also promotes inclusive development by expanding the fiscal space for social investments.
Within the industrial sector, manufacturing in particular has transformed the dynamics of the world economy. The globalization of manufacturing is driven by many factors, including profound changes in geopolitical relations among world nations, the widespread growth of digital information, the decline of transportation costs, the development of physical and financial infrastructure, computerized manufacturing technologies, and the proliferation of bilateral and multilateral trade agreements. These developments have permitted the decentralization of supply chains into independent but coherent global networks that allow transnational firms to locate different parts of their businesses around the world. The creative design of products, the sourcing of materials and components, and the manufacturing of products can now be done more cheaply and more efficiently from virtually any region of the planet while final goods and services are customized and packaged to satisfy the needs of customers in faraway markets. The globalization of manufacturing has thus allowed developed economies to benefit from lower wages in developing countries such as China, India, Bangladesh, Costa Rica, Mexico, and Brazil while creating job and learning opportunities in these formerly poor nations.

Does Manufacturing Still Matter?
In recent decades, innovation, technological developments and new sources of economic growth have led some economists to question whether manufacturing still matters. Manufacturing's share of global value added has steadily declined over the past 30 years as the global value added of services has grown. 2 However, these trends are mainly observed in high-income countries and can be explained by several factors. First, productivity increases and rising standards of living in advanced economies have pushed up wages and forced many industries to delocalize their production to lower-cost nations. Second, increasing levels of efficiency in the world economy have reduced the relative prices of consumer goods while at the same time the demand for services such as healthcare, security, and transportation has increased. Finally, and perhaps even more importantly, manufacturing jobs have a multiplier effect on employment in services, as the development of industries everywhere automatically generates a wide variety of economic activities, from transportation to housing, from hospitality to entertainment. 3 Concerns about the future of manufacturing as a viable source of economic growth have been investigated empirically by Hausmann et al. (2011). They found over 70% of the income variations among nations can be explained by differences in manufactured product export data alone. The analysis of the composition and scale of a nation's manufacturing sector revealed that sophisticated economies export a large variety of 'exclusive' goods that few other countries can produce. These economies have typically accumulated productive knowledge and developed manufacturing capabilities that others do not have. It therefore appears that national income and economic sophistication (economic complexity) rise in tandem.
Even basic manufacturing expertise can gradually generate new knowledge and lead to new, more advanced products, provided that the right strategic and business decisions are made on industrial and technological upgrading. In the words of Hausmann and Hidalgo (2012, p. 13), economic development is 'a social learning process, but one that is rife with pitfalls and dangers. Countries accumulate productive knowledge by developing the capacity to make a larger variety of products of increasing complexity. This process involves trial and error. It is a risky journey in search of the possible. Entrepreneurs, investors, and policy-makers play a fundamental role in this economic exploration. Manufacturing, however, provides a ladder in which the rungs are more conveniently placed, making progress potentially easier.' In sum, manufacturing still generates economies of scale, sparks industrial and technological upgrading, fosters innovation, and has significant multiplier effects.

Composing an Index of Inclusive and Sustainable Transformation
Both the Sustainable Development Goals (SDGs) and the Paris Climate agreement require that all signatories continuously assess and report on the progress made toward their objectives. However, the monitoring of this progress is a major challenge, especially given the differences in the level of development and the production structure of the countries that joined the agreements. As Ahluwalia (2015, p. 5) notes the best one can expect under these circumstances is for economists to 'help to define a set of measurable indicators reflecting various aspects of inclusiveness and sustainability, taking into account availability of data on these indicators, and the scope for improving data availability over time. We could then set targets for each of these indicators and hope that they would be accepted by different stakeholders as representing significant improvement in each dimension. ' In the case of the SDGs, each goal requires a multidimensional policy framework for action. As a result, to keep track of the progress on all 17 goals, 169 target indicators were identified. With this many aspects to consider, assessing overall progress has become even more challenging, as we can expect numerous conflicting narratives on whether progress has been made. In short, there is a need for synthetic indicators that can capture the essence of empirical analyses and convey policy-relevant messages to development stakeholders who would otherwise be overwhelmed trying to make sense of the data generated about each indicator.

Measuring with Indexes: Beyond the Utopian Quest for Legitimate Indicators
There is no shortage of composite indexes to track economic development over time and across countries. In fact, there are so many of them that it has become almost impossible for policymakers to make sense of the stories they tell and to identify the specific, actionable policy levers that yield clear economic and social gains. In his critical review of some popular composite indexes of development, Ravallion (2011, p. 2-3) categorized them into two broad types. First, indexes such as the gross domestic product (GDP), for which the choice of the component series and the aggregate function 'are informed and constrained by a body of theory and practice from the literature.' Second, indexes such as the Human Development Index that are based on 'a set of indicators that are assumed to reflect various dimensions of some unobserved (theoretical) concept.' He saw the former as more appropriate indexes, while the latter lacked the necessary analytical legitimacy: 'neither the menu of the primary series nor the aggregation function is pre-determined from theory and practice, but are […] key decision variables that the analyst is free to choose, largely unconstrained by economic or other theories intended to inform measurement practice' (p. 3). To illustrate his point, Ravallion (2011) contrasts an index where the variables and weights are based on a regression model calibrated with survey data with an index where they are set by an analyst who has some concept of economic welfare in mind. Ravallion (2011) refers to the latter as a 'mashup' index.
Such a distinction may seem like an elegant conceptualization of the problem at hand, but it is an artificial one. First and foremost, the contention that GDP should be the model index, one legitimized by 'pure' theoretical reasoning and rigorous analytical modeling, is invalidated by the body of academic research that has highlighted its many shortcomings (beyond being a randomly aggregated set of variables that form a series of accounting identities). GDP as an index is not beyond suspicion. Calling it a 'capitalist conspiracy' like Venezuelan President Hugo Chavez did, may have been an extreme form of criticism. Yet, as a model index, GDP carries many shortcomings and paradoxes (see Coyle 2014). While GDP measures product, it ignores central facts such as quality, costs, sustainability, or purpose (Stiglitz et al. 2010). 4 Moreover, the expectation that economists and other social scientists can construct development indicators that pass the test of 'pure theories' simply because such indicators would be 'based on a regression model calibrated to survey data' is unrealistic. It is well known that regression models are based on a host of assumptions; without them, legitimate inferences cannot be drawn from the model. While there are statistical procedures for testing some of these assumptions, the tests often cannot detect substantial failures. As pointed out by Freedman (2010, p. 14), 'model testing may become circular; breakdowns in assumptions are detected, and the model is redefined to accommodate. In short, hiding the problems can become a major goal of model building.' However appealing it may be at face value, the dichotomy between 'credible' indexes based on some theory or calibrated from regression analyses, and 'mashup' ones singled out as 'randomly elaborated,' is problematic because what constitutes an acceptable theoretical basis is always debatable. Such a distinction assumes the existence of a rationallyneutral analyst who can observe and monitor performance with distance, detachment, and balance. Philosophers have long provided good arguments about the impossibility of this type of rational actor. From Darwin to Marx, Nietzsche, Freud, Wittgenstein or Heidegger, there is an accumulated body of evidence that the so-called sovereign rational subject-the detached observer imagined by Kant-actually does not exist. 5 It follows that no development economist is intellectually autonomous, self-transparent and capable of identifying causal relationships and causal mechanisms in a detached manner. It is impossible to deny the role of the pre-conceptual and non-conceptual at the very core of the rational. The notion that subject and object can be separated from one another-the supreme dogma of empiricism-is merely an illusion.
Therefore, any economic theory or model, especially one built on regression analyses, should acknowledge the limits of its generated knowledge. Any index out there reflects an explicit or implicit theoretical analysis of the dynamics of economic development. The real criteria for assessing pertinence and effectiveness should be whether an index provides useful information to strengthen intellectual and policy arguments and whether it helps focus the attention on social and economic goals deemed to be of importance to society.

Rescaling, Weighing, and Aggregation
The first step in the construction of the IST index is the selection of the indicators of inclusive and sustainable development, which will be outlined in detail in the next section. The remainder of this section discusses the subsequent steps, namely the way in which these indicators are rescaled, weighed and aggregated, as well as how the level of development is taken into account and how missing observations are handled.
Before the indicators can be combined, they first have to be put them on a comparable footing. There are different ways of transforming these variables, many of which are described in the OECD handbook on composite indicators (OECD-JRC 2008). The distribution of values of the indicators in the IST dataset can differ markedly between indicators and different levels of development. This argues against using more common transformations like z-scores or Min-Max as done for example in the Quality of Growth index (Mlachila et al. 2014). Instead, we use the (empirical) cumulative density function (CDF), which expresses the probability of finding a country with a lower score. 6 The main advantage of the CDF is that it provides a clear interpretation: a CDF of zero means that the country performs worse than all other countries with the same level of development, while a CDF of one indicates the opposite. Any score in between can be interpreted as the fraction of countries that scores worse. Moreover, as we will see below, the CDF also enables us to account for the level of development in a straightforward fashion, without discontinuity problems.
The estimation of the CDF is done year-by-year, using only the information on the distribution of the indicator available at that point in time. This allows the frame of reference to change over time, as the older patterns in the distribution of the indicators left out. As a result, the transformed scores represent a country's position relative to its peers in that year. The CDF will not change if all countries improve (or deteriorate) at the same rate. Similarly, a country's score will decrease if it remains unchanged while its peers improve. In this way, the index underlines the idea that inclusive and sustainable industrial development is a continuous process of improvement for all countries, rather than a fixed path with a clearly defined end goal. An additional benefit of this relative approach is that it undermines what Ravallion (2011) termed rank-seeking behavior: increasing your score until it just exceeds a benchmark. Unless a government is willing to decrease its level of development deliberately, each aspect of the index will need continuous improvement for the country to keep up with its peers.
After transformation, the indicators are combined into the IST index using a simple average. This means that each component receives equal weight in the final index. Moreover, it imposes perfect substitutability between all goals: a high score on manufacturing compensates perfectly for a lower environmental score. However, in the robustness section of this paper (Sect. 5.1), we abandon the assumption of perfect substitutability and instead use a geometric average, which penalizes countries for imbalances in their scores. 7 However, as will be shown in the robustness section, the effect of using a geometric average on the index values is small.
As the weights sum up to one, the IST index has the same range as the transformed indicators: an IST score of one means that the country outperforms all of its peers, the opposite holds true for zero. Anything in between can be interpreted as the average fraction of countries that perform worse.

Differentiating by Level of Development
A central tenet of New Structural Economics is that the structural characteristics of countries are not a one-size-fits-all (Lin 2012a). Instead, the economic structure that best helps growth will change as the country develops and new characteristics become feasible. Simply put, one cannot expect relatively developing countries such as India, Burundi or Ethiopia to have the same environmental, institutional and economic characteristics as rich countries like Denmark or Japan. Rather than relying on the individual to consider this when using the IST index, we want to embed this thinking directly into the index. It should indicate the performance of a country relative to countries with a similar level of development.
By way of illustration, Panel a of Fig. 1 plots the investment in Research and Development in percent of GDP (R&D) versus the level of development (dev) as measured by the log of the Gross National Income (GNI) per capita in 2012. It shows that the distribution of investment in R&D depends strongly on the level of development. The higher dev, the more the distribution shifts towards the maximum values of R&D.
An intuitive way of taking development into account is to group countries by their development level and only compare their characteristics within these groups. In line with NSE theory, PPP converted per capita income can be used as an indicator of the level of development and the capacity of an economy. 8 For example, the income classification employed by the World Bank identifies four different income levels: low-income, lower-middle and upper-middle income and high-income countries. 9 Panel b shows the joint distribution of y and dev and Panel c illustrates how this translates into the conditional CDF for the four income-levels. In general, countries with a high investment in R&D will get a higher score. However, for a given percentage of investment in R&D, a low-income country will receive a higher score than a middle or high-income country. The use of discrete income groups to differentiate levels of development has one important drawback: it creates discontinuities for countries lying on the border. A small change in the level of development can change the group a country belongs to, which can have significant consequences for those variables that strongly depend on it. This can lead to a situation where a small improvement nevertheless leads to a decrease in the index value, only because the country is now compared to an entirely different set of countries. It also biases the comparison between countries that lie on either side of the cut-off point. This problem can be avoided by using a continuous way of controlling for the level of development such as the conditional cumulative density function. The conditional CDF gives us the probability of finding a country with a lower score that has the same level of development: F y|dev (a) = p(y ≤ a|dev) . As illustrated in Panel d, the transformed values using the conditional CDF are very similar to those using fixed thresholds, but without the discontinuity problems for countries whose level of development is close to the thresholds.
The conditional cumulative density F y|dev is estimated using a multivariate kernel density estimator. An essential feature of this type of estimator is that it assigns a higher weight to information in the vicinity of the point of interest, both in terms of the variable and the level of development. The bandwidth of the estimator determines what counts as the vicinity: the larger the bandwidth is, the more the performance of dissimilar countries taken into account. By estimating the bandwidth, we can adjust it to match each indicator. 10 If an indicator is strongly dependent on the level of development, only the information of countries with a very similar level of development is taken into account. However, if the level of development does not affect y, the increase in the bandwidth ensures that more information is used, allowing us to estimate the CDF with greater certainty.
Finally, we ensured that a higher score is always an improvement in outcomes. If an indicator measures something positive (e.g., the share of renewable energy), the transformed indicator ŷ i indicates the probability of finding countries with a similar level of development ( dev i ) that score lower: ŷ i = F y|dev y i |dev i . Conversely, for indicators that measure something negative (e.g., CO 2 emissions) ŷ i shows the probability of finding a country that scores higher: ŷ i = 1 − F y|dev y i |dev i .

Addressing Missing Values
The final issue that needs to be addressed before the index can be computed is how to deal with missing values. The IST index comprises a large number of indicators, and as can be seen in Table 1, the coverage of those indicators can be markedly different. When we limit the dataset to the period 1990-2016, slightly more than half of all observations are missing. A closer examination of the pattern in the missing data reveals that this cannot be easily solved. Figure 2 illustrates this by mapping the data availability using black (available) and white (missing) rectangles. Each row in this table corresponds to one of the 27 variables. The columns, in turn, show the different combinations in which the indicators are available, with the width of each column signaling how often this combination occurs  The list of abbreviations can be found in "List of Abbreviations" of Appendix in the dataset. The columns are sorted in decreasing order of occurrence. The first column tells us that the most common combination is that only the national electrification rate (O7.1) is available. The second most prevalent combination (4% of the data) is the total absence of any information. In contrast, there are no observations that are covered by all indicators as there are always at least two indicators missing. Our first step in tackling the missing data problem is to limit the number of countries for which we compute the index. Specifically, we exclude those countries for which the average data availability over the entire period is less than 25%. This removes mostly small island nations and city-states like Gibraltar, Nauru, or Vatican City, leaving 198 countries in the dataset. However, as Fig. 2 already suggested, limiting the dataset in this way will not solve the missing data problem, as there is no simple pattern in the missing observations. While removing the countries with the worst coverage does reduce the number of missing observations with one quarter, the total fraction of missing values remains high (43%).
In a similar vein, we could leave out the variables that have the lowest availability. For example, by excluding the participation in global value chains (O2.3), gender equality (O5.2) and the wage gap (O5.3), municipal waste (O.6.4), waste recovery (O.65), the Ocean Health index (O8.2) and the protection of terrestrial and marine protected areas (O8.4), the number of missing observations can be cut by half. However, changing the variables also affects the meaning of the index, as it amounts to imposing a zero weight on a quarter of the indicators in our dataset. Moreover, all but the most extensive reductions in the scope and span of the index would still leave differences in availability, making it hard to tell whether changes in the index are not due to differences in availability of the underlying indicators.
Rather than drastically limiting the scope of the IST index, we opt instead to solve the missing data problem using multiple imputation, i.e., filling in the gaps with the most likely value of each indicator. Unlike simple imputation (e.g., linear interpolation), multiple imputation draws many different possible values for each missing observation. This gives us a large number of imputed datasets, each of which yields a different index. The final result is computed as the average of these indexes. The variation of the index over different imputations also gives us an indication of the reliability of the results. The more data is missing and the worse the available data is at filling in the gaps, the greater the dispersion in the imputed values and the wider the confidence intervals of the index will be. In summary, multiple imputation allows us to compute the IST index despite the significant Fig. 2 Availability of the indicators in the IST index. Note: Available data is represented by a black rectangle and missing values by a white rectangle. The width of the columns indicates the prevalence of the different combinations, which are listed in decreasing order data availability problems without having to reduce the scope of the index or omit variables ex-ante.
To estimate the most likely value of the missing observations, we use a method similar to the Multivariate Imputation by Chained Equations method (MICE) of Buuren and Groothuis-Oudshoor (2011). MICE uses the values of the available indicators to predict what the missing values could have been. For example, it will use the information on the number of patents applied to estimate what the level of investment in research and development could have been. While this works well in a cross-section analysis of countries, MICE ignores time-patterns in the data. As we have a panel dataset where many of the indicators depend strongly on their previous values, we expand the MICE method to a state-space model to also take this time-dependency into account. The so-called Multiple Imputation by State Space models (MISS) improves the quality of the imputations thereby significantly decreasing the size of the confidence intervals. Technical details on the imputation and a comparison of the two techniques can be found in "Multiple Imputation Using State-Space Model (MISS)" of Appendix.
To test the robustness of our imputation algorithm, we also ran the model on a reduced dataset where the data availability of each country was at least 50%. This reduced the total number of countries by half compared to our baseline model (which only imposed 25% availability). However, the imputation of the missing values remained virtually unaffected. For all but one variable, the correlation between the mean imputed values was in excess of 0.99. The change in the sample did affect the size of the confidence intervals, but not in a uniform way. One-third of the sample had confidence intervals that were at least 10% smaller; a third had larger confidence intervals, while the remainder had confidence intervals that were more or less the same. 11

Indicators of Inclusive Sustainable Transformation
Our selection of indicators measuring IST is based to a large extent on the literature dealing with the measurement of progress towards the Sustainable Development Goals. Almost half of the development goals are directly linked to the idea of inclusive and sustainable transformation, including Goal 5 which is 'to achieve gender equality and empower all women and girls'; Goal 8 promoting 'sustained inclusive and sustainable economic growth, full and productive employment and decent work for all'; Goal 9 which is to 'build resilient infrastructure, promote inclusive and sustainable industrialization and foster development'; and Goal 12 which aims to 'ensure sustainable consumption and production patterns.' 12 For primary sources of indicators, we look at two UN reports on the monitoring of the SGDS: Indicators and a Monitoring Framework for the SDGs (SDSN 2015) and the report of the Inter-Agency and Expert Group on SDG Indicators (ECOSOC 2016). This list of indicators is supplemented with those discussed in the papers by Sachs et al. (2016Sachs et al. ( , 2017 and Kroll (2015) on the readiness of (developed) countries for the SDGs; the WEF and IMD's reports on (sustainability adjusted) global competitiveness; and the Human Development Index. From these reports, we retained those indicators that were available for a large group of countries over the past decade. Since the focus is on structural transformation rather than economic growth and development in general, there are differences between the indicators listed in these sources and those selected for the IST index. For example, with respect to inclusiveness, only the indicators related to labor markets are retained, leaving out, e.g., the proportion of seats held by women in parliament. "Comparison of the Indicators Included in the IST Index with Other Indicators of Sustainable Development" of Appendix maps our selection of indicators on the different sources, highlighting variables like participation in global value chains or the economic complexity index that are unique to the IST index. Nevertheless, three-quarters of the indicators are used in one or more of the reports mentioned above, and most are described in great detail in the ECOSOC and SDSN reports.
The IST index can be subdivided into eight categories that each contain between two and five indicators, which are listed in Table 1. In addition to a short description of each indicator, this table lists the availability of each indicator over time, the number of countries it covers and the source of the data. 13 The first two components of the IST index look at the strength of the manufacturing and export sectors. The former includes the value added by the manufacturing sector per capita, as well as its share in GDP. The latter considers the export of manufactured goods and commercial services and also includes the volume of exports to compensate for sudden shifts in the terms of trade. Because of their increasing importance to global trade, we also include an index that captures the extent to which countries participate in global value chains. Finally, included in both components is the contribution of medium and high tech firms to the value added of the manufacturing and exports, respectively.
The third IST component measures the technological expertise embedded in a country's economy. To that end, we track the overall investment in research and development, the number of patents per capita and the complexity of a country's export basket. The last is measured using the economic complexity index of Hausmann et al. (2011), which combines information on the diversity of goods a country produces with the ubiquity of those goods (i.e., the number of countries that are capable of producing them).
The fourth and fifth components deal with the strength and inclusiveness of the labor market. For the former, we include indicators of the number of people working in manufacturing, 14 their labor productivity and their level of education. Gender equality is measured as the ratio of male and female employment, gender differences in wages as well as the existence and strength of institutional policies promoting equal opportunities for men and women.
The final three components consider the environmental performance. The first one looks at pollution. Air pollution levels are measured by CO 2 emissions, the abundance of fine particle matter in the air and the consumption of ozone-depleting substances. This component 13 There are two differences between the current selection of indicators and the selection of indicators of the working paper published by the African Development Bank, Ghent University and Peking University's Center for New Structural Economics. First of all, the data on patents for indicator O3.2 now comes from the World Bank's World Development Indicators as opposed to the OECD. Secondly, labor productivity (O4.2) is now measured per worker as opposed to per hour. Both changes were made because they significantly increase the number of countries covered. 14 Services are excluded, as we were not able to distinguish traditional from modern services. also includes the total municipal waste that is generated but counterbalances it with the percentage of waste that is recycled or composted. The second environmental component looks at the structure of the energy market, in particular, the fraction of the population that has access to modern energy (i.e., electricity) and the share of renewable energy in the total energy consumption. The eighth and final component evaluates the management of environmental resources. This includes the percentage of the population with access to drinkable water, the annual change in forest area (as a percentage of land area), the percentage of terrestrial and marine area that is environmentally protected, and an index tracking the health of the ocean. 15 Looking at the list of indicators, it is clear that there is often a significant overlap in what is measured. For example, the manufacturing subcomponent (O1) contains value added by the manufacturing sector both as a percentage of GDP (O1.1) as well as per capita (O1.2). It could be argued that it would be better to reduce the number of indicators, given the relatively marginal contribution of this second indicator to measuring the strength of the manufacturing sector. However, there are a number reasons for allowing this apparent surplus of indicators. First, while one indicator can proxy the overall state of affairs, the inclusion of different indicators enables us to build a complete image of the current situation. Second, the overall contribution of some variables might be relatively limited, but their inclusion can be more important for specific groups of countries. One such example is the national electrification rate (O7.2), which matters much more for developing than developed countries. Finally, the inclusion of indicators with different availabilities enhances the performance of the multiple imputation algorithm. For instance, while the number of patents per capita (O3.2) is available for a longer period, it covers fewer countries than the expenditure on research and development (O3.1). In general, even if the selection is disputed, we will show in the robustness section that our results remain virtually unaffected by the exclusion of any one of the indicators.
As is the case for all components of the index, the environmental variables are also scored conditional on the level of development of the country. This might seem incongruous with variables that have global environmental consequences, most noticeably CO 2 emissions. However, the goal of the index is not to measure environmental impact, for which there already exist numerous indicators and indexes of high quality that, for example, also take consumption and offshoring of polluting activities into account (e.g., ecological footprint). Instead, the IST index measures how well the environment is protected given the available means, even if the consequences of pollution are not limited to the country in question. Furthermore, as the index is focused on structural transformation, it does not include a measure of the sustainability of agriculture, although there is some overlap with the indicators that are included. 16 Finally, while our index only considers the level of development, the capability to score well on a number of indicators depends on more than just the level of development. For example, the location of a country determines access to certain sources of renewable energy, while the lack of access to the sea significantly increases the cost of trade. That being said, many of these problems can be overcome with the right investments, meaning that even in these cases the level of development remains an important factor. Moreover, technological progress is likely to continue to increase its importance as an impediment towards sustainable, inclusive structural transformation.

The Inclusive Sustainable Transformation Index
Using the dataset and methodology described above, we computed the IST index from 1990 to 2016 for 198 countries. As the availability of indicators drops to one in four in 2016, we will focus the discussion of the index on the year before. However, except for an increase in the confidence bands in the final year, the results are very similar in 2016. Figure 3 shows the worldwide distribution of the IST index in 2015. Countries that score above average are colored blue, and those that score below average are colored red, with the darker colors corresponding to respectively higher or lower values. While in theory, the values of the IST index can lie between zero and one, we find that the actual values of the index lie between 0.3 and 0.7. The values on the individual components lie much closer to the theoretical extremes, indicating that countries that score very high on one component will compensate this with lower scores on other components. Overall, the values of the index tend to be slightly negatively skewed, with below average scores centered on 0.45, and above average scores having a fatter tail. In other words, most countries that score below average tend to do so only slightly, and there are more countries with a very high than with a very low score.
By taking the level of development into account when comparing the structural characteristics of countries, the IST index can identify good performance despite a lower level of development. For example, Vietnam's IST score is higher than all other countries on the Asian, American, and African continent. At the same time, the overall picture mostly confirms our expectations. Except for Greece, European countries score highly, and while Central and North America also score above average, South America scores below. Countries in Southeast Asian also tend to score above average and while those in Africa show more mixed results, many Southeast African countries tend to perform well. The high scores for some of the high-income European countries like Austria (0.64), Sweden (0.64) and Finland (0.63) are due in part to the fact that they outperform other high-income countries like Saudi Arabia (0.34), Oman (0.37) and Bermuda (0.38), particularly on the environmental and equality components. However, some of the best scores on the European continent are accrued by lower-and upper-middle income countries in Eastern Europe, namely Hungary (0.66), Slovenia (0.65) and Slovakia (0.65).
When using the index to make comparisons between countries or over time, it is important to keep in mind that they always reflect a country's position relative to those with a similar level of development in that year. As a result, a decrease in the index does not necessarily mean that a country's absolute achievement deteriorated: it could also mean that other countries made (more) progress. That being said, the IST index is relatively stable: its variation between countries is more than twice as large as its variation over time. There are countries whose score has changed dramatically over the past decade. To illustrate, Fig. 4 show the evolution of the IST score of China (panel a) and the United Arab Emirates (panel b), together with their 95% confidence intervals. For both countries, the index changes only gradually and the confidence intervals indicate that the year-to-year changes are all insignificant. Nevertheless, when considering the evolution over a more extended period, there are 46 countries where changes in the index are big enough that the 95% confidence intervals no longer overlap, including China and the United Arab Emirates.
Before comparing our index with those suggested by Sachs et al. (2017) and Kroll (2015), it is important to note that these indexes do not consider the time dimension. Both use only the latest available values for each indicator, which in some cases date back to 2012. Their correlation with the 2015 values of IST is 0.54 and 0.43, respectively. These rise slightly when we control for GDP per capita: the respective partial correlation coefficients rise to 0.56 and 0.58. As the IST conditions on the level of development, the correlation between IST and per capita GNI is low (0.17), but its correlation with the HDI is higher (0.34). This is the principal difference with the indexes of Sachs et al. and Kroll, which are strongly correlated with the level of development (0.60 and 0.67 with GNI/cap; 0.92 and 0.79 with HDI), even though Kroll only compares developed countries.
Having considered the overall IST scores, the next step is to look at the underlying indicators to better understand how certain scores came about. To that end, Fig. 5 shows the transformed scores of the indicators in the IST index for two low, two middle and two  Table 1 ▸ high-income countries in 2015 using a radar chart. The countries in the left column have one of the lowest scores in their development group, while the countries on the right have of the highest scores. These graphs further illustrate that each component of the index is scored relative to the level of development. Take, for example, the Ocean Health Index (O8.2). Tanzania's transformed score is twice that of Saudi Arabia (0.79 vs. 0.39), even though it has a lower score on the Ocean Health Index (55.6 vs. 66.3). The reason is that its higher level of development means that Saudi Arabia is compared to a group of countries that protect their oceans better, like Germany that has a perfect score on the OHI.
These radar charts can be a useful tool for economic and development policy, as they highlight those policy areas that require more attention, as well as specific problems that need to be addressed. Moreover, they can guide countries towards policies that work for countries with similar levels of development. For example, panel a links Tanzania's low score to its poor performance on the manufacturing component. Rather than to try and emulate the economic structure of high-income countries, Tanzania could look at the economic policies of a country like Swaziland that scores highly on Manufacturing value added and Manufacturing value added per capita (over 0.94). In contrast, Germany scores highly on almost all components and as a result, ends up with the highest score in 2015. Nevertheless, Germany's score could even higher if it managed to decrease the total amount of municipal waste that is created. For help in achieving this goal, Germany could take a closer look at Belgium, Korea or Iceland. All three countries have similar levels of development but score exceptionally well on this component: 0.86,0.91 and 0.94,respectively,versus Germany's 0.11.

Robustness Checks
In this final section, we determine the IST index' sensitivity to our modeling choices. First, we look at how the results change when an alternative measure of the level of development is used. To that end, we use the UN's Human Development Index (HDI), a composite index that combines GNI per capita with life expectancy and education level. With the HDI as our measure of development, the individual indicators from Table 1 are transformed and the results are combined into a second index: IST HDI . While there are some differences between our baseline index and IST HDI , their overall correlation is high (Fig. 6, panel a). The development category where the most prominent changes take place is in the lowermiddle income group. While the holistic nature of the HDI might make it a more appealing choice as a measure of development, there are two reasons why we use GNI per capita instead. First of all, there is an overlap between HDI and the indicator of human capital included in the index (O4.3). Secondly and most importantly, the HDI is only available every five years between 1990 and 2010. 17 As a second robustness check, we consider the effect of using a geometric average to combine the indicators. Unlike the arithmetic average, which imposes perfect substitutability between the components, the geometric mean penalizes countries with asymmetric component scores. The effect on the ranking by the index is minimal, as the correlation between IST GEO and the baseline IST is 0.94. Nevertheless, as panel b of Fig. 6 shows, the scores are lower using the geometric average. Some countries find their score significantly decreased, like Libya which sees a 44% decrease (from 0.32 to 0.18). While the distribution of IST is negatively skewed, IST GEO has a more symmetric distribution. On the other hand, it also has much wider confidence intervals. Given that the results are so similar, we opt for the arithmetic average as it provides a more straightforward interpretation (i.e., the average fraction of countries that score better).
We also check how the index changes when variables are omitted. In this regard, the index is recomputed 26 times using all but one of the indicators. Regardless of which variable is left out, the index and its standard deviations are almost identical: the correlation of both exceeds 0.98 each time (Fig. 6, panel c). Finally, as some categories contain more variables than others, we also check how the results change when each category, rather than each indicator, receives equal weight in the final index. Similar to the leave-one-out estimations, the results are practically identical.

Conclusion
The universal adoption of the Sustainable Development Goals and the successful conclusion of the Paris Climate Summit were seen as turning points in the pursuit of shared global prosperity. However, the monitoring of these goals has become a significant challenge, especially since economies around the world are at different levels of development and have different production structures. This paper proposes the Inclusive Sustainable Transformation (IST) index as a contribution to the monitoring of these global objectives.
The IST index measures the extent to which a country has developed a modern economy that protects the environment and is gender inclusive. In contrast with other development indicators, the level of development is taken into account when the structural characteristics of countries are compared. This is in line with New Structural Economics thinking, which posits that a country's most optimal development strategy depends on its level of development. To make this conditional comparison, we employ a continuous method of transformation (a conditional CDF) that does not bring about structural breaks in the index. Our results show that taking the level of development into account can reveal patterns that are otherwise hidden, revealing a number of countries that performed much better and worse than expected.
Given the ambitious scope of the index, both in terms of countries covered and indicators included, missing data is a big concern. However, we address this problem using 1 3 multiple imputation. This allows us to estimate the relative performance of close to 200 countries and provides us with an estimate of how the reliability of the index is affected by missing data.
Rather than only measuring a country's overall progress, we focus on how the different components of the index contribute to the overall score. To that end, radar graphs accurately show the disaggregated results and allow us to identify those policy areas that are leading or lagging quickly. By breaking the IST scores down to their different components, policymakers and analysts can identify 'best practices' among countries with a similar level of development on a wide range of policies.
future values, we make use of a state-space model. As is the case with MICE, the selfreferential nature of MISS can be solved by iteratively running the algorithm.
As there are entire books devoted to state-space models and how to estimate them (e.g., Kim and Nelson 1999;Durbin and Koopman 2012) we will keep the explanation short and refer the interested reader to these sources. A state-space model is a dynamic model that contains unobserved variables called state variables. It typically consists of two equations. The measurement equation describes how the observed variables are related to the unobserved and to-be-estimated state variable. The state equation describes the dynamic pattern in the state variables: i.e., how it depends on its previous values. When estimating the state-space model, the most likely values of the unobserved variable are determined as a weighted average of the information in the observed variables and that in the past and future values of the state variable. The weights are determined by how reliable the observed data is versus how strongly the variable depends on its past values.
In this case, the unknown state variable is the to-be-imputed variable X i t . As was the case in the MICE model, we use the (imputed) values of the other variables X −i

Comparison with MICE
As Fig. 7 illustrates, the effect of MISS on the imputation of missing values can be substantial. Especially for variables that are available every 5 years (panel a) or that depend strongly on their previous values (panel b), the range of imputed values is drastically reduced when using the state-space technique. This decrease in the variance of the imputed values in turn leads to a smaller variance in the transformed indicators and the IST index.

Monte Carlo Simulation
In order to get a better understanding of how the model performs when the number of missing values increases, we ran a Monte Carlo simulation on a generated dataset whose characteristics mimic the dataset of the IST index. Specifically, we first generated nine variables (1000 observations each) that have both an autoregressive part and are moderately correlated to two other variables: where = � −1 and is an upper triangular matrix filled with 0.5.  After normalizing the data, we subsequently randomly deleted 10% of the observations of the first variable, 20% of the second, and so on until the last variable only has 10% of his original observations left. The MISS algorithm was then used to try to fill in the gaps in the dataset. The MISS estimator ran for 1100 iterations of which the first 1000 were discarded as burn-in, and the entire Monte Carlo simulation was repeated a hundred times.

Multivariate imputation using Chained Equations M ultiple Imputation using State-Space models
The results are shown in Table 2. The first two rows compare the imputed values of the MISS algorithm with the original values. The bias is the difference between the original values of the variable and the average value returned by the MISS algorithm. The first row of Table 2 shows the average bias over the Monte Carlo simulations, while the second row shows the standard deviation of the biases. This reveals that even when the 90% of the data is missing, the MISS algorithm returns the rights coefficients on average. However, the standard deviation of the bias does increase as the number of missing values increases. In line with expectations, row three shows that the confidence with which the MISS algorithm can fill in the missing values gradually decreases as the fraction of missing values increases. Nevertheless, the average standard deviation of the imputed values remains well below 1.4, which is what you would get if these values were filled using random draws from a normal distribution, meaning that they remain informative.
Overall, the Monte Carlo simulations support the earlier finding that when the dataset is reduced to only those countries with more than 50% availability, the results of the MISS algorithm remain the same. Sachs et al. (2017) Kroll (