Introduction

Environmental, Social and Governance (ESG) investing has enjoyed rapid growth and by some measure has already reached $35 trillion—more than one third of global total assets under management (GSIA 2020). Further rapid growth is expected, as ESG funds’ assets under management could exceed $50 trillion by 2025 (Bloomberg 2021). This trend presents an opportunity for investment managers, and potentially for society as a whole.

ESG scores are a key tool for implementing investment managers’ ESG strategies (Amel-Zadeh and Serafeim 2021). Among the most popular strategies are ESG integration, which employs ESG factors alongside financial factors for portfolio selection, negative screening, whereby assets with the worst (or “worst-in-class”) ESG characteristics are excluded, and “best-in-class”, which increases portfolio allocations towards assets with better ESG characteristics. Typically, investment managers rely on ESG scores by one or several data providers to measure ESG performance. ESG scores therefore are central to ESG investing and, by extension, to a substantial and rising share of investment allocations globally. The use of ESG scores faces some well-known challenges, however. One key challenge for investment managers is the very low correlation of scores between the different major data providers (Berg et al. 2021a, b). Investment managers may thus arrive at different portfolio selections using the same strategy but different ESG data providers.

In this paper, we take an investment manager perspective and aim to circumvent the inconsistencies of ESG scores by deconstructing them and focusing on the underlying data points. Our main question is: Can an investment manager construct a portfolio of equities (from a broad investable universe) that achieves a given financial performance, but with better underlying ESG characteristics? Cutting straight through to the underlying characteristics not only circumvents potential inconsistencies of ESG scoring and weighing methods, but also affords asset managers the flexibility to focus on specific aspects within the ESG sphere, mitigating the ambiguity that the amalgamation of a large set of diverse ESG factors inherently creates. Achieving given risk-return characteristics isolates the effect of implementing different degrees of ESG screening. More importantly, it also resembles the problem that many large investors are facing in implementing their ESG strategy. For most investment funds, the primary mandate remains financial performance, while ESG considerations must either support or be neutral to the primary mandate.

Our key result is that constructing a portfolio with an ESG objective based on underlying ESG data points comes at virtually no costs in terms of financial performance. Investment managers can construct portfolios with specific ESG benefits, without relying on potentially confounding ESG scores, while still being able to match a given desired financial performance. For instance, with the 33% screening threshold, the portfolio score improves by 18 pp (percentage points) on average across ESG categories (from 62 to 80, with a maximum possible score equal to 100), while the increase is equal to only 11 pp when the overall ESG score is targeted. The screening process has a substantial impact on regional and sectoral exposures of the portfolio, which an otherwise passive investor may not be able to accept. Therefore, we implement a “best-in-class” strategy by excluding firms with the lowest scores and reinvesting the proceeds in firms with the highest scores in the same region-sector. The scores associated with this strategy improve further, though not markedly, while the portfolios have the same regional and sectoral exposures as the benchmark. The main cost of the screening strategy is the tracking error relative to the MSCI ACWI, which is equal to 1.6% per year with a 33% screening (exclusion) threshold.

Related literature

We build on the literature on ESG investing and its financial performance, which has grown very rapidly, as demonstrated by Friede et al. (2015). For a long time, firms with low ESG scores and “sin stocks” were expected to enjoy superior performance (Fabozzi et al. 2008 and Hong and Kacperczyk 2009). Recent analysis has also been spurred by the financial outperformance of firms with high ESG scores during the great financial crisis (Lins et al. 2017) as well as the COVID-19 shock (Garel and Petit-Romec 2021), although there is some evidence also to the contrary (Demers et al. 2021; Pástor et al. 2022; Scatigna et al. 2021). One possible explanation for these diverging research conclusions is the heterogeneity and inconsistency of data, including diverging imputation methods employed in ESG scoring to address data gaps (Kotsantonis and Serafeim 2019). Deconstructing ESG scores into their underlying elements is aimed at shedding light on this discussion.

Depending on their preferences, investors may be willing to trade off financial returns for nonfinancial benefits. Bonnefon et al. (2022) distinguish between two main views of investors’ ethical preference: “value-alignment” investors who have an aversion against companies that do not operate in line with investors’ own values; and “impact-seeking” investors who value investment that generate a positive societal impact. In this paper, we assume that institutional investors caring about ESG characteristics are primarily interested in value alignment. While impact investing is gaining traction, it is still a niche and usually not within the mandate of institutional investors (IFC, 2021). Further, ESG scores are an imperfect measure of impact. Elmalt et al. (2021), for instance, find that the ESG score (and the E score) does not capture differences across firms in their carbon emissions. ESG investing, on the other hand, is closer to value-alignment and ESG scores represent a way to compress a broad number of factors predominantly reflective of moral and social values into a single numeric measure.

In equilibrium, ESG screening should result in lower expected returns for firms with high ESG scores if investors have preferences for firms with high ESG quality. Pedersen et al. (2021) propose the concept of an ESG-efficient frontier—the highest attainable Sharpe ratio for each ESG level. Conceptually, the approach in this paper attempts to maximize the ESG level, while keeping the performance of the portfolio as close as possible to that of a diversified benchmark. Pástor et al. (2021) obtain that in an equilibrium model with ESG preferences, green assets have negative alphas and brown assets positive alphas. However, as pointed out by Pástor et al. (2022), green assets have delivered higher realized returns in the recent period because of the demand pressure driven by investors’ climate concerns.

Our paper is also part of the broader literature on the disclosure of nonfinancial information and has implications for the ongoing global efforts to improve ESG-type of corporate disclosures. The impact of mandatory sustainability disclosure has been analyzed in several papers, including Khan et al. (2016) and Ioannou and Serafeim (2019). In particular, Khan et al. (2016) find that firms with high ratings on material sustainability topics significantly outperform firms with poor ratings on these topics in terms of risk-adjusted return. In contrast, firms with good ratings on immaterial sustainability issues do not significantly outperform firms with poor ratings on the same issues. Jouvenot and Krueger (2020) evaluate the effect of a law in the United Kingdom that mandates publicly listed firms to disclose their greenhouse gas (GHG) emissions in a standardized way in their annual reports. The authors find that firms respond to the law by reducing GHG emissions by approximately 16% percent. In conjunction, Mésonnier and Nguyen (2020) have investigated the impact of a French regulation that requires institutional investors (except banks) to report annually on both their climate-related exposure and climate change mitigation policy. They find that investors subject to the disclosure requirements curtailed their financing of fossil energy companies by some 40% compared to investors in the control group.

Finally, our paper is related to the literature that investigates quality issues in ESG data. As the development of ESG ratings is relatively recent, there is a large discrepancy between ESG ratings produced by different data providers, raising issues about their reliability and comparability. Berg et al. (2021a, b) identify three sources of divergence in ESG ratings (due to divergence in scope, in measurement, and in weights) and find that differences in measurement explain most of the differences between ESG ratings, meaning that the same ESG attribute is measured using different underlying indicators. Gibson et al. (2021), Billio et al. (2021), and Serafeim and Yoon (2021) analyze the disagreement across data providers and evaluate its impact on future stock returns. High disagreement regarding the ESG quality of a firm tends to be associated with a lower subsequent stock return. Berg et al. (2021a, b) document large and repeated changes in the historical ESG scores. While they find a positive relation between ESG scores and stock returns when updated data are used, the authors do not observe such a relationship with the initial data. Sahin et al. (2021) document the large proportion of missing information, which calls  the reliability of ESG scores into question.

Data

To define the scope of the analysis, we make a few fundamental choices. The first is the scope of underlying ESG data points. Our analysis uses Refinitiv (formerly Thomson Reuters Asset4) data, which is one of the few major ESG data providers that make the underlying data points publicly available.Footnote 1 This database has been used in a number of academic papers.Footnote 2 We use all ESG data points, comprising 186 comparable measures, which are subsequently amalgamated into the three fundamental pillars of E, S, and G. The second critical decision pertains to selecting the specific ESG characteristics that underlie our analysis. This decision inherently carries a subjective element, contingent upon the preferred ESG scheme of an investment manager. In our analysis, we consider the ESG categories in the Refinitiv dataset, which are likely representative of the thematic objectives commonly sought by investment managers. The third decision pertains to designing the investment strategy. We assume an institutional investor who is otherwise passive, i.e., who carefully tracks the composition of an internationally diversified benchmark and evaluates the portfolio’s performance relative to this benchmark. The benchmark’s constituents, therefore, encompass our universe of investable assets. We select a comprehensive equity index, specifically the MSCI All Country World Index (ACWI), to showcase the broad geographical and sectoral applicability of our findings. The choice of such a broad benchmark index is made feasible by the extensive coverage of listed firms in the Refinitiv ESG data.

Construction of the scores

The methodology adopted by Refinitiv for scoring firms is relatively complex, as it combines a vast amount of different types of data and different aggregation schemes (see Refinitiv 2021). At the same time, it is strongly data-driven and transparent—not least due to the disclosure of both the underlying methodology and data points. First, the database is based on 450 data points (or metrics), which can be Boolean indicators and numeric indicators, such as ratios and analytics. Of these 450 metrics, 186 comparable measures are actually used for the ESG scoring. Other data points cover different topics of interest but are not directly used for the ESG scoring. The 186 comparable measures are then aggregated, using different weightings, into 10 categories. The 10 categories, in turn, are aggregated further to compute the three (E, S, and G) pillars.

The definition and characteristics of the pillars and categories are summarized in Table 1 (Refinitiv 2021). ESG pillar scores are obtained by multiplying category scores with their category weights. For the E and S pillars, category weights vary across industries depending on the materiality of the associated indicators. Some indicators are material for some industries but are not included in the calculation of the scores for the other industries.Footnote 3 For the G pillar, the weights of the three categories are the same across all industries, as indicated in the table. The overall ESG score is based on combining the three pillars, with weights that are specific to the industry of the assessed firm. All scores are between 0 and 100, with 100 being the best possible score.

Table 1 Definition and Characteristics of Pillars and Categories

An important aspect of the Refinitiv database is the data collection process. For the list of firms covered by Refinitiv, analysts collect information over individual ESG measures using numerous publicly available sources (including annual reports, CSR reports, company website, or news sources). Essentially, the data collected by Refinitiv reflect the disclosure policy of the firms, except for some particular situations such as controversies, which also reflect reports from global media. A potential drawback of the database is that data are subject to backward changes of up to five years as new information is disclosed, so changes to the historical data are relatively likely.Footnote 4

The Refinitiv database has two important advantages, which make it highly suitable for economic analyses. First, it provides the raw underlying data for the 186 comparable measures used to calculate the 10 category scores. These data points allow us to identify what generates the particular distribution of the category scores, as we detail in Section “Disclosure of ESG Information.” Second, the methodology used to build the scores is transparent, which also allows us to precisely interpret the scores.

Data coverage

We use all data available in Refinitiv database, from 2010 to 2019. Our analysis of ESG data ends in 2019 for two reasons. First, there is a substantial time lag for a complete update of the database for a given financial year. At the time of our last download (March 2021), some data were already available for 2020, but for a substantial number of firms the data were still missing. Second, as we describe in the Section “ESG Screening at the Category Level,” we evaluate the financial performance of a portfolio built at the end of year t using stock returns in year t + 1, so that the performance of the portfolio built with 2019 data is based on financial returns at the end of 2020. Our sample is defined as the complete set of firms included in the Refinitiv database for which a market capitalization is available in a given year. At the time of our last download, the database contains 10, 142 firms that have been evaluated at some point in time.

Table 2 reports summary statistics on the number of firms for which both market capitalization and ESG scores are available. All numbers in the table are relative to the firms covered by Refinitiv ESG in 2020: The total number of firms worldwide in the database is 10,142 as of 2020. The proportion of firms with available market capitalization in 2010 was equal to 74.8%. Among these 7,590 firms, 3,911 (51.5%) also have a Refinitiv ESG score in the respective year. The proportion of firms with an ESG score remains fairly stable in our sample until 2014, at slightly above 50% of the firms in the database. Starting in 2015, the coverage improves steadily, with a maximum in 2019, with 82.3% of firms in our sample. In general, other scores (3 pillars and 10 categories) have a coverage that is essentially the same as the aggregate ESG score.

Table 2 Global Coverage of Refinitiv Database

In 2019, the regional coverage in terms of market capitalization is the following: 39% of firms are from North America, 20% from Europe, 14% from the Pacific, 23% from Emerging countries. The table also reveals that among the firms in the Refinitiv database, the proportion of firms with ESG scores varies substantially across regions. On average, it is relatively low in North America at the beginning of the sample, at below 50% until 2014. In Emerging countries, the proportion of firms with an available ESG score is above 50% since 2011. In Europe and the Pacific, the ESG score coverage is relatively large over the full sample.

Disclosure of ESG information

As discussed in the introduction, there is currently no generally agreed regime for ESG-type disclosures. We describe the two issues raised by ESG data (missing values in numeric indicators and proportion of zeros in Boolean indicators) and their implications for category scores.

Missing values in numeric indicators

For numeric indicators, a score (based on the relative percentile ranking) is calculated only if the firm has reported this information and a missing value is assigned when Refinitiv cannot find the information in publicly available reports. To compute the proportion of valid (or non-missing) values for a given numeric indicator in a given year, we calculate the number of firms for which a given indicator is available in that year; then, for this given indicator, we identify the industries for which the indicator is material (i.e., the industries for which a given indicator is used for the ESG rating). Finally, we calculate the number of firms with valid data in these industries and divide by the total number of firms in the given industry. Table 3 (Panel A) reports summary statistics on the proportion of valid values for numeric indicators, for each pillar and category, which we interpret as a proxy for the disclosure policy of the firms. Overall, the proportion of valid values is relatively low, close to 40% on average for all numeric indicators over the sample. In fact, there is a large gap between the E and S pillars (20 and 30%) and the G pillar (80%). Within a given pillar, this proportion is usually homogeneous. Indicators related to Emission reduction have a proportion of valid values equal to 22% worldwide on average. These results reveal the lack of disclosure, in particular regarding measures taken by firms to protect the environment (approximately 10% of firms provide data on their Renewable energy use ratio) or to reduce greenhouse gas emissions (25% of firms report data on their CO2 equivalent Scope 3 indirect emissions). The proportion of valid values is particularly low for the Innovation category in the E pillar and the Product responsibility in the S pillar (close to 10% on average worldwide).

Table 3. Proportion of Valid Values and Zero Values

Proportion of zeros at category level

Boolean indicators usually do not have missing values in the Refinitiv database. When the information concerning a Boolean question is not found in the public disclosure of a firm, Refinitiv assigns a default value to a Boolean indicator. The default value is 0 when answering ‘yes’ to the question is positive from a sustainability point of view (e.g., “Does the firm conduct corporate social responsibility reporting?”) and 1 in the rare case when answering ‘yes’ is negative (e.g., “Is the structure of the company board classified?”). For convenience, we assume a default value of 0 in the following.Footnote 5 We note that this methodological choice would incentivize companies to improve their nonfinancial information disclosure if they actually improve their policy.

Because of the choice to assign a value of 0 to missing Boolean indicators, the proportion of valid values is equal to 100% for most categories, with a very large proportion of firms with a value of 0 for some Boolean indicators. For instance, in 2019 the proportion of 0 is equal to 74% for the reporting on firm’s environmental expenditures and 95% for the reporting on the total individual compensation of all executives and board members. When we turn to category scores, this attribution approach may have a considerable impact because some categories (Human rights and CSR strategy) are exclusively based on Boolean indicators. As a consequence, for these categories, a substantial proportion of firms report a score equal to 0.

In Table 3 (Panel B), we report the proportion of firms with a category score equal to 0 for the three ESG pillars and the ten categories at the world level. The distribution of scores is also displayed in Figs. 1, 2 and 3. As the table reveals, the problem is particularly acute for the E pillar, because the pick of scores equal to 0 also contaminates the E pillar score itself. Even if there are numeric indicators for these three categories, they also have a large proportion of missing values, so that the score of the categories is often based on Boolean indicators only and therefore may result in a score equal to 0. This problem is substantial for Innovation, as 56% of firms worldwide have an Innovation score equal to 0 in 2019. Emission and Resource use scores are also affected by this issue but to a lesser extent, with a proportion of 0 equal to 28 and 29% worldwide in 2019.

Fig. 1
figure 1

Distribution of the E Score and underlying categories—2019. This figure displays the cross-sectional distribution of scores for the E pillar and its categories for 2019

Fig. 2
figure 2

Distribution of the S Score and underlying categories—2019. This figure displays the cross-sectional distribution of scores for the S pillar and its categories for 2019

Fig. 3
figure 3

Distribution of the G Score and underlying categories—2019. This figure displays the cross-section distribution of scores for the G pillar and its categories for 2019

Regarding the S categories, we also find a large fraction of firms with a score equal to 0 for the Human rights and Product responsibility categories (42 and 10% of firms, respectively). Finally, as the CSR strategy category is based on Boolean indicators only, it reports approximately 34% of scores equal to 0 in 2019.

Scores at category level

The large frequency of scores equal to 0 for some categories may introduce some distortion in the resulting average score across categories and therefore across pillars. For this reason, we now consider the temporal evolution of scores across categories. Table 4 confirms the large differences in the average score across ESG categories. Categories based on Boolean indicators only (Human rights in the S pillar and CSR strategy in the G pillar) or on a small proportion of numeric indicators with a large proportion of missing values (Innovation in the E pillar) are associated with low average scores. On average, scores are lower for the E pillar than for the S and G pillars.

Table 4 Average Scores by Region

The table also reveals a substantial heterogeneity across regions. Overall, European firms have higher scores, in particular for E and S categories. Firms in North America and Emerging countries have lower E scores. On average, scores tend to improve over time. Pacific and Emerging countries benefit from large increases in the ESG score, in particular because of the E and S pillars. In contrast, the ESG score does not improve in North America, mainly because of the decrease in the E pillar.

As reported in Table 5, we find interesting results across sectors. The Emission score is much higher for firms in energy, utilities, and basic materials (45, 47, and 44%, respectively, in 2019), although these industries emit large quantities of greenhouse gases. In contrast, firms in health care, financial, and technology sectors have very low Emission scores (20, 30, and 32%, respectively), although they have low carbon intensity. This difference has two sources: First, a large fraction of energy and utilities companies report on their emission policy, for instance whether they have environmental partnerships, a policy to improve emission reduction, or targets or objectives to be achieved on emission reduction. So large carbon emissions can be at least partly compensated, at the Emission score level, by policy measures taken by the company.Footnote 6 In contrast, firms in health care or the financial sector often do not report information on these topics, potentially because they do not consider these issues to be material. As a result, they exhibit low Emission scores, even if they generate low carbon emissions. As a result, the average E score ranges between 17.6 for health care firms and 44.8 for utilities, in 2019. This contrast due to reporting biases is less pronounced for the S and G pillars. The average S score ranges between 42.9 and 46.3 in 2019 across sectors. The average G score is between 40.6 and 53.5.

Table 5 Average Scores by Sector

ESG Screening at the category level

Our analysis identifies two issues with the implementation of an ESG-based screening investment strategy at the category level. First, the proportion of scores equal to 0 is substantial for 6 out of 10 categories. Setting a low value of the screening threshold (for instance, excluding 1% or 5% of the firms with the lowest scores and reinvesting in the remaining firms proportionately) would result for these categories in the exclusion of some firms that would have a score equal to 0 while other firms with a score of 0 would be kept in the portfolio. Therefore, the screening at the category level is well suited for relatively large screening levels (say, 25% or 33%), as we illustrate below.Footnote 7

Second, given the large heterogeneity of scores across regions or sectors, the screening process will imply significant regional and sectoral biases in the ESG portfolio relative to the market exposures. Such biases would be an issue for investors seeking to hold an otherwise passive portfolio. To address this issue, we proceed as follows. We assume a benchmark portfolio, which reproduces the structure of the targeted market and provides representative weights for the companies. We construct an ESG portfolio based on excluding firms with the lowest scores associated with a given ESG category. In the first strategy, the proceeds of the excluded firms are reinvested proportionately in the remaining firms. As this approach generates large regional and sectoral biases, we consider a second strategy, in which the screening is performed at the region-sector level: The proceeds of the exclusion of low score firms in a given region-sector are reinvested in high score firms in the same region-sector. This strategy is akin to what is often called a “best-in-class” approach, whereby investment managers select the firms with the highest scores within their sector and often also region.

As a large internationally diversified stock market benchmark, we use the MSCI ACWI, which covers developed and emerging markets. For this index, the list of constituents and the corresponding market weights are available, which we use to define the reference weights for regions and sectors. From now on, we consider the subset of firms in the Refinitiv database that also belong to the MSCI ACWI.

As Table 6 reports, the coverage of the MSCI ACWI (and its regional subindexes) with valid scores is particularly high. It is above 95% in developed markets from 2010 on. For emerging markets, the coverage is above 90% from 2010 on and above 95% from 2017 on. The focus on an index of relatively large firms mitigates the potential size bias pointed out in several papers, notably Drempetic et al. (2020).

Table 6 Coverage of MSCI Constituents with Refinitiv Database

Global screening

The global screening is based on all the firms in the benchmark with scores available at the end of year t. For a screening threshold of \(\theta\) (say, 25%), we identify all the firms with the lowest scores until their cumulative market cap represents a proportion θ of the market cap of the benchmark portfolio.

For a given score S, we denote by \({q}_{\theta ,t}^{\left(S\right)}\) the threshold corresponding to probability \(\theta\). The list of firms to be excluded is given by \({I}_{Ex,t}={\left\{{1}_{\left\{{S}_{i,t}\le {q}_{\theta ,t}^{\left(S\right)}\right\}}\right\}}_{i=1}^{{N}_{t}}\), where \({N}_{t}\) is the number of firms available in year t. The threshold \({q}_{\theta ,t}^{\left(S\right)}\) is defined such that the sum of the market weights of excluded firms, \({\sum }_{i=1}^{{N}_{t}}{w}_{i,t}^{(b)}{ 1}_{\left\{{S}_{i,t}\le {q}_{\theta ,t}^{\left(S\right)}\right\}}\), is as close as possible to the targeted probability, \(\theta\), where \({w}_{i,t}^{(b)}\) is the weight of firm i in the benchmark.

The proceeds of the exclusion are reinvested in the remaining firms, whose list is given by \({I}_{In,t}={\left\{{1}_{\left\{{S}_{i,t}>{q}_{\theta ,t}^{\left(S\right)}\right\}}\right\}}_{i=1}^{{N}_{t}}\) in proportion of their market weight. The vector of weights in the pure exclusion portfolio p is therefore given by:

$${w}_{i,t}^{\left(p\right)}= 0 \quad {\text {for}} \quad i\in {I}_{Ex,t} \quad\mathrm{with }\quad{\sum }_{i\in {I}_{Ex,t}}{w}_{i,t}^{\left(b\right)} \approx 1-\theta$$
$${w}_{i,t}^{\left(p\right)}={w}_{i,t}^{\left(b\right)}\left(\frac{1}{{\sum }_{i\in {I}_{Ex,t}}{w}_{i,t}^{\left(b\right)}}\right) \quad\mathrm{ for}\,\,\, i\in {I}_{In,t}$$

The portfolio composition is consistent with the portfolio of an otherwise passive investor, as the relative weights of the included firms (\({I}_{In,t}\)) are the same as in the benchmark.

Stock market returns of the subsequent year are used to compute the financial performance of the portfolio, so a portfolio built at the end of year t is evaluated at the end of year t + 1. We consider investors with a preference for some particular dimension of the ESG pillars (for instance for Emission reduction or Human rights).

Table 7 reports summary statistics for screening portfolios based on the 2010–2019 sample. MSCI ACWI represents the market index, including all firms, even those with no ESG score. The row labeled ‘Benchmark’ represents the portfolio based on MSCI ACWI constituents for which Refinitiv ESG scores are available. As the table reveals, for the world index, we lose only 2.4% of the market cap on average due to the lack of Refinitiv scores among firms within the MSCI ACWI.

Table 7 Summary Statistics on Exclusion Portfolio—Global Exclusion / Reinvestment

The first two columns represent the proportion of firms and the proportion of the market value with scores equal to zero, while the next two columns indicate for a given threshold how many firms are actually excluded and which fraction of the market cap is excluded. The comparison of these columns allows us to evaluate the impact of zero scores on the composition of the screening portfolio. First, we note that, as low score firms also tend to have a low market cap, we in fact exclude a rather large fraction of smaller firms. For the 10% screening criterion (Panel A), we exclude 9.9% of the market cap but 26.2% of the firms with the lowest ESG scores. Similarly, for the 25% screening (Panel B), we exclude 24.7% of the market value but 50.4% of the firms. These proportions are equal to 33 and 60.2%, respectively, for the 33% screening (Panel C).

Second, we turn to categories with a large fraction of firms with a score equal to zero. For the Innovation category, we find that 40.3% of firms in the MSCI index (26.1% of the market cap) have a score equal to zero. Consequently, the lowest screening threshold that we can apply to build a screening portfolio is the 26.1% quantile (to avoid arbitrary selection of firms with a score equal to zero). Similarly, for the Human Rights category, we cannot exclude less than 21.9% of the market cap (40% of the firms). Consequently, for these two categories, the impact of the screening process is much larger than for other scores because it actually corresponds to an approximately 25% screening. For the CSR strategy category, the lower bound for screening is 9.5% of the market cap. These results clearly illustrate the impact of the scoring methodology on the screening strategy. For these categories, because of the large proportion of firms with scores equal to zero, a screening strategy with a low screening threshold cannot be implemented.

The gain on the score (difference between the portfolio score and the benchmark score) is substantial, usually between 4 and 7 points for the 10% threshold. In relative terms (gain divided by benchmark score), the gain is between 6 and 10%. One factor that limits the score gain is that the portfolio is market cap-weighted. As mentioned above, large firms tend to have higher scores than small firms.Footnote 8 Therefore, the benchmark portfolio is already tilted in favor of firms with relatively high scores.

The score gain is the highest for the E category. For the same 10% proportion of excluded firms, the Resource use and Emission scores deliver the highest score gains, above 7 pp. We note, however, that the gain on the E pillar score is much smaller than the average gain in the individual E categories. The reason is that category scores are summed at the firm level first, so that the large proportion of zeros observed for the Innovation score has a limited impact on the distribution of the E pillar score. As category scores are not perfectly correlated across firms, it is more difficult to improve the E pillar score than its components separately. We observe the same result for the other S and G pillar scores and the aggregate ESG score.

Last columns of Table 7 provide statistics on the financial performance of the screening portfolios, including the Sharpe ratio, the Treynor ratio, the Omega ratio, the annualized risk-adjusted return, and the annualized tracking error. If we denote by \({R}_{i,t+1}\) the return of firm i between dates t and t + 1, the ex-post return of the portfolio p is equal to \({R}_{p,t+1}={\sum }_{i=1}^{{N}_{t}}{{w}_{i,t}^{(p)}R}_{i,t+1}\). Similarly, \({R}_{b,t+1}={\sum }_{i=1}^{{N}_{t}}{{w}_{i,t}^{(b)}R}_{i,t+1}\) is the ex-post return of the MSCI ACWI portfolio.

  1. (1)

    The ex-post Sharpe ratio represents the average excess return of the portfolio divided by its volatility over the sample period, \(S{R}_{p}=({\overline{R} }_{p}-{\overline{R} }_{f})/{\sigma }_{p}\), where \({\overline{R} }_{p}\) and \({\overline{R} }_{f}\) denote the average return of the portfolio and the average risk-free rate, respectively, and \({\sigma }_{p}\) denotes the volatility of the portfolio return.

  2. (2)

    The Treynor ratio represents the average excess return of the portfolio divided by its market beta (systematic risk) over the sample period, \({TR}_{p}=({\overline{R} }_{p}-{\overline{R} }_{f})/{\beta }_{p}\), where \({\beta }_{p}\) is the parameter of the market return in the linear regression of the portfolio return on the market return.

  3. (3)

    The Omega ratio of Keating and Shadwick (2002) is defined as the probability weighted ratio of gains versus losses for some threshold return target, which can be written as \(\Omega (r)={\int }_{r}^{\infty }[1-F(x)]dx/{\int }_{-\infty }^{r}F(x)dx\), where \(F(.)\) denotes the cumulative distribution function for the returns and r denotes the return target.

  4. (4)

    The annualized risk-adjusted return (alpha) represents the parameter of the intercept estimated in the linear regression of the portfolio return on the five Fama-French factors (Fama and French 2015).Footnote 9

  5. (5)

    The tracking error is the annualized volatility of the difference (\({R}_{p,t+1}-{R}_{b,t+1}\)) between the ex-post return of the portfolio and the ex-post return of the MSCI ACWI.

The ex-post Sharpe ratio is close and insignificantly different from that of the MSCI ACWI. The CSR strategy score generates the lowest Sharpe ratio (0.59 versus 0.65 for the MSCI ACWI), while the Community score improves the Sharpe ratio to 0.69. The same result applies to the Treynor ratio and the Omega ratio, with marginal differences between the MSCI ACWI and the screening portfolios. We also compute the annualized risk-adjusted return (alpha) of the various portfolios. In most cases, the alphas of the screening portfolios are positive, although not significantly different from 0 and from the alpha of the MSCI ACWI.

Finally, screening portfolios exhibit ex-post tracking errors of 1.16 and 1.6% relative to the MSCI ACWI for the 10 and 33% thresholds, respectively. Notably, a significant portion of this tracking error stems from the exclusion of certain MSCI ACWI constituents lacking ESG scores as mandated by our screening method. This observation is underscored by the data presented in the table: the ‘Benchmark’ portfolio, mirroring the MSCI ACWI but exclusively comprising firms with ESG scores, demonstrates an annual tracking error of 0.9% relative to the MSCI ACWI. Consequently, the screening process is accountable for only 0.26 and 0.7% in the two screening portfolios, reflecting the moderate impact of ESG score-related exclusions on the overall tracking error.

One reason why the Sharpe ratio of screening portfolios differs from the benchmark Sharpe ratio may be that the screening process implies some changes in the regional and sectoral exposures. Average scores suggest that firms in North America are likely to be underweighted in favor of European firms and that health care firms are likely to be underweighted in favor of financial firms or utilities for almost all categories. As an illustration of the impact of screening on risk exposures, we consider the screening based on the E score with the 33% threshold. On average over the sample, the screening would imply an overweighting of 6.4 pp (from 23.7 to 28.7%) of European firms and an underweighting of 3.5 pp (from 11 to 7.5%) of firms in Emerging countries. Similarly, the screening would imply an overweighting of 1.9 pp (from 11.7 to 13.6%) of financial firms and an underweighting of 1.7 pp (from 17.8 to 16.1%) of health care firms.

Such an impact on regional and sectoral exposures would be an issue for investors seeking to improve the ESG quality of their portfolio but without altering their risk exposures. We address this issue in the next section.

Best-in-class screening at the region-sector level

We now consider a screening strategy that maintains the same regional and sectoral exposures as in the MSCI ACWI. Therefore, for a screening threshold \(\theta\), we exclude in each region r and sector s the firms with the lowest scores until their cumulative market cap represents a proportion \(\theta\) of the market cap of the region r and sector s in the benchmark portfolio. We denote by \({R}_{i}\) and \({S}_{i}\) the region and sector of firm i. The set of firms in a given region r and sector s is denoted by \({I}_{t}(r,s)= \{{1}_{{R}_{i}=r,{S}_{i}=s\}}{\}}_{i=1}^{{N}_{t}}\), for any r and s. The list of firms to be excluded in this region-sector is the subset \({I}_{Ex,t}(r,s)=\{{1}_{\left\{{R}_{i}=r,{S}_{i}=s,{S}_{i,t}\le {q}_{\theta ,t}^{\left(S\right)}\right\}}{\}}_{i=1}^{{N}_{t}}\). A proportion \({\theta }_{t}(r,s)={\sum }_{i\in {I}_{Ex,t}(r,s)}{w}_{i,t}^{\left(b\right)}/{\sum }_{i\in {I}_{t}(r,s)}{w}_{i,t}^{\left(b\right)}\) of the market value is excluded in region r and sector s.

The proceeds are reinvested in the firms with the highest scores in the same region-sector until their cumulative market cap represents a proportion \({\theta }_{t}(r,s)\) of the market cap of the region r and sector s. The set of firms to be overweighted in a region r and sector s is \({I}_{Ov,t}(r,s)=\{{1}_{\left\{{R}_{i}=r,{S}_{i}=s,{S}_{i,t}>{q}_{{\theta }_{t}\left(r,s\right),t}^{(S)}\right\}}{\}}_{i=1}^{{N}_{t}}\). The list of all the other firms is the subset \({I}_{I,t}(r,s)=\{{1}_{\left\{{R}_{i}=r,{S}_{i}=s,{{q}_{\theta ,t}^{\left(S\right)}\le S}_{i,t}\le {q}_{{1-\theta }_{t}\left(r,s\right),t}^{(S)}\right\}}{\}}_{i=1}^{{N}_{t}}\). The vector of weights in given:

$${w}_{i,t}^{\left(p\right)}=0 {\ \text{for} \ } i\in {I}_{Ex,t} \,\,\mathrm{with }\,\,{\sum }_{i\in {I}_{Ex,t}}{w}_{i,t}^{\left(b\right)}\approx 1-\theta$$
$${w}_{i,t}^{\left(p\right)}={w}_{i,t}^{\left(b\right)} \,\,{\text{for}}\quad i\in {I}_{I,t}$$
$${w}_{i,t}^{\left(p\right)}={w}_{i,t}^{\left(b\right)}\left(1+\frac{{\sum }_{j\in {I}_{Ex,t}({R}_{i},{S}_{i})}{w}_{j,t}^{\left(b\right)}}{{\sum }_{j\in {I}_{Ov,t}({R}_{i},{S}_{i})}{w}_{j,t}^{\left(b\right)}}\right)\,\, \mathrm{ for} \,\,i\in {I}_{Ov,t}$$

where \({\sum }_{j\in {I}_{Ov,t}({R}_{i},{S}_{i})}{w}_{j,t}^{\left(b\right)}\approx {\sum }_{j\in {I}_{Ex,t}({R}_{i},{S}_{i})}{w}_{j,t}^{\left(b\right)}\).

This approach is therefore akin to a best-in-class strategy, in which investors reweigh their portfolio from worst-in-class to best-in-class firms. For this reason, a region-sector approach is more likely to have an impact on the cost of financing of the reweighted firms: The cost of financing of excluded firms would tend to increase, while the cost of financing of overweighted firms would tend to decrease.Footnote 10 In Table 8, we consider the case where the reallocation is performed at the region-sector level with 25 and 33% thresholds, respectively.

Table 8 Summary Statistics on Sector-Region Exclusion and Reinvestment Portfolio

Starting with the 25% threshold (Panel A), we find that imposing regional and sectoral exposures results in scores that are slightly lower than those obtained without exposure restrictions. The gain in the overall ESG score is equal to 10.3 pp, while it is equal to 14, 11.3, and 11.7 pp for the E, S, and G scores, respectively. For ESG categories, the gain increases on average to 16.7 pp, so that gains are approximately 2.5 pp higher than with the global screening. This result suggests that reinvesting in the best-in-class firms at the region-sector level is more effective than reinvesting in all remaining firms proportionately.

For the 33% threshold (Panel B), even with constrained sectoral and regional exposures, reallocation strategies allow investors to benefit from substantial increases in ESG scores.Footnote 11 Gains are equal to 16.3, 13.3, and 14 pp for the E, S, and G pillars, respectively. For ESG categories, the gain increases to 19 pp.

Targeting some specific ESG categories usually yields more substantial gains compared to targeting the broader ESG pillars, for the same reallocation threshold. For instance, in targeting the Emission score, the 33% threshold would allow investors to improve their score from 66.9 to 84.4 in an otherwise passive portfolio. These findings could be particularly relevant for investors who wish to target certain ESG objectives, such as a lower portfolio carbon footprint relative to the benchmark. Without relying on the overall ESG score, these investors could exclude firms with the lowest Emissions scores (thus focusing on the particular underlying ESG category), and achieve similar results as the benchmark in terms of financial performance.

The Sharpe ratios of the reallocation portfolios based on category scores are very similar to the Sharpe ratio of the MSCI ACWI (0.68 vs. 0.69). Most annualized alpha are substantially higher than the alpha of the MSCI ACWI. In addition, the tracking error relative benchmark is lower than 1.4% per year on average, which includes 0.9% due to the constituents of the MSCI ACWI with no ESG score.

In Fig. 4, we display how the increase in the portfolio score is affected by the reallocation threshold. We vary the reallocation threshold from 5 to 66% and consider the various pillars and categories. The Innovation, Human rights, Product responsibility, and CSR strategy categories achieve substantial increases in the category scores, even with a modest reallocation threshold, because all firms with zero scores are excluded sim ultaneously, resulting in the larger than expected reallocation. For other categories, the score gain increases steadily, up to 20–30 pp for the 66% threshold.

Fig. 4
figure 4

Impact of the Exclusion Threshold on the Score Gain. This figure displays the gain in the score of the exclusion portfolios when the threshold is increased from 5 to 66%, for the various pillars and categories.

Figure 5 reports the Sharpe ratio of the various portfolios, as well as the Sharpe ratio of the MSCI ACWI (horizontal line). The negative impact of the reallocation is marginal for the E categories and moderate for the S categories (for thresholds up to 40%). For the G pillar, the Sharpe ratio of the CSR strategy category decreases significantly, while the Sharpe ratio of the other categories increases. Plots for the Treynor ratio and the Omega ration reveal similar patterns and are not shown to save space.

Fig. 5
figure 5

Impact of the Exclusion Threshold on the Sharpe Ratio. This figure displays the Sharpe ratio of the exclusion portfolios when the threshold is increased from 5 to 66%, for the various pillars and categories.

Figure 6 indicates that the annual tracking error usually increases as the reallocation is more severe. However, it remains below 2.5%, even for reallocation thresholds as high as 66%, while the tracking error of the portfolio with no reallocation (but with only firms with an ESG score) is equal to 0.9%. Consequently, a reallocation strategy based on a rather high threshold (such as the 33% threshold) can be implemented at a relatively low financial cost, close to 1.5% per year on average.

Fig. 6
figure 6

Impact of the Exclusion Threshold on the Tracking Error. This figure displays the annual tracking error of the exclusion portfolios when the threshold is increased from 5 to 66%, for the various pillars and categories.

Finally, Figs. 7 and 8 present the evolution of the risk exposures of the screening portfolios to the Fama-French factors. We do not report the exposures to the market factor because they all remain very close to the benchmark exposure. The figures reveal that the exposures of the screening portfolios might be different from the benchmark exposures. As the top plot of Fig. 7 reveals, the exposure of the benchmark portfolio to the SMB factor is negative, reflecting that firms in the benchmark are larger than the pool of firms in the U.S. market. Screening portfolios usually have an even more negative exposure, which reflects that excluded firms may be smaller than overweighted firms. This evidence is consistent with the size bias of Drempetic et al. (2020), although this effect remains overall limited. We also notice that screening portfolios are positively and more exposed to the HML factors than the benchmark.

Fig. 7
figure 7

Impact of the Exclusion Threshold on the Exposure to Fama-French Factors. This figure displays the exposure of the exclusion portfolios to the SMB and HML factors when the threshold is increased from 5 to 66%, for the various pillars and categories

Fig. 8
figure 8

Impact of the Exclusion Threshold on the Exposure to the Fama-French Factors (cont’d). This figure displays the exposure of the exclusion portfolios to the RMW and CMA factors when the threshold is increased from 5 to 66%, for the various pillars and categories

It is worth noting that constraining regional and sectoral exposures not only results in higher gains on the ESG score, but also allows investors to hold a portfolio that is otherwise passive, as it is not exposed to additional regional and sectoral risk relative to the market portfolio. This result suggests that investing in a portfolio with a higher ESG category score, with minimal regional and sectoral risk exposures relative to the market portfolio can be attained at almost no cost in terms of financial performance.

Conclusion

Devising investment strategies based on an amalgamation of three fundamentally different topics underpinning ESG investing has been a practical hurdle, especially given the potential for weak scores in one pillar to be offset by strong scores in another pillar. We use the Refinitiv ESG database to deconstruct ESG scores and analyze underlying indicators in depth. For most underlying categories of ESG factors, an investment strategy based on excluding firms with low category scores allows investors to substantially improve the score of their portfolio without suffering from a lower risk-adjusted performance compared to a broad global stock market benchmark. However, the screening process also results in significant regional and sectoral biases relative to this benchmark. Implementing a best-in-class approach imposing the same regional and sectoral exposures as the benchmark slightly increases the gain on the targeted score with no material impact on the risk-adjusted performance and minimal increase in the tracking error of the portfolio.

Shifting focus from aggregate ESG pillar scores and ratings to more granular characteristics (to the extent they may be available from the various ESG data vendors) has three further key advantages beyond the financial ones documented in our analysis. First, a focus on specific categories would enable investors to overcome the “aggregate confusion” created by consolidated ESG scores or ratings and directly focus on factors that are most relevant to their investment mandates. For example, an investor seeking to protect the environment and universal human values could target themes such as emission reduction or human rights. Second, focusing on specific themes would help them better track the sustainability performance trajectory of their investments vis-à-vis their stated sustainable investment objectives. This would also help initiate divestments or reweigh investments within the portfolio. Finally, over time, the focus on themes would also enable investors to develop their own ESG assessment models using actual and observed third-party vendor data, thereby overcoming vendor-specific concerns.