ESG controversies and controversial ESG: about silent saints and small sinners

Based on an extensive international dataset containing Thomson Reuters environmental, social and corporate governance (ESG) rating, as well as Thomson Reuters newest controversies and combined score of an average of 2500 companies in the years 2002–2018, this article contributes to the existing discourse of the relationship between corporate social performance and corporate financial performance (CFP) by examining the Fama and French (J Financ Econ 116(1):1–22, 2015) five-factor risk-adjusted performance of positive screened best and worst portfolios, based on a 10%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} cutoff, respectively, for equally, value- and rank-weighted strategies in the European, US and global market. Furthermore, the controversies score allows us to examine the mid-to-long-term effects of scandals on the CFP without having to rely on the event study methodology. Even though a value-weighted strategy does not show any significant abnormal returns, we examined a significant outperformance for equally weighted worst ESG portfolios and best controversies strategies. These results strongly indicate that this is, on the one hand, driven by low-rated smaller companies (“small sinners”) and clean-coated firms with regard to controversies (“silent saints”) on the other hand. The findings hold for several robustness checks such as adjusting the cutoff rates or splitting the dataset across time.


Introduction
The interaction between corporate social performance (CSP) measured by ESG scores (which evaluate the performance of companies in their environmental, social or corporate governance pillars) and their corporate financial performance (CFP) has been the subject of academic research for many years with various findings. This paper is the first to examine the mid-to-long-term effects of controversies, as the new dimension of ESG, on the CFP of listed companies in a portfolio context. Furthermore, it determines the impact of different weighting strategies for high-and low-rated ESG and controversy portfolios.
Since the 1970, the matter of the relationship between CSP and CFP has been investigated by a pile of academic research. Revelli and Viviani (2015) report in their recent meta-analysis that the consideration of CSP in a portfolio leads to neither an under-nor an outperformance when compared with non-ESG-based investment strategies. Friede et al. (2015) conclude from their meta-analysis that approximately 90% of the more than 2000 considered studies report a nonnegative relationship between CSP and CFP. This heterogeneity of the results can generally be ascribed to three issues, namely the question of how to measure CSP, the methods of stock selection and the question of how to define and measure CFP.
Addressing the first concern, some companies like Sustainalytics, MSCI-KLD or Asset4 specialize in issuing an ESG-based rating system and represent therefore as external and independent rating providers a transparent and reliable source of objective corporate social responsibility (CSR) measurements. Nevertheless, Capelle-Blancard and Monjon (2012) as well as Revelli and Viviani (2015) argue that the academic discordance can mainly be ascribed to the factor of data-driven results. Furthermore, Dorfleitner et al. (2015) and Chatterji et al. (2016) report a lack of homogeneous ESG measurement concepts, even among the large international ESG rating institutions.
To address the CSP measurement issue, our analysis includes three distinct ratings that represent industry-based percentile-ranked scores, which enable a simple implementation of a best-in-class approach and therefore do not discriminate any industry groups. The first one, the Thomson Reuters ESG score (in the following referred to as TR score), evaluates the CSR in various pillars, the Thomson Reuters Controversies score (in the following referred to as controversies score) measures the amount of ESG-based controversies a company encounters during a fiscal year, and finally, the Thomson Reuters combined score (in the following referred to as combined score) aggregates ESG-related controversies and the TR score of a company.
Despite the fact that the controversies score finds its application within other financial research [see, for example, Park (2018) and Vasilescu and Wisniewski (2019)], we still contribute to the literature as we are the first ones to consider the extreme event of an ESG-based scandal within the context of portfolio selection.
The heterogeneity of academic results is strengthened even further by the use of various stock selection criteria. The most common and easy way in which an investor can implement a socially responsible investment (SRI) strategy is represented by socially responsible (SR) mutual funds. These funds claim to construct a portfolio based on SR selection criteria, such as selecting stocks with a high ESG rating (positive screening) or excluding the so-called sin stocks (tobacco, alcohol, arms or gambling industry) from their investment decisions (negative screening). The majority of the literature devoted to these type of investment strategies reports on no financial performance differences between SR and conventional mutual funds (see, i.e., Statman 2000;Bauer et al. 2005;Bello 2005;Kreander et al. 2005;Cortez et al. 2009;Utz and Wimmer 2014). However, socially or ethically motivated value-driven investors in particular have to pay close attention to the shifting level of social responsibility of these SR funds. Wimmer (2013) finds that these funds are optimized towards their financial rather than their social performance and therefore the overall level of social performance of an SR fund is only persistent in the short run. Utz and Wimmer (2014) argue that, viewed from an individual stock level, neither SR mutual funds nor conventional funds differ greatly in terms of portfolio composition. This leads to the conclusion that SR mutual funds do not sustainably satisfy the needs of value-driven investors.
To overcome the stock selection problem, our analysis does not include SR funds, but rather selects stocks based on an ESG-ranking, allowing us to measure the CSR of a firm directly and therefore constructs long-term ESG-persistent portfolios by implementing a monthly rebalanced positive screening process following the ESG-based portfolio formation method of Kempf and Osthoff (2007). We construct a best and worst portfolio based on 10% cutoffs for ESG and controversy out-and underperformer in the sample, respectively. Additionally, the best-minus-worst zero-cost-investment strategy simply buys the outperformers and short sells the underperformers. Besides testing for the standard approach of value-weighted portfolios, we also conduct equally weighted ones to better control for disparities between large and small firms. Furthermore, we implement a ranked weighting, which, given an ESG-based stock selection, allocates a higher weight to the respective stock the more extreme its score becomes.
Regarding the definition and measurement of CFP, researchers tend to use methods of two different directions. Whereas the first group, which represents an accountingbased view, defines CFP as the shift in earnings per share (EPS), operating profitability [return on equity (ROE), return on assets (ROA) or return on sales (ROS)] or net income, the second employs a stock-market-oriented perspective by applying (risk-adjusted) performance measurements such as abnormal returns, Sharpe Ratio or Tobin's Q. A common method in the accounting-based direction comprises the implementation of a particular type of regression analysis. Qiu et al. (2016), for instance, regress the ROS of companies on their respective ESG score. Mervelskemper and Streit (2017) follow the valuation approach of Ohlson (1995) and add an ESG dimension to the model resulting in a regression of the market-to-book value of equity ratio on an ESG score. Van der Laan et al. (2008) implement a firm-fixedeffects regression to measure the influence of different CSP rating dimensions on the ROA and the EPS. In the stockmarket-based perspective, factor models represent a common way in which to measure CFP as they have evolved from simple single-index models (like the CAPM) into a more appropriate approach like the Fama and French (2015) five-factor model. Kempf and Osthoff (2007) and Halbritter and Dorfleitner (2015), for example, align themselves in this group by implementing a Carhart (1997) four-factor model to estimate the abnormal returns of ESG portfolios. With a Fama and MacBeth (1973) regression, Halbritter and Dorfleitner (2015) also incorporate a cross-sectional approach as they regress the excess return of a certain company on its ESG score. Pintekova and Kukacka (2019) analyze the share prices of companies based on the Thomson Reuters combined score using a within-group fixed-effects model. Aouadi and Marsat (2018) utilize a fixed-effects model with dummy variables to estimate the relationship between Tobins' Q and an ESG score. Other studies, such as Auer (2016) and Auer and Schuhmacher (2016) who implement a Sharpe Ratio approach, rely on financial ratios. Event studies represent another noteworthy methodology, which is especially useful when analyzing the short-term impact of certain events (for example, the eventuation of a scandal). Among others, Lundgren and Olsson (2009) examine the effects of environmental-based scandals on firm value by applying a t test to the cumulative standardized abnormal return, whereas Krüger (2015) utilizes the cumulative abnormal return to show the impact of positive and negative ESG-related news separately on firm value. As these examples show, there is a wide variety of different methods and models for different purposes. A more stock-market-oriented perspective is especially suitable for an analysis from an investor's perspective as these methods better reflect the investors' perception of the impact of CSR on the future value of the company (see, i.e., Hillman and Keim 2001;Gentry and Shen 2010;Pintekova and Kukacka 2019). Therefore, we align with the stockmarket-oriented perspective and use the Fama and French (2015) five-factor model to calculate the risk-adjusted abnormal return. Furthermore, the use of the controversy score allows us to directly measure the mid-to-long-term effects of controversies on CFP without having to rely on the event study methodology.
Besides the academic disjointedness, SRI strategies have received a rapid rise in interest over the recent years. The global AUM, according to the Global Sustainable Investment Review GSIA (2018), grew significantly from 22.89$ trillion in 2016 to 30.68$ trillion in 2018, whereas, as reported by the U.S. Forum for Sustainable and Responsible Investments USSIF (2018), the AUM experienced a sharp increase from $8.7 trillion in 2016 to $12.0 trillion at the beginning of 2018 in the US market alone, which shows an almost 40% growth over two years. Furthermore, as mentioned by Crilly et al. (2012), the increasing pressure provided by various stakeholder groups forces companies to invest financial resources in CSR. Moreover, many investors pay close attention to the CSR or CSP of firms, whether they be value-driven investors trying to satisfy their altruistic needs or attempting to achieve abnormal returns by investing in firms with high ESG ratings.
Interestingly, within our results, we find a significant outperformance of up to almost 9 % p.a. for the worst TR score portfolios for equally weighted strategies as well as 7 % p.a. for the equally weighted best controversies score portfolios. These results show that investors should focus on low-rated smaller companies ("small sinners") and clean-coated firms with regard to controversies ("silent saints"). The implementation of a rank-weighted strategy instead of an equally weighted one shows an improvement in alpha across nearly all tested strategies. Regarding the value-weighted strategies, no significant out-or underperformance can be found. These findings apply for different markets and hold true for various robustness checks. This paper is organized as follows. "Literature overview" section provides a short overview of the recent state of literature, while the data and methodology are discussed in "Data and methodology" section. "Results" section presents our results. "Robustness checks" section implements several robustness checks, and "Conclusion" section concludes.

Literature overview
This section provides an overview of the three perspectives regarding the relationship between CSP and CFP.
The first one indicates a positive relationship between the ESG score of a company and their respective CFP (see, i.e., Kempf and Osthoff 2007;Statman and Glushkov 2009;Auer 2016;Pintekova and Kukacka 2019) and is often referred to as doing good while doing well. This hypothesis holds true if the costs of socially responsible activities are overestimated or the respective benefits exceed the expectations of the managers and investors. This can be explained through the managerial myopia theory (see, i.e., Narayanan 1985;Stein 1988), where, on the one hand, managers tend to prefer decisions with a shortterm profit rather than those that maximize long-term shareholder value, and short-term focused investors, on the other hand, who undervalue long-term benefits. Since the costs of socially responsible activities occur immediately, the benefits of those arise in the future. Therefore, the corresponding benefits are harder to predict and less attractive to short-term focused investors. Among others, Derwall et al. (2005) and Edmans (2011), who link the doing good while doing well-hypothesis with the managerial myopia theory, conclude that short-term investors are unable (or unwilling) to price the long-term benefits of those activities correctly and therefore undervalue stocks of companies with high levels of engagement in environmental or social aspects, leading to higher returns in the long-run for the respective stocks when compared with other stocks. This idea of benefit manifestation in the long run is consistent with the findings of Dorfleitner et al. (2018), who conclude that the benefits of socially responsible activities (measured by the abnormal stock returns) are produced by unexpected additional cash flows which occur mid-to-long term. Pintekova and Kukacka (2019) divide the term of ESG-based activities into a primary and a secondary sector, whereas the first category refers to socially responsible activities which are closely related to the core business of the respective company. They can corroborate within their results, the point of view of doing good while doing well if the ESG-based activity is located in the primary sector.
The second approach reverts the above-mentioned relationship, which produces a view of doing good but not well (see, i.e., Boyle et al. 1997; Barnea and Rubin 2010;Renneboog et al. 2008;Hong and Kacperczyk 2009). This hypothesis holds true for many reasons. First of all, based on the idea of Barnea and Rubin (2010), socially responsible activities that represent lavish expenditures of managers motivated by personal benefits, such as public appreciation rather than the altruistic motive of non-financial utility, lead to a significant decrease in shareholder value and inferior financial performance. Thus, an agency problem occurs. As described by Krüger (2015), investors will react negatively (positively) to the announcement of socially responsible activities of firms with a high (low) amount of liquidity and can therefore be seen as wasteful investments. Furthermore, as stated by Heinkel et al. (2001) and Hong and Kacperczyk (2009), socially responsible investors and institutions which are subjected to social norm pressures (such as pension funds, universities and religious organizations) exclude "sin stocks" from their investment decisions resulting in a lower demand, respectively, price and therefore a higher return in comparison with stocks which have a high ESG rating. Another reason supporting the doing good but not well-hypothesis is the trade-off theory stated by Aupperle et al. (1985). In the case of socially responsible investments, the theory argues that ESG-based activities exhaust financial resources which are lacking in other places. Thus, companies with a low level of expenditure on CSR achieve a competitive advantage in the long run, which may be especially relevant for smaller firms who are on a tighter budget. For small companies, the trade-off theory is strengthened even further by the findings of Aouadi and Marsat (2018). Since they examine the connection between firm visibility, CSP and CFP they conclude that only for high-attention firms (firms that are larger, more present in the media and more greatly observed by analysts), the ESG rating plays a role. In conclusion, if smaller firms invest in CSR, this could be seen as a waste of precious financial resources and therefore reduce firm value.
A third view suggests that there is no clear positive or negative relationship between the CSP and the CFP of a firm. Among others, the recent studies of Halbritter and Dorfleitner (2015) and Auer and Schuhmacher (2016) indicate that there is no statistical difference in the risk-adjusted returns of a portfolio consisting of either high ESG-rated or low ESG-rated firms. This third point of view does not necessarily conclude the absence of a connection between CSP and CFP but may, in contrast, on the one hand, indicate that the market prices CSP properly which leads to an absence of risk-adjusted returns, or, on the other hand, that the benefits resulting from the ESG-based activities will be offset by their respective drawbacks such as, for example, their costs or the occurrence of agency problems.
Whatever the relationship between CFP and CSP reveals itself to be in a specific context, the question of informational efficient markets still arises. As the stock selection of corresponding investment strategies is frequently based on the evaluation of certain ESG-based ratings, one may argue, as these scores are publicly available, that financially motivated investors could not generate a risk-adjusted excess return over conventional or non-ESG-based investments, due to of market efficiency. Fama (1965Fama ( , (1970 describes, with the efficient market hypothesis (EMH), a framework in which, if the semistrong form holds true, all information regarding the CSR of a company such as sustainability reports, ESG ratings and even ESG-based scandals should be correctly incorporated into the price of the respective stock shortly after being made public. Therefore, an outperformance of an ESG-based stock selection strategy would not be possible. However, Grossman (1976) and Grossman and Stiglitz (1980), for example, argue that a perfect information-efficient market could not exist, as there would be no incentive for investors to gather information or to actively manage a portfolio whatsoever, because they could not generate any excess returns.
In the case of SRI, Mynhardt et al. (2017) examine the efficiency of socially responsible indices by calculating a Hurst coefficient. The results indicate that most socially responsible indices are significantly less efficient than conventional ones. With a few exceptions, the Hurst coefficient of most of these indices differs from an efficient market (where the Hurst coefficient would be exactly 0.5), ranging either from 0.3 to 0.45 (signaling fat tails with an anti-persistent return series which is negatively correlated) or from 0.55 to 0.6 (indicating fat tails with a tendency to persistent return series with a slight positive correlation), which raises the question of whether ESG-based information is priced immediately and correctly and is considered in its entirety. This appears to be especially crucial in terms of ESG-based scandals as, whereas the occurrence of a scandal is publicly perceived and indeed undoubtedly immediately priced, the impact of the absence of these scandals has often been overlooked as companies with a low amount of scandals "fly under the radar". In this regard, the controversy score represents a good opportunity to decrease this inefficiency and can add significant value to ESG investing as this score is comparable to credit default ratings as these ratings also evaluate the absence of an infrequent event. Dorfleitner et al. (2018) also address the aspect of information inefficiency in the context of SRI as they argue that the future financial benefits of socially responsible activities are not immediately perceivable and therefore the economic nature of CSR remains fairly opaque. Within their results, they conclude that ESGbased activities lead to significant earnings surprises and unexpected additional cashflows in the long run. Edmans (2011) proves something similar with respect to the intangible asset of being one of the best companies to work for, due to the particularly good of their employees.

Data
Due to their transparent scoring methodology, we choose Thomson Reuters 1 as the world's largest ESG rating database for our data source (see, i.e., Cheng et al. 2014;Durand and Jacqueminet 2015). Therefore, our dataset includes all Thomson Reuters scores (in the following referred to as TR scores), controversies and combined scores for the European, US, as well as the global market (including the US and European market) in the period under review from 2002 to 2018. These three scores represent the starting point for further calculations and are explained in more detail below.
First, the controversies scores, which pertain to Thomson Reuter's latest scoring methodology, add a new dimension to previous approaches by capturing negative media stories from global media sources. This score is a percentile ranking that takes ESG-based scandals into account concerning and infringing on any of the following controversy topics and that occur during a company's fiscal year. Its rating methodology consists of 23 ESG controversy topics such as "controversies privacy" or "business ethics controversies" (see Thomson Reuters 2019). This score is also benchmarked on the respective industry groups.
Thus, if a scandal occurs, it has a negative impact on the evaluation of the company involved. Ongoing legislation disputes, lawsuits and fines may also affect the ensuing years and may still be visible in further controversy ratings. Furthermore, the valuation is as follows: In brief: the fewer scandals that affect a company, the higher its score is. 2 The TR score evaluates a company's environmental, social and corporate governance performance (ESG) with regard to ten main categories based on publicly available company-reported data. Each of these categories (for instance, resource use, innovation and emissions in the environmental pillar, human rights and workforce in the social pillar and management in the corporate governance pillar) receives an individually calculated category score and a related category weighting within its associated pillar. These data result in three so-called pillar scores, one for each ESG pillar. To calculate the overall ESG score, these pillar scores (1) score = # comp. with a worse value + # comp. with the same value included current one 2 # comp. with a value are aggregated 3 and in the last step, the TR score is ranked by percentile and benchmarked against the industry. Therefore, the TR score implies an easy way to implement a bestin-class approach (see Thomson Reuters 2019).
Next, the combined score comprises both the TR and the controversies score and thus offers a broadly diversified scoring with regard to performance-based ESG data and controversies collected from worldwide media sources (see Thomson Reuters 2019). The controversies score has no impact on the TR score if it is greater than or equal to 50. In this case, the combined score equals the TR score. However, if the TR score is less than the controversies score, the combined score also equals the TR score. Only if the TR score is greater than the controversies score ( < 50 ), the combined score equals the average of both scores. 4 In order to determine our data universe, we only consider companies for which all three ratings are present. Moreover, penny stocks are deleted. As a result, we obtain a monthlybased dataset with over 529,000 observations in total at an average of approximately 2500 companies in a single month during our time period of 2002-2018 (192 months), more precisely between 900 and 4700 at each point in time. For all observed companies, we have a comparable dataset of the three ratings (TR, combined and controversies). Table 1 shows the descriptive statistics of our data universe.
Concerning the TR rating, the mean value of the rating universe corresponds almost exactly to 50 with a standard deviation of approximately 17. The controversies score is approximately the same as the TR score in terms of mean value and standard deviation. As can be expected with regard to the calculation, the combined score has a lower mean value than the TR and controversies score with a standard deviation of 15.
Regarding the correlation between the three scores it is noteworthy that the correlation between the controversies score and the TR score is negative (− 0.3107). Thus, companies with a high TR score tend to have a low controversies score.
One explanation for this may be that companies that tend to have high ESG scores are affected more greatly by controversies, as reflected by the saying "the higher you fly, the harder you fall".
Furthermore, as would be expected from the composition, the correlation between TR score and combined score is positive (0.7774) as well as between controversies score and combined score (0.3077).
The analysis in this paper is carried out from the perspective of an US investor, so all data is converted into US dollars. The total returns and market capitalization of the considered companies are received from Thomson Reuters Eikon. Discarded (delisted) or insolvent companies are considered until the last available rating or financial information. Thus, our results are not influenced by a potential survivorship bias. For more detailed insights, some descriptives for the European and US market are displayed in Table 2. While for the European market we consider over 158,000 observations based on an average of approximately 820 companies (between 400 and 1000), for the US market, our data consist of over 191,000 observations at an average of approximately 1000 companies (between 400 and 2300).

Methodology
As a first step, we construct several portfolios by generally sorting stocks according to each score. To calculate the monthly returns, we select the best-rated and worst-rated stocks, respectively, and combine them in a portfolio, one being for each of the three scores. Following this procedure, we consider a best-only and worst-only strategy as well as a best-minus-worst strategy, which is long in the best-performing companies and short in the worst-performing ones. As a next step, we consider three different weighting approaches upon which to construct the portfolios. We include the common value-weighted and equally weighted strategies and also a rank-weighted strategy that we present in detail below in "A different approach: rank-weighted portfolios" section.
We obtain nine stock portfolios 5 for value-and equally weighted and rank-weighted strategies, which is the object of contemplation in "Rank-weighted portfolios" section, respectively, in the European, US and global market-in total 27 per market. In order to determine the performance of our portfolios, we apply the Fama and French (2015) fivefactor model, which is based on the regression: In this model, the return of portfolio i for period t is represented by R it while R Ft comprises the risk-free return. R Mt denotes the return of the market portfolio, SMB t represents the small-minus-big factor (returns of small stocks minus returns of big stocks) and HML t is the performance difference between companies with a high and low book-to-market value. The factor RMW t indicates the difference between the returns of stocks with a weak and a robust profitability. CMA t describes the returns of conservative (i.e., low-investment firms) minus aggressive (i.e., high-investment firms) stocks. Moreover, b i , s i , h i , r i , and c i are the estimated regression coefficients which are calculated by OLS regression, in which e it denotes a (zero-mean) residual and a i the intercept.
Since a Breusch and Pagan (1979) test applied to all portfolios indicates that the residuals of the regressions are subject of heteroskedasticity and a Godfrey (1978) and Breusch (1978) test as well as a Durbin and Watson (1971) test show autocorrelations for most of the models, we use the approach of Newey and West (1987) to calculate standard errors.

A different approach: rank-weighted portfolios
Besides equally weighted and value-weighted portfolios, we also consider a new portfolio composition strategy following a similar approach to Frazzini and Pedersen (2014) which reflects the great importance of the ESG ratings for those investors, who may wish to award a different level in the scores through a corresponding weight. Consequently, we build portfolio weights based on the respective score placements. Our new approach is to award better scores and to consequently include them with higher weights in a best-portfolio strategy and vice versa in order to reward worse scores with higher weights in the worst portfolio. In addition, the best portfolios constructed this way have, by definition, a higher ESG rating than value-weighted or equally weighted strategies, whereas the worst portfolios have lower ratings. First, we determine the best and worst stocks. Next, we divide the companies up by rank in ascending and descending order. In the best portfolios, the company with the highest score receives the (numerically) highest rank. In contrast, the company with the worst score receives the highest rank in the worst portfolios. To calculate the weights w i,t of a company c ∈ C t ⊆ C , where C is the set of all companies within the respective data and C t is the set of all companies within the portfolio at time t, we use and for each t ∈ T there holds where Rk t (c) note the rank of a company c at t, N t = |C t | the cardinality of the portfolio selection at t, in the monthly period under review. If a company ĉ ∈ C�C t does not appear in the portfolio selection at time t by definition, its weight is Table 3 presents some measures of all 27 equally weighted 10% portfolio strategies. Concerning the Sharpe ratio, the Sortino ratio and the Treynor ratio, it is noteworthy that all controversies best and TR worst portfolios show higher values than the respective market portfolio, which is a first indication that the performance of these portfolios is high. Furthermore, most best and worst portfolios have a higher risk than their respective market in terms of maximum drawdown (MDD), while the controversies best-minus-worst portfolios have a much lower risk in all three markets. Additionally, the MDD is lower than that of the corresponding market for the following portfolios: combined best-minus-worst (US, global), controversies best (Europe, global), TR worst (global) and combined worst (European). To examine a potential over-performance of the strategies in more detail, we consider the alphas of the respective portfolios. The results of the Fama and French (2015)

Equally and value-weighted portfolios
five-factor regressions are presented in Table 4 for equally weighted portfolios and in Table 5 for value-weighted portfolios. Some results immediately catch the eye: Regarding the equally weighted strategy, the worst portfolios based on the TR and combined scores, as well as the best portfolios of the controversies score, indicate positive and significant outperformance. For the controversies score best portfolios, consistently positive and significant alphas can be observed for all portfolios. These portfolios show strongly significant returns of up to almost 7% p.a. 6 In contrast to this, the controversies score worst and best-minus-worst portfolios do not exhibit any striking features. Surprisingly, when considering combined score portfolios, a best portfolio strategy does not lead to a significant performance. However, the performance of the worst portfolio shows a consistently strong and significant outperformance of up to about 7.6% p.a., which can be observed in all three markets. As a result of this, the calculations indicate a significant underperformance of the best-minus-worst portfolios. Therefore, this effect cannot be caused by the controversies score, but instead appears to be determined by the second component of the combined score, namely the TR score.
When taking a closer look at the ESG portfolios, we notice the following. While the performance of the best portfolios-apart from a slight significance in the global market-does not show any over-performance, a strongly significant outperformance of up to almost 9% ( 8.86% ) p.a. can be observed for the worst TR score portfolios in all three markets. These results resemble those of the combined score portfolios.
On the contrary, we compare this with the results of the value-weighted portfolios in Table 5. Apart from very few exceptions neither best nor worst portfolios based on the three ratings obtain any ongoing positive and significant alphas within the European, US or global market. So, it becomes relatively clear that there are no ongoing tendencies recognizable in terms of any benefits of best or worst strategies. Apart from some isolated outliers, the results lead us to the assumption that the value-weighted strategy does not result in any excess return for investors, which is consistent with the findings of Halbritter and Dorfleitner (2015). It should also be pointed out that the adjusted R 2 values of all long and short portfolios are consistently high, which indicates a strong explanatory power of our underlying factor model.
There is a clearly recognizable difference between Tables 4 and 5: since the results of the value-weighted and the equally weighted portfolios are very distinct, this 6 The annualized performance of the global controversies score best portfolio is: 1.0056 12 − 1 = 0.0693.
points to the fact that the significant outperformance of the equally weighted portfolios is strongly driven by the small companies. In particular, the TR portfolios support the above finding as the equally weighted portfolios based on low TR scores achieve strong outperformance. These results provide some evidence of the trade-off hypothesis (see Aupperle et al. 1985), as investors appear to reward smaller companies for not investing their money in ESG improvements. They may consider this spending as a wasteful investment and prefer companies that invest in growth and innovation. As no or even negative significant results were shown for value-weighted best portfolios, we can conclude that, for large companies, the benefits of expenditures improving CSP are already reflected in the stock price of these companies.  (2015) five-factor regression for portfolios from 2002 to 2018 on a monthly basis. The regressions are calculated individually for each equally weighted portfolio based on a 10% cutoff of each score, market and portfolio set. The best (worst) portfolios consist of the 10% best (worst) rated companies regarding a particular score. The best-worst portfolios are long in the best-performing companies and short in the worst-performing ones. Monthly alphas, all estimated coefficients of the five Fama and French (2015) factors and adj. R 2 are reported upon. In order to estimate standard errors, we use the Newey and West (1987) procedure ***, ** and * indicate a significance level of 1%, 5%, and 10% Looking at the data, it becomes apparent that an equally weighted portfolio strategy based on a high controversies score leads to a high outperformance. Therefore, this demonstrates that small companies in particular generate a sustained stock performance if they have a "clean coat" with regard to controversies. Thus, one might say that they "fly under the radar".  (2015) five-factor regression for portfolios from 2002 to 2018 on a monthly basis. The regressions are calculated individually for each value-weighted portfolio based on a 10% cutoff of each score, market and portfolio set. The best (worst) portfolios consist of the 10% best (worst) rated companies regarding a particular score. The best-worst portfolios are long in the bestperforming companies and short in the worst-performing ones. Monthly alphas, all estimated coefficients of the five Fama and French (2015) factors and adj. R 2 are reported upon. In order to estimate standard errors, we use the Newey and West (1987) procedure ***, ** and * indicate a significance level of 1%, 5% and 10% Last but not least, the above observations also find their reflection in the combined score portfolios. On the one hand, the effect of the TR worst portfolios also occurs in the combined score worst portfolios, which are by definition strongly influenced by the TR score. On the other hand, it is not surprising that a slight decrease in the returns appears in these portfolios compared with corresponding TR worst portfolios, which can be explained due to the influence of the controversies score.
To discuss these results against the background of current literature, it is necessary to divide this step into two parts. As already published by previous studies such as Halbritter and Dorfleitner (2015), we confirm the recent observation, being that a market-weighted ESG strategy does not result in ongoing significant overperformance, so for this strategy, there is no clear out-or underperformance of best or worst portfolios.
The hypothesis of a positive relationship between the CSP and the CFP of a company (see, e.g., Kempf and Osthoff 2007) could only partly be confirmed. Evidently, there is no performance loss when investing in ESG portfolios, but the data suggest that there is also no ongoing positive outperformance for companies with high ESG ratings, so for these portfolios, we strongly support the results of Revelli and Viviani (2015), being that neither weaknesses nor strengths can be detected for value-weighted positive CSP strategies.
However, this is reverted when considering equally weighted portfolios. Remarkably, no significant negative performance is detected when investing in best ESG portfolios with an equally weighted strategy. Thus, there are no ESG-based performance losses for investors. Moreover, Statman and Glushkov (2009) find that investors can achieve positive abnormal returns with socially responsible topminus-bottom strategies using equally weighted portfolios. Thus, in relation to the results of our best-worst portfolios, there is no reason for investors to pursue this strategy nowadays because, in particular, the worst portfolios based on the TR score reveal a significant overperformance. However, this also stands in contradiction to Auer (2016), who claims that investors should eliminate firms with the worst ESG ratings, whereas we find evidence of the fact that these represent some potential for (ESG neutral) investors. Moreover, this finding contradicts even Kempf and Osthoff (2007), who use a long-short strategy and obtain an overperformance. Contrary to this and related to our results, doing good while doing well did not manifest itself at all during our work.
Market efficientists would expect an immediate reaction on the stock market in the face of a controversy. Therefore, no long-term overperformance can be expected with regard to market-efficiency aspects, so it is surprising that there are several corresponding findings for the controversies score portfolios. Although the occurrences of controversies may be immediately priced by the market, which is indicated by the non-existing underperformance of the worst controversies score portfolio, the absence of controversies appears to be incorrectly evaluated for small companies. The significant outperformance of the best-rated companies therefore indicates a less efficient market regarding ESG-based information as discussed by Edmans (2011), Mynhardt et al. (2017 and Dorfleitner et al. (2018). Smaller companies without an unwanted boost in public perception due to a controversy remain "silent saints" so-to-speak and "fly under the radar". The controversies score enables a valuation of controversies that do not take place and may therefore be a good tool to enhance ESG investment as it reveals companies with a low amount of scandals with a specific potential for an increase in market value and stock price.
An additional consideration of the Fama and French factor coefficients yields some interesting insights regarding the differences between value and equally weighting. First, it can be seen that the market betas are generally around 1, but tend to be lower for value-weighted portfolios. This is not surprising, as smaller companies may have higher market betas and these companies are represented with higher weights in the equally weighted portfolios. Second, we notice that the controversies best, TR worst and combined worst equally weighted portfolios have significant positive SMB t factor coefficients and reveal a higher absolute value compared to the respective value-weighted portfolios, which is again explainable by the higher weights for smaller companies. Third, the remaining factors show no systematically deviating patterns.

Portfolios based on market capitalization
To further investigate whether the observed strong overperformance of equally weighted portfolios with low TR ratings and high controversies scores is driven by company size, we divide our dataset at the median of the market capitalization and create new portfolios based on companies with high and low market capitalizations. Table 6 displays these portfolios based on a 10% cutoff for the European, US and global markets. From this table, it is apparent that the main results remain consistent, namely a significant outperformance of portfolios based on small companies with low TR score ratings as well as portfolios based on small companies with fewer controversies and therefore high controversies score.
It also can be seen from Table 6 that even the valueweighted calculations based on firms with low market capitalization mostly show significant and positive alphas for controversies best, TR worst portfolios and ensure our results.

Table 6
Alphas of equally and value-weighted 10% portfolios: regression based on high and low market capitalization This table shows the alphas of the Fama and French (2015) five-factor regression for portfolios from 2002 to 2018 on a monthly basis. The regressions are calculated individually for each equally and value-weighted portfolio based on a 10% cutoff of each score, market and portfolio. The calculations are performed on the basis of our dataset divided by the median of the market capitalization. The best (worst) portfolios consist of the 10% best (worst) rated companies regarding a particular score. The best-worst portfolios are long in the best-performing companies and short in the worst-performing ones. Monthly alphas are reported upon. In order to estimate standard errors, we use the Newey and West (1987) procedure ***, ** and * indicate a significance level of 1%, 5% and 10%  Table 7 displays best and worst rank-weighted portfolios based on a 10% cutoff for the European, US and global market. When considering these portfolios, nearly all returns of the best and worst portfolios are higher than with the corresponding equally weighted strategies. Based on these calculations, the returns improve by up to 42.86% 7 for the best, by up to 32.24% 8 for the worst and by up to 84.28% 9 for the best-minus-worst portfolios, compared with the corresponding equally weighted portfolios. Note that rank-weighted portfolios also reveal a lower significance level in terms of p values, which indicates a real potential for investors. On the one hand, there are a number of promising investment strategies for investors who strongly attach importance to ESG scores. As we previously mentioned, the controversies score represents a huge potential for investors in particular, and together with a rank-weighted portfolio strategy the corresponding alphas even increase, so this score describes a way in which to detect companies with a specific management culture that apparently leads to higher future cash flows and therefore to higher and more significant alphas.

Rank-weighted portfolios
Surprisingly, companies with a high controversies score do not necessarily have a high ESG score. This noteworthy observation remains open for future research.
On the other hand, investors pursuing exactly the opposite strategy also benefit from rank weighting portfolios. This is particularly evident in the outperformance of the TR worst portfolios. Obviously, stronger weightings for firms with very low TR scores lead to significant overperformance, which can be traced back to a trade-off interpretation (see Aupperle et al. 1985). In summary, one can conclude that the rank weighting portfolios represent a useful tool for investors who wish to profit from ESG ratings either by investing in high-ranked companies or by investing in low-ranked firms. Finally, to put it in a nutshell: buy the "saints" or invest in the "small sinners".

Robustness checks
To check our results for robustness, we run some further regressions. First of all, we construct the equally weighted portfolios based on the 20% (instead of 10%) best and worst companies. Again we use the Fama and French (2015) fivefactor regression model. The results are presented in Table 8 and indicate that all previous results remain materially the same for the 20% equally weighted selection, i.e., an outperformance of the controversies score best and the TR and combined score worst portfolios.
Moreover, with regard to the rank-weighted strategy, the 20% portfolios are also examined. Following the same  Fama and French (2015) five-factor regression for portfolios from 2002 to 2018 on a monthly basis. The regressions are calculated individually for each rank-weighted portfolio based on a 10% cutoff of each score, market and portfolio set. The best (worst) portfolios consist of the 10% best (worst) rated companies regarding a particular score. The best-worst portfolios are long in the best-performing companies and short in the worst-performing ones. Monthly alphas and adj. R 2 are reported upon. In order to estimate standard errors, we use the Newey and West (1987)  procedure, this leads to the results displayed in Table 9. Also, in this case, all results of previous calculations remain approximately unchanged. Compared with the 20% equally weighted portfolios, most of the alphas are higher. For instance, we can observe an almost 20% increase in the alpha of the controversies best portfolio in the global market from 0.0046 to 0.0055, both being significant at a 1 % level.
As a next step, we divide our portfolios into bull and bear market periods to monitor how the portfolio strategies perform in different market phases. The results are shown in Table 10. The data suggest that the majority of the strategies work in bull markets. Moreover, one argument against this cannot be ignored: In our investigation period, there were mostly bullish phases and only a few bearish time periods,  Fama and French (2015) five-factor regression for portfolios from 2002 to 2018 on a monthly basis. The regressions are calculated individually for each equally weighted portfolio based on a 20% cutoff of each score, market and portfolio set. The best (worst) portfolios consist of the 20% best (worst) rated companies regarding a particular score. The best-worst portfolios are long in the best-performing companies and short in the worst-performing ones. Monthly alphas and adj. R 2 are reported upon. In order to estimate standard errors, we use the Newey and West (1987) (2015) five-factor regression for portfolios from 2002 to 2018 on a monthly basis. The regressions are calculated individually for each rank-weighted portfolio based on a 20% cutoff of each score, market and portfolio set. The best (worst) portfolios consist of the 20% best (worst) rated companies regarding a particular score. The best-worst portfolios are long in the best-performing companies and short in the worst-performing ones. Monthly alphas and adj. R 2 are reported upon. In order to estimate standard errors, we use the Newey and West (1987) (2015) five-factor regression for portfolios from 2002 to 2018 divided into bull and bear market periods. The regressions are calculated individually for each equally weighted portfolio based on each score, market and portfolio set. The best (worst) portfolios consist of the best (worst) rated companies regarding a particular score.
The best-worst portfolios are long in the best-performing companies and short in the worst-performing ones.  (2015) five-factor regression for portfolios from 2002 to 2018 on a monthly basis divided into two subperiods. The first subperiod dates from April 2002 to March 2010 and the second from April 2010 until April 2018. The regressions are calculated individually for each equally weighted portfolio based on a 10% and 20% cutoff of each score, market and portfolio set. The best (worst) portfolios consist of the 10% and 20% best (worst) rated companies regarding a particular score. The best-worst portfolios are long in the best-performing companies and short in the worst-performing ones. Monthly alphas and adj. R 2 are reported upon. In order to estimate standard errors, we use the Newey and West (1987)

Table 12
Alphas of equally and value-weighted 20% portfolios: regression based on high and low market capitalization This table shows the alphas of the Fama and French (2015) five-factor regression for portfolios from 2002 to 2018 on a monthly basis. The regressions are calculated individually for each equally and value-weighted portfolio based on a 20% cutoff of each score, market and portfolio. The calculations are performed on the basis of our dataset divided by the median of the market capitalization. The best (worst) portfolios consist of the 20% best (worst) rated companies regarding a particular score. The best-worst portfolios are long in the best-performing companies and short in the worst-performing ones. Monthly alphas are reported upon. In order to estimate standard errors, we use the Newey and West (1987) procedure ***, ** and * indicate a significance level of 1%, 5% and 10% those of which are comparatively short. Since we are nevertheless also able to detect a number of positive significant results in bearish market phases, for example, the best controversies portfolio in the US market or most portfolios in the global market, this points to the fact that the strategies are robust against various market movements. Furthermore, we split our portfolios up into two subperiods (Table 11). The first subperiod dates from April 2002 to March 2010 and the second from April 2010 until April 2018. The findings show for the US and global portfolios in particular that the abnormal returns are maintained even under this sample split. Eventually, we also check the results for a winsorization of the returns at the 1% level and re-run all regressions. The results remained unchanged.
In addition, we also construct equally and value-weighted portfolios based on 20% (instead of 10%) best and worst companies with high and low market capitalization. The results of these regressions are displayed in Table 12. All previous major results remain materially unchanged for the 20% portfolios.
In order to include transaction costs, it is necessary to account for the turnover rate of the considered portfolios. For the 10% cutoff and US portfolios, we observe an average monthly turnover of 6.74% for the best TR and 8.55% for the worst TR, respectively 11.82% and 9.15% for the controversies score, as well as 8.84% and 9.69% for the combined score portfolios. This remains on an equal level for the other markets under review, so that the average monthly turnover rate stands at approximately 10%. Even for all other cutoffs, the turnover rate is materially the same. Thus, in line with Frazzini et al. (2018), the results of these portfolio strategies lead to expected annual trading costs between 90 and 150 bps, which implies that the significant alphas remain positive even after transaction costs.

Conclusion
In this paper, we examine a dataset that includes over 4700 companies and the associated TR, controversies and combined scores in the Thomson Reuters Eikon universe in the investigation period from 2002 to 2018. All calculations are performed for the European, US and global markets. This paper is the first one investigating positive screened portfolios dependent on the controversies score, which measures the amount of ESG-based controversies a company has faced. The calculations based on the Fama and French (2015) five-factor model show that there is still potential for an investor to achieve a significant outperformance. Even though a value-weighted investing strategy does not show any significant over-or underperformance and therefore confirms many of the previous literature findings (see Halbritter and Dorfleitner 2015), we can find some noteworthy results.
First of all, the inclusion of the controversies score in an ESG-based portfolio selection approach enables for a simple implementation as a way to quantify and evaluate the absence of a certain event, namely an ESG-based scandal, which might help to improve the information efficiency of the market with regard to the absence of these. Furthermore, from an investor's standpoint, having a "clean coat" with regard to controversies is especially profitable for smaller companies, as the absence of these scandals may be overlooked and incorrectly incorporated in the market prices. Thus, one might say that the respective companies "fly under the radar".
In addition, equally weighted portfolio strategies based on worst TR and combined scores show significant outperformance, which leads to the conclusion that for the respective (small) companies there are indications in favor of the trade-off theory. Moreover, the results hold true for various robustness checks such as the variation of cutoff levels or the splitting of the period under review. Besides the two standard approaches in the context of portfolio formation, namely value-and equally weighting, we discover new potential in the rank-weighted strategy for investors, which leads to improvements in terms of both, alpha and level of significance, within most of the investigated portfolios. For investors who attach great importance to ESG ratings, this represents an enormous opportunity to reward better scoring placements of companies and additionally to gain higher returns.
In light of these findings, it must, however, still be considered that there are hidden opportunities for investors that can be exploited in order to benefit from ESG-based ratings. The empirical results and arguments provided above prove that it is worth remaining vigilant concerning this issue.