1 Introduction

The majority of the life cycle assessment (LCA) studies is relative in the sense of involving a comparison (Heijungs et al. 2019). Comparative LCA studies are usually dealing with the comparison of alternative products that fulfill a similar function (such as an electric car and a gasoline car) or the comparison of alternative production processes that produce the same product (such as coal-based electricity and nuclear electricity). They are done to take decisions regarding the best-performing (minimum-impact) product or process. But there is another class of comparisons which is more methodological. Such comparisons focus on alternative methods for calculating LCA results. Here we refer to such studies as “meta-comparisons,” because they take a more abstract and higher vantage point.

This article provides a critical analysis of papers that engage in meta-comparative LCA, comparing methods for LCA. Because we will compare methods for meta-comparison, our paper may be classified as meta-meta-comparative (see https://xkcd.com/1447/). We believe that ours is the first paper studying meta-comparisons (although we acknowledge that Pizzol et al. (2011) observed that “it is not straightforward to compare the methods,” and that Dong et al. (2016) wrote that “there is a lack of an agreed approach that can differentiate various [life cycle impact assessment] methods”), and that even the term “meta-comparison” has not been used before in the context of LCA.

As a first defining feature, we emphasize that our paper is not about so-called “meta-analysis of LCA” (Brandão et al. 2012) or meta-regression of LCA (Menten et al. 2013). Such studies try to draw generic lessons for product groups from a limited number of studies. Meta-comparative LCA, by contrast, aims to draw lessons on methods for LCA.

With the title’s term “methods for LCA,” we have several groups of studies in mind. Below, we briefly review the literature in a number of major topics:

  • comparing life cycle impact assessment (LCIA) methods;

  • comparing inventory (LCI) methods;

  • developing streamlined LCA methods; and

  • comparing software and databases.

As a sidenote, we emphasize that our meaning of “methods for LCA” is much broader than LCIA methods: it includes methods for the full LCA calculation.

A major group is provided by the studies that compare competing LCIA methods. An early example is Baumann and Rydberg (1994), who compare three LCIA methods that employ different principles. Most later analyses focus specifically on comparing characterization methods (Dreyer et al. 2003; Van der Werf and Petit 2002; Landis and Theis 2008; Weidema 2015; Chen et al. 2021), but some authors concentrate on normalization methods (Lautier et al. 2010; Myllyviita et al. 2014) or weighting methods (Huppes et al. 2012); Myllyviita et al. 2014), or address the full LCIA pathway (Notarnicola et al. 1998; Brent and Hietkamp 2003; Bovea and Gallardo 2006). This group also includes studies that compare LCA results of an established impact assessment with an updated version (Dekker et al. 2020).

A second group is formed by the studies that compare process-based, IO-based and hybrid inventories. Here we mention Hendrickson et al. (1997), Suh and Huppes (2005), Junnila (2006), Islam et al. (2016), and Crawford et al. (2018) as key representatives. In this group, we also include studies that investigate the effects of other inventory choices, such as allocation (Huijbregts 1998; Curran 2007) and algorithm (Heijungs et al. 2015).

A third group comprises the studies that use a streamlined method to approximate an LCA. For instance, Huijbregts et al. (2006) propose the use of cumulative energy demand (CED) of products as a proxy for addressing the impact scores for a host of other impact categories, including global warming, stratospheric ozone depletion, acidification, eutrophication, photochemical ozone formation, land use, resource depletion, and human toxicity. This idea has been further refined by, among others, Röös et al. (2013), Scipioni et al. (2013), and Steinmann et al. (2017a). A more abstract version of this type of analysis is the study on the use of a product property (such as life span or weight) as a predictor (Padey et al. 2013; Eddy et al. 2015). The idea has also been employed to predict characterization factors (e.g., Birkved and Heijungs 2011) or to predict entire LCA scores from chemical properties (e.g., Wernet et al. 2008; Eckelman 2016).

A final group of studies compares software and databases for LCA, using the same settings (system boundaries, LCIA methods, etc.). Examples include Speck et al. (2015), Herrmann and Moltesen (2015), and Iswara et al. (2020). It also includes studies that compare algorithms for LCA, like Peters (2007) and Heijungs et al. (2015). Notice that software is sometimes mixed up with the implemented data. An example is the study by Martínez et al. (2015), which sets out to compare software, but effectively compares different LCIA methods.

Some of these studies are of an analytical nature: they dissect the logical and/or mathematical structure of the contrasting methods and expose differences in assumptions, principles, and value choices. Examples are Van der Werf and Petit (2002), Amani et al. (2011), Núñez et al. (2016), and Crawford et al. (2018).

Other studies employ a quasi-empirical set-up. They use different LCA methods to calculate results for a number of products, and then check the degree of agreement between these results. Within this approach, there is quite some diversity in the details. One extreme is presented by Dreyer et al. (2003), who apply several LCA methods to just one product, and use a contribution analysis to assess the degree of correspondence. Another extreme is Laurent et al. (2012), who use up to 3954 products and calculate correlation coefficients and other statistics. Many studies are in-between: for instance, Cavalett et al. (2013) base their analysis on two products, and Röös et al. (2013) use 53 products. Notice that we speak of quasi-empirical. Empirical studies work with observed data, but in quasi-empirical, the data is constructed from an available unit process set or LCI database.

Our focus in this article is on these quasi-empirical studies. We will analyze a number of such studies and seek to find out the approaches used and the strengths and weaknesses of those approaches given the specific purpose of the analysis.

Our motivation for this purpose is that there is not only little methodological guidance for comparative LCA (the only sources we are aware of are Heijungs and Suh (2002); Jung et al. (2014)), but that the situation for meta-comparative LCA is even more obscure (Pizzol et al. 2011; Dong et al. 2016). For instance, Huijbregts et al. (2006) report regression coefficients and \({R}^{2}\) statistics while Dekker et al. (2020) report \(t\) statistics, root mean square errors (RMSEs), and Spearman correlation coefficients, and Simões et al. (2011) compare the LCIA results with a primarily verbal approach. In some cases, the set of products used as benchmark displays a large span of orders of magnitude for the scores, which then induces some researchers (e.g., Bösch et al. 2007; Steinmann et al. 2017a) to study logarithmic relationships. Some authors speak of “significant correlations” (e.g., Berger and Finkbeiner 2011) without using a hypothesis test; others explicitly set a “significance alpha” (e.g., Pascual-González et al. 2016). A further complication is that the studies use different words and symbols for the techniques and indicators (for instance, the “correlation coefficient” is to Huijbregts et al. (2006) \({r}^{2}\), while it is \(r\) to Röös et al. (2013)), that even the same article sometimes is not consistent in its symbol use (e.g., Pascual-González et al. (2016) show several figures with an \(\Omega\)-axis, which is in their text probably \({I}_{k}\)), that equations are sometimes absent anyhow (e.g., in Sousa et al. 2000), that equations in a few cases contain mistakes (e.g., Kaufmann et al. (2010), show an Eq. (1) in which a term \(\underset{j}{\mathrm{max}}\left\{{E}_{i}{m}_{i}\right\}\) occurs), and that many more things can go wrong (e.g., Ligthart and Ansems (2019) report cases with “\(p<0.00\)”.). Some authors do not specify equations but refer to specific software. For instance, Dekker et al. (2020) write that the “statistical analysis was done with R-studio version 3.4.0,” which further complicates finding out the details, especially when no code is provided as supplementary information. Altogether, it appears that meta-comparative LCA has been practiced a lot, but that there is no guidance, let alone agreement on the methodological basis for carrying out such studies.

Some of the approaches have been criticized, for a variety of reasons. Hanes et al. (2013) criticize the use of log-transformed variables, and Heijungs (2017) comments on the absence of random sampling which would rule out the use of confidence intervals and \(p\) values. Valente et al. (2019) check the relationship between global warming and acidification for a number of hydrogen production systems, but they find a disappointing goodness-of-fit. Another possible critique is that many statistical techniques need assumptions, for instance, normal distributions or independence, and that such assumptions are often not mentioned or not checked for. Also the role of confounding variables (Pourhoseingholi et al. 2012) is in general not checked for.

Altogether, it appears that methodological guidance is needed to facilitate meta-comparative LCA, in order to eventually improve LCA and LCIA practices, reduce uncertainties, evaluate robustness of outcomes, and improve decision support. In Sect. 2, we will analyze a large number of such meta-comparative studies. Section 3 will examine the major techniques in terms of desirable and undesirable properties. Section 4 will then propose an innovative technique, which will be illustrated with the data set from Dekker et al. (2020). Section 5 summarizes and concludes.

2 Review of existing approaches

In this section, we analyze a number of meta-comparative LCA studies in order to extract the approach taken. In this process, we will focus on the quasi-empirical studies, which we define to be studies that calculate LCA results (LCI, midpoint LCIA, endpoint LCIA, weighting) for a number of products with two or more different LCA methods, which are next submitted to a quantitative analysis in order to draw conclusions on the agreement or disagreement between the methods.

2.1 Notation and terminology

Because every study employs its own notation and terminology, we will introduce a uniform set of principles here. The analysis is done on the basis of a sample of \(n\) products. The scores for one indicator will be denoted by \({x}_{i}\) (\(i=1,\dots ,n\)) and for the other indicator, it will be \({y}_{i}\). For instance, the \(x\) scores may be the values for the predictor or streamlined or old characterization method, and the \(y\) scores the values for the predicted or full or new method.

Within this format, we discern five major purposes of the studies:

  • streamlining;

  • proxy;

  • reduction;

  • comparison; and

  • sensitivity.

Streamlining includes those studies that attempt to mimic a full result (\({y}_{i}\)) by means of another, more easily determined, result (\({x}_{i}\)). The interesting question is then to what extent \({x}_{i}\) resembles \({y}_{i}\). A particular characteristic of such studies is that \(x\) and \(y\) have the same unit. For instance, both are expressed in kg CO2-equivalent. For an example, we refer to Frischknecht et al. (2007), who study to what extent the results of an LCA are influenced by ignoring capital goods. In those studies in which a value is predicted, we use a hat to indicate the predicted value. For instance, \({y}_{i}\) is the observed value for product \(i\), and \({\widehat{y}}_{i}\) is the predicted value.

The group of proxy studies includes studies that attempt to establish or test a relationship between a proxy indicator (\({x}_{i}\)) and the real indicator (\({y}_{i}\)). In this case, the purpose is not necessarily to mimic \({y}_{i}\), but rather to find out to what extent choices (“product A is the best”), rankings (“product A is better than product B”), or subdivisions of scores (“\(60\%\) of the score for product A is caused by transport”) are stable when \(x\) scores are used instead of \(y\) scores. Here the \(x\) and \(y\) scores may have different units; for instance, \(x\) is in MJ of primary energy and \(y\) in kg CO2-equivalent. An example is the paper by Huijbregts et al. (2006), where the cumulative energy demand is the predictor (\(x\)) and a variety of impact categories (global warming, stratospheric ozone depletion, acidification, etc.) is the variable-to-be-predicted (\(y\)).

With reduction studies, we embrace studies that seek to reduce the number of indicators to a smaller subset. A typical example is provided by Steinmann et al. (2016), who attempt to reduce the “hundreds of indicators” to “a nonredundant key set of indicators representative of the overall environmental impact.”

The group of comparison studies comprises studies that do not seek to predict or provide a proxy, but that merely try to find out how different the results are. Baumann and Rydberg (1994) provide a typical example here. This type also includes the study of updates. For instance, Dekker et al. (2020) use ReCiPe2008 for the \(x\) and ReCiPe 2016 for the \(y\). In some cases, the two variables will have equal units, but there may also be situations in which this is not the case. A variation of this are studies like those by Junnila (2006), which compare process-based (\(x\)) and input–output-based (\(y)\) result, without necessarily declaring that one is better than the other one.

Finally, there are studies that primarily study one specific product and apply several methods (e.g., several LCIA methods) to study how robust the result is for methodological choices. We refer to these as sensitivity studies. A typical example is Cavalett et al. (2013), who compare gasoline and ethanol “using different LCIA methods.”

These five purposes are summarized in Table 1.

Table 1 Proposed differentiated use of statistical techniques per purpose

Many quasi-empirical, meta-comparative studies employ overall descriptive statistics that are computed from the \(\left({x}_{i},{y}_{i}\right)\) data, such as a correlation coefficient or \(p\) values. The five types of studies may require different types of statistics. After all, for the streamlining group, we expect that the \({y}_{i}\) is close to the \({x}_{i}\) for the majority of products, but for the proxy group, we might be more interested in a robust ranking of the products. As such, there is no universally best meta-comparison indicator. Instead, a purpose-dependent result may appear to emerge.

Because ranking can be important for certain applications, we need to introduce the idea more precisely. Order statistics of a data vector refer to a rearrangement of the data vector, such that the elements are ordered from small to large. The \(i\) th order statistic of a data vector with elements \({x}_{1},\dots ,{x}_{n}\) is indicated by \({x}_{\left(i\right)}\). Altogether, we have \({x}_{\left(1\right)}\le {x}_{\left(2\right)}\le \cdots \le {x}_{\left(n-1\right)}\le {x}_{\left(n\right)}\). Using this notation, we can easily indicate the smallest value of \(x\) by \({x}_{\left(1\right)}\) and the largest value by \({x}_{\left(n\right)}\). Ranks refer to the place of a particular value \({x}_{i}\) in the vector of order statistics. Often, symbols like \({R}_{i}\) are used to indicate ranks, but as we need to be able to distinguish the ranks of the \(x\)- and \(y\)-series, we prefer the notation \({R}_{{x}_{i}}\) and \({R}_{{y}_{i}}\). In ranking, a choice has to be made about how to handle ties. Ties occur when two or more data points have the same value. We will adopt the midrank convention, in which all data with the same value will receive an average rank (Agresti 2002).

Ranking implies a preference. If \({x}_{i}>{x}_{j}\) and a lower value of \(x\) is preferable (“less is better”), we have \({R}_{{x}_{i}}<{R}_{{x}_{j}}\). We further write in that case that \(i\prec j\), meaning that product \(i\) has a lower preference than product \(j\). The symbol \(\sim\) indicates indifference.

In some cases, we will need to work with the average value of \(x\) or \(y\), over the entire set of products. For this. we use the bar-notation:

$$\overline{x }=\frac{1}{n}\sum_{i=1}^{n}{x}_{i}$$
(1)

and similar for \(\overline{y }\). Likewise, the standard deviation will be indicated by \(s\), with possible subscripts for \(x\) and \(y\):

$${s}_{x}=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}{\left({x}_{i}-\overline{x }\right)}^{2}}$$
(2)

and similar for \({s}_{y}\). The variances are then simply the squared standard deviations: \({s}_{x}^{2}\) and \({s}_{y}^{2}\).

Some authors do not analyze the raw data, but use the logarithm of the values of \({x}_{i}\) and \({y}_{i}\). In the analysis below, we will pay particular attention to this aspect. On the other hand, we will, at some places, ignore the use of logarithms, and just provide formulas with \(x\) and \(y\) for which, if needed, \(\mathrm{log}\left(x\right)\) and \(\mathrm{log}\left(y\right)\) may be inserted, which may indicate \(10\)-log or natural logarithm.

In some studies, there are multiple \(x\) variables. We will then write \(k\) for the number of \(x\) variables. We will indicate the values for product \(i\) as \({x}_{i1},{x}_{i2},\dots ,{x}_{ik}\). The data can then be conceived as building a data matrix \(\mathbf{X}\).

In some of the studies analyzed, hypothesis tests are used. The test statistic will be indicated by symbols like \(z\), \(t,\) and \(F\) for the standard normal, Student \(t,\) and Fisher \(F\) distribution, and where needed degrees of freedom will be indicated by \(df\), \(d{f}_{1}\), etc. The resulting \(p\) values will be indicated by \(p\), and the critical value for significance by \(\alpha\). The null hypothesis is indicated by \({H}_{0}\) and the alternative hypothesis by \({H}_{1}\). We use the convention that \({H}_{0}\) and \({H}_{1}\) are complementary (see, e.g., Ott and Longnecker 2015), for instance \(H_0\!\!:\mu \ge 0\) versus \(H_1\!\!:\mu <0\). In this, we deviate from some other texts (e.g., Agresti and Franklin 2013), who use \(H_0\!\!:\mu =0\) versus \(H_1\!\!:\mu <0\).

In a situation of sampling, we should distinguish population parameters and sample statistics. In general, we will use Greek letters, like \(\sigma\) and \(\beta\), for parameters, and their Roman equivalents, like \(s\) and \(b\) for their realized values in a sample. An exception is the mean, for which the parameter is \(\mu\) and the statistic \(\overline{x }\). The sample statistic as a random variable will be denoted by Roman capitals, such as \(\overline{X }\), \(S,\) and \(B\).

In the following sections, we will analyze the approaches by the major approaches from literature. We will often, without notice, change the original symbols to agree with the uniform principle outlined above. We will also sometimes add, or remove, some other details, such as indices and summation symbols.

2.2 Review of studies

There is no objective bibliometric way to identify meta-comparative LCA studies. For the purpose of our review, we selected studies on the basis of our private knowledge of the literature, including the references in and to those papers. This resulted in a collection of around 100 papers, most of which were published in peer-reviewed journals. With a focus on quasi-empirical methods, this number slightly reduces. Table 2 provides an overview of the selected articles, with an indication of their main characteristics.

Table 2 Overview of articles on meta-comparative LCA that apply a quasi-empirical procedure

The table reveals that meta-comparative LCA in fact has been done quite often, by many authors, on different topics, and using an array of techniques. Nevertheless, we can discern a number of trends:

  • LCIA, and in particular the characterization, is the most popular topic;

  • comparison is the most popular purpose; and

  • the most popular statistical techniques are correlation/regression and the presentation of differences or contribution analyses.

In the next few sections, we discuss the statistical techniques in more detail.

2.3 Individual measures of difference

If the score of product \(i\) for one method is indicated by \({x}_{i}\) and for the other method by \({y}_{i}\), we can form various measures of difference. We first discuss the one-by-one measures, and then move to overall indicators.

Several studies (e.g., Junilla 2006; Weidema 2015) list the \(x\) and \(y\) scores without any further processing. Valente et al. (2018) look at the difference between the two scores:

$${d}_{i}={x}_{i}-{y}_{i}$$
(3)

A variation in the form of ratios is used by Herrmann and Moltesen (2015) as well as by Huijbregts et al. (2008):

$${r}_{i}=\frac{{x}_{i}}{{y}_{i}}$$
(4)

Frischknecht et al. (2007) use an indicator of the type:

$${\delta }_{i}=\frac{{x}_{i}-{y}_{i}}{{x}_{i}}$$
(5)

This indicator expresses the relative error of using \({y}_{i}\) instead of \({x}_{i}\). Crawford (2008) also uses this indicator, giving it the name “GAP.”

Several studies (e.g., Simões et al. 2011; Monteiro and Freire 2012; Cavalett et al. 2013) visualize the results with the largest indicator set to \(100\%\):

$${x}_{\mathrm{rel},i}=\frac{{x}_{i}}{{\mathrm{max}}_{i=1}^{n}{x}_{i}}\times 100\%\mathrm{\ and }{\ y}_{\mathrm{rel},i}=\frac{{y}_{i}}{{\mathrm{max}}_{i=1}^{n}{y}_{i}}\times 100\%$$
(6)

Valente et al. (2018) do a similar thing, but they use the largest of both methods as a reference, inserting \({\mathrm{max}}_{i=1}^{n}\left({x}_{i},{y}_{i}\right)\) in the denominator for both expressions.

There are also studies where one product is used as a reference. For instance, Notarnicola et al. (1998) use the score for steel in the denominator.

Peters (2007) compares two algorithms for solving an IO-based LCI. He calculates an “error,” which compares the two methods, as well as a “tolerance”. Unfortunately, the precise details are not specified. We guess that the error is defined as \(\frac{{x}_{i}-{y}_{i}}{{x}_{i}}\), but what exactly is used here for \({x}_{i}\) (sector outputs, emissions) is unclear.

2.4 Aggregated measures of difference

The indicators above express differences per product. As such, they are less suitable for studies that address a large number of products, such as Huijbregts et al. (2006) and Pascual-González et al. (2016). In this section, we discuss the overall indicators, in which some form of aggregation or averaging over all products (\(i=1,\dots ,n\)) is made.

Dekker et al. (2020) use a number of indicators. These include the root mean square error (RMSE), defined as follows:

$$\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{\left({x}_{i}-{y}_{i}\right)}^{2}}$$
(7)

and its normalized version:

$${\mathrm{RMSE}}_{n}=\frac{\mathrm{RMSE}}{\overline{x} }$$
(8)

In these formulas, we have used Dekker’s reference to Timsina and Humphreys (2006), correcting a typo. Birkved and Heijungs use a “root mean square error of prediction” (RMSEP), for which they give in their appendix C a formula that is probably wrong (e.g., it contains no root and no square). Given the general idea of a RMSE, we correct it here as follows:

$$\mathrm{RMSEP}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{\left(\widehat{{y}_{i}}-{y}_{i}\right)}^{2}}$$
(9)

Wernet et al. (2008) argue that RMSEP is less suitable when \(y\) varies over an order of magnitude or more, and prefer to use the mean of the absolute values of the relative prediction error:

$$\mathrm{MRE}=\frac{1}{n}\sum_{i=1}^{n}\frac{\left|\widehat{{y}_{i}}-{y}_{i}\right|}{{y}_{i}}$$
(10)

Interestingly, they apply this for \(30\) test sets, and report the mean \(\mathrm{MRE}\) (which is therefore a mean of means), the median \(\mathrm{RME}\), as well as the standard deviation \(\mathrm{MRE}\). Despite their reservations, they do report RMSE in their Supplementary material, in a similar way.

A primitive form of statistical analysis is performed by Hochschorner and Finnveden (2003), who study the differences between \({x}_{i}\) and \({y}_{i}\), defining these as “significantly better,” “probably better,” etc. A more sophisticated form is presented by Dekker et al. (2020), who use a “two-sided \(t\) test,” for which no further details are provided. A study of their R code (supplied by the authors) reveals that the independent samples \(t\) test for equality of the mean, without assuming equality of variance, was used. The null hypothesis tested is as follows:

$${H}_{0}:{\mu }_{x}={\mu }_{y}$$
(11)

where \(\mu\) indicates the population mean, and the computational details are provided in the Supplementary Information of this article. When the \(p\) value is smaller than a pre-determined significance level (such as \(5\%\) or \(1\%\)), the test results in a “significant” result, in this case, a significant difference between the two means. Dekker et al. (2020) choose \(5\%\) for this.

Visual presentations of the difference take different forms. Dekker et al. (2020) show box plots of \(x\) and \(y\) next to each other. Huijbregts et al. (2008) also use box plots, but now of the ratio \(\frac{{x}_{i}}{{y}_{i}}\). Chen et al. (2021) use scatter plots, using the horizontal axis for \(i\), showing \(x\) and \(y\) with different colors on the vertical axis. Mendoza Beltrán et al. (2016) show box plots and partly overlapping histograms. Several authors (Huijbregts 1998; Cherubini et al. (2018) construct a “comparison indicator,” which combines a comparison of methods with a stochastic treatment of the numerical data, and present a histogram of the comparison indicator.

2.5 Contribution analysis

Several studies (e.g., Junnila 2006; Bovea and Gallardo 2006; Dewulf et al. 2007; Weidema 2015) address a few products and concentrate on the contributions made by different parts, without constructing an indicator. Most studies do this in a quantitative manner, but a few papers (e.g., Brent and Hietkamp 2003) use a more qualitative approach. The results are often presented in tables and/or bar graphs; see Bueno et al. (2016) for a good example.

Contribution analysis can proceed in different ways. Monteiro and Freire (2012) split the scores by life cycle stage (materials, transport, maintenance, etc.). Pizzol et al. (2011) by contrast specify the contribution from different stressors (aluminum, antimony, etc.) to an overall impact score (human health). Weidema (2015) shows how aggregated impact categories (such as ecotoxicity) are built-up of subscores (such as freshwater, marine, and terrestrial). Halleux et al. (2006) go even further, and show how a single index is built-up in terms of endpoint impacts (resources, ecosystem quality, human health).

2.6 Measures of correlation

In meta-comparative LCA, we expect that a product with a relatively low \(x\) value will also have a relatively low \(y\) value. The degree to which the \(x\) and \(y\) values run together can be expressed in various ways. Below, we discuss different types of correlation coefficients and regression analysis. In this section, we discuss approaches using correlation, and in the next one, regression. Note that in many of the reviewed articles, the word “correlation” is used in an overall sense, also embracing regression. For instance, Bösch et al. (2007), Kaufman et al. (2010), and Berger and Finkbeiner (2011) speak of “correlations” which they determine using regression analysis.

The Pearson product-moment correlation coefficient, or correlation coefficient for short, indicated by \(r\), is given in the Supplementary Information of this article. It measures the degree of linear correlation. If \(r=1\), there is a perfect linear positive correlation, indicating a perfect match with a straight line with positive slope \(b\):

$${y}_{i}=a+b{x}_{i}$$
(12)

with \(b>0\). If \(0>r>1\), there is a certain scatter around this line, the closer \(r\) is to \(1\), the better the agreement with the straight line. A negative value of \(r\) represents a case of anti-correlation, reflecting a straight line with a negative slope. Notice that a correlation coefficient reveals little (only its sign, positive or negative) of the value of \(b\). Correlation coefficients have been reported by, among others, Laurent et al. (2012), Röös et al. (2013), and Dong et al. (2016).

Several authors (e.g., Bösch et al. 2007; Wernet et al. 2008; Berger and Finkbeiner 2011; Valente et al. 2018) prefer to report the square of the correlation coefficient, indicated by \({R}^{2}\), and known as the coefficient of determination:

$${R}^{2}={r}^{2}$$
(13)

The reason is probably that these studies use regression analysis (see below) as a way to determine correlations. We will discuss \({R}^{2}\) in more detail below. Note that Huijbregts et al. (2006) refer to \({r}^{2}\) as the “correlation coefficient”, and that Wernet et al. (2008) are somewhat vague on this.

Laurent et al. (2012), Kalbar et al. (2017), and Dekker et al. (2020) use the Spearman correlation coefficient, or rank correlation coefficient, to check for the consistency of ranking (note that Kalbar et al. (2017) use the term “nonlinear” correlation coefficient, which is a bit misleading). It is based on the Pearson correlation of the ranked variables, and sometimes indicated by \(\rho\) or \({r}_{S}\) (see Supplementary Information). Like the Pearson correlation, the Spearman correlation is between \(-1\) and \(1\); however, its interpretation is slightly different. While the Pearson correlation indicates the agreement with a straight line, the Spearman correlation indicates consistency of ranking. If for all products the rank according to \(x\) agrees with the rank according to \(y\), we have \({r}_{S}=1\). Otherwise, the value will be less than \(1\). A Spearman correlation of \(1\) can be interpreted as a ranking-preserving signal. If for at least one pair of products \(\left(i,j\right),\) the ranking according to \(x\) differs from the ranking according to \(y\) (for instance, \({R}_{{x}_{i}}<{R}_{{x}_{j}}\) but \({R}_{{y}_{i}}>{R}_{{y}_{j}}\)) then the Spearman correlation coefficient is less than \(1\).

Because, as observed, several texts prefer to use the square of the Pearson correlation instead of the raw version, there are also authors who square the Spearman correlation. An example can be found in the paper by Wernet et al. (2008).

It is important to note that several textbooks supply a simplified version of the formula to calculate the Spearman correlation (see Supplementary Information). It is, however, only valid when there are no ties, i.e., when all \(x\) values and \(y\) values occur only once. We do not know which of the two formulas for calculating the Spearman correlation has been used by the meta-comparative studies that use this as an indicator.

A third type of correlation coefficient is known as Kendall’s \(\tau\). For reasons of consistency, we will use the symbol \({r}_{K}\) here; see Supplementary Information for details. We are aware of only one paper using it, namely Kalbar et al. (2017).

Correlations can be visually supported by scatter plots. Laurent et al. (2012) provide examples of such plots. Note that the straight line indicated is not a regression line, but a “45 degree” line, indicating equality. Dekker et al. (2020) show two lines: the equality line and a regression line. We discuss the regression line in more detail in the next section.

All these types of correlations can be subject to a hypothesis test, testing the null hypothesis that the population value of the correlation coefficient is \(0\); see the Supplementary Information for details. Röös et al. (2013) and Dong et al. (2016) test Pearson correlation coefficients, using \(\alpha =0.05\) as a criterion for significance, and Pascual-González et al. (2016) use \(\alpha =0.001\). Berger and Finkbeiner (2011) also identify “significant correlations,” but do not inform the reader about their criterion for significance.

A variation is the use of confidence intervals. Laurent et al. (2012) are the only case we found where confidence intervals for correlation coefficients are used.

Kalbar et al. (2017) use the Spearman correlation with a hypothesis test, but they do not specify the precise form taken. Wernet et al. (2008) indicate the critical value of their (squared) Spearman correlation, using \(\alpha =0.01\), but they also do not indicate the precise procedure (\(z\) or \(t\)).

For Kendall’s correlation coefficient, there are various forms available (see Supplementary Information). Kalbar et al. (2017) use a significance test, although precise details are not presented, except for a general reference to Matlab.

Spearman’s and Kendall’s correlation coefficient are examples of indicators that focus on the ranking of products according to the \(x\)- and \(y\)-scales. Rankings play also a prominent in the analysis by Heijungs (2017).

2.7 Simple regression

Closely related to, but different in a number ways from correlation, is regression analysis. Here, we restrict the discussion to simple regression, with only one \(x\) variable, which is the type of analysis used by a number of authors (e.g., Huijbregts et al. 2006; Curzons et al. 2007; Berger and Finkbeiner 2011), although in some cases a logarithmic transformation has been applied prior to the analysis (see below).

In a simple regression analysis, the data (\(x\) and \(y\)) is used to estimate:

$${y}_{i}=a+b{x}_{i}+{e}_{i}$$
(14)

where \(a\) is the intercept (or constant), \(b\) is the slope (or regression coefficient), and \({e}_{i}\) is a residual (or error) term that indicates the deviation of \({y}_{i}\) from the regression line. With such a regression line, we predict with a given \({x}_{i}:\)

$$\widehat{{y}_{i}}=a+b{x}_{i}$$
(15)

which deviates from the observed value \({y}_{i}\) by an error:

$${e}_{i}={y}_{i}-\widehat{{y}_{i}}={y}_{i}-\left(a+b{x}_{i}\right)$$
(16)

Details on the estimation procedure are in the Supplementary Information. The goodness-of-fit of a regression line is usually reported as the coefficient of determination, \({R}^{2}\) (see Supplementary Information), which can be interpreted as a fraction of explained variance. For instance, if \({R}^{2}=0.9\), the \(x\) variable is accountable for \(90\%\) of the variance in the \(y\) variable, the remaining \(10\%\) is due to random (unexplained) variation.

The standard error of the regression, also known as residual standard error (see Supplementary Information), is another measure of the goodness-of-fit. It is used by Huijbregts et al. (2006) for calculating an “uncertainty factor,” \(k\):

$$k=\frac{97.5p}{2.5p}$$
(17)

Although not described as such, we think that \(97.5p\) refers to the \(97.5\) percentile of the distribution of residuals \({e}_{i}\), which is further assumed to be log-normally distributed. The \(k\) values have been reported in their Table 4, with values ranging between \(1.2\) and \(42000\).

Pascual-González et al. (2016) employ another measure of the quality of the fit, namely the relative error (see Supplementary Information), which they further express as a percentage. Pascual-González et al. (2015) use a variant of this, called the average relative error, which is a generalization for multiple \(y\) variables.

Birkved and Heijungs (2011) use, besides \({R}^{2}\), a related statistic, which is indicated by \({Q}^{2}\) (see Supplementary Information), and which is based on leave-one-out cross validation (LOOCV). In general, cross validation is a technique in which part of the sample (let us say, \(m<n\) data points) is used to “train” the model (i.e., to estimate the coefficients), and the rest of the data (the remaining \(n-m\) data points) is used to compute a goodness-of-fit measure. In LOOCV, we use \(m=n-1\), and loop over all \(n\) data points to find \({Q}^{2}\) that is averaged over the entire sample. For a more detailed description, we refer to James et al. (2015).

Also Wernet et al. (2008) mention the leave-one-out principle for cross validation, but they give a result (indicated as \({q}^{2}\)) with the name “coefficient of determination,” and their Supplementary Information provides a formula for \({q}^{2}\) which indeed looks more like the usual \({R}^{2}\).

Regression models can also be the subject of a significance test, in several ways. For a simple regression, this can take the form a \(t\) test or an \(F\) test, both of which test the null hypothesis \({H}_{0}\!\!:\beta =0\), and which give identical \(p\) values. The details are in the Supplementary Information. As far as we know, this test has not been carried out before in the context of meta-comparative LCA. Although Zhang and Bakshi (2007) write that “statistical regression and hypothesis testing is used to determine whether a statistical correlation exists,” they do not report \(t\) or \(F\) or \(p\) values.

Like with the correlation coefficient, the appropriateness of a two-tailed test can be doubted. More fundamentally, if \(y\) is supposed to mimic \(x\), a test for a unit regression coefficient seems even more appropriate. Such a test could take the form:

$$H_0\!\!:\beta=1$$
(18)

which looks more like the unit root test of time series econometrics (Gujarati 2003; Hill et al. 2011).

Regression analyses (and correlations; see 2.6) are often supported by scatter plots, one variable showing the \(x\) values and the \(y\) variable at the other axis. In many cases, the regression line (\(y=a+bx\)) is shown in addition. Examples can be found in, among others, Berger and Finkbeiner (2011) and Curzons et al. (2007). It is sometimes due to the presence of such regression lines that it becomes clear that the authors indeed apply regression, while their paper uses the term correlation (see, for instance, Kaufman et al. 2010). The paper by Berger and Finkbeiner (2011) is further a good piece of evidence in showing the extent to what extent correlation and regression can be mixed up, using phrases like “strong linear regressions” in a paper which has just “correlation analysis” in the title.

2.8 Multivariate analyses

Correlation and simple regression are bivariate techniques, investigating the relationship between two variables, indicated here with \(x\) and \(y\). Several extensions have been employed for meta-comparative LCA.

Steinmann et al. (2017a) use multiple regression. Although they do not specify the precise details, we can make some educated guesses here. The regression model in this case is as follows:

$${y}_{i}=a+{b}_{1}{x}_{i1}+{b}_{2}{x}_{i2}+\cdots +{b}_{k}{x}_{ik}+{e}_{i}$$
(19)

with \(k\) slope coefficients \({b}_{1},{b}_{2},\dots ,{b}_{k}\). There exist standard matrix-based techniques to find the optimal values of these coefficients, as well as their standard errors. Multiple regression also yields \({R}^{2}\) values. The slope coefficients have no interpretation of a correlation, but their difference from \(0\) can still be tested with a \(t\) test. As these \(b\)-coefficients have different units and scales, they cannot be compared with each other. One way to allow for such comparisons is by transforming them into standardized regression coefficients (see Supplementary Information). Steinmann et al. (2017a) indeed use such standardized regression coefficients to express the relative importance of the different \(x\) variables in contributing to \(y\).

Multiple regression is also used by Park et al. (2001) and Park and Seo (2003). These authors also report significance tests on the basis of the \(F\)-statistic (see Supplementary Information).

The multiple regression model requires that the \(x\) variables are independent. One way to test for dependence among the \(x\) variables is through variance inflation factors (VIFs; see Supplementary Information). All VIFs should be \(1\) for full independence, although values up to \(5\) can be argued to be still reasonable. Steinmann et al. (2017a) use the VIF to remove redundant \(x\) variables.

Another approach to study mutual dependence and to control for redundance is by principal component analysis (PCA). The purpose of a PCA differs in an important way, as it does not predict a \(y\) from one or more \(x\) variables, but rather studies the degree to which different \(x\) variables provide added value. Examples of such studies include Le Téno (1999), Curzons et al. (2007), Gutiérrez et al. (2010a), Pozo et al. (2012), Steinmann et al. (2016), Lasvaux et al. (2016), and Balugani et al. (2021). Because their aim is not to compare but to reduce, we will exclude those studies from our analysis. However, we will discuss one aspect, because it resembles the previous discussions. The PCA technique proposes a rotated, orthogonal, coordinate system, in which the first principal components (PCs) describe a large fraction of the variance. For instance, Steinmann et al. (2016) show a scree plot in which the first PC explains \(83.3\%\) of the variance, and the second PC adds another \(3.1\%\). Such numbers can be interpreted similar to the \({R}^{2}\) of a regression, and can therefore suggest to support the idea of a proxy indicator. However, the PCs are themselves weighted combinations of the original \(x\) variables, and therefore even a proxy by only one PC needs in general information from all \(x\) variables.

A useful distinction of ways of analysis has been made by Cattell (1952). For our purpose, we restrict the discussion to Q and R techniques (also: Q and R analyses; Legendre and Legendre 1998):

  • the Q technique addresses similarities between “objects” (products), for instance to find out which products are comparable; and

  • the R technique addresses similarities between “descriptors” (variables), for instance to reduce the number of impact categories.

The PCA studies mentioned are examples of R analyses. There are also a few meta-comparative LCA studies that use Q techniques. For instance, Gutiérrez et al. (2009) use multidimensional scaling (MDS), and Gutiérrez et al. (2010a) use cluster analysis to group similar products. We do not further discuss these Q techniques, because their aim falls outside the scope of this article.

Several more advanced variations on regression analysis have been used. We mention Birkved and Heijungs (2011), who use partial least squares regression (PLSR), which is a multivariate technique that is based on the combination of PCA and regression. We also mention Balugani et al. (2021), who use robust ordinal regression, a technique that focuses on ordinal rankings instead of the numerical values. Pascual-González et al. (2015) combine multiple regression and mixed integer linear programming (MILP). Eddy et al. (2015) apply kriging, which can also be regarded as a variation to regression. The advanced nature of these methods, combined with their only occasional use, forces us to keep these further undiscussed.

In the context of multiple regression, Steinmann et al. (2017a) use the Akaike information criterion (AIC) to assess the goodness-of-fit. AIC, like \({R}^{2}\), is a measure of the quality of the model, but it penalizes the use of an excessive number of \(x\) variables. In that respect, it resembles the more familiar adjusted \({R}^{2}\), \({R}_{\mathrm{adj}}^{2}\). Both \(\mathrm{AIC}\) and adjusted \({R}^{2}\) are described in the Supplementary Information. A difference is that \({R}^{2}\), and by extension \({R}_{\mathrm{adj}}^{2}\), has a stand-alone interpretation, while \(\mathrm{AIC}\) makes only sense in a comparison of regression models.

Pascual-González et al. (2016) investigate the correlation between multiple \(x\) variables, defining a “correlation index” as the relative number of variables correlated with a specific variable. We interpret this in our notation as follows:

$${I}_{l}=\frac{1}{k}\sum_{\begin{array}{c}j=1\\ j\ne l\end{array}}^{k}\Theta \left(p\left({r}_{jl}\right)-0.001\right)$$
(20)

where \(p\left({r}_{jl}\right)\) is the \(p\) value associated with the correlation coefficient of variables \({x}_{j}\) and \({x}_{l}\) and \(\Theta \left(x\right)\) is the Heaviside step function.

Finally, Kalbar et al. (2017) mention the use of partial correlation coefficients (see Supplementary Information). The result \({r}_{12\cdot 3}\) measures the correlation between variables \(1\) and \(2\), corrected for a confounding variable \(3\) that is correlated with both \(1\) and \(2\).

2.9 Machine learning techniques

Modern developments in machine learning and artificial intelligence have enriched the toolbox of predictions by computationally intensive techniques. Here we briefly describe a few approaches that have been used in the context of meta-comparative LCA, without providing the full details.

Several authors have used neural networks (also called artificial neural networks, ANN) to establish relationships between predictors and results. Marvuglia et al. (2015) do this for the relation between chemical properties and characterization factors, and Park and Seo (2006) use ANN to streamline the design of products using simple product characteristics, such as mass, percentage of plastics, and lifetime. A similar approach is taken by Sousa et al. (2000).

Wernet et al. (2008) and Park and Seo (2003) use both regression analysis and neural networks, and can therefore be interpreted as a meta-meta-comparative LCA.

Shariar Hossain et al. (2014) use several clustering techniques. Such analyses can be interpreted as Q-mode analysis (see previous section), and therefore can be seen as answering a different type of question.

Also Hou et al. (2020) apply a number of ML techniques, ranging from neural networks to nearest neighbor methods.

3 Critical discussion

The previous section gave a neutral overview of the different indicators that have been used for meta-comparative LCA. It is clear that there is a tremendous choice of methods: differences, regression, correlation, neural networks, \(t\) tests, and \(p\) values. Further choices, such as the use of logarithmic transformations, complicate the situation even more. In this section, we will add a few critical discussions.

3.1 Lack of detail, inconsistencies, and other issues

We start our critical section by pointing out that many of the cited papers are incomplete, unclear, or contain mistakes. This is a pity, because the approaches are often interesting, but they are insufficiently clearly described to reproduce, and therefore it is difficult to come to a full appreciation or adoption in software. Here we give a few examples, without claiming to be complete.

Pascual-González et al. (2016) show several figures with an “\(\Omega\)” on the axis, without defining its meaning in the text. They also use the bar-symbol (\(\overline{x }\) and \(\overline{y }\)) for the mean in their Eq. (1), while in their Eq. (4) the mean is \(\mu\). Dekker et al. (2020) use a “two-sided \(t\) test,” but do not specify details like paired vs. independent-samples, or equal vs. unequal variance.

If \(p\) values are reported, the null hypothesis is hardly ever mentioned. The paper by Röös et al. (2013) is one of the few who actually does report it, but many other papers just list \(p\) values. When \(p\) values are used to decide if something is “significant,” the significance level (\(\alpha\)) is often not mentioned. A proper use of null hypothesis tests further necessitates the distinction between population parameters (such as \(\rho\) and \(\beta\)) and sample statistics (such as \(r\) and \(b\)). Such refinements are almost completely lacking.

We already commented on the confused use of the terms “correlation” and “regression”. A simple regression analysis yields an \({R}^{2}\) and a Pearson correlation analysis an \(r\), which are trivially related. But for multiple regression and Spearman correlation, the situation becomes harder. Despite their similarities, the two types of analysis are fundamentally different. Correlation is about the moving together of two variables, without any assumption of causality or priority. Regression, by contrast, assumes that one variable (\(y\)) depends on another variable (\(x\)), implying a causal structure. Regression also assumes that the \(x\) variable is not random and without error, while the \(y\) has random error. Correlation assumes that both \(x\) and \(y\) (or perhaps more appropriately written, \({x}_{1}\) and \({x}_{2}\)) are both random. The difference between correlation and regression has repercussions for their applicability. Comparisons of methods (like Dekker et al. 2020) may benefit from a correlation analysis, while for streamlining and proxy studies (like Huijbregts et al. 2006), regression is more appropriate. A clear definition of the approach used is therefore a requirement for a correct judgment of the quality of the studies.

3.2 Measures of difference

The (undocumented) choice by Dekker et al. (2020) for an independent samples \(t\) test instead of a paired \(t\) test is remarkable, because there is a natural pairing of an \(x\) value and a \(y\) value for every product \(i\). A paired \(t\) test first defines the following:

$${d}_{i}={x}_{i}-{y}_{i}$$
(21)

and then constructs the test statistic:

$$t=\frac{\overline{d}}{{s }_{d}/\sqrt{n}}$$
(22)

which is tested with \(df=n-1\) and yields in general a much smaller \(p\) value than an independent samples \(t\) test.

Both the independent samples \(t\) test and its paired version effectively result in a \(t\) value which scale with \(\sqrt{n}\), and as such can result in highly significant differences, even when these are small, given a large sample size \(n\). And because sample size can be increased arbitrarily by including more products in the test set, such significant measures in the end have little meaning (Heijungs et al. 2016).

3.3 Measures of association

In our discussion of correlation, we found quite a few of papers that report \(p\) values for correlation coefficients. Traditionally, a two-tailed test is used, but in fact, it makes sense to consider applying a one-tailed test, because the hypothesis can be argued to be about a positive correlation:

$$H_0\!\!:\rho\leq0\;\textrm{versus}\;H_1\!\!:\rho>0$$
(23)

Note that the number of tails, and therefore the directionality of the null hypothesis, is not always mentioned by the cited articles.

Like with the \(t\) test for equality of means discussed, a test for zero correlation will tend to suggest a rejected null hypothesis when the sample size is large (Heijungs et al. 2016). The interpretation of such a rejected null hypothesis is that there is evidence of some relation between \(x\) and \(y\), but in general, a significant result does not at all imply that the relation is strong. In that respect, a better practice is the one by Huijbregts et al. (2006) and Bösch et al. (2007), who put the emphasis on high values of \({R}^{2}\), even though no significance test is performed. Alternatively, we might propose to test for a correlation of \(1\), instead of \(0\):

$$H_0\!\!:\rho =1$$
(24)

The tests mentioned so far are only applicable to test for a zero correlation. The Fisher transformation allows for a conversion of \(r\) into another variable, \({r}^{^{\prime}}\):

$${r}^{^{\prime}}=\frac{1}{2}\mathrm{ln}\left(\frac{1+r}{1-r}\right)$$
(25)

This transformation is applied to the observed correlation coefficient and to the hypothesized correlation, \({\rho }^{^{\prime}}\). However, the Fisher transform is undefined for \(\rho =1\), so this procedure is of no help.

A much more useful test would be as follows:

$${H}_{0}\!\!:\beta =1$$
(26)

where \(\beta\) is the population value of the regression slope coefficient. Where in a simple regression, the quantity \({t}_{b}=\frac{b}{S{E}_{b}}\) is used to assess the null hypothesis \(H_0\!\!:\beta =0\), we use the following:

$$t=\frac{b-1}{S{E}_{b}}$$
(27)

to asses this modified null hypothesis. We have not identified any study which applied this hypothesis test.

3.4 The use of statistical theory

Any data set can be used to compute a mean and a standard deviation, and any paired data set can be used to compute correlation coefficients and a regression line of best fit. Further analysis typically involves distribution theory, which poses several requirements:

  • the underlying data generating process must satisfy certain characteristics (e.g., it must be normally distributed), or the sample must be sufficiently large to allow for an asymptotic result (e.g., \(n\) must be larger than \(30\)); and

  • the analyzed sample must be a random sample.

If these conditions are not satisfied, several results (in particular standard errors, \(t\), \(F\) and \(p\) values, and results of significance tests) are not reliable.

Similar remarks can be made for other types of indicators that rely on distribution theory, including standard errors, confidence intervals, variance inflation factors, and the Akaike information criterion.

Heijungs (2017) commented on the unjustified use of distribution theory by Steinmann et al. (2017a), after which Steinmann et al. (2017b) analyzed their case in more detail. This remains, however, an exception, and the use of \(t\), \(F\), and \(p\) values in quasi-empirical meta-comparative LCA should be interpreted with caution.

3.5 Issues of scale and units

All quasi-empirical studies choose a certain unit of product as the basis of the \(x\) and \(y\) scores. For instance, Röös et al. (2013) calculate \(n=53\) sets of scores, each on the basis of 1 kg of product. Huijbregts et al. (2006) have a more mixed portfolio: these authors calculated results for \(226\) energy products per MJ, \(750\) materials per kg, etc. Such choices are pretty arbitrary. Because LCA results scale linearly with the amount of product, we would hope that the results of the meta-comparison are insensitive to the exact numerical choice. Would the results of Röös et al. (2013) change if they would choose \(100\) kg of product? And more subtly, would the results by Huijbregts et al. (2006) change if we would continue to use \(1\) kg for the materials, but switch to kWh for the energy products?

Clearly, there is no universal answer to this question. Some results will depend on a change of scale or units, but that does not mean that the final conclusion will change. A further complication is that the effect of a change of scale or unit will always affect the \(y\) variable (because it reflects the emission or impact per unit of product), but not always the \(x\) variable. For instance, in streamlining studies or comparisons, the \(x\) variable will also depend on the scale and unit of \(x\). But for proxy studies, the situation may be different. Consider, for instance the case of a regression model:

$$y=a+{b}_{1}{x}_{1}+{b}_{2}{x}_{2}+e$$
(28)

where \(y\) is the carbon footprint, \({x}_{1}\) is the mass of the product, and \({x}_{2}\) the lifetime. \(y\) and \({x}_{1}\) are sensitive to changes of units and scale, but \({x}_{2}\) is not, and the precise way this affects \(a\), \({b}_{1},\) and \({b}_{2}\) are not a priori clear. To facilitate our analysis, we will focus on situations in which both \(x\) and \(y\) (or all \(x\) and \(y\) variables) depend on the scale and unit in the same way.

Suppose we change for some of the products the LCA basis from \(1\) unit to \(k\) units. For instance, we change from \(1\) MJ to \(1\) GJ, so \(k=1000\). Or from \(1\) MJ to \(1\) kWh, so \(k=3.6\). For these products, we find:

$$\left\{\begin{array}{c}{x}_{i}^{^{\prime}}=k{x}_{i}\\ {y}_{i}^{^{\prime}}=k{y}_{i}\end{array}\right.$$
(29)

For this subset of products, we then also find that the difference between \(x\) and \(y\) scales with \(k\):

$${d}_{i}^{^{\prime}}={x}_{i}^{^{\prime}}-{y}_{i}^{^{\prime}}=k{x}_{i}-k{y}_{i}=k\left({x}_{i}-{y}_{i}\right)=k{d}_{i}$$
(30)

but the relative difference is not affected:

$${\delta }_{i}^{^{\prime}}=\frac{{x}_{i}^{^{\prime}}-{y}_{i}^{^{\prime}}}{{x}_{i}^{^{\prime}}}={\delta }_{i}$$
(31)

The overall scores, such as RMSE and correlation and regression coefficients, are more complicated to analyze. But Fig. 1 gives an illustration of the effect of rescaling one data point by a factor of \(5\), keeping all other points at their original position. The effect can be quite large, because rescaling can create or annihilate outliers at will. The figure shows that a neatly behaving data point can become an outlier through an essentially arbitrary change of scale or unit. When we inflate this point by a factor of \(100\), \({R}^{2}\) even becomes \(0.9999\), suggesting an extremely good approximation. As a concrete example of such inflation, we point to a database that contain LCA data for potatoes (in tonnes) as well as for potato harvesters (in units). These two are perhaps comparable in terms of impacts. But if we would deflate the potatoes to the scale of kg or even g, the harvester would suddenly turn into an outlier.

Fig. 1
figure 1

Six data points with \({R}^{2}=0.59\) (and \(r<0\)) (left) and the same data with one data point rescaled with \({x}^{^{\prime}}=5x\) and \({y}^{^{\prime}}=5y\), yielding \({R}^{2}=0.83\) (and \(r>0\))

3.6 The use of logarithms

Some authors use logarithmically transformed data, at least in part. In several cases, the graphs have logarithmic axes, but it is often not clear if the statistical analyses (correlation coefficients, etc.) are based on the raw data or on their logarithms. For instance, Bösch et al. (2007) write in the caption of their figures “logarithmic scales,” but they do not specify if their \({R}^{2}\) values are based on logarithmic scales as well. Similar remarks apply to Laurent et al. (2012) and Dekker et al. (2020). Huijbregts et al. (2008) are more explicit: they indicate that “the data… were log-transformed,” and they provide a regression equation of the form which in our notation amounts to the following:

$$\widehat{y}={10}^{a}{x}^{b}$$
(32)

The use of logarithms can create further issues. We mention the following:

  • the base of the logarithms (\(10\), \(e\), etc.) is not always stated; and

  • the terminology can be confusing.

Strictly speaking, the use of logarithms requires a specification of the base. But the precise choice matters little, because for any \(b,x>0:\)

$${\mathrm{log}}_{b}x=\frac{\mathrm{ln}\left(x\right)}{\mathrm{ln}\left(b\right)}$$
(33)

so that a change of logarithmic base leads to a change by a factor of \(\frac{1}{\mathrm{ln}\left(b\right)}\), which has a similar interpretation as a change of unit.

For an example of a confusing terminology, we refer to Huijbregts et al. (2006), who applied “log-linear regression analysis,” which might (see Hill et al. 2011) suggest a model of the type:

$$\mathrm{log}\left({y}_{i}\right)=a+b{x}_{i}+{e}_{i}$$
(34)

However, their Fig. 1 contains equations of the type:

$$\mathrm{log}\left({y}_{i}\right)=a+b\ \mathrm{log}\left({x}_{i}\right)+{e}_{i}$$
(35)

which might be regarded as representing a log–log regression, and the horizontal and vertical axes are indeed both logarithmic. To increase the confusion, the caption of the figure mentions a “linear regression,” which might suggest:

$${y}_{i}=a+b{x}_{i}+{e}_{i}$$
(36)

Zhang and Bakshi (2007) are clearer in writing about a “linear regression of log transformed data,” and they moreover provide a formula like:

$$\log\left(y_i\right)=a+b\ \log\left(x_i\right)+e_i$$
(37)

But these authors show a mix of graphs: linear–linear (their Fig. 1), log–log (their Fig. 2) and linear-log (their Fig. 3), with the result that the details of their analysis are still confusing.

Fig. 2
figure 2

Six data points with \({R}^{2}=0.93\) on a normal scale (left) and the same data on a logarithmic scale with \({R}^{2}=0.63\)

Fig. 3
figure 3

Plots of the comparison of ReCiPe 2008 (horizontally) and ReCiPe 2016 (vertically) endpoints. The angle of the solid line indicates the mean slope, and its length the mean resultant length. The dashed line indicates the \(y=x\) line. I = individualist perspective; H = hierarchist perspective; E = egalitarian perspective

Part of the confusion is inherent in the terminology. Gujarati (2003) defines a number of terms in this respect, including log-linear, log–log, double-log, semilog, log-lin, and lin-log. But because other texts (e.g., Hill et al. 2011) use deviating terms, words alone cannot suffice, and specifying the relationship (as done by Zhang and Bakshi (2007) and Huijbregts et al. (2006)) seems imperative.

An import question is to what extent logarithms affect the results of the analysis. Obviously, logarithms can render a graph more convincingly. But they change some of the numerical indicators as well. In particular, outliers can make a large difference. Figure 2 gives an illustration of this phenomenon.

Reasons to use logarithms vary also. Huijbregts et al. (2006) introduce a logarithm “to account for [the] skewed distributions” and Steinmann et al. (2017a) do so “because the footprints varied up to 10 orders of magnitude”. Bösch et al. just mention “logarithmic scales”, without any reason.

The use of logarithms is in any case problematic in case negative or zero values of \(x\) or \(y\) occur. Laurent et al. (2012) mention this in their Supplementary Information and discard these data points. Probably, most impact scores are non-negative, but zeros certainly can occur, and also negative values may show up, for instance as an artifact of allocation.

3.7 The intercept of a regression line

The default linear regression model is based on the equation:

$$\widehat{y}=a+bx$$
(38)

where \(a\) is the intercept and \(b\) the slope. This idea contradicts one of the basic principles of LCA, namely the proportionality of LCA results (emissions, impacts, etc.) with the quantity of product that is expressed by the functional unit. If the quantity of product is \(z\), it follows that the LCA result \(x\) is given by the following:

$$x=pz$$
(39)

and that another LCA result \(y\) is given by the following:

$$y=qz$$
(40)

where \(p\) and \(q\) are the per-unit impacts of the product on variable \(x\) and \(y\), respectively. As a consequence:

$$y=\frac{q}{p}x$$
(41)

which amounts to the following:

$$a=0\mathrm{\ and\ }b=\frac{q}{p}$$
(42)

Many quasi-empirical meta-comparative LCA studies use a sample of \(x\) and \(y\) results to estimate not only the slope, \(b\), but they also estimate the intercept, \(a\). For instance, Berger and Finkbeiner (2011) report \(y=1000000x+256.16\), and Huijbregts et al. (2010) find \(\mathrm{log}\left(\mathrm{EF}\right)=0.9\mathrm{log}\left(\mathrm{CED}\right)-0.6\) (notations slightly adapted).

For the linear case, the intercept should be \(0\). But a standard regression analysis estimates the intercept on the basis of the data. It is, however, possible to force the intercept to zero (see Supplementary Information). We have not identified regression studies that use a zero-intercept regression line for comparative LCA. That is remarkable, because there is quite unanimous agreement that LCA results scale proportionally (Heijungs 2020). The reason is probably that the default regression analysis includes the estimation of an intercept, and that turning off this feature requires a deliberate action.

For the logarithmic case, the situation is a bit more complicated. If \(y=\frac{q}{p}x\), we have the following:

$$\mathrm{log}\left(y\right)=\mathrm{log}\left(\frac{q}{p}x\right)=\mathrm{log}\left(\frac{q}{p}\right)+\mathrm{log}\left(x\right)$$
(43)

In a log–log regression with the following:

$$\mathrm{log}\left({y}_{i}\right)=a+b\ \log\left({\ x}_{i}\right)+{e}_{i}$$
(44)

this would mean that \(a\) is to be estimated while \(b=1\) is given. Again, this type of analysis has not been found in our sample of studies.

The above critique was based on simple regression, but it also holds for the case of multiple regression.

3.8 The least-squares principle

Although the regression line \(y=bx+e\) makes much more sense than the traditional \(y=a+bx+e\), the estimation of \(b\) is still problematic. The reason is that the usual procedures rely on a least-squares principle, minimizing the following:

$$\sum_{i=1}^{n}{e}_{i}^{2}=\sum_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}$$
(45)

The fact that LCA results (\({x}_{i}\) and \({y}_{i}\)) can be moved with an arbitrary scale and unit creates a degree of arbitrariness in the optimal value of \(b\). As an example, we refer to Fig. 2, in which the left panel yields \(b=0.34\) and the right panel \(b=0.38\), the only different being a shift of one data point by a factor of \(5\) in both \(x\) and \(y\). Points at the far end will tend to dominate the sum of squares, and there is no unique way to define the scales.

4 A new approach

In this section, we introduce a novel approach for meta-comparative LCA. We also demonstrate its use on a real-world dataset.

4.1 Directional statistics

Summarizing the results so far, we postulate a proportional relationship between \(x\) and \(y\) of the form \(y=bx\), and we also acknowledge that a sampled product \(i\) with coordinates \(\left({x}_{i},{y}_{i}\right)\) might have been rescaled as \(\left(k{x}_{i},k{y}_{i}\right)\). For the first reason, a regression model of the form \(y=a+bx\) is inappropriate, as the intercept \(a\) must be \(0\). For the second reason, a least-squares regression is inappropriate, as the sum of squares \(\sum_{i=1}^{n}{\left({\widehat{y}}_{i}-{y}_{i}\right)}^{2}\) depends on the arbitrary rescaling of individual data points.

To overcome these problems, we propose an entirely different approach. The relation \(y=bx\) with scalable \(x\) and \(y\) can be rewritten in a scale-independent form as follows:

$$b=\frac{y}{x}$$
(46)

Given a sample of data points \(\left({x}_{i},{y}_{i}\right)\), we then analyze the sample of slope coefficients:

$${b}_{i}=\frac{{y}_{i}}{{x}_{i}}$$
(47)

Changes of scale and unit of individual data points do not affect such ratios, because it trivially follows that

$$\frac{k{y}_{i}}{k{x}_{i}}={b}_{i}$$
(48)

as well.

So, the main question is then: how to average a sample of \({b}_{i}\) values? For this, we conceive these values as representing an angle \({\theta }_{i}\), given by the following:

$${\theta }_{i}=\mathrm{arctan}\left({b}_{i}\right)$$
(49)

and turn to the field that is alternatively called directional statistics (Mardia and Jupp 2000; Ley and Verdebout 2017) and circular statistics (Batschelet 1981; Jammalamadaka and SenGupta 2001; Pewsey et al. 2013). For an accessible summary review, see Lee (2010). In the Supplementary Information, we review the basic concepts of directional statistics, and focus below on its application to meta-comparative LCA.

4.2 Example application

We reprocessed the dataset that was used by Dekker et al. (2020), which consists of the scores of \(n=154\) food products on different impact categories according to ReCiPe 2008 (\(x\)) and ReCiPe 2016 (\(y\)). The resulting directional statistics for the two endpoint indicators, human health and ecosystems, for the three perspectives (individualist, hierarchist, and egalitarian) are shown in Table 3.

Table 3 Results of using directional statistics on a test set of \(154\) food products (see also Fig. 3)

Figure 3 shows how the data points are concentrated or dispersed over the unit circle. It also includes several of the descriptive statistics of Table 3.

These results should be compared with the (logarithmic) regressions by Dekker et al. (2020) (their Fig. 1). For instance, for human health, hierarchist perspective, Dekker et al. (2020) reported “no significant differences,” which we can understand to mean that the dashed line is nearby the solid line. For the individualist perspective, the 2016 data (\(y\)) was “significantly smaller” than the 2008 data (\(x\)), which is confirmed by a solid line which is much flatter than the dashed diagonal. But the figures reveal much more, because the dispersion of data points over the unit circle is in some cases (e.g., human health, individualist) much larger than in other cases (e.g., ecosystems, individualist). Let us take an in-between case, human health, hierarchist. The \({R}^{2}\) of a linear regression is \(0.98\), for a logarithmic regression it is \(0.97\) (see Fig. 4). However, the directional plot of Fig. 3 shows a much more diverse picture, with a much larger variation than both \({R}^{2}\) values suggest, and in the logarithmic case, a much larger deviation between the solid and the dashed line.

Fig. 4
figure 4

Linear and logarithmic plots of the human health, hierarchist comparison. The solid line is the regression line, the dashed line the \(y=x\) line

4.3 Applicability and extensions

Directional statistics offers a method for meta-comparative LCA that is closer to the principles of LCA, in particular the arbitrary size, scale, and unit of the functional unit. It is also insensitive to the huge range of variation that is seen when we analyze a large number of very different products. But it is not a panacea to all problems in meta-comparative LCA.

In Table 1, we discerned five purposes:

  • streamlining;

  • proxy;

  • reduction;

  • comparison; and.

  • sensitivity.

Some of these will, we believe, benefit from the use of directional statistics, while for others, the robustness of ranking, using for instance Spearman’s or Kendall’s correlation coefficient, will be more suitable. In Table 4, we present our ideas in this respect. We emphasize that these ideas are seminal and sometimes speculative. The field of meta-comparative LCA is, from a methodological side, still underexplored, and the present article should be considered as a first step.

Table 4 Proposed differentiated use of statistical techniques per purpose

5 Conclusion

We have seen that using regression analysis, either linear or logarithmic, is incompatible with a basic axiom of LCA (namely that the impact of \(k\) units of product is equal to \(k\) times the impact of \(1\) unit of product), and it is vulnerable to arbitrary choices (namely of the unit and scale of the training set). In analyzing the cause of these problems, we have seen that fitting a best line through a number of data points introduces an unwanted dependence on scale and unit. By moving from a regression line to directional statistics, the deficits of the regression approach are resolved.

In other words, we have found a powerful recipe to compare methods for LCA on the basis of quasi-empirical sample of data:

  • find for every product \(i\) the score on both methods (\({x}_{i}\) and \({y}_{i}\));

  • construct the average direction (\(\mathrm{tan}\left(\overline{\theta }\right)\)) according to the formulas in the Supplementary information (based on directional statistics);

  • for comparisons and streamlining: assess if \(\mathrm{tan}\left(\overline{\theta }\right)\) is close enough to \(1\); and.

  • for proxies and streamlining, use \(\widehat{y}=\mathrm{tan}\left(\overline{\theta }\right)x\) to predict the \(y\) score from the \(x\) score.

A classical regression model returns, besides the estimates of the coefficients, supplementary statistics, such as the standard error of the estimates, \({R}^{2},\) and the AIC. For some of these, analogous concepts have been developed in the theory of directional statistics (Mardia and Jupp 2000; Jammalamadaka and SenGupta 2001). However, as discussed by Heijungs (2017), such statistics should be used with care, because quasi-empirical comparisons in LCA are typically not based on random samples.

In a completely different context, namely the comparison of methods for clinical measurements, others have observed that “the correct statistical approach is not obvious” and that popular methods like correlation and regression are inappropriate (Bland and Altman 2010). In fact, that critique went back as far as 1981, when Altman and Bland (1983) described the comparison of means, correlation, and regression as “incorrect methods of analysis.” Because their topic markedly differs from ours, we cannot blindly copy the recommendations by these authors, but clearly, the comparison of methods has a wider history than merely LCA.

The subject of meta-comparative LCA is important, as is shown by our list of almost 100 articles. But the method to do meta-comparative LCA is underexplored, and deserves a more thorough investigation than one article can offer.