Introduction

Cause-effect testing (CT) is vital in rendering an effective measurement for policy making, strategy adoption, hypothetical decision, etc. (Chen 2020). Usually researchers adopt similar or resemblance indexes (Johnston 1976; Nipperess et al. 2010; Clarke et al. 2006), distance functions (Chen 2021a, b), or statistical methods: regressions, covariance, correlated coefficients and so on, for analysing the relationship between variables. These measures normally are symmetric and assumption dependent. On the whole, they are hard to reveal the causes and effects between variables and they require some assumptions regarding the distribution of data. Though they could solve many related problems, to tackle cause-effect problems, these methods are lacking a persuasive extended method and approach. The motivation of this research is to provide an intuitive approach that could effectively identify the cause-effect relations between variables. The methodology is detailed in Section 2.

Our CT method is useful and informative in revealing the cause-effect relations between environmental variables. We locate various environmental variables that are related to waste producing and its consequences — including water and air pollution, and effect on biodiversity. The variables are divided into three categories: sectoral economic activities, water and air pollutants, and biodiversity. For agriculture, the wastes come from various sources: dairy manure, chicken manure, and all sorts of fertilizers. These wastes would affect the water and soil quality (Warman 1986). Construction activities, such as construction of buildings, bridges, infrastructures, etc, shall also produce some environmental wastes (Zunjarrao 2018). Electricity sector also produces wastes due to the unwanted by-products of electricity generators. Manufacturing also produces wastes, such as toxic chemicals. Mining and quarrying activities also produce solid wastes that shall affect the environment (Basommi et al. 2016; Liakopoulos et al. 2010). Above-mentioned activities might change water and air quality, which in turn might change the biodiversity: birds (Dutta 2017; Glaser et al. 1969), fishes, etc. (Palomares et al. 2020), and eutrophication (Malone1 and Newton 2020).

Based on our method, we analyse various environmental variables whose data are fetched from the European Union. In Section 3, we specify these data and environmental variables. The implemented results are shown and summarised in Section 4. These results identify the causes and effects between the variables. It should facilitate the policy makers or other researchers to focus on the causal relations. The main characteristics regarding our method are it is much intuitive and persuasive in revealing the cause-effect relations and it depends only on non-parametric statistic and real data, which provides some advantages over other typical approaches. The details are described in Section 5.

Methodology

We call a vector whose elements consist of a series of non-overlapped consecutive natural numbers starting from 1, a ranked vector. For example, (3, 5, 1, 2, 4) is a ranked vector. Given any two (annual) ranked vectors \(\mathbf {A}^*=(a_1,a_2,\cdots , a_t)\) and \(\mathbf {B}^*=(b_1,b_2,\cdots , b_t)\), we would like to produce a decision procedure that could decide whether \(\mathbf {A}^*\)-related variable is the cause of \(\mathbf {B}^*\)-related variable. This set of procedures will utilise non-parametric ranking testing techniques. Let \(P_{\mathbf {B}^*}(n)\) denote the position index h such that \(b_h=n\). Let \(P_{\mathbf {B}^*}(1:t)\equiv (P_{\mathbf {B}^*}(1), P_{\mathbf {B}^*}(2),\cdots ,P_{\mathbf {B}^*}(t),)\) denote the position vector of \((1,2,\cdots ,t)\) in \(\mathbf {B}^*\).

Taking the time scale, i.e. \(\{1,2,\cdots , t\}\) into consideration, we could specify several characteristics of cause and effect: \(\mathbf {A}^*\) to \(\mathbf {B}^*\), as follows:

  1. 1.

    (indication for positivity) k precedes \(P_{\mathbf {B}^*}(a_k)\) is a witness that a positive cause and effect is likely to exist;

  2. 2.

    (strength of positivity) If k precedes \(P_{\mathbf {B}^*}(a_k)\) in a shorter span of time, the positive strength of the cause-effect relation is higher;

  3. 3.

    (indication for negativity) If \(P_{\mathbf {B}^*}(a_k)\) precedes k, then it is a witness that a negative cause and effect is likely to exist;

  4. 4.

    (strength of negativity) If \(P_{\mathbf {B}^*}(a_k)\) precedes k in a shorter span of time, the negative strength of the cause-effect relation is higher;

  5. 5.

    (additivity) The individual strength of the cause-effect relations shall decide the overall strength of cause-effect relation.

In order to implement these characteristics, we come up with the following definitions. The first four characteristics are captured by the following function \(\sigma\).

Definition 2.1

Define

$$\begin{aligned} \sigma (a_k):=sign(P_{\mathbf {B}^*}(a_k)-k)*\frac{1}{2^{|P_{\mathbf {B}^*}(a_k)-k|}}, \end{aligned}$$

where \(sign(v)=1\), if \(v\ge 0\) and \(sign(v)=-1\), otherwise.

If \(sign(P_{\mathbf {B}^*}(a_k)-k)\ge 0\), then it is an indication for positivity; if \(sign(P_{\mathbf {B}^*}(a_k)-k)< 0\), then it is an indication for negativity. The strength of the positivity and negativity is calculated by the factor \(\frac{1}{2^{|P_{\mathbf {B}^*}(a_k)-k|}}\). The higher the length \(|P_{\mathbf {B}^*}(a_k)-k|\) is, the lower the value \(\frac{1}{2^{|P_{\mathbf {B}^*}(a_k)-k|}}\) is. Moreover, the last characteristic: additivity, is calculated by the following SCE function.

Definition 2.2

(Strength of cause and effect, SCE) Define

$$\begin{aligned} SCE(\mathbf {A}^*,\mathbf {B}^*):=\frac{\sum\limits _{i=1}^{t}\sigma (a_i)}{t}. \end{aligned}$$

\(SCE(\mathbf {A}^*,\mathbf {B}^*)\) reflects the (average) strength of A being the cause and B being the effect. Let \(sign(\mathbf {v})\) denote the vector \((sign(v_1),sign(v_2),\cdots ,sign(v_n)), |\mathbf {v}|=(|v_1|,|v_2|,\cdots ,|v_n|)\), and \(2^{\mathbf {v}}\) denote the vector \((2^{v_1},2^{v_2},\cdots , 2^{v_n})\), where \(\mathbf {v}=(v_1,v_2,\cdots ,v_n)\). Then we have a succinct form that

$$\begin{aligned} SCE(\mathbf {A}^*,\mathbf {B}^*)=\frac{1}{t}\cdot sign([P_{\mathbf {B}^*}(1:t)-P_{\mathbf {A}^*}(1:t)])\bullet 2^{-[P_{\mathbf {B}^*}(1:t)-P_{\mathbf {A}^*}(1:t)]}. \end{aligned}$$

This former part measures the positive and negative displacement between rank vector \(\mathbf {A}^*\) and \(\mathbf {B}^*\), while the latter part measures the magnitudes of the displacement.

Implementing procedures In this article, we shall treat a ranked vector as a time series and the elements (ordinal numbers) as the comparative magnitudes between the variables. Let two numerical time series \(\mathbf {X}=(x_1,x_2,\cdots ,x_t), \mathbf {Y}=(y_1,y_2,\cdots ,y_t)\in \mathbb {R}^t\), which are related to variable (or factor) X and variable (or factor) Y respectively, be arbitrary, where no two elements are identical within a time series. Let Perm(1 : t) denote all the permutations (in the form of vectors) of \((1,2,\cdots , t)\).

  1. 1.

    Convert time series \(\mathbf {X}\) of factor (or variable) X into a ranked vector \(\mathbf {X^*}\); and convert time series \(\mathbf {Y}\) of factor (or variable) Y into a ranked vector \(\mathbf {Y^*}\). A simple way for such conversion is the equipped orders for \(\mathbf {X}\) and \(\mathbf {Y}\); other complex way for such rank conversion could also be done via some assigned function or weights, depending on the purpose;

  2. 2.

    Compute \(SCE(\mathbf {X^*},\mathbf {Y^*})\), which is the empirical strength of cause and effect;

  3. 3.

    Specify the statistic SCE. The statistic is exactly the one described in Definition 2.2 and the sample space S is the set of all the rank product set, i.e. the product set of all the permutations of \((1,2,\cdots , t)\), or \(S=Perm(1:t)\times Perm(1:t)\);

  4. 4.

    Construct the non-parametric distribution for the chosen statistic SCE, i.e. one computes the set \(\{SCE(\mathbf {H}^*,\mathbf {K}^*):(\mathbf {H}^*,\mathbf {K}^*)\in S\}\), tabulates the results and converts the results into a probability density function. Name this distribution PDF(t);

  5. 5.

    Specify the critical level \(\alpha\) and their corresponding interpretations. Compute the critical value \(cv_{\alpha }\) of PDF(t) for the critical level \(\alpha\):

  6. 6.

    Conduct the cause-effect testing: (cause) X and (effect) Y. The null hypothesis (\(H_0\)) is “factor X does not causes factor Y”, and the alternative hypothesis (\(H_1\)): “factor X causes factor Y”. If \(SCE(\mathbf {X}^*,\mathbf {Y}^*)\ge cv_{\alpha }\), then one shall reject \(H_0\) and accept that factor X causes factor Y. If \(SCE(\mathbf {X}^*,\mathbf {Y}^*)< cv_{\alpha }\), then one shall accept that factor X does not cause factor Y.

Example 1

Suppose one is interested in specifying the cause (factor A) and effect (factor B). He collects the time series data of A, or \(\mathbf {A}=(29,-8,-18,4,12)\); similarly, the time series data of B, or \(\mathbf {B}=(-1.7\%,1.6\%,4\%,2.1\%,0.3\%)\). Then by its natural inequality ordering on real numbers, one has

\(\mathbf {A}^*=(a_1,a_2,a_3,a_4,a_5)=(5,2,1,3,4)\) and \(\mathbf {B}^*=(b_1,b_2,b_3,b_4,b_5)=(1,3,5,4,2)\). Hence \(t=5,\sigma (a_1)=sign(3-1)*\frac{1}{2^{3-1}}=\frac{1}{4}\); similarly, \(\sigma (a_2)=\frac{1}{8}\), \(\sigma (a_3)=-\frac{1}{4},\sigma (a_4)=-\frac{1}{4},\sigma (a_5)=-\frac{1}{2}\) and \(SCE(\mathbf {A}^*,\mathbf {B}^*)=-\frac{5}{8}\cdot \frac{1}{5}=-\frac{1}{8}\). By the same token, one could compute that \(\sigma (b_1)=\frac{1}{4}, \sigma (b_2)=\frac{1}{4},\sigma (b_3)=-\frac{1}{4},\sigma (b_4)=\frac{1}{2},\sigma (b_5)=-\frac{1}{8}\) and thus \(SCE(\mathbf {B}^*,\mathbf {A}^*)=\frac{5}{8}\cdot \frac{1}{5}=\frac{1}{8}\). Obviously, in this case, neither of them could be qualified as cause or effect, even though \(SCE(\mathbf {B}^*,\mathbf {A}^*)\) is marginally positive. A much more systematic approach for settling the argument could be achieved via statistical inference as exemplified in the following.

Example 2

Let us continue the results in Example 1 and perform the statistical inference as described from Step 3 to Step 6 of the above-mentioned implementing procedures. The sample space \(S=Perm(1:5)\times Perm(1:5)\), in which \(|S|=5!^2=14400\). For example the rank vectors \((3,1,2,5,4),(4,3,5,2,1)\in S\). Now compute the set \(\{SCE(\mathbf {H}^*,\mathbf {K}^*):(\mathbf {H}^*,\mathbf {K}^*)\in S\}\), tabulate it and convert it into a probability density function PDF(5) as shown in Fig. 1.

Now choose the critical level \(\alpha =0.05\) and find the critical value \(cv_{0.05}=0.6\), which is larger than \(SCE(\mathbf {A}^*,\mathbf {B}^*)=-\frac{1}{8}\). Henceforth, statistically speaking, factor A could not serve as a cause for factor B.

Fig. 1
figure 1

PDF(5) for SCE: This probability density function is the distribution of \(\{SCE(\mathbf {H}^*,\mathbf {K}^*):(\mathbf {H}^*,\mathbf {K}^*)\in Perm(1:5)\times Perm(1:5) \}\)

Data, variables and procedures

There are mainly three types of environmental variables (according to the data availability) in this research:

  1. 1.

    Waste produced by sectoral activities (Eurostat 2022a, b):

    • (V1) Waste produced by Agriculture, forestry and fishing sector: in tonne;

    • (V2) Waste produced by Construction sector: in tonne;

    • (V3) Waste produced by Electricity, gas, steam and air conditioning supply sector: in tonne;

    • (V4) Waste produced by Manufacturing sector: in tonne;

    • (V5) Waste produced by Mining and quarrying sector: in tonne;

    • (V6) Waste produced by Wholesale of waste and scrap: in tonne.

  2. 2.

    Water and air pollutants:

    • (U1) Biochemical oxygen demand (BOD) in rivers: in milligrammes of oxygen per litre of water (Eurostat 2022c, d);

    • (U2) Exposure to air pollution by particulate matter less than 10 µm (Eurostat 2022e, f);

    • (U3) Marine area affected by eutrophication: in km (Eurostat 2022g, h);

    • (U4) Marine area affected by eutrophication: in percentage;

    • (U5) Phosphate in rivers: in milligrammes per litre (Eurostat 2022i, j).

  3. 3.

    Biodiversity:

    • (W1) Common bird index by type of species — EU aggregate: with index Year 2000=100 (Eurostat 2022l, m);

    • (W2) Common bird index by type of species — EU aggregate (Common farmland species): with index Year 2000=100;

    • (W3) Common bird index by type of species — EU aggregate (Common forest species): with index Year 2000=100;

    • (W4) Fish stock biomass: with index Year 2003=100 (Eurostat 2022n, o);

    • (W5) Grassland butterfly index — EU aggregate: with index Year 2000=100 (Eurostat 2022p).

The source data is the aggregate data of the European Union. Due to the availability of the data and for our analytical convenience, we curtail some variables’ data — the unified common time series lie between 2008, 2010, 2012, 2014, 2016, and 2018. In some sense, the collective data (refined and shown in Appendix Table 4) represent the real situations before the onset of COVID-19.

Results and analyses

In this section, we present the results run mostly by R program codes. The. refined raw data (due to the common availability of data) regarding the variables V1–V6, U1–U5 and W1–W5 are listed in Appendix Table 4. Since the sample size (biennial data from 2008 to 2018) is small, it is suitable for non-parametric statistical analysis. The non-parametric statistics takes the samples per se to construct their sample space, which plays a role such as population for the parametric statistics. The time series data of these variables are plotted in Figs. 2, 3 and 4.

Fig. 2
figure 2

sectoral waste for various sectoral economic activities

Fig. 3
figure 3

European water and air pollutants

Fig. 4
figure 4

European biodiversity

Now we could implement the procedures described in Section 2. First of all, we convert the refined source data into rank vectors by the natural ordering on real numbers. The results are presented in Table 1. Then we compute the empirical strength of cause and effect for three sets of cause-effect relations:

  1. 1.

    (cause) V1–V6 and (effect) U1–U5;

  2. 2.

    (cause) V1–V6 and (effect) W1–W5;

  3. 3.

    (cause) U1–U5 and (effect) W1–W5.

The calculated results are presented in the top and bottom blocks of Table 2.

Table 1 Ranked vectors conversion for refined source data: In correspondence to the 6 points of time (2008, 2010, 2012, 2014, 2016, and 2018), we arrange 6 ranks, which are decided by the natural ordering on real numbers, for the target variables V1–V6, U1–U5 and W1–W5. These ranks preserve their original natural ordering on real numbers
Table 2 Empirical strength of cause and effect: The top block shows the SCE for V1–V6 (cause) and U1–U5, W1–W5 (effect); the bottom block shows the SCE for U1–U5 (cause) and W1–W5 (effect)

In order to conduct CT, we need to specify the critical levels (cl) and compute their critical values (cv) for both left-tail and right-tail under PDF(6) and PDF(5). The results are shown in Table 3 (Fig. 5).

Fig. 5
figure 5

PDF(6) for SCE: This probability density function is the distribution of \(\{SCE(\mathbf {H}^*,\mathbf {K}^*):(\mathbf {H}^*,\mathbf {K}^*)\in Perm(1:6)\times Perm(1:6)\}\)

Table 3 Various critical levels and critical values: The upper half is designated for critical levels (cl) and critical values (cv) under PDF(6), while the lower part is designated for critical levels and values under PDF(5). L and R denote the left-tail and right-tail respectively

We could summarise the results as follows:

  1. 1.

    By comparing Tables 2, 3 and the provided critical values, one could find that variable U is much more qualified than variable V for being the cause of variable W, i.e. water and air pollution has a more direct effect on shaping the European biodiversity, in particular birds and butterflies.

  2. 2.

    By inspecting Tables 2 and 3, V2 is a statistically reasonable cause for effect U3 and U4, i.e. waste produced by construction sector is a cause for marine area affected by eutrophication.

  3. 3.

    By inspecting Tables 2 and 3, V5 causes insignificant effect to the environment, i.e. waste produced by mining and quarrying sector has the least effect on the environment among all the studied sectors;

  4. 4.

    By inspecting Tables 2 and 3, by and large V3, V5 and V6 are insignificant in causing environmental transformation, while V1 and V2 play some roles in contributing the environmental changes;

  5. 5.

    By inspecting Tables 2 and 3, U1 has an extremely direct effect on W2 and W5, i.e. BOD is the extremely direct cause for the biodiversity of European farmland birds and grassland butterflies; meanwhile U2 and U5 are also statistically significant causes for the biodiversity of the two species, i.e. air pollution and phosphate in rivers also play vital roles for the birds’ and butterflies’ biodiversity;

  6. 6.

    By inspecting Tables 2 and 3, only U3 and U4 are statistically remote cause for W3 and W4, i.e. marine area affected by eutrophication is, comparatively speaking, likely a cause for the biodiversity of European forest birds and fishes.

Conclusion

In this research, we initiate a non-parametric cause-effect testing method for various potential independent and dependent environmental variables, which are instanced by environmental data from the European Union. This study identifies the statistically significant causes for water and air pollution or biodiversity. The results are summarised in the previous section. These analysed results should provide much insightful knowledge for the policy makers to single out the causes and effects. The results also indicate that the quality of the water directly causes the change of the European biodiversity.

Now let us specify some properties of this CT method:

  • The method is data unit free, namely, all the data are converted into ranks — there is no need to consider scaling problem among different batches of data. This property ensures its flexibility in data analysis and its tolerance of different data types; even the signs of the data are also irrelevant in this method, since they are converted into ranks;

  • We devise a statistic for measuring the strength of cause and effect between variables. This statistic serves the cornerstone of our statistical justification for identifying the causes and effects between environmental variables;

  • Unlike similarity indexes or distance functions, this CT method takes cause and effect directly into consideration — it is very straightforward and the result is much more persuasive than other indirect inferences;

  • Unlike typical statistical regression, t statistic-based testing or F statistic-based testing, there is no explicit assumption needed for this CT method regarding the statical inferences.

Nonetheless, some improvements or notices for this research are worth mentioning:

  • Due to the compatibility and availability of the European environmental data and the capacity of our computer resources, we only sample biennial data from years 2008 to 2018. If one wants to have a much comprehensive result, he shall also resort to much larger data set, even including the data after year 2019;

  • In order not to complicate the method per se, we randomly assign the ranks between the tied values (though they rarely occur in a larger sample). To alleviate such restriction, one could easily extend the whole method by involving larger sample space for the SCE statistic;

  • For convenience and data availability, we only consider the aggregate data, i.e. the European Union. In order to further pin down the locations that bear stronger cause and effect, one shall further involve and analyse the national data.