Testing tests before testing data: an untold tale of compound events and binary dependence

In any statistical investigation, we deal with the applications of probability theory to real problems, and the conclusions are inferences based on observations. To obtain plausible inferences, statistical analysis requires careful understanding of the underlying probabilistic model, which constrains the extraction and interpretation of information from observational data, and must be preliminarily checked under controlled conditions. However, these very first principles of statistical analysis are often neglected in favor of superficial and automatic application of increasingly available ready-to-use software, which might result in misleading conclusions, confusing the effect of model constraints with meaningful properties of the process of interest. To illustrate the consequences of this approach, we consider the emerging research area of so-called ‘compound events’, defined as a combination of multiple drivers and/or hazards that contribute to hydro-climatological risk. In particular, we perform an independent validation analysis of a statistical testing procedure applied to binary series describing the joint occurrence of hydro-climatological events or extreme values, which is supposed to be superior to classical analysis based on Pearson correlation coefficient. To this aim, we suggest a theoretically grounded model relying on Pearson correlation coefficient and marginal rates of occurrence, which enables accurate reproduction of the observed joint behavior of binary series, and offers a sound simulation tool useful for informing risk assessment procedures. Our discussion on compound events highlights the dangers of renaming known topics, using imprecise definitions and overlooking or misusing existing statistical methods. On the other hand, our model-based approach reveals that consistent statistical analyses should rely on informed stochastic modeling in order to avoid the proposal of flawed methods, and the untimely dismissal of well-devised theories.


Introduction
Similarly to every human activity and aspect of life, science is characterized by some cyclical dynamics, making topics and theories cyclically fashionable or neglected (e.g., Landauer 1997). Sometimes existing topics are revitalized by changing their name and/or slightly modifying definitions and context. Often, this generates confusion and leads to dismiss or forget part of the literature and results concerning those topics or strictly related subjects. Here, we consider the case of the so-called 'compound events' and their probabilistic and statistical treatment as an example of this phenomenon. Broadly speaking, compound events are sets of single events that are gathered and analyzed altogether for some reason. For example, analyzing the combination of multiple factors such as direct surface runoff, river discharge, storm surge, and waves can improve flood risk assessment in coastal regions (e.g. Camus et al. 2021).
In the context of the study of compound events, De Michele et al. (2020) proposed a procedure to assess the mutual dependence of pairs of binary time series describing the rainfall occurrence (wet/dry conditions) observed at two rain gauge stations. This testing procedure aims to overcome the possible limitations of classical analysis based on Pearson correlation coefficient, and it attempts to provide a more detailed description of the dependence between pairs of binary time series. Hereinafter, we define such a test as JOP (joint occurrence probability) test.
In the spirit of seeking theoretical consistency in statistical analyses (Lombardo et al. 2012(Lombardo et al. , 2014(Lombardo et al. , 2019Serinaldi and Kilsby 2016a;Serinaldi et al. 2015Serinaldi et al. , 2020a, this study discusses the results of a socalled neutral (independent) validation/falsification analysis (see e.g. Popper 1959;Boulesteix et al. 2018, and references therein) to independently check theoretical properties and interpretation of the JOP test. We also discuss the supposed limitations of classical Pearson correlation analysis for binary processes that led to introduce the JOP procedure. Our inquiry relies on theoretical arguments, Monte Carlo simulations, and re-analysis of rainfall data previously studied by De Michele et al. (2020). We also propose a procedure to simulate binary pairs with negative correlation.
This study is organized as follows. In Sect. 2, we discuss the effects of introducing new research topics based on ambiguous definitions, with an emphasis on 'compound events'. Section 3 summarizes the JOP algorithm, introducing concepts and symbols required to provide a selfcontained discussion. We also present and motivate our alternative model-based approach relying on Pearson correlation. In Sect. 4, we set up and perform Monte Carlo simulations to assess the effective significance and power of the JOP procedure accounting for the interaction of serial and cross correlation as well as cyclical patterns of the rate of occurrence (seasonality). A subset of rainfall data studied by De Michele et al. (2020) is re-analyzed in Sect. 5, and JOP outcomes are compared with those corresponding to simulated time series with given properties mimicking rainfall data. In Sect. 5.2, we discuss theoretical aspects enabling the correct interpretation of JOP outcomes in terms of Pearson cross correlation and marginal rates of occurrence, which are shown to provide enough information to reproduce the behavior of rainfall co-occurrence. In Sect. 6, we discuss the lesson learned from our analysis in the broader perspective of a change of paradigm in the stochastic analysis of hydro-climatological data, moving from statistical testing to stochastic modeling. Conclusions are summarized in Sect. 7.

Ambiguity
The increasing interest in compound events in hydroclimatology is partly related to relatively recent advancements of stochastic methods, which have enabled unprecedented analysis and modeling of multiple variables via convenient and easy-to-use techniques and software (e.g. Heffernan and Tawn 2004;Nelsen 2006;Salvadori et al. 2007;Davison et al. 2012;Asadi et al. 2015;Papalexiou 2018;Papalexiou and Serinaldi 2020;Papalexiou et al. 2021, and references therein). Despite such an interest, a rigorous mathematical definition of compound events is still lacking in risk analysis; the literature is fragmented, and the very different concepts of compound, interconnected, interacting, and cascading risks are often used as synonyms, thus causing redundancy and confusion (Pescaroli and Alexander 2018). Koutsoyiannis et al. (2018) emphasized that ''absence of a definition entails domination of intuition over logic, dark over light, or uncritical acting over critical thinking''. Clarity and precision in the definition of new forms of risk are key issues because ''the way we understand and describe risk strongly influences the way risk is analyzed and hence it may have serious implications for risk management and decision making'' (Aven 2016). In this respect, Pescaroli and Alexander (2018) attempted to highlight differences and complementary aspects of compound, interconnected, interacting, and cascading risks in order to propose a holistic framework that can support focused research, avoiding redundancy. In this framework, compound risk mainly refers to the concurrence of natural events, while ''institutional definitions tend to focus more narrowly on the hazard component of disaster risk''. For example, the cross-border scale effects of the eruption of the Icelandic volcano Eyjafjallajökull in April 2010 were due to the coincidence of such a geophysical event and weather conditions that caused the ash spread out over an area with a high concentration of essential transportation system nodes (Pescaroli and Alexander 2018). On the other hand, storms and related floods, or seaside earthquakes and related tsunamis are classified as interacting events as the former causes the latter in a chain of events reflecting the physical relationship of these phenomena. More generally, moving from compound risk to cascading risks in the framework proposed by Pescaroli and Alexander (2018) means introducing additional degrees of complexity, including interactions between human, environmental, and technological systems, and therefore those among hazard, exposure, and vulnerability. In this context, stochastic methods are the privileged approach for quantitative assessment of compound risks, such as the simultaneous occurrence of volcanic eruption and particular weather conditions. In fact, the inherent rarity and uncertainty of such events, reflecting the lack of knowledge about their possible physical relationships, prevent the use of deterministic methods. Therefore, it is easy to recognize that the analysis and modeling of compound events heavily rely on traditional stochastic methods devised for multivariate analysis of 'joint variables' and 'joint events'. In particular, joint events have been the subject of the entire theoretical and applied literature dealing with multivariate statistics and spatio-temporal dependent processes.
However, as recognized by Pescaroli and Alexander (2018), the lack of a precise and clear definition of compound events generated confusion, which had a two-fold consequence: (i) a growing fashionable literature on compound events that seems to deal with a new topic, but addresses technical problems that do not differ from those already discussed in the existing literature on multivariate random processes; in fact, such a literature commonly uses the same stochastic approaches (often the simplest version) previously developed for joint events, such as multivariate distributions, generalized linear models, Markov chains, and their variants (e.g., Hao et al. 2018, for a review); and (ii) the apparent novelty of the topic seems to prevent the acknowledgment of its similarities to other topics. This seems to cause a sort of damnatio memoriae (i.e., condemnation of memory) affecting existing theories and methods, which are therefore neglected when approaching problems that were actually solved in the past.

Inconsistency
Before discussing in depth the case of joint occurrence processes, which is the specific subject of this study, we report some preliminary examples of the possible consequences of introducing seemingly new topics without precise definitions, preventing the construction of a consistent theory.
As mentioned above, compound events remain without an agreed mathematical definition; however, they are often introduced using the definition proposed by Leonard et al. (2014): ''A compound event is an extreme impact that depends on multiple statistically dependent variables or events''. This definition may be problematic. In fact, while interacting factors are reasonably related either functionally (if the physical dynamics are clear) or statistically (if the physical dynamics are complex and not well known), for compound events often it is not even possible to establish whether their forcing variables are correlated or not. As compound events are extremes according to the foregoing definition, they occur rarely and can result from a set of statistically independent and non dangerous forcing events that occur together with a frequency given by the product of their marginal probabilities. Therefore, the statistical dependence used in the definition above is questionable, as we can have compound events (extreme impact) with statistically independent forcing factors.
Moreover, statistically significant dependence/correlation does not mean or imply physical dependence, which is the dependence of interest. Indeed, the former depends on the sample size, and almost every test assigns statistical significance to physically negligible differences for very large samples (Nicholls 2001;. As discussed below, talking about 'statistically significant dependence/correlation' might be meaningless. In other words, we can have 'an extreme impact that depends on multiple statistically independent variables or events'. We also note that sometimes dependence is incorrectly considered necessary for the occurrence of joint or compound events, as independence is confused with mutual exclusiveness. Ironically, mutually exclusive events are indeed dependent on each other. Imprecise definitions generate paradoxes as well. For example, the joint occurrence of relatively low precipitation and high temperature characterizing the 2014 California drought was studied by AghaKouchak et al. (2014) as a compound event although those variables are statistically independent, as shown by Serinaldi (2016). In this case, either the 2014 California drought is not a compound event (as precipitation and temperature are statistically independent, contradicting the definition of Leonard et al. (2014)) or the 2014 California drought is a compound event according to a different but not specified definition.
Another example of the consequences of inconsistency when performing data analysis could be the following statement: ''We note that for cascading hazards with independent drivers, we cannot simply multiply the probabilities of the hazard drivers. The reason is that the events are physically related, primarily through the actual impact (e.g., debris flow as a result of rain over burned area); however, the relationship cannot be simply described using the commonly used statistical metrics. More research should focus on developing methods for characterization of such events.' ' (AghaKouchak et al. 2020). This statement concerns the deadliest debris flow event in California's history, which occurred in 2018 in Montecito as a result of heavy rainfall over a burned area previously affected by wildfires in 2017. In this context, 'cascading events' refer to events delayed in time. While wildfires and heavy rainfall look like independent events (as there is no evidence to establish a link between them), obviously the extreme debris flow results from earlier joint occurrence of both forcing events. Then, why does not the product probability of the hazard drivers (heavy rainfall and wildfire) describe the probability of the target event (debris flow)? Because such a product is not the correct probabilistic model describing the problem at hand.
Let us derive suitable probabilistic models. As the problem involves three hazards, a probabilistic model must involve three random variables as well, that is, R (rainfall), W (wild fire), and Q (debris flow). Assuming that we are only interested in the occurrence probability, we can assume that the state space of each variable is binary, that is, each hazard occurs or does not. Let R 1 (R 0 ), W 1 (W 0 ), and Q 1 (Q 0 ) denote the event (E) corresponding to the occurrence (non-occurrence) of R, W, and Q, respectively. For a generic binary variable, the events E 0 and E 1 are mutually exclusive and exhaustive, i.e. P½E 0 \ E 1 ¼ 0 and P½E 0 þ P½E 1 ¼ 1. Recalling the total probability theorem and the conditional probability theorem (e.g., VanMarcke 2010, pp. 26-27), and assuming that R and W are independent of each other (in agreement with AghaKouchak et al. 2020), the occurrence probability of Q (i.e. the probability of the event Q 1 ) is This model highlights that the occurrence probability of debris flow Q is not (and cannot be) the simple product of the occurrence probabilities of the forcing factors R and W. The conditional probability P Q 1 jR i ; W j Â Ã actually accounts for the (unknown) physical relationship mentioned by AghaKouchak et al. (2020). If R, W, and Q are assumed to be random variables of continuous type (e.g. rainfall intensity, wild fire extension, debris flow height), the model in Eq. 1 assumes the following form giving the distribution function of Q FðQ qÞ ¼ where F denotes the distribution function, f is the probability density function, and X R and X W are the state spaces of R and W, respectively.
Even though the probabilistic formalization in Eqs. 1 and 2 does not require any definition of compound events, the framework proposed by Pescaroli and Alexander (2018) provides a useful interpretation. Here, rainfall and wildfire are compound events as they concur to set the conditions for debris flow. They are not necessarily statistically dependent, thus not fulfilling the definition by Leonard et al. (2014); they are delayed in time and not interacting, as there is no evidence for physical laws linking them. On the other hand, rainfall and debris flow are interacting as the former triggers the latter, while wildfire and debris flows are indirectly related, as the former is not the direct trigger of debris flow, but it is the direct cause of the antecedent soil conditions that emphasized rainfall effects. In this framework, P R i ; W j Â Ã describes the probability of a compound event (i.e. occurrence of rainfall and wildfire), while P Q 1 jR i ; W j Â Ã provides a probabilistic representation of the causal chains, when we do not use (or do not know) a physical model to describe the interacting dynamics of rainfall, soil conditions, and debris flow. We also note the profound difference of the definitions of ''cascading events/risks'' given by Agha-Kouchak et al. (2020) and Pescaroli and Alexander (2018).
This example highlights that the use of inconsistent and ambiguous definitions could generate not only redundancy and confusion, but also incorrect risk assessment; therefore, vague definitions hinder correct formalization of the problem under investigation. On the other hand, correct use of basic principles of probabilistic reasoning can prevent the untimely requirement of new statistical metrics and methods. As further discussed below, before dismissing seemingly old-fashioned but theoretically solid concepts, one should be sure that their possible inconsistencies are not related to their misuse.

Rainfall occurrence: new compound or classical joint?
Occurrence processes have been widely studied in many disciplines from different perspectives for a long time, as they are related to the occurrence or non-occurrence of some natural or technological phenomena of interest, such as rainfall (Koutsoyiannis 2006;Ng and Panu 2010;Marchand 2012;Lombardo et al. 2017;Olson and Kleiber 2017;Serinaldi and Lombardo 2017b), floods (Bogachev and Bunde 2012;Serinaldi and Kilsby 2016b), droughts (Coats et al. 2016), wind storms (Mailier et al. 2006), tornados (Valente and Laurini 2020), earthquakes (Anagnos and Kiremidjian 1988; Ogata 1999), disease incidence (Diggle et al. 2005), neuron activation (Macke et al. 2009), and failure in multiversion software (Nicola and Goyal 1990), just to mention a few. Broadly speaking, the stochastic modeling of occurrence processes falls within the realm of general point processes theory (see e.g., Cox and Isham 1980;Møller and Waagepetersen 2003;Lowen and Teich 2005;Diggle 2013). Focusing on rainfall, occurrence (rain/no rain or wet/dry periods) recorded at a rain gauge network can be described as a binary spatio-temporal stochastic process defined over a fixed set of spatio-temporal coordinates, that is, rain gauge locations, and time steps identified by gauges' temporal resolution along the recording period. As these processes are multidimensional, their description, modeling, and simulation include spatio-temporal dependence as a distinctive characteristic. Commonly, the study and modeling of dependence involve only second-order statistical properties, such as serial and cross (spatio-temporal) correlation functions (see e.g., Cox and Isham 1980;Emrich and Piedmonte 1991;Demirtas 2006;Macke et al. 2009;Diggle 2013;Serinaldi and Lombardo 2017a, b;Papalexiou 2018;Papalexiou et al. 2018;Papalexiou and Serinaldi 2020;Papalexiou et al. 2021, among others).
Based on remarks and examples in Sect. 2.2, the following research questions arise: -Is standard analysis based on Pearson correlation actually unsuitable or insufficient to describe the joint occurrence of rainfall at location pairs? Does it have limitations? If yes, which ones? -Is JOP better and more informative than methods based on Pearson correlation?
In the next sections, we try to answer these questions, highlighting the effects of re-framing joint occurrence processes as ambiguously defined compound events.
3 Changing perspective: from testing to modeling 3.1 JOP testing procedure Let X i ¼ X i;n È É n2N and X j ¼ X j;n È É n2N be two discretetime stochastic processes with state space 0; 1 f g, where n ð¼ 0; 1; 2; :::Þ denotes discrete time. Then, we define x i;n È É n2N and x j;n È É n2N as two sequences of binary random numbers respectively sampled from X i and X j , for simplicity x i f g and x j È É , taking values 1 and 0 with probabilities p X i l and p X j k , where l; k 2 0; 1 f g, respectively (e.g., È É can be seen as daily time series describing the occurrence ('1') or non-occurrence ('0') of rainfall at two locations i and j (e.g. two rain gauge stations), whereby rainfall occurrence corresponds to daily precipitation amount exceeding a given threshold value (e.g. 0 mm, as for De Michele et al. (2020)).
For each n, we can have four possible combinations of the values assumed by the pair ðX i ; X j Þ: ðx i;n ¼ 1; x j;n ¼ 1Þ or wet-wet condition, ðx i;n ¼ 0; x j;n ¼ 0Þ or dry-dry, ðx i;n ¼ 1; x j;n ¼ 0Þ or wet-dry, ðx i;n ¼ 0; x j;n ¼ 1Þ or drywet. Following De Michele et al. (2020), let P ¼ P lk denote the probability of the outcome ðl; k 2 0; 1 f gÞ, that is If the processes X i and X j are independent of one another, the probability P lk is the product of the marginal probabilities of occurrence, P I lk :¼ p X i l p X j k . The JOP testing procedure aims at assessing whether P½X i ¼ l; X j ¼ k 6 ¼ P½X i ¼ lP½X j ¼ k, meaning that the locations i and j tend to exhibit the joint condition (l, k), e.g. wet-wet, with frequency lower or higher than that expected under the assumption of independence. To support our discussion and for the sake of completeness, we recall the JOP algorithm, which consists of the following six steps to be applied to each pair of time series x i f g and x j È É , and each year of data (De Michele et al. 2020): 1. Calculate the sample estimate,P , of P for each of the four combinations (i.e., 00, 11, 01, and 10); 2. compareP with the confidence interval (CI) under the hypothesis of independence (i.e., binomial distribution where z a=2 is the a=2 quantile of the standard Gaussian distribution, N is the sample size, and a is the significance level. The interval in Eq. 4 is also known as Wald interval (see e.g., Miao and Gastwirth 2004). If P 2 ½l L P ;a ; u L P ;a , then X i and X j are independent with respect to combination ; 3. otherwise, ifP 6 2 ½l L P ;a ; u L P ;a , then calculate the CI under hypothesis of independence via Monte Carlo simulations assuming that X i and X j can be modeled by two-state Markov chains, to account for the effect of possible serial dependence. IfP 2 ½l MC P ;a ; u MC P ;a , then X i and X j are independent with respect to combination .

IfP [ u MC
P ;a (orP \l MC P ;a ), one checks the behavior of the other combinations (denoted as ) different from the tested , as PP ¼ 1 and the increase of oneP causes the decrease of one or moreP . If there is at least a combination such thatP [ u MC P ;a (or P \l MC P ;a ), therefore X i and X j are judged dependent with respect to (or ). 5. Conversely, one makes the same check shrinking the CI, considering a level of significance equal to 1:5a, as the reduction ofP (orP ) could be not significant at a level a. If there is a combination such that P [ u MC P ;1:5a (orP \l MC P ;1:5a ), therefore X i and X j are judged dependent with respect to (or ). 6. If not, then one checks the p value ofP [ u MC P ;a . If its p value is \0:0025, therefore X i and X j are considered dependent with respect to , otherwise they are considered independent.
The following preliminary remarks are needed to better investigate the properties of the JOP algorithm described above: -Similar to many other works developed under the umbrella of 'compound events', JOP does not refer to any true risk analysis involving hazard, exposure, vulnerability, interactions between human, environment, and technological systems, disaster escalation, and appropriate cost-benefit and/or multi-criteria analysis. Actually, JOP simply deals with the study of joint behavior of two binary random processes, that is, a classical topic of spatio-temporal statistics. Thus, referring to 'compound' context does not add any information to problem definition and formalization. Recognizing this fact is fundamental to review the existing literature and properly address the available theories and operational tools before proposing something new (see discussion in Sect. 2.2). -De Michele et al. (2020) suggest using quite a small significance level a ¼ 0:005 to reduce the probability of Type I errors. However, while this value is valid for steps 2 and 3 of the JOP algorithm, it does not provide the effective statistical significance of the overall procedure, as the step 4 requires multiple comparisons involving probabilities different from the target P (resulting in a multiple testing procedure). Moreover, step 5 implies shrinking the CIs of the combinations , while step 6 resorts to p values. Therefore, the effective significance of the overall procedure is basically unknown. -Type I and Type II errors cannot be both arbitrarily small, because the probability of the latter tends to increase as that of the former decreases, and vice versa. Therefore, statistical hypothesis tests should be applied balancing between the two types of errors or focusing on one of them. However, this is possible only if we know the effective significance, and we have insights into the power of the test, i.e. the rejection rate under the alternative hypothesis (here, some form of dependence between the two processes X i and X j ). Then, before applying the JOP test to real-world data, we need to know its properties and performance under controlled conditions.

Pearson-based modeling
Investigating significance and power of the JOP testing procedure requires suitable models enabling the simulation of processes with prescribed properties, such as serial dependence, marginal (at-site) rate of occurrence, cross dependence, and seasonality. On the other hand, the availability of such models capable of reproducing the key properties of an observed process makes tests quite superfluous, especially if the tested null hypothesis H 0 is rather unrealistic, such as the lack of cross dependence among rainfall time series recorded over areas affected by similar or contrasting weather patterns. Referring to Sect. 6 for a deeper discussion of inherent inconsistencies of statistical testing procedures, a model-based method does not focus on rejecting an hypothesis that is often trivial and already untenable, but on reproducing something that is theoretically sound and empirically consistent with the observations. Therefore, our model-based approach starts from existing (prior) knowledge of key features of rainfall fields, including the existence of spatial homogeneity at given spatio-temporal scales, and possible dual behavior (wet/ dry) of different areas, related for instance to fluctuations in the difference of atmospheric pressure at sea level (such as North Atlantic Oscillation, or Southern Oscillation). Thus, we discard the hypothesis of independence, which is too restrictive and unrealistic according to (conditioned on) our prior knowledge. Therefore, according to the principle of parsimony, we adopt a modeling strategy that includes dependence in the simplest but most effective and tenable way by means of cross correlation, and also independence as a special case. Finally, we check if the model is able to reasonably reproduce the observed summary statistics of the binary pairs, including P . If the agreement of observations and simulations is plausible, then the model and its components (here, cross correlation and marginal rates of occurrence) provide a satisfactory description, and can be used for practical purposes, bearing in mind that ''the sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work -that is, correctly to describe phenomena from a reasonably wide area. Furthermore, it must satisfy certain esthetic criteria -that is, in relation to how much it describes, it must be rather simple.' ' (von Neumann 1955).
Binary sequences with specified serial and cross correlation, and rate of occurrence are simulated by BetaBit algorithm (Serinaldi and Lombardo 2017a) and its extension BetaBitST , which can be generalized by the CoSMoS framework (Papalexiou 2018;Papalexiou et al. 2018;Papalexiou and Serinaldi 2020;Papalexiou et al. 2021). In Appendix A, we suggest an extension of BetaBitST enabling the simulation of negatively cross correlated binary random processes, which are used in the discussion below.
These algorithms implement models for serially and cross-correlated binary processes based on Pearson correlation coefficient, and rates of occurrences. We note that these models enable: (i) precise reproduction of any characteristics of correlated binary processes, including the four probabilities P or any dependence or similarity measure, such as those listed by Wijaya et al. (2016) andBrusco et al. (2021); (ii) uncertainty analysis of any summary statistics characterizing the correlated binary processes; and (iii) power analysis and validation/falsification of any statistical test involving serially and mutually dependent binary processes. On the other hand, leaving aside the inconsistencies discussed later in Sect. 6, statistical tests can only establish whether the evidence supports rejection (or no rejection) of some generally trivial hypothesis H 0 of limited usefulness. In other words, modeling does not require any statistical test, while statistical tests necessarily require modeling to be preliminarily validated before being used in real-world applications.

Monte Carlo experiments
In this section, we investigate the effective significance and power of the JOP test as well as the combined effects of serial correlation and seasonally varying rates of occurrence p X i 1 and p X j 1 on the JOP outcomes, and their interpretation.

Significance of JOP test and effects of serial correlation
Firstly, we investigate the effective significance of the JOP test, i.e. the actual rejection rate when the null hypothesis, H 0 : P ¼ P I , is true. As mentioned in Sect. 3.1, the JOP test accounts for possible serial dependence in the processes X i and X j assuming that they can be modeled by two-state Markov chains. Here, we use binary processes characterized by (i) an exponentially decaying autocorrelation function (ACF) corresponding to the first-order autoregressive model, AR(1), with parameter q 1 (e.g. Jacobs and Lewis 1983; Jentsch and Reichmann 2019), which is similar to the ACF of Markov chains, and alternatively (ii) power-law decaying ACF corresponding to fractional Gaussian noise (fGn) with parameter H (see e.g. Lowen and Teich 2005;Gong et al. 2011, and Appendix B for ACF formulas). Complementing AR(1) with fGn allows for a more comprehensive assessment of JOP behavior for a process implying long range persistence and remarkable over-dispersion (Serinaldi andLombardo 2017a, b, 2020;. Simulations consist of generating independent time series x i f g and x j È É of size N ¼ 365 with q 1 ranging from 0 to 0.9 by steps of 0. 0:5Þ É , and then applying the JOP test. This setting mimics the analysis of daily records on an annual basis in terms of sample size. The values of q 1 and H cover a wide range of serial dependence scenarios, from independence to fairly strong persistence, while the values of p X i 1 and p X j 1 are consistent with the possible combinations of average probabilities of occurrence of daily rainfall in various climates (e.g., Harrold et al. 2003;Robertson et al. 2004;Serinaldi 2009;Olson and Kleiber 2017). For each combination of parameters, simulation and JOP test are repeated 10000 times in order to compute the empirical rejection rate. For each simulated series, Monte Carlo CIs are computed using 5000 auxiliary simulated series with rates of occurrence, q 1 , and H estimated on the parent series, and the same sample size (365).
For nominal rejection rates a ¼ 0:5%, 5%, and 10%, the effective rejection rates are much lower than the nominal ones, with maximum values equal to 0.04%, 0.39%, and 1.43%, respectively, for binary processes with AR(1) ACF, and 0.02%, 0.2%, and 0.48% for binary processes with fGn ACF (see Fig. 1). Moreover, rejection rates tend to decrease as q 1 and H increase. The cases corresponding to binary processes with AR(1) ACF and nominal significance equal to 5%, and 10% show that the effective rejection rates decrease as the rates of occurrence p X i 1 and p X j 1 increase. This behavior is not so evident for the other cases because of the very low effective rejection rate. These results indicate that the JOP test almost never rejects the null hypothesis when it is true (that is, the test is very conservative), and it is not generally possible to control the effective rejection rate modulating the nominal one. In fact, the suggested use of the nominal a ¼ 0:5% to avoid Type I errors does not seem to be grounded on theoretical or empirical evidence, and this choice can be problematic in terms of power of the test, as discussed below.

Effects of seasonal variation under mutual independence
Laplace CIs and Monte Carlo CIs (relying on first-order Markov chains) in steps 2 and 3 of the JOP algorithm (Sect. 3.1) implicitly assume that the theoretical values of p X i 1 and p X j 1 (and p X i 0 and p X j 0 ) are constant over the period of analysis. However, the seasonal patterns of rainfall occurrence (e.g., Harrold et al. 2003;Robertson et al. 2004;Serinaldi 2009;Mehrotra et al. 2012;Olson and Kleiber 2017) make this assumption quite unrealistic for daily data over annual time windows, which is the time interval suggested by De Michele et al. (2020) for JOP analysis. Therefore, it is important to understand the effect of seasonally varying p X i 1 and p X j 1 on the outcome of the JOP procedure. To this aim, we performed a Monte Carlo experiment with the following settings: sample size N ¼ 365, AR(1) ACF with q 1 ranging from 0 to 0.9 by steps of 0.1, fGn ACF with H 2 f0:55; 0:65; 0; 75; 0:85g, and p X i 1 ¼ p AR(1) correlation structure, Fig. 2 shows that the seasonal fluctuations of p X i 1 and p X j 1 yield rejection rates higher than those obtained without seasonality (Fig. 1). There is quite a clear dependence of rejection rate on q 1 for all a values. Seasonality tends to generate spurious correlation that inflates the rejection rate. On the other hand, increasing serial correlation tends to conceal the seasonal patterns of p X i 1 and p X j 1 , and therefore the spurious cross correlation. In agreement with results described in Sect. 4.1, long range dependence causes a decrease of the effective rejection rate, compared to the case of AR(1) dependence structure. Long range dependence conceals the effects of seasonality, preventing the appearance of spurious correlations, and inflation of rejection rate. These results show that the type of correlation structure is an additional factor influencing the JOP outcome. Therefore, establishing a clear relationship between effective and nominal rejection rates is even more problematic due to the dependence of effective rejection rate on serial correlation type and intensity.

Power of JOP test and compound effects of serial and cross correlation
Besides significance, a complementary aspect to be investigated before applying a testing procedure is its power, i.e. its rejection rate under the assumption that the alternative hypothesis is true (here, H 1 : P 6 ¼ P I ). In other words, the power of the test is the probability that we reject H 0 when false. Based on results of Sect. 4.1, we performed a Monte Carlo experiment with the following settings: sample size N ¼ 365, cross correlation q c ranging from 0 to 0.9 by steps of 0.1, p X i 1 ¼ p Results of power analysis are reported in Fig. 3. As the JOP test is very conservative, its power is rather low when α = 0.5%  The most interesting result is however the concealing effect resulting from the interaction of serial and cross correlations. Serial correlation remarkably reduces the power, that is, the test ability to detect the existing true cross correlation. This effect appears for weak/moderate AR(1) serial correlation structures (with short range dependence), and becomes more prominent as the temporal persistence increases. This means that the treatment of serial correlation in the JOP test (step 3 of the test) does not prevent negative effects on test power. Overall, the JOP test does not allow a clear control of significance, power, and effects of nuisance factors, thus making difficult, if not impossible, the reliability assessment of numerical results.

Combined effects of cross correlation and seasonality
We also checked the power of JOP when seasonal patterns of the rate of occurrence combine with cross correlation. In this case, Monte Carlo experiments are set up as follows: sample size N ¼ 365, cross correlation q c ranging from 0 to 0.9 by steps of 0.1, and p X i 1 ¼ p X j 1 ¼ 0:16 cos 2p 365 n À Á þ 0:25, n ¼ 1; :::; N, and q 1 2 0:1; 0:5; 0:8 f g . The analysis is repeated 1000 times for each parameter combination.
Comparing Fig. 4 to 3, we can note that the synchronous cyclic rate of occurrence inflates the power, as it combines with cross correlation, strengthening the frequency of simultaneous occurrence of the pairs '00' and '11'. However, a closer look reveals that we have an opposite effect when q 1 ¼ 0:8. This behavior is more evident for a ¼ 0:5%, and moderate/high values of cross correlation, and denotes quite a complex interaction of serial correlation and cyclic rate of occurrence. Of course, a deeper analysis including seasonal patterns of p X i 1 and p X j 1 with different amplitude, period, and phase, and different values of q 1 for X i and X j would be feasible but nonexhaustive in any case, as the possible combinations of parameters and conditions is almost infinite. Moreover, additional investigation is also not very informative as the JOP analysis suffers from shortcomings more serious than the effects of confounding factors on significance and power. These drawbacks are discussed below in detail.

Rainfall data re-analysis
For illustration, De Michele et al. (2020) reported graphical representation of JOP outcomes, and corresponding interpretation, for some pairs of European daily rainfall time series. We re-analyzed 10 of those pairs accounting for the results of the foregoing Monte Carlo experiments. In particular, we performed the JOP analysis on an annual basis, and using monthly stratification. Similar to seasonal analysis, here, 'monthly stratification' means concatenating all the observations of each specific calendar month (January, February, etc.) to create twelve time series of size equal to n m Â n y , where n m is the number of days of a specific month (28/29, 30, or 31) and n y is the number of years of record. Annual analysis allows direct comparison and independent check of the original results, while monthly stratification filters the concealing effects of seasonality out, and guarantees (i) seasonally-filtered rates of α = 0.5%   Fig. 5 was reported as an example of 'dry-dry type of dependence' as there are several years withP 00 falling outside the CIs, and the JOP test suggests rejection of H 0 : P 00 ¼ P I 00 . Yet,P 11 values are systematically greater than those expected under independence, but generally within CIs, i.e.,P 11 values are close to but lower than the upper limit of Laplace and Monte Carlo CIs. Conversely, P 10 andP 01 values are systematically lower than those expected under independence. On the other hand, results of annual analysis for the pair SOUID105474-SOUID110384 (Fig. 6) were reported as a typical example of 'dry-wet type of dependence', as there are several years withP 01 values falling outside the CIs, and the JOP test suggests rejection of H 0 : P 01 ¼ P I 01 . Yet, compared with the first example, in this case,P 10 values are systematically greater than those expected under independence, whileP 11 and P 00 values are systematically lower.

Annual and monthly JOP analysis
Based on the above examples and the others shown in Figs. S1-S8, De Michele et al. (2020) concluded that the JOP procedure can discriminate different types of dependence corresponding to wet-wet, dry-dry, wet-dry, drywet conditions or their possible combinations. However, filtering the seasonal effect out, results of the monthly analysis tell a different story. For almost all pairs of stations and months, probabilitiesP 11 ,P 00 ,P 10 , andP 01 fall outside the CIs, and the JOP test rejects H 0 : P ¼ P I , with few exceptions corresponding to the pairs (SOUID100039 (Salzburg, Austria), SPE00120602 (Valladolid, Spain)), and (SOUID110002 (Nes Pa Hedmark, Norway), SOUID109403 (Tromso, Norway)). Therefore, filtering α = 0.5% The pairs SOUID100810-SOUID105863 and SOUID105474-SOUID110384 are two examples of positively and negatively correlated bivariate binary series, respectively. 3. The JOP analysis should be performed on series (or sub-samples) fulfilling, at least approximately, the implicit assumption of stationary p X i 1 and p X j 1 , in order to avoid the concealing effects of their seasonal fluctuations, which in turn lead to confuse numerical artefacts with physical properties.

Revising the interpretation of JOP output: a matter of total probability theorem
Elementary properties of bivariate binary processes provide the theoretical explanation of the empirical results in points (1) and (2) listed at the end of Sect. 5.1. Let X i and X j be two mutually independent binary processes with marginal rates of occurrence p X i 1 ¼ 1 À p X i 0 and p X j The four joint probabilities P I describe a valid joint distribution if they fulfill not only the relationship mentioned by De Michele et al. (2020), but also at least three additional relationships that uniquely define the four probabilities P I , ensuring that they are consistent with the marginal probabilities p X i 1 and p X j 1 . By total probability theorem, we have for example: Now, let us consider two binary processes with marginal rates of occurrence equal to those of the foregoing independent processes, and some unknown form of dependence that impacts on the joint probabilities. Without loss of generality, let us assume that such an unknown form of dependence results in P 11 [ P I 11 (i.e. the so-called '11' dependence in JOP terminology). In order to have a valid joint distribution consistent with marginal probabilities, it must be P 10 \P I 10 , P 01 \P I 01 , and P 00 [ P I 00 according to the constraints in Eqs. 8, 9, and 10, respectively. Recalling the definition of Pearson correlation coefficient of two binary variables, which is known as Pearson / coefficient or mean square contingency (Cramér 1946, pp. 281-283) q c / ¼ P 11 P 00 À P 10 P 01 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi it follows that such a case corresponds to positive cross correlation, q c [ 0, which is the scenario in Equation 5. Conversely, let us assume that the unknown form of dependence results in P 11 \P I 11 . According to Eqs. 8, 9, and 10, it must be P 10 [ P I 10 , P 01 [ P I 01 , and P 00 \P I 00 . Moreover, q c \0, and this case corresponds to the scenario in Equation 6. Starting the same reasoning from any P 6 ¼ P 11 yields the same conclusions.
Using JOP somewhat misleading language, '11' dependence would necessarily imply '00' dependence, meaning that both P 11 and P 00 are necessarily greater than P I 11 and P I 00 , respectively. In this case, we also necessarily have a form of '01' and '10' dependence corresponding to systematic inequalities P 10 \P I 10 , and P 01 \P I 01 . On the other hand, '01' dependence would imply '10' dependence (i.e. simultaneous P 10 [ P I 10 , and P 01 [ P I 01 ), and necessarily a form of '11' and '00' dependence corresponding to systematic inequalities P 00 \P I 00 , and P 11 \P I 11 . In the 'new world' of vaguely defined compound events, JOP implicitly assumes that only the relationship in Equation 7 links the four joint probabilities P , allowing three of them to move arbitrarily. This misconception leads to the incorrect belief that P can be tested individually without accounting for the theoretical constraints driving their joint behavior. On the other hand, going back to the 'classical world' of well-devised probability theory, we can recover the missed elementary properties of joint binary stochastic processes, which allow meaningful analysis, and correct interpretation. In the case of JOP procedure, probability theory says that two binary processes can only be positively/negatively correlated (or independent, of course), and the four probabilities P must fulfill the inequalities in Eqs. 5 and 6, which are the only admissible scenarios. Any analysis that does not recognize this behavior is theoretically inconsistent.
As far as the point (3) is of concern, misinterpretation of JOP output is further exacerbated by performing the analysis on time series with cyclostationary (seasonally varying) p X i 1 and p X j 1 , with no preliminary check of its effect, as well as significance and power. Therefore, JOP analysis does not only rely on an algorithm that neglects a fundamental theorem of probability theory, but it is also applied neglecting its own implicit assumption of stationarity for p X i 1 and p X j 1 . Filtering the seasonal variability of p X i 1 and p X j 1 out, our monthly analysis reveals the expected behavior consistent with the theory by recognizing the two scenarios in Eqs. 5 and 6, and indicating that all the probabilities P 6 ¼ P I under dependence. As a consequence, identification of multiple types of dependence (i.e. '01', '10', '01&10', etc.) has neither theoretical nor empirical basis, as it comes from the incorrect interpretation of numerical output resulting from a theoretically flawed algorithm.
To further support the foregoing statements, here we present a simple Monte Carlo simulation showing that the results of the annual analysis are spurious numerical artefacts, which cannot be used to draw conclusions about supposed types of dependence different from the overall (positive or negative) dependence summarized by q c . We simulated two pairs of binary time series x i f g and x j È É with size N ¼ 365 Á 50, AR(1) serial dependence with q 1 ¼ 0:3, cross correlation q c 2 À0:3; 0:3 f g , and cyclically varying p X i 1 ¼ p X j 1 ¼ 0:16 cos 2p 365 n À Á þ 0:25, n ¼ 1; :::; N. This setting is similar to that used in Sect. 4.2, and mimics 50 years of daily data. Figure 7 and S9 show JOP results for pairs with positive and negative cross correlation, respectively. According to the original interpretation of JOP annual analysis, one should conclude that the positively cross-correlated pair exhibits '11' dependence, while the negatively cross-correlated pair exhibits '10' dependence. However, monthly analysis clearly shows that (i) all the four probabilities P generally fall outside their CIs, (ii) null hypothesis ðH 0 : P ¼ P I Þ is almost always rejected, (iii) lack of systematic rejection for some of the probabilities P is a spurious effect of the interaction of cross correlation and seasonality, and (iv) results fall within the two scenarios in Eqs. 5 and 6. Note that we purposely used relatively low values of q c to emphasize the confounding effect of cyclically varying p X i 1 and p X j 1 , and obtain JOP results resembling those corresponding to the observed series with supposed '11' and '01' dependence. Therefore, some cases of lack of rejection in the monthly analysis are expected as well. Modulating q c , p X i 1 , and p X j 1 , we can also build synthetic examples so that JOP yields other supposed types of dependence ('00', '00&11', etc.). Of course, the cases '00&11' and '01&10' can be obtained using higher absolute values of q c or reducing the amplitude of the seasonal variation of p X i 1 , and p X j 1 . In fact, both strategies reduce the concealing effect of cyclostationary p X i 1 and p X j 1 on annual JOP analysis. In all cases, monthly analysis better highlights the expected theoretical patterns in Eqs. 5 and 6, revealing the existence of only two types of admissible scenarios alternative to independence.

Rainfall re-analysis from modeling perspective
To further stress the inconsistency of testing dependence by using the single probabilities P , we show how a modelbased approach involving cross correlation q c , and marginal probabilities p X i 1 , and p X j 1 can effectively describe the patterns of P for the pairs of rainfall time series analyzed in Sect. 5.1. In fact, unlike the single probabilities P , the three quantities q c , p X i 1 , and p X j 1 provide sufficient information to describe a bivariate binary process, and set up a model that can be used to obtain CIs under dependence. Precisely, these Monte Carlo CIs are obtained simulating cross correlated binary time series with parametersq c ,p X i 1 , andp X j 1 estimated from the observed binary series describing rainfall occurrence. For direct comparison with results in Sect. 5.1, the analysis was performed on an annual basis and via monthly stratification. Figures 8 and 9 (and S10-S17), corresponding to Figs. 5 and 6 (and S1-S8), report the Laplace CIs and Monte Carlo CIs under dependence. In all cases, observed probabilitiesP fall in the center of Monte Carlo CIs, as q c , p X i 1 , and p X j 1 summarize all the information required to describe and reproduce allP patterns. This confirms that the supposed multiple types of dependence identified by JOP (testing the singleP at an inappropriate time scale) do not really exist and are only the expected effect of the overall correlation of two binary random processes.
It is worth noting that the interpretation of the JOP test relies on the condition P [ P I (equation 2 in De Michele et al. (2020)), meaning that a pair of stations is classified as wet-wet if P 11 [ P I 11 , for instance. Theoretical discussion in Sect. 5.2, and modeling results highlight that the foregoing condition and corresponding interpretation are misleading, as the four probabilities P are inherently related, and their values necessarily fulfill the conditions corresponding to the two scenarios in Eqs. 5 and 6. Therefore, the actual condition to be checked would be at most P 6 ¼ P I rather than P [ P I . Actually, the structure of a possible statistical test should check the four probabilities P simultaneously for the right direction (i.e. P ?P I ) according to the expected theoretical inequalities in Eqs. 5 and 6. Direction and magnitude of such inequalities is identified by the sign and modulus of q c , respectively, while the asymmetries of P I 11 and P I 00 , and P I 10 and P I 01 are the effect of the different values of p X i 1 and p X j 1 . Note that JOP does not (and cannot) provide information about the magnitude of the difference between P and P I . Such a difference could be statistically significant even if its value seemed numerically negligible. In fact, the outcome of a statistical test depends on the width of the distribution of the test statistic under the null hypothesis H 0 . If the sample size is large enough, any statistical test always rejects the null hypothesis for very small departures from the expected value of the test statistics under H 0 . In other words, JOP cannot provide information about the physical significance of the difference between P and P I . If the magnitude of such a difference is required for design purposes, the analysis cannot rely on statistical significance, but it should involve the actual values assumed by the probabilities P . A model-based approach is effective because it simultaneously involves a triplet of summary statistics (q c , p X i 1 , p X j 1 ) that provide sufficient information to correctly describe and reproduce the joint behavior of two binary processes, including the patterns of the four probabilities P . Of course, three out of the four probabilities P (e.g. P 00 , P 01 , and P 11 ) provide the same amount of information when considered simultaneously. In fact, q c , p X i 1 , and p X j 1 can be written in terms of three probabilities P . In this respect, we stress again that the theoretical inconsistency of JOP is not related to use of P per se, but to the incorrect assumption that three out of the four probabilities P are independent of each other, and can be taken and tested separately.
As mentioned in Sect. 3.2 and confirmed by the foregoing rainfall re-analysis, using the triplet (q c , p X i 1 , p X j 1 ) has some advantages: (i) it provides a straightforward summary of bivariate binary processes, clearly distinguishing dependence and marginal properties, and avoiding possible misinterpretations; and (ii) it allows for easily setting models that enable simulation of multivariate binary processes, and thus uncertainty analysis of any characteristic  Fig. 9 Results of annual and monthly q c analysis for the pair of rainfall series (SOUID105474, SOUID110384). Lines represent confidence intervals at the 99.5% confidence level. Monte Carlo confidence intervals ('MC CIs') are obtained from models incorporating cross correlation of interest, including P . As far as we know, no other triplet of summary statistics is so effective in terms of descriptive clarity, analysis' efficiency, and ease of building models.

Pearson correlation: properties and use
Since JOP procedure relies on the probabilities P , it is claimed that it ''has no problems of existence (like the Pearson correlation coefficient) and has the advantage (respect the existing techniques available in literature) to distinguish between the different types of dependence within the network [with respect to wet-wet, dry-dry, wetdry, and dry-wet cases], of interest for compound events perspective.'' (De Michele et al. 2020). We discussed the issue of 'types of dependence' in the previous sections.
Here, we focus on the properties of Pearson correlation and its possible limitations in describing mutual dependence of binary processes.
Concerning the problem of existence, q c is not defined if one of the probabilities in the denominator of Eq. 11 is equal to zero. This can happen in two cases: (i) one of the marginal probabilities (p X i 1 , p X j 1 , p X i 0 , and p X j 0 ) is theoretically equal to zero, or (ii) the estimate of one of them is equal to zero. The first case only occurs if a row or column of the theoretical contingency table describing the joint distribution of two binary processes contains only zeros, which in turn means that one of the processes assumes values '1' or '0' with probability equal to one. This case is theoretically meaningless because defining q c would mean attempting to quantify dependence between a random variable and a constant. In this respect, the analysis of P leads to identical conclusions from a different perspective. For example, if p X i 1 ¼ 0, then P 11 ¼ P 10 ¼ 0, X i 0, P 01 ¼ p X j 1 , and P 00 ¼ p 1 . This means that the problem is purely univariate, and no dependence analysis applies or makes sense, as there are no longer two random processes to analyze. The same argument holds for the other cases (p X i 0 ¼ 0, p X j 1 ¼ 0, and p X j 0 ¼ 0), mutatis mudandis. The second case can happen if we have two true binary random processes with marginal probabilities theoretically different from zero, but the estimateq c is not defined because one of the estimatesp X i 1 ,p X j 1 ,p X i 0 ,p X j 0 is equal to zero. If the binary process is actually bivariate, undefinedq c means that one of the time series x i f g or x j È É is a sequence of 0's or 1's, which is not representative of the actual marginal properties of the binary processes. Therefore, undefinedq c indicates that the sample size is insufficient to explore the sample space, and the calculation must be done using longer time series. In other words, the emergence of undefinedq c values indicates that the analysis is performed improperly on a sample that is not representative of the underlying process. This case can be common when one analyzes too short time series of zeroinflated and potentially persistent (over-dispersed) occurrence processes, such as daily rainfall, especially in arid climates. For this type of data, undefinedq c is a warning message recalling that statistics should be used wisely, accounting for the nature of processes, available data, and aim of the analysis. Another characteristic of Pearson correlation coefficient of continuous or discrete (including binary) processes is the dependence of its admissible values on the marginal distributions (e.g. Yule 1912;Lancaster 1957;Embrechts et al. 2002). In particular, for bivariate binary processes, the greatest possible positive (negative) value of q c is Prentice 1986;Emrich and Piedmonte 1991;Demirtas and Hedeker 2011;Serinaldi and Lombardo 2017a, and references therein). These range restrictions are required to ensure that the joint mass function for the binary outcomes be non-negative for all outcomes (Emrich and Piedmonte 1991).
The boundaries of q c can be problematic in three main circumstances: (i) when q c is used to directly quantify the degree of correlation of two binary processes; (ii) when q c values are used to compare the degree of correlation of different pairs of binary processes with different marginal rates of occurrence (in this case, two q c values are not generally comparable, as they have different admissible ranges); and (iii) when we want to simulate negatively correlated binary processes. The point (iii) is addressed in Appendix A, while the issues (i) and (ii) are directly related, but do not play any relevant role in the context of JOP testing and model-based analysis. In fact, neither JOP nor our model-based approach aim at quantifying dependence and make a comparison of dependence measures. JOP procedure checks whether independence can be discarded or not, and JOP outcome does not depend on the values of the probabilities P , but on the significance of their difference from P I . On the other hand, our method allows us building models that match/reproduce observed properties, implicitly accounting for their admissible ranges. In other words, for fixed marginal rates of occurrence, the significance of a test and the coverage of confidence intervals are consistent with the admissible ranges of the target statistic, such as q c or P . For example, for p X i 1 ¼ 0:1, and p X j 1 ¼ 0:3, we have q c 2 ðÀ0:048; 0:259Þ, and such a range is consistent with all the admissible values of P . This explains why our models based on q c , p X i 1 , and p X j 1 , are capable of precisely reproducing all observed P values.
To summarize, the properties of q c for bivariate binary processes can be a shortcoming in applications involving statements on their values. Since JOP test makes only statement on rejection/no rejection, and the distribution of the test statistics under H 0 accounts automatically for the admissible range, it follows that such a range does not matter in this type of analysis, and therefore JOP does not provide any practical improvement compared to a test based on q c . On the contrary, JOP leads to incorrect conclusions about the nature of dependence because it tests the four probabilities P independently, missing their mutual relationships.

Avoiding statistical tests to avoid misinterpretation
Most of the problems raising from the use and interpretation of the JOP test are due to the inherent limits (and inconsistencies) characterizing the hybrid Fisher-Neymann-Pearson statistical testing procedures, which are widely discussed in the theoretical and applied literature (Gigerenzer et al. 1989;Flueck and Brown 1993;McBride et al. 1993;Meehl 1997;Gill 1999;Johnson 1999;Nicholls 2001;Levine et al. 2008 (2010) focuses on a common error called the error of transposed conditional (also known as converse inequality argument or inverse probability problem (Pollard and Richardson 1987;Gill 1999;Krämer and Gigerenzer 2005;Beninger et al. 2012;), which consists of confusing conditional and conditioning events, so that ''we want to devise a way to assign a probability to the validity of the null hypothesis, given the observed correlation. But what we end up doing is calculating the probability for a correlation at least as big as the observed correlation when the null hypothesis is assumed to be true.'' (Ambaum 2010). Even though these probabilities are related by Bayes' theorem, they are different. Moreover, rejection of H 0 (and thus low p value), ''is not indicative of much at all except that the observed correlation is not very probable if the null hypothesis were true... But... any quantitative information [on the posterior probability for H 1 given the observations] assuming the null hypothesis is quite irrelevant'' (Ambaum 2010). Indeed, stating that 'correlation is significant' makes little sense in any practical sense, as this statement is a category error occurring when a property is wrongly ascribed to something that cannot have this property. In this case, the hypothesis test, and related p values and rejection rules, refer to unrelated processes (i.e. H 0 ), and they do not (and cannot) say anything about related processes (i.e. H 1 ). For example, the outcome of the JOP test (or any test checking a hypothesis H 0 ) can only be a statement on 'rejection' or 'no rejection' of independence. However, rejection does not allow any statement about possible dependence, because dependence is not a property of the model H 0 assumed to perform the test. In other words, 'rejection' can be due to something that is unknown and different from dependence. Analogously, 'no rejection' resulting from annual JOP might be related to violation of the implicit assumption of stationary rates of occurrence, i.e. the introduction of a exogenous factor (cyclical nonstationarity of p X i 1 , and p X j 1 ) that is not included in the list of assumptions of the model H 0 . To make statements on dependence, we need a model incorporating dependence, thus enabling comparison of model output and observations. If we call this model H 1 , in principle, we can build a test checking H 1 . However, also in this case, such a test can only provide statements ('rejection' or 'no rejection') on H 1 , and nothing else. Moreover, as shown in Sect. 5.3, if the model is informative, testing H 1 is no very useful, as the direct comparison of model output and observations is sufficient to assess the model performance and the suitability of the underlying assumptions.
The foregoing inconsistencies affecting statistical tests contributed to the misinterpretation of the JOP output, as it does not provide information about anything different from independence of pairs of binary processes. Conversely, as discussed in Sect. 3.2, our model-based approach is actually a change of paradigm in this context, as it moves from ubiquitous uninformative statistical testing to informed, parsimonious, and scientifically tenable stochastic modeling, relying on solid theory and extensive validation/falsification under challenging conditions, rather than untenable/unrealistic null hypotheses.

Conclusions
This study presented an inquiry on the potential problems and inconsistencies related to the introduction of new topics or relabeling known concepts without the necessary rigor and attention for the existing theories. We considered the emerging research area of so-called 'compound events', focusing on a specific test proposed by De Michele et al. (2020) to study the joint occurrence of rainfall at pairs of locations. This method, denoted as JOP test throughout the text, relies on a separate analysis of the single probabilities P (i.e. P 00 , P 11 , P 10 , and P 01 ), and was proposed as an improved technique overcoming the supposed shortcomings of classical analysis based on Pearson correlation. However, our neutral validation led to an opposite conclusion. In fact, the probabilities P are inherently related, they are not enough to describe the effects of dependence when taken individually, and their individual analysis can lead to incorrect interpretations if their relationship is not accounted for. As a consequence, the four types of dependence originally introduced to interpret JOP output are numerical artefacts that result from overlooking the available theory, the expected effects of dependence and marginal rates of occurrence, and the interaction with confounding factors, such as seasonality and serial correlation. We also discussed why the properties of Pearson correlation are not a true shortcoming in the context of statistical testing, and why JOP test cannot provide any improvement in this respect. Therefore, the rather convoluted JOP algorithm cannot provide the information we expect on the joint behavior of bivariate binary processes. On the other hand, classical Pearson correlation coefficient and the marginal probabilities of occurrence provide a rigorous description and clear interpretation of the behavior of bivariate binary rainfall observations, and allow for setting up models capable of faithfully reproducing the observed fluctuations of the single probabilities P . Moreover, these models can be used to check and validate any statistical test under challenging conditions.
We presented the case of the JOP algorithm in a wider perspective of quite a common attitude in many human activities, consisting of revitalizing a research topic or revamping businesses by re-labeling concepts or rebranding products, without introducing substantial changes. While a rigorous approach to compound, interacting, interconnected, and cascading risks is surely of interest, the concept of 'compound events' is often ill-defined, superficially used, and conceals the re-proposal of classical problems already addressed in the past. When new names do not correspond to new concepts, the main effect is a hiatus between new and existing literature, whereby the former neglects the latter, which in turn provides theories and solutions for the same (or very close) subjects under different names. In this context, a method like the JOP test should have been considered only after double checking that the classical analysis based on Pearson correlation and related testing and modeling do not provide satisfactory results. This would have been easier by recognizing that the study of rainfall occurrence process at two locations concerns the theory of bivariate binary processes. Proper study of such processes does not require any 'compound events' re-labeling, which in turn might introduce confusion and theoretically inconsistent statements transforming 'compound events' into 'confounding events'. Finally, we argue that the use of different names to denote the same well-defined concept already causes lack of communication among different disciplines, slowing science advances down. When new names do not even correspond to welldefined concepts, their ambiguity can cause the paradoxes discussed in this study. The best way to advance science is to introduce new concepts rather than names.
we can circumvent the problem by focusing on a dual positively correlated binary process with q Xþ , p X i 1þ , and p X j 1þ , and the corresponding parent Gaussian process with q Yþ . Such dual binary processes are linked by the following relationships: : ðA:3Þ These relationships derive from observing that q c in Eq. 11 depends on P , and the sign of q c changes from positive to negative when the off diagonal probability mass becomes grater than that along the diagonal, and vice versa. Therefore, for bivariate binary pairs with negative correlation, we can define a dual process with positive correlation whose properties are obtained by circular rotation of the probability masses P , writing for instance : ðA:4Þ Figure 10 provides a visualization of counterclockwise rotation along with an illustration of parent Gaussian samples identifying the masses of probability in each of the four quadrants corresponding to the four possible states of a binary pair. Therefore, the simulation procedure of a process with desired q XÀ , p X i 1À , and p X j 1À consists of the following steps: ðA:5Þ Of course, a dual positively correlated bivariate binary process can be obtained by clockwise rotation as well.

Declarations
Competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.