Abstract
Learning from data that contain missing values represents a common phenomenon in many domains. Relatively few Bayesian Network structure learning algorithms account for missing data, and those that do tend to rely on standard approaches that assume missing data are missing at random, such as the ExpectationMaximisation algorithm. Because missing data are often systematic, there is a need for more pragmatic methods that can effectively deal with data sets containing missing values not missing at random. The absence of approaches that deal with systematic missing data impedes the application of BN structure learning methods to realworld problems where missingness are not random. This paper describes three variants of greedy search structure learning that utilise pairwise deletion and inverse probability weighting to maximally leverage the observed data and to limit potential bias caused by missing values. The first two of the variants can be viewed as subversions of the third and best performing variant, but are important in their own in illustrating the successive improvements in learning accuracy. The empirical investigations show that the proposed approach outperforms the commonly used and stateoftheart Structural EM algorithm, both in terms of learning accuracy and efficiency, as well as both when data are missing at random and not at random.
1 Introduction
The field of Bayesian Network (BN) structure learning represents a set of approaches that focus on recovering the conditional or causal relationships between variables from data. Structure learning can be divided into two main categories known as constraintbased and scorebased methods. Constraintbased methods such as PC (Spirtes et al. 2000) and IAMB (Tsamardinos et al. 2003) recover a graph by ruling out the structures that violate the conditional independencies discovered from data, and orientating edges by determining colliders. Scorebased algorithms such as GES (Chickering 2002) and GOBNILP (Cussens 2011) recover a graph by exploring the search space of possible graphs and returning the graph with the highest objective score. While numerous BN structure learning algorithms have been proposed in the literature over the past few decades, most of them do not efficiently learn from data that contain systematic missing values. This hinders the application of structure learning to realworld problems, since missing data represents a common issue in most applied areas including medicine and healthcare (Constantinou et al. 2016), clinical epidemiology (Pedersen et al. 2017), traffic flow prediction (Tian et al. 2018), anomaly detection (Zemicheal and Dietterich 2019), and financial analysis (John et al. 2019). Therefore, there is a greater need for structure learning algorithms that account for potential data bias due to systematic missing values, without having significant impact on the computational efficiency of structure learning.
According to Rubin (1976), missing data problems can be categorised into three classes. These are the Missing Completely At Random (MCAR), the Missing At Random (MAR) and the Missing Not At Random (MNAR). Specifically, MCAR denotes that the missing values are purely random and independent of other observed variables or parameters. This type of missingness is usually caused by technical error that would not bias the analysis. The definition of MAR, on the other hand, is somewhat counterintuitive in its name and assumes the missing values are dependent on observed data. For example, in an investigation between age and frequency of smoking, missing data are MAR if younger respondents are more likely to not disclose their smoking frequency. Lastly, data missingness are said to be MNAR if it is neither MCAR nor MAR. In the above example, the missingness are MNAR if data on respondent’s age also contains missing values.
Methods that deal with missing data typically include naïve approaches such as the complete case analysis (a.k.a listwise deletion) and multiple imputation (Rubin 2004). Complete case analysis involves removing the data cases that contain missing values and hence, restricting learning to complete data cases. Clearly, while this approach is easy to implement, it can be sample inefficient and may yield bias when missingness are not MCAR (Graham 2009). Multiple imputation, on the other hand, fills  rather than ignoring  the missing values and takes the uncertainty of imputation into consideration by repeating imputation over different possible values (Azur et al. 2011). However, multiple imputation is built under the assumption of MAR which means it may also produce biased outcomes when data are MNAR.
One of the earliest advanced approaches for dealing with missing data is the ExpectationMaximisation (EM) algorithm, which was also later adopted by the structure learning community. The Structural EM algorithm (Friedman 1997) is an iterative process which consists of two steps: the Expectation (E) step and the Maximisation (M) step. In E step, Structural EM makes inferences on the missing values and computes the expected sufficient statistics based on the graph learned in previous iteration. The M step follows where the current state of the learned graph is revised based on the sufficient statistics obtained at step E. An advantage of Structural EM is that it can be combined with different structure learning algorithms. A disadvantage, however, is that it is computationally inefficient due to the inference process that takes place at step E. Therefore, in practice, the E step of the Structural EM algorithm is usually implemented with single imputation, i.e., imputing the expectation of the missing values derived from the observed values. Ruggieri et al. (2020) compared the performance of the original Structural EM to that of the imputedbased Structural EM, and found that the latter achieves better performance in most of the simulation scenarios.
An increasing number of algorithms are recently proposed to improve structure learning from data containing missing values. In the case of scorebased learning, two model selection methods have been proposed based on the likelihood function called NodeAverage Likelihood (NAL) for discrete (Balov 2013) and conditional Gaussian BNs (Bodewes and Scutari 2021). While these methods are consistent with MCAR, they are not consistent with MAR or MNAR cases. In constraintbased learning, Strobl et al. (2018) treated missing values as a type of selection bias and showed that performing testwise deletion during conditional independence (CI) tests represents a sound solution for the FCI algorithm (Spirtes et al. 2000). In the context of constraintbased learning, testwise deletion is a process that deletes the data cases with missing values amongst the variables involved in a given CI test. Gain and Shpitser (2018) later show that replacing the standard CI test in PC with an Inverse Probability Weighting (IPW) (Horvitz and Thompson 1952) based CI test, enables PC to be applied to data sets which contain systematic missing values without loss of consistency. IPW is an approach to alleviate bias in data distributions by reweighting the data cases which we will describe in detail in Sect. 3. However, IPW CI testing assumes sufficient information of missingness, such as information about the parents of missingness and the total ordering of the missing indicators, which is unlikely to be known in practise. Tu et al. (2019) tried to address this issue by first predicting the parents of missingness using constraintbased learning, for every observed variable that contained missing values, and applying the IPW CI tests using the sufficient information obtained during the constraintbased learning phase.
In this paper, we propose three variants of the greedy search HillClimbing algorithm to investigate how they handle missing data values under different assumptions of missingness. These variants can be viewed as fusions between greedy search scorebased learning, and the pairwise deletion and IPW methods discussed above that have been previously applied to constraintbased learning. The contribution of this paper is a novel structure learning algorithm suitable for structural learning from data that contain systematic missingness. The empirical results show that, under systematic missingness, the proposed algorithm outperforms the current stateoftheart Structural EM algorithm, both in terms of learning accuracy and efficiency.
The paper is organised as follows: Sect. 2 provides necessary preliminary information that includes notation and background information, Sect. 3 describes the proposed algorithm, Sect. 4 presents the results, and we provide our concluding remarks in Sect. 5.
2 Preliminaries
In this paper, we consider discrete variables which we denote with uppercase letters (e.g., U, V), and the assignment of variable states with lowercase letters (e.g., u, v). We denote a set of variables with bold uppercase letters (e.g., \(\varvec{U}, \varvec{V}\)), and the assignment of a set of variable states with bold lowercase letters (e.g., \(\varvec{u}, \varvec{v}\)).
2.1 Bayesian network
A BN \(\left<\mathcal {G}, P\right>\) is a probabilistic graphical model that can be represented by a Directed Acyclic Graph (DAG) \(\mathcal {G}=\left( \varvec{V}, \varvec{E}\right) \) and a joint distribution P defined over \(\varvec{V}\), where \(\varvec{V}=\left\{ V_1, \ldots , V_n\right\} \) represents a set of random variables and \(\varvec{E}\) represents a set of directed edges between pairs of variables. A BN entails the Markov Condition which states that for every variable \(V_i\) in \(\mathcal {G}\), \(V_i\) is independent of all its nondescendants conditional on its parents. Given the Markov Condition, the joint distribution P can be factorised as follows:
where \(\varvec{Pa}_i\) represents the parentset of \(V_i\) in \(\mathcal {G}\). Since this study focuses on discrete BNs, we assume that every variable follows an independent multinomial distribution given their parents. We also assume that the set of observed variables \(\varvec{V}\) is causally sufficient (Spirtes et al. 2000) and this means that we assume there are no unobserved common causes between any of the variables in \(\varvec{V}\). In practice, this means that even though measurement error can be viewed as a hidden variable problem where nodes that contain any form of error must have a hidden parent that causes that error, we assume causal sufficiency such that the graphs reconstructed by SEM are DAGs that contain the observed variables only.
Because an observed distribution can be represented by multiple different DAGs, we work under the assumption that multiple DAGs can be statistically indistinguishable. A collection of DAGs that are statistically indistinguishable, and express the same joint distribution, is also known as a set of Markov equivalent DAGs often referred to as a Completely Partial DAG (CPDAG) (Spirtes et al. 2000). A CPDAG can be obtained from a DAG by (a) preserving all its vstructures, (b) preserving all the directed edges that would create a cycle or a new vstructure if reversed, and (c) converting the residual directed edges to undirected edges.
2.2 Hill climbing algorithm
For simplicity, we focus on the HillClimbing (HC) structure learning algorithm (Heckerman et al. 1995) which is a classic scorebased learning algorithm that greedily searches the space of neighbouring graphs. It typically starts from an empty graph and explores the search space of graphs via edge additions, deletions and reversals that maximally improve the objective score. HC terminates when no neighbouring graph increases the objective score. HC is an approximate learning algorithm that returns a local maximum solution. However, it is acknowledged to be a computationally efficient algorithm that often outperforms other more complex algorithms (Gámez et al. 2011; Constantinou et al. 2021). The pseudocode of the standard HC structure learning algorithm is provided in Algorithm 1.
As with most other structure learning algorithms, HC is usually paired with a decomposable score function to evaluate each graph explored relative to the input data. A score function \(S\left( \mathcal {G}, D\right) \) is decomposable if it can be written as the sum over a set of local scores, each of which corresponds to a variable and its parents in \(\mathcal {G}\). While all scorebased algorithms can use a decomposable score, this property is particular efficient in the case of HC search since it explores one or two graphical modifications at a time; i.e., one in case of edge addition or removal, and two in the case of edge reversal. Therefore, the objective score for each neighbouring graph \(\mathcal {G}_{nei}\) can be obtained efficiently by only recomputing the local scores of up to two nodes whose parentset has changed, and obtaining the local scores of the remaining nodes whose parentset remains intact from the current best graph \(\mathcal {G}\).
Many score functions offer the decomposable property, and most commonly include the Bayesian Information Criterion (BIC) (Schwarz 1978), the Bayesian Dirichlet equivalent (BDe) (Heckerman et al. 1995) and the quotient Normalized Maximum Likelihood (qNML) (Silander et al. 2018). In this paper, we employ BIC as the score function in all of our experiments. The formal definition of BIC is:
where N is the sample size, \(\varvec{Pa}_i\) is the parent set of \(V_i\) in \(\mathcal {G}\), \(\hat{\Theta }_i\) is the maximum likelihood estimates of the parameters over the local distribution of \(V_i\), and \(\hat{\Theta }_i\) is the number of free parameters in \(\hat{\Theta }_i\). If \(\mathcal {G}\) is defined over a set of discrete multinomial variables \(\varvec{V} = \left\{ V_1, \ldots V_n\right\} \), then the BIC score has the following form:
where \(N_{ijk}\) is the number of cases in data set D in which the variable \(V_i\) takes its \(k^{th}\) value and the parents of \(V_i\) take the \(j^{th}\) configuration. Similarly, \(N_{ij}\) is the number of cases in data set D where the parents of \(V_i\) take their \(j^{th}\) configuration and, therefore, \(N_{ij} = \sum _{k=1}^{r_i}N_{ijk}\). Lastly, \(r_i\) represents the number of distinct values of \(V_i\) and \(q_i\) represents the number of configurations of the parents of \(V_i\).
2.3 Missing data assumptions
We adopt the graphical descriptions of missing data introduced by Mohan et al. (2013) and Mohan and Pearl (2021). In this paper, we denote the set of fully observed variables (i.e, variables without missing values) as \(\varvec{V}_o\) and the set of partially observed variables (i.e., variables with at least one missing values) as \(\varvec{V}_m\). For every partially observed variable \(V_i\in \varvec{V}_m\), we define an auxiliary variable \(R_i\) called missing indicator to reflect the missingness in \(V_i\), where \(R_i\) takes the value of 0 when \(V_i\) is recorded and the value of 1 when \(V_i\) is missing.
Further, we define the missingness graph (mgraph (Mohan et al. 2013)) \(\mathcal {G}\left( \mathbb {V}, \varvec{E}\right) \) that captures the relationships between observed variables \(\varvec{V}\) and missing indicators \(\varvec{R}\), where \(\mathbb {V}=\varvec{V}_o\cup \varvec{V}_m\cup \varvec{R}\). Based on mgraph, we define missing data as MCAR if
, MAR if
, otherwise MNAR. Figure 1 presents the three possible mgraphs assuming three observe variables with structure \(V_1\rightarrow V_2\rightarrow V_3\), depicting the MCAR, MAR and MNAR assumptions respectively.
To ensure the population distributions are recoverable from the observed data, some assumptions need to be employed for the missing indicators. These are:
Assumption 1
Variables in \(\varvec{R}\) neither can be the parent of an observed variables in \(\varvec{V}\) nor other variables in \(\varvec{R}\).
Assumption 2
No partially observed variable can be the parent of its own missing indicator.
Assumption 1 states that a missing indicator in \(\varvec{R}\) can only be an effect (leaf) node in an mgraph, whereas Assumption 2 states that the missing value is independent of the variable value. When both Assumptions 1 and 2 hold, the joint distribution of the observed variables is recoverable from the observed data (Mohan et al. 2013, Theorem 2).
3 Handling systematic missing data with hillclimbing
This section describes the three HC variants that we explore in extending the learning process towards dealing with systematic missing data. Specifically, Sect. 3.1 describes HC with pairwise deletion which we call HCpairwise, Sect. 3.2 describes HC with both pairwise deletion and Inverse Probability Weighting which we call HCIPW, and Sect. 3.3 describes an improved version of HCIPW, the HCaIPW, that prunes less data samples compared to HCIPW. The first two HCvariants can be viewed as subversions of HCaIPW, but are important in their own in illustrating the successive improvements in learning accuracy.
3.1 Hillclimbing with pairwise deletion
Recall that, at each iteration, HC moves to the neighbouring graph that maximally improves the objective score, and that performing HC search with a decomposable scoring function means that there is no need to recompute the local score of variables whose parentset remains unchanged across graphs. Therefore, an efficient (but not necessarily effective) way of applying HC to missing data is to ignore data cases that contain missing values in variables that form part of the set of variables considered when exploring local score changes to a DAG. We refer to this process as pairwise deletion, where “pair” refers to the current pair of candidate DAGs (the current best DAG and neighbouring DAG), and this deletion process may involve more than two variables. When comparing the current best DAG against a neighbouring DAG, the necessary variables would be the nodes with unequal parentsets between the two graphs, plus the parents of those nodes in the two graphs. Formally, when exploring a neighbouring DAG \(\mathcal {G}_{nei}\) from the current best DAG \(\mathcal {G}\), the set of necessary variables \(\varvec{W}\) between \(\mathcal {G}\) and \(\mathcal {G}_{nei}\) can be described as:
where \(\varvec{V}_d\) is the set of variables that have different parentsets between \(\mathcal {G}\) and \(\mathcal {G}_{nei}\), and \(\varvec{Pa}_i\) and \(\varvec{Pa}_i^{nei}\) are the parentsets of \(V_i\) in \(\mathcal {G}\) and \(\mathcal {G}_{nei}\) respectively. For simplicity, we refer to the data set obtained after applying pairwise deletion as the pairwise deleted data set.
Example 1
Assume that, during HC, the current state of DAG \(\mathcal {G}\) is a graph containing three variables \(\left\{ V_1, V_2, V_3\right\} \) and the edge \(V_1\rightarrow V_2\), as illustrated in Table 1. Given DAG \(\mathcal {G}\), there are six possible edge operations each of which produces a neighbouring graph \(\mathcal {G}_{nei}\). Operation add \(V_1\rightarrow V_3\), for example, can be evaluated by assessing the change in the local score of \(V_3\), i.e., \(S\left( V_3\mid V_1\right)  S\left( V_3\right) \), since \(V_3\) is the only variable with different parents between \(\mathcal {G}\) and \(\mathcal {G}_{nei}\). When the data set contains missing values, we can apply pairwise deletion to data given \(\left\{ V_1, V_3\right\} \) in order to obtain a complete data set that will enable us to assess the neighbouring graph resulting from this edge operation. However, there is a risk that this action may lead to biased estimates when missingness is not MCAR.
Because pairwise deletion leads to edge operations that are assessed based on different subsets of the data, it is possible to get stuck in an infinite loop where previous neighbouring graphs are constantly revisited and reselected as a higher scoring graph. This can happen when, for example, DAG \(\mathcal {G}_2\) returns a higher score than \(\mathcal {G}_1\) based on pairwise deleted data set \(D_1\), \(\mathcal {G}_3\) returns a higher score than \(\mathcal {G}_2\) based on pairwise deleted data set \(D_2\), and \(\mathcal {G}_1\) returns a higher score than \(\mathcal {G}_3\) based on pairwise deleted data set \(D_3\). In this example, HC with pairwise deletion would identify the graphical scores as \(\mathcal {G}_1<\mathcal {G}_2<\mathcal {G}_3<\mathcal {G}_1\) and never converge to a maximal solution. We address this issue by restricting HC search to neighbours not previously identified as the optimal graph. We call this variant of HC as HCpairwise, and present its pseudocode in Algorithm 2.
When data are MCAR, on the basis of
the distribution entailed by any pairwise deleted data set is an unbiased estimate of the underlying true distribution:
where \(\varvec{R}_s\) can be any subset of \(\varvec{R}\).
From this, we derive Proposition 1, which states that, when the missingness is MCAR, the DAG learned by HCpairwise is a local maximum graph, at least when BIC is used as the objective function. We define the local maximum graph as the graph with an objective score not lower than the scores of all its valid neighbouring graphs, when these scores are derived from the fully observed data set; i.e., it is independent of missingness generated.
Proposition 1
Assume data D is MCAR and sample size \(N\rightarrow \infty \), for any DAG \(\mathcal {G}\) and one of its neighbouring DAG \(\mathcal {G}_{nei}\)
where \(D_{pw}\) is the pairwise deleted data set which is derived from D by removing the data cases with missing values amongst the necessary variables \(\varvec{W}\), and \(D_f\) is the corresponding fully observed data set.
3.2 Hillclimbing with inverse probability weighting
Although HCpairwise will progressively learn a better DAG after each iteration when missingness is MCAR, this property does not necessarily hold when missingness is MAR or MNAR, since systematic bias in the data might produce
To diminish data biases caused by potential dependencies between missing and observed data, we further explore applying the IPW method on the pairwise deleted data set.
According to Mohan et al. (2013, Theorem 2) and Tu et al. (2019), when Assumptions 1 and 2 hold, the joint distribution of variables \(\varvec{V}\) can be fully recovered from the observed part of the data set (i,e., the data after applying pairwise deletion) by
where \(\varvec{Pa}_{R_i}\) is the set of parents of missing indicator \(R_i\), and \(\varvec{R}_{\varvec{Pa}_{R_i}}\) is the set of missing indicator of the partially observed variables in \(\varvec{Pa}_{R_i}\). We further discuss and provide the derivation of Eq. (7) in Appendix B.
Since the term c in Eq. (7) represents a constant value, we can apply pairwise deletion to the missing data cases of variables \(\varvec{V}\) and weight the pairwise deleted data set by \(\prod _{Ri\in R}\beta _{R_i}\). This will produce a weighted data set that approximates the unbiased distribution \(P\left( \varvec{V}\right) \). We call this HC variant HCIPW, and can be viewed as an extension of HCpairwise that incorporates both the pairwise deletion and IPW methods. Unlike HCpairwise, the HCIPW algorithm can be used under the assumption the input data are MAR or MNAR, in addition to MCAR, to diminish data bias caused by systematic missing values.
It should be noted that when \(\varvec{Pa}_{R_i}\) contains partially observed variables, Eq. (7) implies that \(\varvec{Pa}_{R_i}\subseteq \varvec{V}\); otherwise, the columns of \(\varvec{Pa}_{R_i}\) in the pairwise deleted data set may contain missing values that will render the calculation of \(\beta _{R_i}\) invalid. The following example shows that it might be impossible to recover the underlying true distribution if any \(\varvec{Pa}_{R_i}\not \subseteq \varvec{V}\).
Example 2
Consider that Figure 1c is the true mgraph, the current best DAG \(\mathcal {G}\) in HC search is the one shown in Fig. 2a, and b presents one of its neighbouring DAGs, \(\mathcal {G}_{nei}\). Since the difference in score between \(\mathcal {G}_{nei}\) and \(\mathcal {G}\) is \(S\left( V_3\mid V_1\right)  S\left( V_3\right) \), we need to ensure that missingness does not bias the estimate of distribution \(P\left( V_1, V_3\right) \) when computing distributional score difference. If we apply pairwise deletion directly on the necessary variables \(\left\{ V_1, V_3\right\} \) and use Eq. (7) to recover \(P\left( V_1, V_3\right) \). This will result in the following equation:
However, the problem in the above equation is that we cannot compute the weight term \(\frac{P\left( V_2\mid R_2 = 0\right) }{P\left( V_2\mid R_1 = 0, R_2 = 0\right) }\) for data cases that contain missing values in \(V_2\).
To avoid this, when assessing the edge operations from \(\mathcal {G}\) to \(\mathcal {G}_{nei}\) in HCIPW, the pairwise deletion for Eq. (7) should be performed on sufficient variables \(\varvec{U}\), which is a variable set that contains the necessary variables \(\varvec{W}\) plus the parents of missing indicators of all variables in \(\varvec{U}\):
where \(\varvec{V}_d\) is the set of variables that have different parentsets between \(\mathcal {G}\) and \(\mathcal {G}_{nei}\), and \(\varvec{Pa}_i\) and \(\varvec{Pa}_i^{nei}\) are the parentsets of \(V_i\) in \(\mathcal {G}\) and \(\mathcal {G}_{nei}\) respectively. It is worth noting that Eq. (8) represents a recursive process that iterates over the parents of missing indicators for all involved variables, i.e., not only \(\varvec{W}\) but also \(\varvec{U}\backslash \varvec{W}\) should be included in \(\varvec{U}\) in order to resolve the issue illustrated in Example 2.
Another potential issue with Eq. (7) is that the parents \(\varvec{Pa}_{R_i}\) of each missing indicator \(R_i\) are generally unknown. Tu et al. (2019) used constraintbased learning to discover the parents of each missing indicator, and this approach has been proven to be sound when both Assumptions 1 and 2 hold. We have, therefore, adopted the constraintbased approach proposed by Tu et al. (2019) to discover the parents of the missing indicators in applying HCIPW. The intention here is that this approach can be used to exclude variable \(V_j\) as the parent of \(R_i\), if \(R_i\) is found to be independent of \(V_j\) given any variable set \(\varvec{S}\), given the pairwise deleted data set for \(\left\{ V_j\right\} \cup \varvec{S}\). Algorithm 3 provides the pseudocode.
Algorithm 4 describes the HCIPW algorithm, where lines coloured in blue represent the difference in pseudocode between HCIPW and HCpairwise. Note that when computing the objective score for HCIPW, the weighted statistics \(\widetilde{N}_{ijk}, \widetilde{N}_{ij}\) are used instead of the standard \(N_{ijk}, N_{ij}\) used in HC, and which are defined as follows:
where \(1_{ijk}\) is the indicator function of the event \(\left( V_i=k, \varvec{Pa}_i=j\right) \) which returns 1 when the combination of \(V_i=k, \varvec{Pa}_i=j\) appears in the input data case, and returns 0 otherwise, \(d^s\) is the \(s^{th}\) record in pairwise deleted data set \(D_{pw}\), and \(\beta ^s\) is the weight corresponding to \(d^s\). Therefore, we define the BIC score for pairwise deleted data set \(D_{pw}\) given \(\beta \) as follows:
where \(N_{pw}\) represents the sample size of \(D_{pw}\), \(\widetilde{N}_{ijk}\) and \(\widetilde{N}_{ij}\) represent the weighted statistics as defined in Eqs. (9) and (10), and \(\beta \) is used for computing the weighted \(\widetilde{N}_{ijk}\) and \(\widetilde{N}_{ij}\).
The following proposition shows that HCIPW converges to a local optima when BIC is used as the score function, when both Assumptions 1 and 2 hold, and when sample size \(N\rightarrow \infty \).
Proposition 2
Given Assumptions 1 and 2, assume data D is partially observed and sample size \(N\rightarrow \infty \), for any DAG \(\mathcal {G}\) and one of its neighbouring DAG \(\mathcal {G}_{nei}\)
where \(D_{pw}\) is the pairwise deleted data set which is derived from D by removing data cases with missing values among sufficient variables \(\varvec{U}\), \(\beta = \prod _{R_i\in \varvec{R}_{\varvec{U}}}\beta _{R_i}\), and \(D_f\) is the corresponding fully observed data set.
3.3 Hillclimbing with adaptive inverse probability weighting
Although HCIPW diminishes potential data bias caused by systematic missing values, the learning approach achieves this by removing a greater number of data cases compared to those removed by HCpairwise when \(\varvec{Pa}_{\varvec{R}_{\varvec{W}}}\) contains partially observed variables, which is likely to happen when the missingness are MNAR. This can be a problem when data cases are limited. We illustrate this phenomenon with an example.
Example 3
Suppose graph (a) in Fig. 3 represents the ground truth mgraph in which the variables in shaded backcolour \(V_1, V_4\) and \(V_6\) are partially observed whose missingness are caused by \(V_4, V_5\) and \(V_1\) respectively, as illustrated with the missing indicators \(R_1, R_4\) and \(R_6\) corresponding to the missingness of \(V_1, V_4\) and \(V_6\). Let us assume graph (b) represents the current state of the optimal DAG in the HCpairwise/HCIPW search process, and that graphs (c) and (d) represent two of the possible neighbouring graphs. When HCpairwise compares \(\mathcal {G}\) with \(\mathcal {G}_{n1}\), it applies pairwise deletion on cases in which the necessary variables \(\varvec{W} = \left\{ V_5, V_2, V_6\right\} \) contain missing values. Since only \(V_6\) is partially observed out of the three necessary variables, HCpairwise removes data cases when the value of \(V_6\) is missing. In contrast, when HCIPW is applied to this case, and assuming it correctly learns the parents of missingness via Algorithm 3, it computes the weights of the pairwise deleted data set through pairwise deletion based on the sufficient variables \(\varvec{U} = \left\{ V_5, V_2, V_6\right\} \cup \left\{ V_1, V_4, V_5\right\} \). Thus, HCIPW removes data cases whenever any of the variables in U contain a missing value (in this example, \(V_1, V_4\) and \(V_6\) do). Therefore, HCIPW performs learning on a smaller set of data cases compared to those in the case of HCpairwise.
When \(\varvec{Pa}_{\varvec{R}_{\varvec{W}}}\) (refer to Eq. 4 and Algorithm 4) does not contain any partially observed variables, the HCIPW algorithm will perform learning on the same number of data cases as in HCpairwise. This can happen in cases such as when comparing neighbouring DAG \(\mathcal {G}_{n2}\) against \(\mathcal {G}\) in Fig. 3, where the set of necessary variables \(\varvec{W}\) in HCpairwise contains \(\left\{ V_4, V_2, V_5\right\} \) and the set of sufficient variables \(\varvec{U}\) in HCIPW is \(\left\{ V_4, V_2, V_5\right\} \cup \left\{ V_5\right\} \). In this case, because \(V_5\) is fully observed, applying pairwise deletion given \(\varvec{W}\) and \(\varvec{U}\) would result in the same pairwise deleted data set.
Because the effectiveness of a scoring function increases with sample size, the scoring efficiency of HCIPW can decrease considerably when missingness are MNAR for multiple variables. This is because both the number of partially observed variables and MNAR missingness increase the number of data cases removed during the learning process. It is on this basis we investigated a third variant, called the adaptive IPWbased HC (HCaIPW), and which can be viewed as an extension of HCIPW. The pseudocode of HCaIPW is shown in Algorithm 5. The highlighted section represents the part of the code that differs from HCIPW.
In essence, HCaIPW aims to maximise the samples taken into consideration during the learning process. When there are partially observed variables in \(\varvec{Pa}_{\varvec{R}_{\varvec{W}}}\), HCaIPW applies pairwise deletion given \(\varvec{W}\) and computes the difference in score between the current optimal DAG and the neighbouring DAG using the original pairwise deleted data set and standard scoring function. This is the only difference between HCaIPW and HCIPW. When there are no partially observed variables in \(\varvec{Pa}_{\varvec{R}_{\varvec{W}}}\), HCaIPW uses the same IPW procedure as in HCIPW to compute the difference in score between the current optimal DAG and the neighbouring DAG given the weighted pairwise deleted data set.
4 Experiments
The learning accuracy of each of the three algorithms described in Sect. 3 is investigated and evaluated with reference to the Structural EM algorithm when applied to the same data. The Structural EM algorithm represents a stateoftheart scorebased approach for structure learning from missing data, and also explores the search space of graphs using HC. Since all the involved algorithms are based on HC, we measure their learning accuracy with reference to the results obtained when applying standard HC on complete, rather than incomplete, data. Results from complete data give us the empirical maximum performance we can achieve on these data sets using HC, before making part of the data missing. The HC and Structural EM algorithms used in this paper are those available in the bnlearn R package (Scutari 2010). It is worth noting that the Structural EM algorithm implemented in bnlearn R package is based on single imputation rather than belief propagation. Therefore, the results presented in this paper approximate the difference between the proposed methods and Friedman’s Structural EM. The implementations of the three HC variants described in Sect. 3 are available online at https://github.com/Enderlogic/HCmissingdata.
4.1 Generating synthetic data and missingness
To illustrate the performance of the algorithms under different settings, we consider three types of ground truth DAGs: sparse networks, dense networks and realworld networks. We have constructed 50 random sparse and 50 random dense DAGs. Each network contains 20 to 50 nodes with two to six states per node. A sparse DAG \(\mathcal {G}\) with n variables is generated from a randomly ordered variable set \(V_1<V_2<\ldots <V_n\), where directed edges are sampled from lower ordered variables to higher ordered variables with probability \(2/\left( n  1\right) \). Dense DAGs are generated with the same procedure, but the probability of drawing an edge between variables increases to \(4/\left( n  1\right) \). The conditional probability distribution of variable \(V_i\) in sparse and dense DAGs is parameterised, given any configuration of its parents, by drawing a random number from the Dirichlet distribution \(\text {Dir}\left( \varvec{\alpha }\right) \), where \(\varvec{\alpha }= \underbrace{\left\{ 1, \ldots , 1\right\} }_{r_i}\), and \(r_i\) is the number of states in \(V_i\). For realworld DAGs, we use the six realworld BNs investigated in (Constantinou et al. 2021). The structure and parameters of these BNs are set by either real data observations or prior knowledge as defined in the original studies. The properties of these BNs are provided in Table 2.
We generate complete and incomplete synthetic data using the DAGs introduced above. The complete data sets are provided as input to the standard HC algorithm, whereas the corresponding incomplete data sets are provided as input to the Structural EM and the three HC variants described in Sect. 3. We generate five complete data sets per DAG with sample sizes \(N\in \left\{ 100, 500, 1000, 5000, 10000\right\} \). Each complete data set is then used to construct further three data sets with missing values; one per missingness assumption, MCAR, MAR or MNAR. For the MCAR case, we randomly select 50% of the variables to represent the partially observed variables, and we then remove observed data of these variables with probability p, where p represents a random value between 0.1 and 0.6. For case MAR, we had to ensure missingness are dependent on a subset of the fully observed variables, and this is done as follows:

1.
Randomly select 50% of the variables as partially observed variables (same process as in MCAR);

2.
Randomly assign a fully observed variable as the parent of missingness of a partially observed variable (repeat for all partially observed variables);

3.
Remove observations in partially observed variables with probability \(p=0.6\) when the parent of their missingness is at its highest occurring state; otherwise, remove the observation with probability \(p=0.1\).
Generating MNAR data also involves the above 3step procedure, but step 2 is modified as follows:

2.
Randomly select 50% of the partially observed variables and randomly assign a fully observed variable as the parent of their missingness. For the remaining 50% partially observed variables, randomly assign another partially observed variable as the parent of their missingness.
4.2 Evaluation metrics
The structure learning performance is assessed using two metrics that are fully oriented towards graphical discovery. The first metric is the classic \(F_1\) score, composed of Precision and Recall. The formal definition of the \(F_1\) score is:
where TP is the number of edges that exist in both the learned graph and true graph, FP is the number of edges that exist in the learned graph but not in true graph, and FN is the number of edges that exist in the true graph but not in the learned graph.
The second metric considered is the Structural Hamming Distance (SHD) which measures graphical differences between the learned graph and the true graph (Tsamardinos et al. 2006). Specifically, the SHD score represents the number of edge operations needed to convert the learned graph to the true graph, where the edge operations involve arc addition, deletion and removal. Therefore, in contrast to the \(F_1\) score, a lower SHD score indicates a better performance. Because the SHD score is sensitive to the number of edges and variables present in the true graph, we divide the SHD score by the number of edges in the true DAG to reduce bias.
Because the experiments are based on observational data, multiple DAGs can be statistically indistinguishable due to being part of the same Markov Equivalence class. On this basis, we compare the CPDAGs between the learned and true graphs to measure both the \(F_1\) and SHD graphical scores.
4.3 Results when the true DAG is sparse
Figure 4 presents the average accuracy of the algorithms when the true DAGs are sparse. Each averaged score is derived from 50 CPDAGs, corresponding to each of the 50 randomly generated sparse DAGs. Appendix C provides the mean and standard deviation of the scores. The results suggest that the two evaluation metrics are generally consistent in ranking the algorithms from best to worst performance. Both metrics suggest that all of the three proposed HC variants outperform the Structural EM algorithm when the sample size is greater than 1,000, under all three missingness scenarios MCAR, MAR and MNAR. Interestingly, the HCaIPW algorithm almost matches the performance of HC which is applied to complete data (denoted as HCcomplete in Fig. 4), particularly for experiments with 10,000 sample size, and this observation is consistent across all three missingness assumptions.
The three variants, HCpairwise, HCIPW and HCaIPW, produce very similar results under MCAR, and this is because missingness under MCAR has no pattern that could be identified by the HCIPW and HCaIPW variants. That is, when HCIPW and HCaIPW do not discover any parent of missingness, they follow the search process of HCpairwise. Under MAR, however, both HCIPW and HCaIPW outperform HCpairwise as well as Structural EM when the sample size is larger than 100 and the improvement in performance increases with sample size. From this observation, we can conclude that the IPW method successfully eliminate most of the distributional bias. Interestingly, although the construction of the Structural EM algorithm is based on the MAR assumption, its performance under MAR is considerably lower than its performance under MCAR. A possible explanation is that the single imputation process the bnlearn R package employs during the E step of Structural EM, instead of belief propagation, is unable to capture the uncertainty of the missing values.
Lastly, the results under MNAR suggest that HCIPW generally performs worse than HCpairwise across most sample sizes. This observation can be explained by the reduced sample size on which HCIPW operates, relative to HCpairwise, as discussed in Sect. 3.3. Specifically, when the parents of missingness of necessary variables W contain partially observed variables (i.e., MNAR case), HCIPW applies pairwise deletion by taking into consideration a higher number of variables compared to those considered by HCpairwise. This means that, compared to HCpairwise, the HCIPW algorithm typically evaluates edge operations based on smaller samples when missingness are MNAR, which tends to yield less accurate results. From this, we can also conclude that the negative effect resulting from HCIPW further pruning samples has not been offset by the data bias adjustments applied by the IPW method. On the other hand, the HCaIPW algorithm which is designed to apply the IPW method only when no additional samples would be deleted compared to HCpairwise, generally outperforms all other algorithms under MNAR, particularly under higher sample sizes.
Figure 5 presents the relative execution time between (a) the four algorithms applied to data with missing values, and (b) the HC algorithm applied to the complete data. Because the three HC variants are implemented in Python, we measure their execution time relative to our Python version of HC. On the other hand, Structural EM is implemented in bnlearn R package and makes use of the HC implementation of that package. Therefore, the execution time of Structural EM is measured relative to the HC implementation in bnlearn R package. The mean and standard deviation of the results can be found in Appendix D.
Overall, the results show that HCpairwise is the most efficient algorithm for missingness. Specifically, HCpairwise increases execution time relative to HC by approximately 50%, while HCIPW and HCaIPW are anywhere between 8 and 15 times slower than HC dependent on sample size, and the relative difference in execution time tends to increase with sample size. This is because a higher number of parents of missingness are likely to be detected in larger sample sizes, and these discoveries increase execution time for IPWbased variants. Still, both the HCIPW and HCaIPW variants are more efficient than Structural EM which increases execution time relative to HC by 100 to 700 times.
4.4 Results when the true DAG is dense
In this subsection we investigate the performance of the algorithms when applied to data sets sampled from dense networks. The performance of each algorithm is depicted in Fig. 6, and detailed results are provided in Appendix C. An important distinction between sparse and dense networks is that learning from data sampled from dense networks makes it more likely that local parts of the graph will involve learning from partially observed variables. In other words, the effect of missing values is more severe on dense, compared to sparse, networks as shown in Sect. 4.3.
The results show that the HCaIPW algorithm continues to perform best in the case of denser graphs, in terms of overall performance and over the different missingness and sample size assumptions. Specifically, HCaIPW achieves the highest accuracy in 11 and 8 cases in terms of \(F_1\) and SHD measures respectively, out of the 15 experiments conducted in this subsection. In contrast, the Structural EM algorithm performs best only in two experiments and only in SHD score. However, compared with the results in Sect. 4.3, the divergence in score between Structural EM and HCbased variants is much smaller.
The performance across the three HCbased variants appears to be similar to that obtained under sparse graphs. When data are MCAR, HCIPW and HCaIPW produce scores that are similar to those produced by HCpairwise, and this is expected since no observed variables should be detected as the parents of missing indicators when missingness is MCAR. When data are MAR, both HCIPW and HCaIPW outperform HCpairwise since, unlike HCpairwise, they can detect and reduce bias caused by missing values. Lastly, when data are MNAR, HCIPW performs worst amongst all algorithms, particularly when the sample size is lowest, and this is because it tends to remove a large number of data cases when computing the local scores. On the other hand, HCaIPW (which aims to resolve this specific drawback of HCIPW) performs best in almost all MNAR experiments. The consistency of the results across sparse and dense networks suggests that the performance of HCaIPW, relative to the other algorithms considered in this study, is not sensitive to the sparsity of the network that generates the input data.
4.5 Results when the true DAG is a realworld network
Lastly, we apply the algorithms to data sets sampled from the six realworld networks. Figure 7 shows the average performance of the algorithms across all the six realworld networks and over all the five sample sizes. When the missingness is MCAR, the three HCbased variants achieve similar accuracy, as expected, and generally outperform the Structural EM algorithm when the sample size is larger than 500. When the missingness is MAR or MNAR, the performance of HCaIPW improves over the other algorithms, especially when the sample size is larger than 500. These results are consistent with those obtained from the randomised sparse and dense networks presented in Sects. 4.3 and 4.4 respectively.
5 Conclusion
Learning accurate BN structure from incomplete data remains a challenging task. Most BN structure learning algorithms do not support learning from incomplete data, and this is partly explained by the considerable increase in computational complexity when dealing with incomplete data. The increased computational complexity caused by missing data adds to a problem that is NPhard even when data are complete. This challenge is even greater when missing values are systematic rather than random.
In this paper, we have investigated three novel HCbased variants that employ pairwise deletion and IPW strategies to deal with random and systematic missing data. The HCpairwise and HCIPW variants can be viewed as subversions of HCaIPW, which is the most complete and best performing variant described in this paper. All of the three variants have been applied to different cases of data missingness, and their performance was compared to the stateoftheart Structural EM algorithm that is available in the bnlearn R package. Moreover, all performances under missingness have been compared to HC when applied to the corresponding complete data sets. The empirical results show:

1.
Pairing HC with pairwise deletion (i.e., the HCpairwise variant) is enough to learn graphs that are more accurate, as well as less computationally expensive, compared to the graphs produced by the Structural EM algorithm.

2.
Combining HC with both pairwise deletion and IPW techniques (i.e., the HCIPW variant) further improves learning accuracy under MCAR and MAR, in general, but decreases accuracy under MNAR due to aggressive pruning employed by HCIPW on the data cases (refer to Sect. 3.3). Moreover, HCIPW becomes considerably slower than HCpairwise, although it remains an order of magnitude faster than Structural EM.

3.
The HCaIPW takes advantage of both strategies, as in HCIPW, but relaxes the pruning strategy on the data cases and returns the overall best performance, especially under MNAR which represents the most difficult case of missingness.

4.
All three HC variants described in this paper outperform Structural EM in most cases. Importantly, the performance of HCaIPW on missing data approaches the performance of HC on complete data when sample size is 10,000 and the ground truth graph is sparse, and this observation is consistent under all three cases of missingness.
Future research will investigate the application of these learning strategies to search algorithms that are more complex than HC, such as Tabu, or other variants of HC such as the GES algorithm (Chickering 2002) which explores the CPDAG, rather than DAG space. Another possible research direction would be to combine the IPW method with the NAL score (Balov 2013), which is a scoring function intended for missingness under MCAR, and further investigate the possibility of a new decomposable scoring function under systematic missingness cases of MAR and MNAR.
Data availability
The data used for the simulation results are available upon request to the corresponding author.
References
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.
Balov, N., et al. (2013). Consistent model selection of discrete Bayesian networks from incomplete data. Electronic Journal of Statistics, 7, 1047–1077.
Bodewes, T., & Scutari, M. (2021). Learning Bayesian networks from incomplete data with the nodeaverage likelihood. International Journal of Approximate Reasoning, 138, 145–160.
Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov), 507–554.
Constantinou, A. C., Fenton, N., Marsh, W., & Radlinski, L. (2016). From complex questionnaire and interviewing data to intelligent Bayesian network models for medical decision support. Artificial Intelligence in Medicine, 67, 75–93.
Constantinou, A. C., Liu, Y., Chobtham, K., Guo, Z., & Kitson, N. K. (2021). Largescale empirical validation of Bayesian Network structure learning algorithms with noisy data. International Journal of Approximate Reasoning, 131, 151–188.
Cussens, J. (2011). Bayesian network learning with cutting planes. In Proceedings of the 27th conference on uncertainty in artificial intelligence (UAI 2011), AUAI Press, pp. 153–160.
Friedman, N., et al. (1997). Learning belief networks in the presence of missing values and hidden variables. In ICML, Citeseer, Vol. 97, pp. 125–133.
Gain, A., & Shpitser, I. (2018). Structure learning under missing data. In International conference on probabilistic graphical models, PMLR, pp. 121–132.
Gámez, J. A., Mateo, J. L., & Puerta, J. M. (2011). Learning bayesian networks by hill climbing: Efficient methods based on progressive restriction of the neighborhood. Data Mining and Knowledge Discovery, 22(1), 106–148.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576.
Heckerman, D., Geiger, D., & Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3), 197–243.
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
John, C., Ekpenyong, E. J., & Nworu, C. C. (2019). Imputation of missing values in economic and financial time series data using five principal component analysis approaches. CBN Journal of Applied Statistics, 10(1), 51–73.
Mohan, K., & Pearl, J. (2021). Graphical models for processing missing data. Journal of the American Statistical Association pp 1–16.
Mohan, K., Pearl, J., & Tian, J. (2013). Graphical models for inference with missing data. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., & Weinberger, K.Q. (Eds.) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 26, https://proceedings.neurips.cc/paper/2013/file/0ff8033cf9437c213ee13937b1c4c455Paper.pdf.
Pedersen, A. B., Mikkelsen, E. M., CroninFenton, D., Kristensen, N. R., Pham, T. M., Pedersen, L., & Petersen, I. (2017). Missing data and multiple imputation in clinical epidemiological research. Clinical Epidemiology, 9, 157.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys (Vol. 81). New York: Wiley.
Ruggieri, A., Stranieri, F., Stella, F., & Scutari, M. (2020). Hard and soft EM in Bayesian network learning from incomplete data. Algorithms, 13(12), 329.
Schwarz, G., et al. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.
Scutari, M. (2010). Learning Bayesian networks with the bnlearn R package. Journal of Statistical Software, 35(3).
Silander, T., LeppäAho, J., Jääsaari, E., & Roos, T. (2018). Quotient normalized maximum likelihood criterion for learning Bayesian network structures. In International conference on artificial intelligence and statistics, PMLR, pp. 948–957.
Spirtes, P., Glymour, C. N., Scheines, R., & Heckerman, D. (2000). Causation, prediction, and search. Cambridge: MIT press.
Strobl, E. V., Visweswaran, S., & Spirtes, P. L. (2018). Fast causal inference with nonrandom missingness by testwise deletion. International Journal of Data Science and Analytics, 6(1), 47–62.
Tian, Y., Zhang, K., Li, J., Lin, X., & Yang, B. (2018). LSTMbased traffic flow prediction with missing data. Neurocomputing, 318, 297–305.
Tsamardinos, I., Aliferis, C. F., Statnikov, A. R., & Statnikov, E. (2003). Algorithms for large scale Markov blanket discovery. FLAIRS conference, 2, 376–380.
Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The maxmin hillclimbing Bayesian network structure learning algorithm. Machine Learning, 65(1), 31–78.
Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., & Zhang, K. (2019). Causal discovery in the presence of missing data. In The 22nd international conference on artificial intelligence and statistics, PMLR, pp. 1762–1770.
Zemicheal, T., & Dietterich, T.G. (2019). Anomaly detection in the presence of missing values for weather data quality control. In Proceedings of the 2nd ACM SIGCAS conference on computing and sustainable societies, pp. 65–73.
Acknowledgements
This research was supported by the EPSRC Fellowship project EP/S001646/1 on Bayesian Artificial Intelligence for Decision Making under Uncertainty.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yang Liu. The first draft of the manuscript was written by Yang Liu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Editor: Manfred Jaeger.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Proofs of propositions
In this section, we provide proofs of the propositions discussed in Sect. 3. We define the variables used in proofs as follows: \(\varvec{V}_d\) is the set of variables with different parentsets between a given DAG \(\mathcal {G}\) and its neighbouring DAG \(\mathcal {G}_{nei}\), \(\varvec{W}\) is a set of the necessary variables as defined in Eq. (4), \(\varvec{U}\) is a set of the sufficient variables defined in Eq. (8), and N and \(N_{pw}\) are the sample sizes of the partially observed data set D and pairwise deleted data set \(D_{pw}\) respectively.
Proposition 1
Assume data D is MCAR and sample size \(N\rightarrow \infty \), for any DAG \(\mathcal {G}\) and one of its neighbouring DAG \(\mathcal {G}_{nei}\)
where \(D_{pw}\) is the pairwise deleted data set which is derived from D by removing the data cases with missing values amongst the necessary variables \(\varvec{W}\), and \(D_f\) is the corresponding fully observed data set.
Proof
Equation (12) follows from Eq. (5) given the MCAR assumption and large sample limit. Equation (13) is due to the missing rate of data D, i.e., \(N_{pw} / N\), does not relate to the sample size N and remains constant with the increase of N. \(\square \)
Proposition 2
Given Assumptions 1 and 2, assume data D is partially observed and sample size \(N\rightarrow \infty \), for any DAG \(\mathcal {G}\) and one of its neighbouring DAG \(\mathcal {G}_{nei}\)
where \(D_{pw}\) is the pairwise deleted data set which is derived from D by removing data cases with missing values among sufficient variables \(\varvec{U}\), \(\beta = \prod _{R_i\in \varvec{R}_{\varvec{U}}}\beta _{R_i}\), and \(D_f\) is the corresponding fully observed data set.
Proof
In the above equations, \(\beta = \prod _{R_i\in \varvec{R}_{\varvec{U}}}\beta _{R_i}\), \(\widetilde{N}_{ijk}\) and \(\widetilde{N}_{ij}\) are defined by Eqs. (9) and (10). Equation (14) is a consequence of the recoverability of \(P\left( \varvec{U}\right) \) given Eq. (7). \(\square \)
Appendix B Derivation of Eq. (7)
Based on Mohan et al. (2013), Theorem 2, given Assumptions 1 and 2, the joint distribution \(P\left( \varvec{V}\right) \) can be fully recovered from the observed data via the following equation:
where \(\varvec{Pa}_{R_i}\) is the set of parents of missing indicator \(R_i\), and \(\varvec{R}_{\varvec{Pa}_{R_i}}\) is the set of missing indicator of the partially observed variables in \(\varvec{Pa}_{R_i}\). Then,
In the above equation, the term c depends only on the missing indicators \(\varvec{R}\) and remains constant with respect to the observed variables \(\varvec{V}\). The product \(\prod _{R_i\in \varvec{R}}\beta _{R_i}\) represents the relative probability of a data case from the pairwise deleted data set being observed in the complete data set. For example, if a pairwise deleted data case has \(c\prod _{R_i\in \varvec{R}}\beta _{R_i}\) out of 0.8, then its occurrence rate is assumed to drop by 20% in the complete data set compared to its occurrence rate in the pairwise deleted data set. Therefore, we use Eq. (7) to reweight the pairwise deleted data and estimate the underlying true distribution given the pairwise deleted data set.
Appendix C Supplementary results from the structure learning experiments
Refer to Tables 3, 4, 5, 6, 7 and 8.
Appendix D Supplementary results of execution time
See results in Table 9.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, Y., Constantinou, A.C. Greedy structure learning from data that contain systematic missing values. Mach Learn 111, 3867–3896 (2022). https://doi.org/10.1007/s10994022061958
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994022061958
Keywords
 Expectationmaximisation
 Inverse probability weighting
 Missing data
 Scorebased learning
 Structure learning