Ignoring Non-ignorable Missingness

The classical missing at random (MAR) assumption, as defined by Rubin (Biometrika 63:581–592, 1976), is often not required for valid inference ignoring the missingness process. Neither are other assumptions sometimes believed to be necessary that result from misunderstandings of MAR. We discuss three strategies that allow us to use standard estimators (i.e., ignore missingness) in cases where missingness is usually considered to be non-ignorable: (1) conditioning on variables, (2) discarding more data, and (3) being protective of parameters.

32 PSYCHOMETRIKA in the model to "protect" the parameters of interest. Which approach to take does not depend on any parametric assumptions regarding the missingness process but is determined by conditional independence assumptions between missingness (or selection) indicators and the variables of interest.
Our plan is as follows. In Sect. 1 we introduce modifications to Rubin's MAR assumption and define R-MAR, an assumption that allows valid frequentist inference by IL approaches. In Sect. 2 we briefly describe three strategies for valid inference when R-MAR is violated, and these strategies become the topics of Sects. 3, 4, and 5. Whereas C-MAR, the modified MAR requirement when conditioning on variables (Strategy 1, Sect. 3), has been discussed in the literature, confusion about this requirement persists. The idea of discarding more data to relax R-MAR (Strategy 2, Sect. 4) is new, to our knowledge. We show how it relates to sequential estimation based on Mohan, Pearl, and Tian's (2013) ordered factorization theorem and that it is preferable to sequential estimation when IL methods are applicable. Section 4 gives an overview of protective estimation (Strategy 3, Sect. 5) where estimators and models are selected to protect specific parameters from being inconsistently estimated due to missing data. In Sect. 5 we end with some concluding remarks.

MAR and its Modifications
We assume throughout that we have a parametric model for our variables of interest with parameters θ and that we would like to make inferences regarding (possibly a subset) of these parameters. We also assume that the model is correctly specified so that we can focus on the impact of missing data. Let U i be the vector of all variables in the model for unit i, e.g., with realized values u i = (x i , z i , y i ) . A separate missingness or selection process determines which of the variables are observed for which units. The vector of selection indicators S i has elements equal to 1 if the corresponding variable is observed for unit i and 0 if it is missing, e.g., S i = (S x i , S z i , S y i ) with 4 realized values s i = (s x i , s z i , s y i ) . We will occasionally say that a variable is "selected" for a unit, meaning that it is not missing.
Using the notation u obs i for the sub-vector of u i containing the variables that are observed, i.e., the variables for which the corresponding elements in s i are 1, Rubin's (1976) MAR assumption can be written as P(s i |u i ) = P(s i |u obs i ).
In words, the probability that the missing variables are missing, given the realized values of all observed variables, is unchanged, regardless of what values are substituted for the missing variables (Rubin, 1976;Seaman et al., 2013). 1 Rubin (1976) calls the missingness ignorable if MAR holds and if the parameters of the missingness process are distinct from the parameters of the model of interest. Under these assumptions, direct maximum likelihood inference (without frequentist claims) and Bayesian inference are valid. Rubin (1976) did not define missing completely at random (MCAR), but it later became understood (e.g., Mealli & Rubin, 2015) to mean P(s i |u i ) = P(s i ). An important paper by Seaman et al. (2013) points out that Rubin's MAR definition has been widely misunderstood by not recognizing that it refers to the realized selection indicators and the realized data. Instead, MAR has been interpreted as a conditional independence statement for 33 random variables, S i ⊥ ⊥ U mis i |U obs i , where U mis i is the sub-vector of U i containing the variables that are not observed and U obs i is the corresponding sub-vector of variables that are observed. Definitions of MAR based on random variables rather than their realized values can be interpreted as a stricter requirement than Rubin's MAR, namely that Rubin's MAR should hold in repeated samples. Seaman et al. (2013) show that frequentist likelihood inference ignoring the missingness process requires that MAR always holds (in repeated samples) and calls that assumption "everywhere MAR." Mealli and Rubin (2015) adopt the same definition and suggest the term "always" instead of "everywhere." We will use the acronym A-MAR for always MAR.
A problem with these MAR assumptions is that different units have different variables in U obs i (see also Schafer & Graham, 2002), and it rarely makes sense to assume that a variable X i affects selection of other variables only if it is observed, S x i = 1. An exception would be if X only affects selection when it has been realized or revealed to the individual (e.g., failing an educational assessment). Generally, a more plausible MAR condition therefore is what we call realistic MAR (R-MAR), where missingness cannot depend on any variable that can be missing. Pothoff et al. (2006) call this assumption MAR+, Greenland and Finkle (1995) refer to it as "stratified MCAR," and Mohan et al. (2013) define their MAR assumption this way. If both X and Y can be missing and Z is always observed, the assumption becomes Note that it is now valid to write the assumption as a conditional independence statement. In contrast, the conditional independence statement S i ⊥ ⊥ U mis i |U obs i is problematic, as pointed out by Seaman et al. (2013), because U mis i is a function of S i (in the sense that S i determines which elements of U i are missing) and can therefore not be conditionally independent of it.
In their MAR definition, Mohan et al. (2013) refer to variables like X and Y that have missing values as "partially observed" and to variables like Z that have no missing values as "fully observed" without discussing what would happen in repeated samples. In contrast, Mealli and Rubin (2015) use the term "always observed" to clarify that this is not just what happened in the realized data but that it is an assumption regarding the missingness mechanism. They prove that when the units are exchangeable, A-MAR implies what we call R-MAR. However, as Mealli and Rubin (2016) point out in an erratum, this is true only if the selection indicators S i are mutually independent given U i , which is not required by A-MAR or R-MAR. For instance, A-MAR with exchangeable units allows for the possibility that selection S x i of X i depends on Y i only when Y i is observed, S y i = 1. These kinds of processes seem odd, which is the reason for our term realistic MAR, but we will make use of such a process in Sect. 4 to justify one of our approaches, namely discarding more data.

Ignorability and IL Methods
A missingness process is ignorable if it is valid to base inferences on the AA-data likelihood, ignoring the missingness process, instead of the joint likelihood of the data and the missingness process. The assumptions required for ignorability depend on the kind of inference we wish to make, such as direct likelihood, Bayesian, or frequentist likelihood inference. Seaman et al. (2013) show that A-MAR, together with distinctness of the parameters of the missingness process and the model of interest, is ignorable for frequentist likelihood inference. The reason is that the likelihood of the data is proportional to the joint likelihood of the missingness process and the data, and this is true not just for the realized data but also in repeated samples. Hence, the point estimates and observed information matrix based on the likelihood of the data (ignoring the missingness process) are identical to those based on the joint likelihood of the data and missingness process in each repeated sample. Therefore tests and confidence intervals will have the same frequentist properties for both approaches. Using analogous arguments, Bayesian point estimators and credible intervals have the same repeated sampling properties whether they are based on the likelihood of the data or the joint likelihood of the data and missingness process. Because R-MAR implies (and is stricter than) A-MAR, R-MAR is also ignorable in the same sense.
The likelihood of the data (ignoring the missingness process) is the joint likelihood of the variables, integrated over the missing data. For simplicity, consider three variables, X , Z , and Y , where Z is always observed. The log-likelihood contribution of a unit can then be written as (suppressing the i subscript) Each term corresponds to a missingness pattern, with (s x , s y ) equal to (1,1) for the first term, (1,0) for the second, (0,1) for the third, and (0,0) for the final term. Correspondingly, P(X, Z , Y ) is the joint distribution of all variables, P(X, Z ) is the marginal joint distribution of X and Z , integrating out Y because it is missing, and similarly for the remaining terms. For each pattern, we make use of all available data, as described for a multivariate normal distribution by Anderson (1957). As mentioned in the introduction, when this likelihood is used in maximum likelihood or Bayesian estimation (including multiple imputation without an auxiliary model), we will use the umbrella term "ignorable likelihood" (IL) method. Importantly, R-MAR treats all variables in U as response variables, with the implicit assumption that the likelihood is defined as in (2). However, typically the model of interest is a regression model (in a general sense, e.g., linear, logistic, multilevel, quantile, etc.) for Y given X and Z . To make use of the R-MAR assumption, we can embed this model within a multivariate model for U. In the case of linear regression, this is easy to do by specifying a linear structural equation model (SEM) as shown in Fig. 1 where the parameters of interest are the coefficients for the paths X → Y and Z → Y . Maximum likelihood estimation for linear SEMs based on AA data was discussed in detail by Muthén et al. (1987) and Allison (1987) for the case of few missingness patterns (multiple group approach) and Arbuckle (1996) for the general case. If all variables are categorical, loglinear models can be used in an analogous way. Under R-MAR, all IL methods will have the same frequentist properties as the corresponding approaches based on the joint likelihood of the data and selection indicators.
In addition to the various definitions of MAR discussed in Sect. 1.1 that have been a source for confusion, there are three other sources of confusion, each of which suggests a strategy or guiding principle that allows us to ignore missingness processes that are usually understood to be non-ignorable. Section 2 gives an overview of these strategies, and the following sections provide more details on each strategy.

Strategy 1: Condition on (Functions of) Variables
An important source of confusion is that the MAR assumptions (Rubin's MAR, A-MAR, or R-MAR) are relevant only for inferences regarding the joint distribution of U. However, by far the most common types of analyses are regression models for one response variable given a set of covariates. Conditioning on covariates automatically results in units with incomplete data to be discarded, sometimes called listwise deletion or complete-case analysis. Contrary to common belief, such an approach does not require any of the MAR assumptions but rather an assumption that we will call conditional MAR, or C-MAR, that is more lenient than R-MAR. Unfortunately, it is common practice to apply a univariate MAR condition to each variable, such as incorrectly requiring that missingness of a covariate X cannot depend on X itself, given the other variables. However, this misconception can lead to adoption of approaches that fail when X directly affects its own missingness.
In latent variable models, conditioning on sufficient statistics for the latent variables (conditional maximum likelihood estimation) means that C-MAR can be relaxed further to allow selection to depend directly on the latent variables.

Strategy 2: Discard More Data
MAR allows selection of one variable to depend on selection of another variable for the same unit. This is again due to the multivariate definition of MAR where S i is a vector of all selection indicators for unit i, so these indicators can be dependent. This issue is rarely discussed and, in fact, Mealli and Rubin (2015) neglected this possibility in their theorem. It turns out that we can interfere with the missingness process, by discarding data in some variables for those units for which other variables are missing, making the selection indicators more dependent, and thereby making the process MAR. We refer to this approach as M-MAR (for make MAR). By imagining that we would discard data in this way in repeated samples, so that it becomes part of the missingness process, the process becomes A-MAR and frequentist likelihood inference becomes valid. We can alternatively think of the data deletion as being part of the estimator. We show that there is a close connection between our M-MAR approach and Mohan et al.'s (2013) ordered (or sequential) factorization theorem.

Strategy 3: Be Protective of (Subsets of) Parameters
Violation of MAR conditions (e.g., A-MAR, R-MAR, C-MAR), i.e., the problem of missing not at random (MNAR), does not imply that all parameters are estimated inconsistently when ignoring the missingness mechanism. Some estimators may be consistent for the parameters of interest. A well-known example is binary logistic regression for case-control data, where cases (with response variable equal to 1) and controls (with response variable equal to 0) have different probabilities of inclusion in the sample, which violates C-MAR. Nevertheless, standard maximum likelihood estimators of the regression coefficients and corresponding odds ratios are consistent, although the estimator of the intercept is not.
We can sometimes modify our model or estimation method to protect the parameters of interest from being estimated inconsistently, a strategy we call protective estimation (Skrondal & Rabe-Hesketh, 2014). For example, in binary longitudinal data, different kinds of conditional maximum likelihood estimators can be used to protect the odds ratios of interest. These results also take advantage of conditioning (Strategy 1) and can involve discarding some data (Strategy 2).
The next three sections discuss each of the strategies in more detail. 36 PSYCHOMETRIKA 3. Strategy 1: Condition on (Functions of) Variables

Complete-Case (CC) Regression Analysis
If we are only interested in the conditional distribution P(Y |X, Z ), as in a regression model, it seems cumbersome to specify and estimate a multivariate model for X , Z , and Y and use the joint likelihood in (2) for IL estimation as described in Sect. 1.2. Instead, we may want to use the likelihood conditional on the covariates. The only units that make contributions to the likelihood conditional on X and Z are those units that have complete data. Complete-case (CC) analysis refers to analyzing the subsample of individuals with complete data, sometimes called listwise deletion. If both X and Y can be missing, whereas Z is always observed, as in Sect. 1.2, the log-likelihood contribution from a unit becomes Due to the conditioning on covariates, MAR definitions are no longer useful, as also pointed out by White and Carlin (2010). In fact, we can relax the R-MAR assumption and define C-MAR as where C = S x S y is an indicator for being in the CC sample. In a longitudinal setting, Little (1995) calls this assumption covariate-dependent missingness (or dropout).
This condition allows missingness of X to depend on X itself, given the other variables. For instance, if X is income, then whether income is reported can depend on income (and other covariates). Comparing C-MAR with R-MAR shows that there are situations where CC regression is valid and (multivariate) IL methods are not. Specifically, in any situation where missingness of X or Y depends on either X or Y , IL methods will not be valid, but CC regression will be valid as long as missingness of X or Y does not depend on Y (given X ). Figure 2 illustrates the scenario where X is likely to be missing when it is less than zero and never missing when it is greater than zero (and here there is no other covariate Z ). The ordinary least squares regression line (in black) coincides with the true regression line (thick gray line) because the distribution of Y given X is the same in the selected sample as in the full sample, P(Y |X, S x = 1) = P(Y |X ), and because the selected sample is so large that the least squares estimate is very precise. Selection just thins out the scatterplot to the left of zero but keeps the conditional distribution intact.
It is important to note that, while selection is associated with Y here, it is independent of Y given X , and therefore satisfies the C-MAR condition. Mohan et al. (2013) formalize this way of reasoning by representing the missingness process via directed acyclic graphs (DAGs) that they call Missingness Graphs or m-graphs. Conditional independence relations can then be derived by d-separation (e.g., Pearl, 2009). Figure 3 [same as Figure 1(c) in Mohan et al. (2013)] is an m-graph that satisfies C-MAR. There is no Z here and both X and Y are not always observed, as indicated by hollow circles. The variables S x and S y are caused by X , as shown by the paths from X to these variables, and they are fully observed (filled circles). The fully observed "proxy" variable X * equals X when the selection indicator S x = 1 and equals a symbol for missing, such as "NA" or ".", otherwise. So X * is determined by the combination of X and S x , as indicated by the two paths X → X * and S x → X * and similarly for Y * . The proxy variables and selection indicators are always observed and constitute the data. The question is whether we can estimate a given quantity or estimand (referred to as "query" by  Mohan et al., 2013) from the data consistently. With the implicit assumption that all variables are categorical, Mohan et al. (2013) discuss estimation (or "recovery") of the joint distribution P(Y, X ) or conditional distribution P(Y |X ) from the observed data. It follows from the graph that (S x , S y ) ⊥ ⊥ Y |X , so that P(Y |X ) = P(Y * |X * , S x = 1, S y = 1). Therefore, we can recover the conditional distribution from the observed data by estimating it in the CC sample. However, we cannot recover P(X ) to obtain the joint distribution because of the path X → S x .
It has been pointed out frequently that CC regression is valid if missingness depends on the covariates, as long as it does not depend on the response variable given the covariates (e.g., Dardanoni et al., 2011;Jones, 1996;Little, 1992;Little & Rubin, 2020, p. 49;Seaman et al., 2013;Wooldridge, 2010, p. 796). Nevertheless, MCAR is often said to be necessary for valid CC regression (e.g., King et al., 2001;Molenberghs et al., 2004;Molenberghs & Kenward, 2007, p. 43). One reason for this confusion may be that covariates are sometimes not treated as random variables. For example, Diggle and Kenward (1994) define "completely random dropout" and "random dropout" in longitudinal data only in terms of whether dropout depends on current or previous values of the outcome variable (without conditioning on covariates). Another reason is that missingness that depends on covariates only is sometimes defined as MCAR (e.g., Daniels & Hogan, 2008, p. 92;Laird, 1988).
Believing that MCAR is necessary would erroneously lead to rejecting CC regression analysis based on the path from X to S y in Fig. 3 (even if there is no path from X to S x ). Then relying on the A-MAR assumption would lead to adoption of IL inference for the multivariate model. However, such an approach will likely be inconsistent because it is not realistic that X affects selection of Y only when X is observed. CC regression, in contrast, would yield valid inferences, even with the additional path from X to S x . That multiple imputation can be invalid when CC regression is valid does not appear to be widely known although it has been pointed out repeatedly (e.g., Allison, 2000;Bartlett et al., 2014;Little & Zhang, 2011;White & Carlin, 2010).
Another common belief is that missingness of a covariate X in a regression model cannot depend on X itself given the other variables. This misconception appears to arise from falsely assuming that a univariate version of A-MAR must hold for each variable. Specifically, for each variable V i , it is sometimes assumed to be necessary for valid inference that P( This assumption is clearly violated in the scenarios depicted in Figs. 2 and 3 which satisfy C-MAR and hence produce valid inferences for regression models. Both Enders (2010, p. 11, 13) and Allison (2002, p. 4) define MAR in this univariate way and, when discussing that MAR is needed for ignorability, do not mention that this is so only for a multivariate model. Readers can find remarks elsewhere in these books that the univariate MAR assumption is not required for covariates in CC regression.

Hybrid CC and AA Analysis: Subsample Ignorable Likelihood
As discussed in Sect. 3.1, CC regression is consistent if selection of any covariate in the model depends on the covariate itself, in contrast to inferences regarding the joint distribution of U via IL methods. Little and Zhang (2011) therefore suggest a hybrid approach. Denoting the subset of covariates suspected of affecting their own selection as W , they assume that C-MAR holds for these variables, S w ⊥ ⊥ Y |W, X, Z , where Z is completely observed variables (assumed to be always observed), and the variables in X are partially observed covariates, assumed not to affect their own selection. The subsample of units with complete data for W is then analyzed using IL methods based on the likelihood for P(Y, X |Z , W ), under the assumption that MAR or A-MAR holds for selection of X and Y , given W and S W . Little and Zhang (2011) write the assumption as P(S x , S y |Z , W, X, Y, S W ) = P(S x , S y |Z , W, X obs , Y obs , S W ). Note that this hybrid approach can also be viewed as an example of Strategy 2 to discard more data.

Fixed Instead of Random Effects for Longitudinal or Clustered Data
We now consider longitudinal data where units j = 1, . . . , N are observed at n j occasions i = 1, . . . , n j . The variables Y i j and X i j are time-varying and Z j is time invariant. A linear random-intercept model can be written as where ζ j is a random intercept or latent variable and i j an error term. Typically, it is assumed that ζ j ∼ N (0, ψ) and i j ∼ N (0, θ). Associated with each variable is a selection indicator S y i j , S x i j , and S z j . The same model can also be used for cross-sectional clustered data, but we will use longitudinal-data terminology for concreteness.
Let C i j = S y i j S x i j S z j be the complete "case" indicator (where a "case" is a unit-occasion combination), taking the value 1 if all variables in the model are observed for unit j at occasion i and zero otherwise. We use vectors for the variables associated with a subject j across all n j occasions, C j = (C 1 j , . . . , C n j j ) , W j = (Z j , X 1 j , . . . , X n j j ) , and Y j = (Y 1 j , . . . , Y n j j ) . Then C-MAR becomes Again, this is covariate-dependent missingness in the sense of Little (1995). Selection cannot depend on ζ j because this latent variable is always missing, and we are not conditioning on it.
If selection depends on ζ j , we can adopt a fixed-effects approach. We now treat ζ j as fixed by using indicator (or dummy) variables I r j for units j (with I r j = 1 if r = j and I r j = 0 otherwise) and omitting the intercept α The coefficient γ of the time-invariant covariate Z j cannot be estimated because Z j is perfectly collinear with the dummy variables for the units. Selection based on ζ j now becomes selection based on covariates I r j , and β can be estimated consistently (see also Verbeek & Nijman, 1992). The requirement for valid inference now becomes Another advantage of the fixed-effects approach is that it controls for all possible known and unknown time-invariant confounders (e.g., Skrondal & Rabe-Hesketh, 2022). Adopting a fixed-effects estimator for β while not obtaining any inferences for γ and ψ can also be viewed as an example of Strategy 3, protective estimation, discussed in Sect. 5. There we describe the standard fixed-effects estimator for random-intercept logistic regression, which is the conditional maximum likelihood estimator. That estimator is valid under C-MAR*. Furthermore, modifying the model and/or discarding more data produces protective estimators under several MNAR mechanisms.
Interestingly, when selection S j of units j (instead of unit-occasion combinations) depends on ζ j and not on the covariates W j , i.e., when S j ⊥ ⊥ (Y j , W j )|ζ j , the maximum likelihood estimator for the random-intercept model in (3) is consistent for the regression coefficients. The reason is that selection alters only the latent variable distribution P(ζ j |S j = 1) = P(ζ j ) and not the conditional response distribution P(Y j |W j , ζ j , S j = 1) = P(Y j |W j , ζ j ) and consistency of the regression coefficients does not rely on correct specification of the random-effects distribution in linear mixed models (Verbeke & Lesaffre, 1997). When the model is modified to a common factor model, by replacing ζ j by λ i ζ j (with λ 1 = 1), replacing α by α i , and removing the covariates, the maximum likelihood estimator of the factor loadings λ i is also consistent when S j ⊥ ⊥ Y j |ζ j . This result is closely related to factorial invariance (e.g., Meredith, 1964). As pointed out by Skrondal & Rabe-Hesketh (2004, p. 56), consistency requires that anchoring (setting a factor loading to 1) is used for identification instead of factor standardization (setting the variance of the factor to 1) because the variance of the latent variable is different in the selected sample. Choosing anchoring to obtain consistent estimates of the factor loadings can therefore also be seen as a form of protective estimation.

Strategy 2: Discard More Data
In this section, we return to the scenario with three variables X , Z , and Y , where Z is always observed, whereas X and Y are not always observed, and we are interested in a model for P (X, Z , Y ). Even if we are interested only in the parameters governing P(Y |Z , X ), we may want to model the joint distribution by IL methods, making use of AA data, because it is more efficient than CC regression (e.g., Little & Schluchter, 1985). Now R-MAR is required for valid frequentist inference. However, we consider two missingness processes that violate R-MAR and show that we can still obtain valid frequentist inference by discarding more data to make the process A-MAR before proceeding with IL inference.

MNAR-X: X Affects Selection of Y
Consider the m-graph in the left panel of Fig. 4 with proxy variables not shown. Here, the DAG for X , Z , and Y is compatible with the SEM in Fig. 1, but could correspond to many other statistical models because DAGs are nonparametric. This graph is not strictly a DAG because there is a double-headed arrow between X and Z , but this arrow could be replaced by a latent variable node with paths to both X and Z . R-MAR is violated because of the path X → S y . However, C-MAR is satisfied because (S x , S y ) ⊥ ⊥ Y |X, Z , so we could perform CC regression. However, if we would like to estimate the joint distribution P(X, Z , Y ), IL methods will not be valid.

M-MAR
It turns out that IL methods become valid if we discard Y when S x = 0, with corresponding modified missingness indictoṙ This does not mean deleting units when X is missing, but just making Y missing for the units with missing X (but retaining Z for these units). We now show that the process forṠ y satisfies A-MAR by factorizing the joint probability of the selection indicators as where the first term is P(S x |U) = P(S x |Z ) and the second term is We see that P(Ṡ y |S x , U) = P(Ṡ y |S x , X obs , Z ), so the following condition is satisfied: The idea is that we allow selection of Y to depend on X if X is selected/observed, but when X is missing, we make selection of Y impossible so that it no longer depends on the unobserved X . The M-MAR (make MAR) condition is satisfied because we made it so by data deletion. We can think of the selection process as a natural process, represented in the left panel of Fig. 4, followed by deletion of Y when X is missing by the data analyst. It does not matter for inference that part of the process is man-made. If we imagine that data analysts will behave this way in repeated samples, we have A-MAR and frequentist IL inference is therefore valid.  Figure 1(d) in Mohan et al. (2013) corresponds to the m-graph in the left panel of Fig. 4 with Z removed. Applying their approach (in their Example 3) to our situation, the joint distribution can be factorized as follows:

Ordered Factorization
Then the terms are estimated sequentially as follows: Step 1: Estimate P(Z ) by using all units because Z is never missing.
Step 2: Estimate P(X |Z ) by using only those units with S x = 1 (i.e., deleting units with S x = 0). This is valid because S x ⊥ ⊥ X |Z , so that P(X |Z ) = P(X * |Z , S x = 1).
Step 3: Estimate P(Y |X, Z ) by using only units with S x S y = 1, i.e., pruning the dataset further by deleting units with S y = 0. This is valid because This last step corresponds to CC regression and is justified because C-MAR is satisfied. Mohan et al. (2013) point out that the deletion order matters. Units with missing X are deleted in Step 2, followed by deletion of further units, with missing Y , in Step 3.

M-MAR Versus Ordered Factorization for MNAR-X
We can consider the contribution of a unit to the AA-data log-likelihood after discarding Y when X missing. Replacing s y in (2) bẏ s y , the third term disappears, (1 − s x )ṡ y lnP(z, y) = 0, because (1 − s x ) is nonzero only when s x = 0, and in this caseṡ y = 0. We therefore have Using the factorization in (6), we can rewrite this log-likelihood contribution as = lnP(z) + s x lnP(x|z) + s xṡ y P(y|x, z).
We can see that information about P(Z ) comes from all units, information about P(X |Z ) comes only from the subset of units with s x = 1, and information about P(Y |X, Z ) comes only from the subset of units with both s x = 1 and s y = 1, exactly as in the sequential estimation proposed by Mohan et al. (2013). Factorization such as shown in (6) also facilitates AA-data maximum likelihood estimation (e.g., Anderson, 1957;Marini et al., 1980), and for this reason (not for achieving consistency) it has been suggested to discard data (Marini et al., 1980, p. 333). It is instructive to consider why it is necessary to discard values of Y when X is missing or why including the third term from (2), namely (1 − s x )s y lnP(z, y), in the log-likelihood would lead to inconsistent estimation. Units with s x = 0 and s y = 1 contribute to this term, but P(Z , Y |S x = 0, S y = 1) = P(Z , Y ) because S y is a collider in the graph, so conditioning on it creates a new backdoor path between Z and Y through X and therefore corrupts the joint distribution.
The M-MAR approach is preferable to sequential estimation whenever the goal is to estimate parameters of a parametric model. After deleting Y when X is missing, estimation can be performed straightforwardly using standard software for IL methods, such as AA-data maximum likelihood estimation, and standard error estimates are produced as a byproduct. In contrast, Mohan et al. (2018) use m-graphs to derive sequential estimators for parameters of linear SEMs. Their estimators of regression coefficients are sums of products of estimators of variances and other path coefficients and require complex algorithms to evaluate sequentially. Estimation of standard errors requires further work, such as a delta method or resampling approaches.

MNAR-Y
For MNAR-Y, shown in the right panel of Fig. 4, the problem is that selection of X depends on Y , but Y is not always observed. It is clear that CC regression cannot be used to estimate P(Y |X, Z ) because S x ⊥ ⊥ Y |X, Z , violating C-MAR.

M-MAR
The M-MAR solution here is to delete X when Y is missing, with corresponding modified missingness indicator,Ṡ We can factorize the joint probability of the selection indicators as where P(S y |U) = P(S y |Z ), and so that M-MAR: P(Ṡ x , S y |U) = P(S y |Z )P(Ṡ x |S y , X obs , Z ) = P(Ṡ x , S y |U obs ).
The term omitted from the AA-data log-likelihood isṡ x (1 − s y )lnP(x, z) becauseṡ x = 0 whenever s y = 0. This term is problematic because conditioning on S x produces an additional path between X and Z through Y .

Ordered Factorization
For MNAR-Y, the ordered factorization approach by Mohan et al. (2013) is based on the factorization P(Z )P(Y |Z )P(X |Z , Y ). Unfortunately, the conditional distribution of interest P(Y |X, Z ) does not appear directly but can derived from the joint distribution by dividing it by P(X, Z ) if P(X, Z ) > 0. Note that we cannot obtain P(X, Z ) directly because P(X, Z ) = P(X * , Z |S x = 1), but we can obtain P(X, Z ) by marginalizing the joint distribution. In practice, the marginalization will not be straightforward and the resulting distribution may not be a closed-form function of the model parameters of interest. In contrast, M-MAR remains as easy to implement as for MNAR-X and will directly yield estimates of the parameters of interest if the joint distribution is parameterized in terms of P(Y |X, Z ) as in Fig. 1. Therefore, M-MAR becomes the method of choice for MNAR-Y.
For a model with three variables, we have considered two different R-MAR violations and shown how we can make the missingness A-MAR. With more variables, a general approach would be to identify, for each variable V , which other variables have direct paths to S v . If any of these variables are missing for a unit i, discard V i . This approach presupposes substantive understanding of the missingness mechanisms and may lead to a considerable loss of data. An alternative approach would be to check whether it is possible to sort the variables so that the missingness pattern is approximately monotone, in the sense that earlier variables are rarely missing for a unit if later variables are not missing for the unit. The next step would be to assess whether it is justifiable to assume that selection of each variable is independent of subsequent variables given the previous variables and their selection indicators. If this does not appear reasonable for a given variable, the variable should be placed later in the sequence as needed. The final step would be to make the missingness monotone. If there are covariates that affect their own selection, we can condition on those variables in the IL method, as described in Sect. 3.2.

Making Longitudinal Data Monotone
Returning to a longitudinal setting with the notation of Sect. 3.3, we consider the scenario where the CC indicator C i j for unit j at occasion i being a complete "case" with X i j and Y i j (as well as Z j ) observed depends on the unit's outcomes at previous occasions. Then A-MAR is violated because those previous outcomes may be missing, unless the missingness patterns are (always) monotone as shown in Fig. 5. Here rows represent units j which have been sorted in terms of the occasion when missing data first occur, and the rectangles for (X i j , Y i j , i = 1, 2, 3, 4) enclose all units with complete data at occasions 1, 2, 3, and 4.
When missingness is not monotone, we propose making the missingness monotone. This means deleting Y i j if any previous Y i j is missing: As in previous subsections, we are exploiting the fact that A-MAR allows for dependencies among selection indicators, and that we can manufacture part of the selection process ourselves. For example, consider the case where selection depends on the previous outcome. Then the new selection mechanism becomes and satisfies A-MAR. Interestingly, the fact that A-MAR allows missingness to depend on other responses for the same unit is often mentioned in the longitudinal data literature, but the point that this requires monotone missingness is rarely mentioned, an exception being Schafer and Graham (2002). In longitudinal data, it is possible that only monotone patterns can occur because having a missing value at an occasion means that the unit has dropped out and cannot re-enter the study. When there are no such barriers to re-entering the study, it is difficult to think of a natural selection mechanism where the previous response causes missingness only when it is observed. Therefore, the A-MAR property cannot be assumed to hold even if the realized missingness pattern is monotone, unless the data analyst imagines that she would make the data monotone in repeated samples.
As mentioned at the end of Sect. 3.3, regression coefficients (or factor loadings) can be estimated consistently in linear mixed models (or factor models) if selection S j of units depends on the random effects (or latent variables) as long as S j ⊥ ⊥ (Y j , W j )|ζ j . This result does not hold when there is item non-response, where C i j can be 1 for some items (or occasions) i and 0 for other items for the same unit j. In this latter situation, we can convert item non-response to unit non-response by dropping units with i C i j = 0, so thatṠ j = i C i j . Then consistency is achieved under the assumption that C j ⊥ ⊥ (Y j , W j )|ζ j .

Logistic Regression
Strategy 3 is best introduced for logistic regression, for simplicity with a single covariate X i , In a case-control study, controls (with Y i = 0) are undersampled relative to cases (with Y i = 1), also known as outcome-based or retrospective sampling, and selection into the CC sample C i = S x i S y i therefore depends on Y i : The model for the CC sample becomes It follows that so the log odds ratio, β, is estimated consistently by maximum likelihood, whereas the estimator of the intercept α converges to α * = α + ln{π(1)/π(0)}. The intercept can be estimated consistently only if π(1)/π(0) is either known (e.g., by design) or can be consistently estimated, in which case ln{ π(1)/ π(0)} can be included in the logistic regression model as an offset. This result is well-known for case-control designs (e.g., Breslow, 1996).

Fixed-Effects Logistic Regression for Longitudinal Data
As in Sect. 3.3, we consider a random-intercept model for clustered or longitudinal data, but now with a logit link for a binary outcome variable: Again, we could replace ζ j by a fixed effect to be able to relax the C-MAR requirement to C-MAR* defined in (4), where ζ j can directly affect selection. Because of an incidental parameter problem, the fixed-effects estimator is not obtained by including indicator variables for the units as in Sect. 3.3, but by conditional maximum likelihood estimation. The contribution from unit j to the conditional likelihood, given the sum of the outcomes for the unit, τ j = i Y i j , is where B j ={d j = (d 1 j , . . . , d nj ) | d i j = 0 or 1, and i d i j = τ j }, or in words, B j is the set of all vectors of length n j with binary elements that sum to τ j . This set can be obtained by permuting the elements of Y j . Note that the between-unit component of the model, α + γ Z j + ζ j , cancels out due to conditioning on the sufficient statistic τ j . When there are missing data, we let I j be the set of occasions for unit j when outcomes are observed and redefine B j as The conditional likelihood contribution from unit j, conditioning on the vector of selection indicators C j , is: If selection does not depend on observed outcomes, given missing outcomes and random intercepts, P(C j |Y obs j , Y mis j , W j , ζ j ) = P(C j |Y obs j , W j , ζ j ), then the integrals in the numerator and denominator are identical and we obtain the standard conditional likelihood in (9).
If selection depends on the current outcome only, P C j |Y obs j , Y mis j , W j , ζ j = i∈I j P(S i j = 1|Y obs i j ) i∈I j Pr(S i j = 0|Y mis i j ) the integrals in the numerator and denominator of (10) become respectively. Taking the first product in square brackets out of each integral, the ratio of these integrals becomes the ratio of the products in square brackets, giving = i∈I j exp([ln(π i (1)/π i (0))] + β X i j ) Y i j d j ∈B j i∈I j exp([ln(π i (1)/π i (0))] + β X i j ) d i j .
Then we can be protective of β by including occasion-specific intercepts α i in the original model in (8) that represent α + ln(π i (1)/π i (0)), Skrondal and Rabe-Hesketh (2014) show that if selection C i j at occasion i depends on the outcome Y i−1, j at the previous occasion, a consistent estimator for β is obtained by either analyzing complete units (across time, i C i j = 1) only and including occasion-specific intercepts α i , or by allowing the occasion-specific intercepts to take on different values for different missingness patterns across time (through interactions between indicators for occasions and indicators for the missingness patterns). If selection depends on both the previous and current outcomes, a consistent estimator for β is obtained by analyzing complete units (across time) with sum of outcomes equal to τ j = 1 or τ j = n − 1 and allowing the occasion-specific intercepts to take different values for τ j = 1 and τ j = n − 1.