When parental genotypes are missing at random (MAR) in case-parent trio studies, Clayton [1] and Weinberg [2] suggested a partial-score test and a likelihood ratio test, respectively, to deal with such data. Under the same situation, Sun et al. [3] introduced the 1-TDT, a transmission/disequilibrium test (TDT)-type test based on a set of non-iterative estimates of the genotype relative risk (GRR) [4]. Recently, the expectation maximization-haplotype relative risk (EM-HRR) proposed by Guo et al. [5] extended the HRR test [6] to accommodate trios with one or no parental genotypes, and it outperforms the 1-TDT in a homogeneous population. However, when the MAR assumption is violated, occurring when missingness is non-ignorable due, for example, a missing pattern of parental genotypes is related to the disease under study, these tests may be invalid.

To assure a valid test for association between a marker and a putative disease locus under non-ignorable missingness (NIM), Allen et al. [7] introduced a testing procedure based on the joint likelihood of the genotypes of the proband and the observed parents, conditioning on the proband's phenotype and parental missingness pattern. Still, the validity of their method under population stratification is not guaranteed, because it depends on whether the missingness model is suitably specified or not. Therefore, Chen [8] proposed another TDT-type approach based on the conditional likelihood of the proband's genotype given the number and, if any, genotypes of available parents, as well as the proband's phenotype to assure the validity of testing for association between a candidate gene and a disease.

The cost of accounting for NIM is a loss of power under MAR (it is less powerful than the 1-TDT) as indicated by Allen et al. [7]. Their results also suggested that, under NIM, the 1-TDT performs better than the proposed tests by Clayton [1] and Weinberg [2], because the type I error of the 1-TDT is less inflated over the nominal level. In addition, the 1-TDT is a valid test if the NIM is a result of population stratification, while Clayton [1] and Weinberg's [2] methods are not. Hence, the 1-TDT is preferred among those tests for incomplete trios that require the MAR assumption. Because the comparison between the 1-TDT and EM-HRR under NIM is unknown, we examined the performance of the two tests using Genetic Analysis Workshop 14 (GAW14) simulated data.



First consider a diallelic marker with alleles B1 and B2. represent the observed count for each type of trio data, where k = 0, 1, or 2 represents total number of B1 alleles transmitted to the offspring, and i, j = 0, 1, or 2 represents total number of B1 alleles for fathers and mothers, respectively. Note that the superscript * is used when the parental genotype is missing. Curtis and Sham [9] showed that bias in estimating the probability of transmission of certain alleles is introduced if heterozygous affected children with one heterozygous parent families are excluded. For simplicity we denote these dyad families by whenever the father or mother is missing, because we assume no difference according the sex of the parent. Guo et al. [5] applied the EM algorithm to estimate the proportion of heterozygous parents transmitting B1 and not B2 () and transmitting B2 and not B1 () alleles among families to avoid such bias. The details of the EM procedure are available in Guo et al. [5].

The HRR compares parental marker alleles transmitted to an affected child to those not transmitted. One feature of the HRR for dealing with such trio-type family data is that the affected children's genotypes are always known (assuming no genotyping failure) due to ascertainment procedures in which data from an affected individual is collected first and then that of his/her parents. Hence, in the case group, the two transmitted alleles of all affected children are known and can be used in the analysis, even when both parents' genotypes are not available.

Let Ui, Vi, Wi, and Xi represent the total number of transmitted B1 alleles, non-transmitted B1 alleles, transmitted B2 alleles, and non-transmitted B2 alleles from type i families, where i = 1 for complete trios, 2 for dyads (trios with one parental genotype available), and 3 for monads (trios without parental genotypes). Note that only and require the EM estimates; the rest can be inferred without the EM algorithm and can be uniquely determined. Both V3 and X3 are 0, because no parental genotypes are available to infer what alleles are not transmitted.

The EM-HRR is defined as (U1 + U2) × (X1 + X2) / (V1 + V2) × (W1 + W2), if type 3 families are excluded. If all families ascertained are used for analysis regardless of missing one or two parents, then the EM-HRR becomes (U1 + U2 + U3) × (X1 + X2) / (V1 + V2) × (W1 + W2 + W3). Under the null hypothesis of no linkage or no association, the EM-HRR is expected to be 1 and the test statistic follows a central chi-square distribution with 1 degree of freedom. Note that Var(EM - HRR) can be approximated by (), if the type 3 families are excluded and by (), if all three type of families are used.


One affected child was randomly selected in each nuclear family in order to maintain the independence among ascertained trios. Both dominant and recessive disease models were examined, since we used traits "b" (dominant) and "l" (recessive) for ascertainment. For trait b, SNPs C01R0052 and C01R0001 were used in power and type I error simulations, respectively. Similarly, SNPs C09R0765 and C09R0850 were used for trait l (several loci were examined with similar results but not shown here). Based on resampling of the 100 replicates provided, 10 replicates in the Danacaa (DA) population were randomly selected for each simulation. A total of 1,000 simulations were conducted for power and type I error comparisons.

The TDT and HRR were first applied to the complete trios. To illustrate the impact of NIM, we examined the extreme case by assigning parental genotypes to be missing if they were affected. We first determined the missing rate for parents in the NIM simulations (there was no difference in sex specific missing rates), then generated a MAR dataset of equal amounts of missing data by randomly assign parental genotypes to be missing according to that rate. The 1-TDT and EM-HRR were both applied to the subset of complete trios and dyads, but only EM-HRR can accommodate monads under NIM and MAR.


The average sample sizes of ascertained trios are 120 and 750 for the recessive trait l and dominant trait b, respectively. The average missing rates for each parental genotype are approximately 10% and 30% for recessive trait l and dominant trait b, respectively. In Tables 1 and 2, the rows marked TDTtrad and HRRtrad present the results for the traditional tests using all unrelated trios. TDTcomp and HRRcomp tests used the traditional TDT and HRR tests on the subset of complete trios only after assigning parental genotypes to be missing. 1-TDT and EM-HRRdyads tests used both subsets of complete trios and dyads. EM-HRRall used three types of trios.

Table 1 Recessive trait (L): 120 trios on average
Table 2 Dominant Trait (B): 750 trios on averge

As in results reported by Ewens et al [11], the TDT and HRR perform similarly (comparable power in TDTtrad and HRRtrad, or TDTcomp and HRRcomp) in detecting linkage disequilibrium (LD) between a marker and a putative disease locus in a homogeneous population. In all the situations simulated, TDTcomp and HRRcomp have the lowest power, and the difference between TDTtrad and TDTcomp or HRRtrad and HRRcomp is the loss of power due to exclusion of incomplete trios. The increase from TDTcomp to 1-TDT or HRRcomp to EM-HRRdyads represents a gain of power by including dyads. The difference between EM-HRRdyads and EM-HRRall indicates the gain or loss of power by additionally utilizing monads, which is not applicable for the 1-TDT test. Because the transmitted alleles are always present regardless of missing one or two parental genotypes in the HRR statistic, EM-HRRdyads and EM-HRRall are more powerful than 1-TDT under both dominant and recessive disease models regardless of MAR or NIM.

Under NIM, the probability distribution functions of monads changed the most compared to dyads, which resulted in adding more noise to the EM-HRR statistic. As a consequence, we observed a loss of power in the EM-HRR due to the utilization of monads. Therefore, EM-HRRall is more powerful than EM-HRRdyads under MAR, but their performances are reversed when the missing pattern is informative.

When parental genotypes are missing non-randomly due to a recessive disease, only homozygous parents with two copies of the disease alleles will be missing, assuming there are no phenocopies. Therefore, the subset of complete trios or dyads has more heterozygous (informative) parents compared to those under MAR. One can see that, under NIM, the loss of power from TDTtrad to TDTcomp or HRRtrad to HRRcomp is less compared to the MAR situation. In addition, the EM procedure yields higher informative transmissions based on excess heterozygous (informative) parents. Therefore, the power of EM-HRR using both the subset of complete trios and dyads is higher than HRR using the complete dataset (Table 1). The results are similar under a dominant disease model as seen in Table 2.

Allen et al. [7] and Chen [8] showed that type I errors of MAR tests were inflated over the nominal level. Although our simulations results did not match theirs, we see informative changes in the type I errors. When parental genotypes are missing non-randomly due to a dominant disease, parents with two copies of normal alleles are not affected, assuming no phenocopies. Hence, these types of parents will be more likely to be in the subset of complete trios and dyads. It is evident then that the loss of power from TDTtrad to TDTcomp or HRRtrad to HRRcomp is greater compared to the loss under a recessive disease model. This phenomenon also affects the type I error. Hence the type I error of HRRcomp and TDTcomp are smaller than TDTtrad and HRRtrad (Table 2), but for a recessive disease, the results are reversed (Table 1), because the heterozygous parents are more likely to be in the subset of complete trios and dyads.


The HRR was the first family-based test for LD between a marker and a putative disease locus. Because the TDT performs better than the HRR under extreme admixture, the HRR is not as popular as the TDT. Due to the data structure of the HRR, the transmitted alleles are always present regardless of the absence of one or both parents. Therefore, the EM-HRR is more powerful than the 1-TDT when the population is under Hardy-Weinberg equilibrium or slightly admixed. Because there is no admixture in the DA population, we found that the EM-HRR is the more powerful test when parental genotypes are missing randomly, and the superiority of the HRR remains despite the impact of NIM. Because the use of affected children without parental genotypes does not improve the power of the EM-HRR with NIM, we recommend using the EM-HRR with the subsets of complete trios and one-parent data for testing LD between a marker and a putative disease locus when the missing data pattern is unknown.

Under a different mechanism of NIM and no phenocopies in the simulated Dananca population, our results do not match those of the 1-TDT with inflated type I error reported by Allen et al. [7] and Chen [8]. Instead, we observed a different performance of MAR tests under NIM. Although it is easier to observe that the 1-TDT and EM-HRRdyads are more powerful than TDTtrad and HRRtrad when their type I errors are inflated over the nominal level, our results suggest that in the GAW14 simulated data, parents with different genotypes are equally likely to be diseased under the null hypothesis, and that differential missing rates occur only under the alternative hypothesis.