A statistical measure for the skewness of X chromosome inactivation based on family trios
- 760 Downloads
Abstract
Background
X chromosome inactivation (XCI) is an important gene regulation mechanism in females to equalize the expression levels of X chromosome between two sexes. Generally, one of two X chromosomes in females is randomly chosen to be inactivated. Nonrandom XCI (XCI skewing) is also observed in females, which has been reported to play an important role in many X-linked diseases. However, there is no statistical measure available for the degree of the XCI skewing based on family data in population genetics.
Results
In this article, we propose a statistical approach to measure the degree of the XCI skewing based on family trios, which is represented by a ratio of two genotypic relative risks in females. The point estimate of the ratio is obtained from the maximum likelihood estimates of two genotypic relative risks. When parental genotypes are missing in some family trios, the expectation-conditional-maximization algorithm is adopted to obtain the corresponding maximum likelihood estimates. Further, the confidence interval of the ratio is derived based on the likelihood ratio test. Simulation results show that the likelihood-based confidence interval has an accurate coverage probability under the situations considered. Also, we apply our proposed method to the rheumatoid arthritis data from USA for its practical use, and find out that a locus, rs2238907, may undergo the XCI skewing against the at-risk allele. But this needs to be further confirmed by molecular genetics.
Conclusions
The proposed statistical measure for the skewness of XCI is applicable to complete family trio data or family trio data with some paternal genotypes missing. The likelihood-based confidence interval has an accurate coverage probability under the situations considered. Therefore, our proposed statistical measure is generally recommended in practice for discovering the potential loci which undergo the XCI skewing.
Keywords
X chromosome inactivation Skewing Family trio Ratio estimateAbbreviations
- CI
Confidence interval
- CP
Coverage probability
- DP
Proportion of the discontinuous CIs
- ECM
Expectation-conditional- maximization
- GRR
Genotypic relative risk
- LRT
Likelihood ratio test
- ML
Left tail error
- MLE
Maximum likelihood estimate
- MP
Missing patterns
- MR
Right tail error
- RA
Rheumatoid arthritis
- SNP
Single nucleotide polymorphism
- XCI
X chromosome inactivation
- XMCPDT
Monte Carlo pedigree disequilibrium test on X chromosome
Background
Many human diseases are associated with genes on X chromosome, such as asthma, autoimmune diseases, cancers, some neurological and psychiatric diseases [1, 2, 3, 4, 5]. Most of these X-linked diseases often exhibit sex-specific patterns of susceptibility due to the difference in the number of copies of X chromosome between two sexes. Females have two copies of X chromosome whereas there is only one copy in males. To equalize the expression levels of X chromosome between sexes, dosage compensation is achieved by an important gene regulation mechanism in mammalian females, X chromosome inactivation (XCI), which results in expression silencing of one of two X chromosomes in females [6]. Up to 75% genes on X chromosome are subject to XCI, while there are about 15% escaping from inactivation and expressed from both X chromosomes, and the remaining 10% show variable inactivation patterns in different human cell lines [7].
During the process of XCI, one of two X chromosomes in females is chosen to be inactivated in a random way. This means that roughly 50% of cells in females have the paternal X chromosome expressed, while the others express the maternal one. Although random XCI occurs commonly, the XCI skewing also takes place in females, which is defined as the phenomenon of nonrandom inactivation that one of X chromosomes is selected to be silenced with a probability deviating from 50% [8]. Generally, the skewness of XCI is caused by a second selection mechanism. When the mutation on X chromosome affects the survival and proliferation of cells, the amount of cells carrying an active mutant X chromosome will become larger or smaller than that of cells with an active wild-type X chromosome, which thus leads to the skewness of XCI [9, 10]. Negative selection, where the mutation gives a growth disadvantage to cells, frequently happens in female carriers with X-linked diseases, such as mental retardation disorders, Wiskott-Aldrich syndrome and X-linked severe combined immunodeficiency [11, 12, 13]. On the other hand, when the mutation provides a growth advantage to cells, positive selection occurs and can result in some diseases, such as adrenoleukodystrophy and breast cancer [14, 15].
In genetic association studies on X chromosome, Clayton [16] first took XCI into consideration. Due to XCI, the genotypic effect of homozygous females can be treated the same as that of hemizygous males. Therefore, the genotypic scores are given to be 0, 1 and 2 corresponding to three genotypes at a diallelic locus on X chromosome in females, and 0 and 2 corresponding to two genotypes in males. However, Wang et al. [17] pointed out that such coding strategy only considers the situation of random XCI, but ignores the skewness of XCI and escape from XCI. To account for all possible situations of XCI, Wang et al. suggested that the genotypic score for the heterozygous females, denoted by γ, can be any possible values between 0 and 2. Under XCI, the value of γ reflects the degree of inactivation skewing, with γ/2 representing the proportion of cells having the mutant allele active. As such, γ = 1 stands for random XCI, while γ between 1 and 2 indicates the XCI skewing toward the mutant allele and γ between 0 and 1 denotes the XCI skewing against the mutant allele. For example, γ = 0.5 means that the skewness of XCI is against the mutant allele with 25% cells expressing the mutant allele and the other 75% cells expressing the normal allele. On the other hand, in molecular genetics, the XCI skewing pattern can be identified by assays taking advantage of differential methylation between the active and inactive X chromosomes or mRNA transcription in cells [18, 19, 20]. However, since the XCI pattern always varies among cell lines [21, 22], these assays, which usually use cells from only a few tissues to investigate the XCI patterns, cannot present the status of the whole body [10]. Further, there is no statistical measure available for detecting the XCI skewing pattern in population genetics as yet.
Therefore, in this article, we give the expression of γ for family trios with both parents and one affected daughter in the presence of association between the disease and genotypes. In fact, γ is a function of two genotypic relative risks (GRRs) in females. In addition, we obtain the point estimate of γ from the maximum likelihood estimates (MLEs) of the GRRs. When parental genotypes are missing in some family trios, the expectation-conditional-maximization (ECM) algorithm [23] is used to obtain the corresponding MLEs. Further, the confidence interval (CI) of γ is derived based on a likelihood ratio test (LRT). Finally, simulation study is conducted to investigate the performance of our proposed method. The simulation results show that the proposed likelihood-based CI has an accurate coverage probability under the situations considered. For practical use, we apply our proposed method to the rheumatoid arthritis (RA) data from USA.
Methods
Notations
Consider an X-linked diallelic locus with mutant allele A and normal allele a. Let p_{m} and q_{m} = 1 − p_{m} denote the allele frequencies of A and a in males, respectively. Suppose that p_{f} is the allele frequency of A and ρ is the inbreeding coefficient in females. Then, the frequencies of genotypes aa, Aa and AA in females are respectively g_{0} = (1 − p_{f})^{2} + ρp_{f}(1 − p_{f}), g_{1} = 2(1 − ρ)p_{f}(1 − p_{f}) and \( {g}_2={p}_f^2+\rho {p}_f\left(1-{p}_f\right) \). Note that Hardy-Weinberg equilibrium holds in the population under study when ρ = 0 and p_{m} = p_{f}. Let f_{0}, f_{1} and f_{2} respectively represent the penetrances in females with genotypes aa, Aa and AA. The GRRs in females are defined as λ_{1} = f_{1}/f_{0} and λ_{2} = f_{2}/f_{0}.
Relationship between penetrances and XCI skewing in females
Let the genotypic scores be 0, γ and 2 corresponding to females with genotypes aa, Aa and AA, respectively, where γ ∈ [0, 2] represents the XCI skewing pattern. We assume that a generalized genetic model holds [24, 25], which is defined as f_{0} ≤ f_{1} ≤ f_{2} (i.e., 1 ≤ λ_{1} ≤ λ_{2}) with at least one inequality being strict, in the presence of association between the disease and genotypes in females. If f_{1} is unknown, then it can be expressed as a function of γ, denoted by f_{1}(γ). To derive the expression of f_{1}(γ), let \( {f}_1^{\prime}\left(\gamma \right) \) be the first order derivative of f_{1}(γ) with respect to γ. As such, f_{1}(γ) can be approximated by a first order Taylor expansion around γ = 1 as follows,
From Eq. (2), we notice that f_{1} is around the midpoint between f_{0} and f_{2} under random XCI (i.e., γ=1). Actually, Eq. (2) means that the penetrance for heterozygous females is approximately linear in the genotypic score γ around γ=1.
Note that the value of γ attains its maximum (γ = 2) when λ_{1} = λ_{2} ≠ 1, and γ = 0 when λ_{1} = 1 and λ_{2} ≠ 1. Assume that \( {\widehat{\lambda}}_1 \) and \( {\widehat{\lambda}}_2 \) are the MLEs of λ_{1} and λ_{2}, respectively. Then, the point estimate of γ, \( \widehat{\gamma} \), can be obtained by \( 2\left({\widehat{\lambda}}_1-1\right)/\left({\widehat{\lambda}}_2-1\right) \).
MLEs of λ _{1} and λ _{2} using family trios without missing genotypes
Eight types of possible family trios and the corresponding probabilities
FMC | P(FM) | P(C| FM) | P(FMC| D) |
---|---|---|---|
000 | q _{ m} g _{0} | 1 | q_{m}g_{0}/R |
010 | q _{ m} g _{1} | 0.5 | 0.5q_{m}g_{1}/R |
011 | q _{ m} g _{1} | 0.5 | 0.5q_{m}g_{1}λ_{1}/R |
021 | q _{ m} g _{2} | 1 | q_{m}g_{2}λ_{1}/R |
101 | p _{ m} g _{0} | 1 | p_{m}g_{0}λ_{1}/R |
111 | p _{ m} g _{1} | 0.5 | 0.5p_{m}g_{1}λ_{1}/R |
112 | p _{ m} g _{1} | 0.5 | 0.5p_{m}g_{1}λ_{2}/R |
122 | p _{ m} g _{2} | 1 | p_{m}g_{2}λ_{2}/R |
Equation (3) holds when the disease status of a daughter is only related to her own genotype, and P(C| FM) is determined by Mendelian transmission, which is equal to 0.5 for heterozygous mother and 1 otherwise. Assume that P(FM) = P(F)P(M), and divide the numerator and denominator of Eq. (3) by f_{0}. Then, P(FMC| D) for each trio type can be written as the last column in Table 1, where R = q_{m}g_{0} + 0.5q_{m}g_{1}(1 + λ_{1}) + q_{m}g_{2}λ_{1} + p_{m}g_{0}λ_{1} + 0.5p_{m}g_{1}(λ_{1} + λ_{2}) + p_{m}g_{2}λ_{2}. The detailed derivation of P(FMC| D) in Table 1 is given in Additional file 1: Appendix A.
where n_{FMC} is the number of the family trios of type FMC. To obtain the MLE of θ, numerical methods, such as Newton-Raphson algorithm (by using “maxLik” package in R software [26]) and the ECM algorithm introduced later, are applied. We choose the initial values of p_{m}, g_{0}, g_{1}, λ_{1} and λ_{2} as follows: \( {\widehat{p}}_m^{(0)}=\#\left(F=1\right)/\#\left(F\in \left\{0,1\right\}\right), \) \( {\widehat{g}}_0^{(0)}=\#\left(M=0\right)/\#\left(M\in \left\{0,1,2\right\}\right) \), \( {\widehat{g}}_1^{(0)}=\#\left(M=1\right)/\#\left(M\in \left\{0,1,2\right\}\right), \) \( {\widehat{\lambda}}_1^{(0)}={n}_{011}/{n}_{010} \) and \( {\widehat{\lambda}}_2^{(0)}=\left({n}_{011}{n}_{112}\right)/\left({n}_{010}{n}_{111}\right) \), where # denotes the counting measure. The details about the choice of these initial values are shown in Additional file 1: Appendix B.
MLEs of λ _{1} and λ _{2} using family trios with parental genotypes missing
It is common that parental genotypes are missing in some family trios. For trios with paternal or maternal genotype missing, we call them “mother-daughter pairs” or “father-daughter pairs”, denoted by MC and FC, respectively. Thus, MC takes possible genotypes from Ω_{MC}={00, 01, 10, 11, 12, 21, 22}, and FC takes possible genotypes from Ω_{FC}={00, 01, 11, 12}. As for trios with both parental genotypes missing, we refer to them as “single daughters”. The probabilities of the mother-daughter pair MC, father-daughter pair FC and single daughter C given the daughter being affected are respectively\( P\left( MC|D\right)={\sum}_{F\in \left\{0,1\right\}}P\left( FMC|D\right) \), \( P\left( FC|D\right)={\sum}_{M\in \left\{0,1,2\right\}}P\left( FMC|D\right) \) and \( P\left(C|D\right)={\sum}_{F\in \left\{0,1\right\}}{\sum}_{M\in \left\{0,1,2\right\}}P\left( FMC|D\right) \), where P(FMC| D) are given in Table 1.
Let N_{2}, N_{1m}, N_{1f} and N_{0} respectively be the numbers of family trios, mother-daughter pairs, father-daughter pairs and single daughters. Then, \( {N}_2={\sum}_{FMC\in \Omega}{n}_{FMC} \), \( {N}_{1m}={\sum}_{MC\in {\Omega}_{MC}}{n}_{1m, MC},{N}_{1f}={\sum}_{FC\in {\Omega}_{FC}}{n}_{1f, FC} \), \( {N}_0={\sum}_{C\in \left\{0,1,2\right\}}{n}_{0,C} \) and the total sample size N = N_{2} + N_{1m} + N_{1f} + N_{0}.
Since it is not so easy to obtain the MLE of θ from the above observed log-likelihood function (4), the ECM algorithm will be employed. Assume that \( {n}_{1m, MC}={\sum}_{F\in \left\{0,1\right\}}{z}_{1m, FMC} \), \( {n}_{1f, FC}={\sum}_{M\in \left\{0,1,2\right\}}{z}_{1f, FMC} \) and n_{0, C}= \( {\sum}_{F\in \left\{0,1\right\}}{\sum}_{M\in \left\{0,1,2\right\}}{z}_{0, FMC} \), where z_{1m, FMC}, z_{1f, FMC} and z_{0, FMC} are the unobserved numbers of trios FMC for mother-daughter pairs MC, father-daughter pairs FC and single daughters C, respectively (see Additional file 1: Tables S1-S3). Then, the log-likelihood function for the complete data (n_{FMC}, z_{1m, FMC}, z_{1f, FMC}, z_{0, FMC}) can be written as
where \( {\widehat{p}}_m^{(0)} \), \( {\widehat{q}}_m^{(0)}=1-{\widehat{p}}_m^{(0)} \), \( {\widehat{g}}_0^{(0)} \), \( {\widehat{g}}_1^{(0)} \) and \( {\widehat{g}}_2^{(0)}=1-{\widehat{g}}_0^{(0)}-{\widehat{g}}_1^{(0)} \)are the initial values of p_{m}, q_{m}, g_{0}, g_{1} and g_{2}, respectively. The details about the choice of these initial values are shown in Additional file 1: Appendix B. Given the initial value of θ, the steps mentioned above continue until the convergence criterion is satisfied. For example, the absolute differences between the estimates of the parameters at two consecutive iterations are all less than 10^{−7}. In addition, note that the ECM algorithm still works when there are no missing genotypes in all the family trios. However, it contains only the CM steps in this situation and can be regarded as a special case of the cyclic coordinate ascent method, which is simple and stable [23].
Confidence interval of γ based on likelihood method
Note that there is no closed-form solution of Eq. (6). Thus, numerical method is applied, such as functions from “rootSolve” package in R software [26]. Let γ_{L} and γ_{U} be two unequal roots of Eq. (6) with γ_{L} < γ_{U}. Generally, the 100(1 − α)% CI of γ would be (γ_{L}, γ_{U}). However, since the true value of γ is bounded in [0, 2], the original estimated CI of γ will be truncated by [0, 2] if necessary. As such, the ultimate CI of γ is (γ_{L}, γ_{U})∩[0, 2], which is easier to be interpreted than the origin CI.
Discontinuity problem of confidence interval of γ
Note that γ is a ratio, so like other ratio estimates [27], we find that the proposed CI may consist of two disjoint intervals, such as [0, 0.03)∪(0.59, 2]. In this article, this kind of CIs is referred to as “discontinuous CI” for convenience. Let’s take a close look at this discontinuity problem by the following example. Consider the situation of (n_{000}, n_{010}, n_{011}, n_{021}, n_{101}, n_{111}, n_{112}, n_{122})=(191, 89, 112, 54, 114, 59, 62, 19). Then, \( \widehat{\gamma} \) is 1.92 and two roots of Eq. (6) are 0.03 and 0.59, respectively. If the CI is set to be (0.03, 0.59) normally, to our surprise, \( \widehat{\gamma} \) is located outside this CI. When testing the null hypothesis H_{0}: γ = γ_{0}, we find that γ_{0} taking values between 0.03 and 0.59 is rejected by the LRT. This means that the interval (0.03, 0.59) is actually a rejection region of the corresponding LRT rather than an acceptance region. Hence, the CI of γ turns to be (−∞, 0.03)∪(0.59,+∞), and [0, 0.03)∪(0.59, 2] after being truncated. The discontinuous CI may occur when the denominator of the ratio is close to zero (i.e., λ_{2} is close to 1 in this article) [28]. In fact, when λ_{2} = 1, we assume that we cannot obtain information on the XCI skewing pattern according to the CI of γ. This is because our proposed method measures XCI skewness in the presence of association between the disease and genotypes (i.e., λ_{2} ≠ 1). On the other hand, although these discontinuous CIs are considered to be uninformative and are difficult to be interpreted, there is no satisfactory “objective” method for dealing with this problem well [27].
Simulation settings
Six simulation settings for different combinations of (N_{2}, N_{1m}, N_{1f}, N_{0}) with total sample size N being fixed at 700
MP | N _{2} | N _{1 m} | N _{1 f} | N _{0} |
---|---|---|---|---|
1 | 700 | 0 | 0 | 0 |
2 | 350 | 100 | 100 | 150 |
3 | 0 | 200 | 200 | 300 |
4 | 0 | 100 | 100 | 500 |
5 | 100 | 400 | 100 | 100 |
6 | 100 | 100 | 400 | 100 |
We assess the statistical properties of the CI by the following indexes. Let the coverage probability (CP) be the proportion that the CI contains the true value of γ among k replicates. Note that under H_{0}: γ = γ_{0}, the estimated type І error rate of the LRT is 1−CP. ML and MR denote the left tail error and the right tail error (missing the true value of γ), respectively, with ML \( =\#\left[\left(\gamma <{\gamma}_L\right)\cap \left({\gamma}_L\le \widehat{\gamma}\le {\gamma}_U\right)\right]/k \) and MR \( =\#\left[\left(\gamma >{\gamma}_U\right)\cap \left({\gamma}_L\le \widehat{\gamma}\le {\gamma}_U\right)\right]/k \). Further, we use ML/(ML + MR) to measure the balance of ML and MR, which will be close to 0.5 when the balance is achieved. Notice that we do not consider those discontinuous CIs when calculating ML and MR, since we cannot distinguish between the left side and the right side of the discontinuous CIs. As such, we use DP\( =1-\#\left({\gamma}_L\le \widehat{\gamma}\le {\gamma}_U\right)/k \) to represent the proportion of the discontinuous CIs among k replicates. In addition, note that the distribution of a ratio is not necessarily symmetric [30, 31], and the median can be always used to estimate the central tendency of a skewed distribution better than the mean [32]. So, we give the median of the point estimates of γ over k replicates under each simulation scenario. Further, for simulating the power of the LRT, we fix γ_{0} at 0, 1 and 2. Finally, we also compare the simulation results under MP1 (consisting of only 700 family trios with both parents) based on the ECM algorithm with those based on the Newton-Raphson algorithm. It is found that the results of the two algorithms are almost consistent with each other (data not shown for brevity). Therefore, we only give the simulation results on the basis of the ECM algorithm in the following section.
Results
Simulation results
Statistical properties of likelihood-based confidence interval of γ against missing pattern (MP) and γ with ρ=0, λ_{2}=1.5, and (p_{m}, p_{f}) being (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25)^{a}
(p_{m}, p_{f}) = (0.30, 0.30) | (p_{m}, p_{f}) = (0.25, 0.30) | (p_{m}, p_{f}) = (0.30, 0.25) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
MP | γ | CP (%) | ML/(ML + MR) | DP | CP (%) | ML/(ML + MR) | DP | CP (%) | ML/(ML + MR) | DP |
1 | 0 | 94.37 | 1 | 0.098 | 94.46 | 1 | 0.099 | 94.74 | 1 | 0.100 |
0.5 | 94.70 | 0.51 | 0.034 | 94.59 | 0.51 | 0.030 | 94.79 | 0.56 | 0.035 | |
1 | 95.01 | 0.33 | 0.018 | 94.94 | 0.34 | 0.022 | 94.96 | 0.34 | 0.023 | |
1.5 | 95.10 | 0.26 | 0.052 | 95.08 | 0.30 | 0.048 | 95.08 | 0.25 | 0.058 | |
2 | 95.13 | 0 | 0.061 | 94.84 | 0 | 0.060 | 94.97 | 0 | 0.072 | |
2 | 0 | 95.17 | 1 | 0.162 | 94.69 | 1 | 0.161 | 94.78 | 1 | 0.160 |
0.5 | 94.98 | 0.59 | 0.064 | 94.87 | 0.53 | 0.055 | 95.21 | 0.59 | 0.061 | |
1 | 94.96 | 0.21 | 0.036 | 95.05 | 0.20 | 0.038 | 95.21 | 0.17 | 0.037 | |
1.5 | 94.65 | 0.16 | 0.104 | 95.01 | 0.12 | 0.105 | 94.67 | 0.09 | 0.110 | |
2 | 94.58 | 0 | 0.130 | 94.98 | 0 | 0.120 | 94.87 | 0 | 0.140 | |
3 | 0 | 94.79 | 1 | 0.373 | 95.28 | 1 | 0.347 | 94.99 | 1 | 0.372 |
0.5 | 95.57 | 0.75 | 0.140 | 95.58 | 0.62 | 0.119 | 95.58 | 0.75 | 0.138 | |
1 | 95.49 | 0.01 | 0.071 | 95.72 | 0.02 | 0.069 | 95.56 | 0.03 | 0.063 | |
1.5 | 94.84 | 0 | 0.232 | 94.93 | 0 | 0.227 | 94.96 | 0 | 0.210 | |
2 | 94.87 | 0 | 0.413 | 94.75 | 0 | 0.385 | 95.13 | 0 | 0.393 | |
4 | 0 | 94.72 | 1 | 0.500 | 94.59 | 1 | 0.472 | 94.76 | 1 | 0.488 |
0.5 | 95.14 | 0.87 | 0.183 | 95.31 | 0.81 | 0.162 | 95.32 | 0.89 | 0.174 | |
1 | 94.99 | 0.05 | 0.067 | 95.13 | 0.03 | 0.073 | 94.85 | 0.04 | 0.070 | |
1.5 | 94.79 | 0 | 0.229 | 94.86 | 0 | 0.238 | 94.65 | 0 | 0.194 | |
2 | 94.65 | 0 | 0.457 | 94.39 | 0 | 0.460 | 94.81 | 0 | 0.402 | |
5 | 0 | 94.77 | 1 | 0.211 | 95.12 | 1 | 0.204 | 95.08 | 1 | 0.209 |
0.5 | 95.56 | 0.67 | 0.077 | 95.61 | 0.56 | 0.066 | 95.60 | 0.66 | 0.078 | |
1 | 95.04 | 0.12 | 0.050 | 95.06 | 0.13 | 0.052 | 95.49 | 0.11 | 0.047 | |
1.5 | 94.72 | 0.04 | 0.155 | 94.83 | 0.07 | 0.150 | 94.69 | 0.05 | 0.150 | |
2 | 94.74 | 0 | 0.220 | 94.66 | 0 | 0.199 | 94.81 | 0 | 0.231 | |
6 | 0 | 95.08 | 1 | 0.295 | 95.25 | 1 | 0.275 | 95.05 | 1 | 0.291 |
0.5 | 95.28 | 0.72 | 0.108 | 95.43 | 0.62 | 0.094 | 95.68 | 0.77 | 0.106 | |
1 | 95.20 | 0.06 | 0.059 | 95.30 | 0.05 | 0.063 | 95.03 | 0.03 | 0.059 | |
1.5 | 94.57 | 0.01 | 0.190 | 95.09 | 0.03 | 0.183 | 94.64 | 0.01 | 0.186 | |
2 | 94.87 | 0 | 0.294 | 94.69 | 0 | 0.267 | 94.64 | 0 | 0.297 |
Statistical properties of likelihood-based confidence interval of γ against missing pattern (MP) and γ with ρ=0, λ_{2}=2, and (p_{m}, p_{f}) being (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25)^{a}
(p_{m}, p_{f}) = (0.30, 0.30) | (p_{m}, p_{f}) = (0.25, 0.30) | (p_{m}, p_{f}) = (0.30, 0.25) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
MP | γ | CP (%) | ML/(ML + MR) | DP | CP (%) | ML/(ML + MR) | DP | CP (%) | ML/(ML + MR) | DP |
1 | 0 | 94.67 | 0.88 | 0.025 | 94.73 | 0.90 | 0.030 | 94.80 | 0.93 | 0.029 |
0.5 | 94.71 | 0.40 | 0.005 | 94.95 | 0.42 | 0.006 | 94.65 | 0.41 | 0.008 | |
1 | 94.86 | 0.41 | 0.004 | 94.97 | 0.40 | 0.004 | 95.09 | 0.41 | 0.006 | |
1.5 | 94.96 | 0.41 | 0.006 | 95.25 | 0.43 | 0.005 | 95.11 | 0.41 | 0.008 | |
2 | 95.14 | 0.02 | 0.022 | 94.99 | 0.01 | 0.024 | 95.08 | 0.01 | 0.024 | |
2 | 0 | 95.17 | 1 | 0.032 | 94.71 | 0.99 | 0.040 | 94.64 | 1 | 0.038 |
0.5 | 95.16 | 0.40 | 0.025 | 95.13 | 0.35 | 0.022 | 95.00 | 0.38 | 0.027 | |
1 | 94.60 | 0.36 | 0.019 | 95.24 | 0.38 | 0.019 | 95.01 | 0.33 | 0.020 | |
1.5 | 94.62 | 0.39 | 0.026 | 95.32 | 0.43 | 0.020 | 94.93 | 0.40 | 0.034 | |
2 | 94.71 | 0 | 0.028 | 94.81 | 0 | 0.026 | 95.13 | 0 | 0.029 | |
3 | 0 | 94.91 | 1 | 0.168 | 95.10 | 1 | 0.189 | 95.09 | 1 | 0.172 |
0.5 | 95.24 | 0.29 | 0.159 | 95.61 | 0.27 | 0.135 | 95.56 | 0.31 | 0.164 | |
1 | 95.58 | 0.02 | 0.097 | 95.51 | 0.02 | 0.101 | 95.25 | 0 | 0.091 | |
1.5 | 95.21 | 0.03 | 0.280 | 95.06 | 0.04 | 0.250 | 94.92 | 0 | 0.296 | |
2 | 95.16 | 0 | 0.176 | 95.07 | 0 | 0.142 | 95.10 | 0 | 0.219 | |
4 | 0 | 94.74 | 1 | 0.387 | 94.68 | 1 | 0.406 | 94.57 | 1 | 0.393 |
0.5 | 94.80 | 0.43 | 0.263 | 94.61 | 0.32 | 0.227 | 94.51 | 0.46 | 0.275 | |
1 | 95.19 | 0.01 | 0.127 | 95.37 | 0 | 0.155 | 95.20 | 0.01 | 0.110 | |
1.5 | 94.67 | 0 | 0.449 | 94.54 | 0 | 0.455 | 94.69 | 0 | 0.405 | |
2 | 94.65 | 0 | 0.433 | 94.71 | 0 | 0.399 | 94.61 | 0 | 0.462 | |
5 | 0 | 94.75 | 1 | 0.050 | 94.94 | 1 | 0.063 | 94.93 | 1 | 0.054 |
0.5 | 95.18 | 0.36 | 0.050 | 95.73 | 0.37 | 0.046 | 95.31 | 0.39 | 0.058 | |
1 | 94.98 | 0.25 | 0.037 | 94.85 | 0.26 | 0.042 | 95.21 | 0.22 | 0.037 | |
1.5 | 94.97 | 0.23 | 0.084 | 94.71 | 0.31 | 0.065 | 94.85 | 0.25 | 0.093 | |
2 | 95.24 | 0 | 0.037 | 94.42 | 0 | 0.035 | 94.63 | 0 | 0.051 | |
6 | 0 | 95.34 | 1 | 0.083 | 95.10 | 1 | 0.095 | 95.20 | 1 | 0.101 |
0.5 | 95.72 | 0.36 | 0.092 | 95.24 | 0.36 | 0.081 | 95.48 | 0.33 | 0.094 | |
1 | 94.93 | 0.09 | 0.059 | 94.91 | 0.12 | 0.062 | 94.83 | 0.10 | 0.063 | |
1.5 | 94.96 | 0.14 | 0.133 | 94.57 | 0.18 | 0.119 | 95.07 | 0.13 | 0.146 | |
2 | 94.81 | 0 | 0.055 | 94.71 | 0 | 0.054 | 94.95 | 0 | 0.075 |
Application to RA data
Rheumatoid arthritis (RA) is an autoimmune disease, which has been reported to be associated with the skewness of XCI [33]. To investigate the XCI skewing patterns at the X-linked loci associated with RA, we apply our proposed method to the data from North American Rheumatoid Arthritis Consortium [29], which is made available from Genetic Analysis Workshop 15 [34]. The dataset includes 757 pedigrees and 293 single nucleotide polymorphism (SNP) markers on X chromosome. In this application, one nuclear family with a typed affected daughter is selected randomly from each pedigree. As such, a total of 703 nuclear families are included, which contains 64 case-parents trios, 179 mother-daughter pairs, 37 father-daughter pairs and 423 single daughters.
Application of proposed method to RA dataset with p-values of XMCPDT less than 1% significance level
SNP name | p-value^{a} | (\( {\widehat{p}}_m \), \( {\widehat{p}}_f \)) | \( {\widehat{\lambda}}_2 \) | \( \widehat{\gamma} \) | 95% CI of γ |
---|---|---|---|---|---|
rs2238907 | 0.004 | (0.20, 0.24) | 2.18 | 0.35 | [0, 0.79) |
rs916685 | 0.003 | (0.17, 0.20) | 2.55 | 0.53 | [0, 1.33) |
rs1264064 | 0.001 | (0.42, 0.45) | 1.96 | 0.71 | [0, 2] |
rs1043034 | 0.007 | (0.19, 0.24) | 3.73 | 0.81 | (0.51, 1.69) |
rs2005463 | 0.007 | (0.18, 0.23) | 4.59 | 0.61 | (0.40, 1.06) |
Discussion
In this article, we propose a statistical measure γ of the degree of the XCI skewing for family trio data, which can be represented as a ratio of two GRRs in females in the presence of association between the disease and genotypes. Further, we obtain the point estimate of γ, which is constructed by the MLEs of two GRRs in females. When there are missing parental genotypes in some family trios, the ECM algorithm is used to estimate the two GRRs. The CI of γ is derived from the likelihood method by inverting the LRT. We conduct the simulation study under various simulation settings, including six missing patterns of families, six groups of allele frequencies, two different values of inbreeding coefficient in females, two different values of λ_{2} and five different values of γ. The simulation results show that the proposed likelihood-based CI of γ has an accurate CP under the situations considered, while ML/(ML + MR) and DP of the CI of γ and the median of estimates of γ are influenced by the values of γ, λ_{2} and the missing pattern. Similarly, the simulated power of LRT is affected by the values of ∣γ − γ_{0}∣, λ_{2}, (p_{m}, p_{f}) and the missing pattern. Finally, we apply our proposed method to the RA data from USA and find out a locus, rs2238907, which may undergo the XCI skewing against the at-risk allele.
Many X-linked diseases are always associated with XCI skewing in females. Our proposed statistical measure γ provides information on the potential loci subject to XCI skewing, thus it is helpful to uncover the pathogenesis of X-linked diseases. However, most of the statistical studies on X chromosome today focus mainly on the association tests [17, 24, 25, 36, 37, 38], so there are no other statistical methods available to measure the skewness of XCI. On the other hand, although the XCI skewing pattern can also be detected by differential methylation between the active and inactive X chromosomes or mRNA transcription in cells, our proposed statistical method takes use of population data to measure the skewness of XCI. Thus, it can reflect the average level of the XCI skewing in female population.
There are some issues in our proposed method. First of all, the original CI is truncated by [0, 2] to enhance the interpretability of the CI. However, when the whole original CI lies outside [0, 2], the CI ultimately obtained after truncation would be empty. Although it is hard to interpret this kind of CI containing no values, the simulation results show that when γ takes values on the boundary (i.e., 0 and 2), these empty CIs seldom occur. For example, when γ = 0, two tail errors (ML and MR) are extremely imbalance with ML/(ML + MR) being 1 or close to 1. This means that there are no or very few original CIs whose upper limit is below 0. Likewise, ML/(ML + MR) is 0 or close to 0 when γ = 2, which implies that there are no or very few original CIs whose lower limit is beyond 2. On the other hand, the proposed likelihood method has its own drawback in deriving the CI of a ratio like any other ratio estimation methods. We find that the likelihood-based CI of γ may consist of two disjoint intervals, such as [0, 0.03) ∪ (0.59, 2], and it is also difficult for us to interpret. For example, if \( \widehat{\gamma}=1.92 \) and the CI of γ is [0, 0.03) ∪ (0.59, 2], then the corresponding LRT would reject the null hypothesis of γ_{0} = 0.5, and accept that of γ_{0} = 0.01. It is hard to explain that the LRT rejects a γ_{0} being close to \( \widehat{\gamma} \), while accepts one being far away from \( \widehat{\gamma} \). Although this kind of CI is undesirable, it is also inevitable and can be regarded as a hint of λ_{2} being close to 1 [39]. In addition, it should be noted that the ECM algorithm is not applicable when all the family trios are “single daughters”, since the MLE of θ may not be uniquely specified under this situation (the details see Additional file 1: Appendix C). However, if the other family trios were collected, then the single daughters can make contribution to the MLE of θ in the ECM algorithm together with these trio data (the details see Additional file 1: Appendix D). Finally, we assume that the genotypes’ frequencies in males and females (p_{m}, g_{0} and g_{1}) are unknown and estimate them together with λ_{1} and λ_{2} in our simulation study and real data application. Alternatively, if we can obtain information on the allele frequencies from the online databases, such as the Allele Frequency Net Database [40] and the UCSC Genome Browser Database [41], then it is unnecessary to re-estimate p_{m}, g_{0} and g_{1}, which will reduce the number of parameters so that the ECM algorithm runs faster.
Note that the ECM algorithm can converge to a local maximum of the log-likelihood function instead of a global maximum [23]. To investigate this, we randomly choose 1000 initial values of θ (θ_{0}) from the parameter space and regard the MLE of θ (θ_{0}) with the maximum log-likelihood among 1000 \( \ln L\left(\widehat{\theta}\right) \)'s (\( \ln L\left({\overset{\sim }{\theta}}_0\right) \)'s) as the global MLE of θ (θ_{0}). We conduct a simulation study under the simulation settings with ρ=0, λ_{2}= 1.5 and (p_{m}, p_{f}) = (0.30, 0.30), and the details see Additional file 1: Appendix E. The simulation results (see Additional file 1: Tables S10 and S11) show that the values of \( \widehat{\theta} \) and \( \ln L\left(\widehat{\theta}\right) \) (\( {\overset{\sim }{\theta}}_0 \) and \( \ln L\left({\overset{\sim }{\theta}}_0\right) \)) based on one initial value estimated by the method described in Additional file 1: Appendix B are very close to those based on 1000 initial values under all the simulated situations when N_{2} (the number of complete family trios) is not too small, such as MP1 and MP2, which may indicate that the ECM algorithm based on the estimated initial value converges towards the global maximum. As for MP3-MP6, except that \( {\overset{\sim }{\theta}}_0 \) with (γ_{0}, γ) = (1, 2) under MP5 and MP6, \( {\overset{\sim }{\theta}}_0 \) with (γ_{0}, γ) = (1, 1) and (1, 2) under MP3, and \( {\overset{\sim }{\theta}}_0 \) with (γ_{0}, γ) = (1, 1), (1, 1.5) and (1, 2) under MP4 may converge to a local maximum, all the other \( \widehat{\theta} \) and \( {\tilde{\theta}}_0 \) results converge to the global maximum. Further, for these seven cases, we try and randomly select ten groups of initial values of θ_{0} from the parameter space and regard \( {\tilde{\theta}}_0 \) with the maximum log-likelihood among ten \( \ln L\left({\tilde{\theta}}_0\right) \)'s as the final MLE of θ_{0}. We find that \( {\tilde{\theta}}_0 \)’s based on ten and 1000 initial values are very close to each other under all the seven simulated situations (see Additional file 1: Table S11). So, if N_{2} is zero or too small, we recommend using multiple initial values (such as ten) for obtaining the global MLE of θ_{0}. On the other hand, family trios with both parents are always fortunately collected in the family-based studies in practice.
In future studies, we will extend our proposed method to incorporate covariates by using nuclear families with affected and unaffected offspring. Furthermore, to facilitate the interpretability of the CI of γ, we will utilize the prior information, such as the order of the GRRs in females and the information of the presence of association.
Conclusions
The proposed statistical measure for the skewness of XCI is applicable for complete family trio data or family trio data with some paternal genotypes missing. The likelihood-based CI has an accurate CP under the situations considered. Therefore, our proposed statistical measure is generally recommended in practice for discovering the potential loci which undergo the XCI skewing.
Notes
Acknowledgments
The authors thank the editor and three reviewers for helpful comments that greatly improve the presentation of the article.
Funding
This study was supported by the National Natural Science Foundation of China grants 81,773,544, 81,373,098 and 81,573,207 and the National and Guangzhou University Students’ Innovation and Enterprise Training Project of China grant 201,612,121,017. The authors thank the Genetic Analysis Workshops for providing the RA data, which were supported by the National Institutes of Health grant R01 GM031575. The RA data were gathered with the support of grants from the National Institutes of Health grants N01-AR-2-2263 and R01-AR-44422, and the National Arthritis Foundation.
Availability of data and materials
The dataset supporting the conclusions of this article is from North American Rheumatoid Arthritis Consortium, which is made available from Genetic Analysis Workshop 15 (http://www.gaworkshop.org/).
Authors’ contributions
SQX, YZ, PW, WL, XBW and JYZ all contributed to the study design, analytical preparation and the writing of the manuscript. SQX and YZ performed the simulation studies. PW, WL, XBW and JYZ analyzed the data and revised the manuscript. All authors have read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
- 1.Postma DS. Gender differences in asthma development and progression. Gend Med 2007;4 Suppl B: S133–46.CrossRefPubMedGoogle Scholar
- 2.Invernizzi P, Pasini S, Selmi C, Gershwin ME, Podda M. Female predominance and X chromosome defects in autoimmune diseases. J Autoimmun. 2009;33:12–6.CrossRefPubMedGoogle Scholar
- 3.Dorak MT, Karpuzoglu E. Gender differences in cancer susceptibility: an inadequately addressed issue. Front Genet. 2012;3:268.CrossRefPubMedPubMedCentralGoogle Scholar
- 4.Qureshi IA, Mehler MF. Genetic and epigenetic underpinnings of sex differences in the brain and in neurological and psychiatric disease susceptibility. Prog Brain Res. 2010;186:77–95.CrossRefPubMedPubMedCentralGoogle Scholar
- 5.Skuse DH. Imprinting, the X-chromosome, and the male brain: explaining sex differences in the liability to autism. Pediatr Res. 2000;47:9–16.CrossRefPubMedGoogle Scholar
- 6.Chow JC, Yen Z, Ziesche SM, Brown CJ. Silencing of the mammalian X chromosome. Annu Rev Genomics Hum Genet. 2005;6:69–92.CrossRefPubMedGoogle Scholar
- 7.Carrel L, Willard HF. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature. 2005;434:400–4.CrossRefPubMedGoogle Scholar
- 8.Minks J, Robinson WP, Brown CJ. A skewed view of X chromosome inactivation. J Clin Invest. 2008;118:20–3.CrossRefPubMedGoogle Scholar
- 9.Van den Veyver IB. Skewed X inactivation in X-linked disorders. Semin Reprod Med. 2001;19:183–91.CrossRefPubMedGoogle Scholar
- 10.Brown CJ. Skewed X-chromosome inactivation: cause or consequence? J Natl Cancer Inst. 1999;91:304–5.CrossRefPubMedGoogle Scholar
- 11.Plenge RM, Stevenson RA, Lubs HA, Schwartz CE, Willard HF. Skewed X-chromosome is a common feature of X-linked mental retardation disorders. Am J Hum Genet. 2002;71:168–73.CrossRefPubMedPubMedCentralGoogle Scholar
- 12.Prchal JT, Carroll AJ, Prchal JF, Crist WM, Skalka HW, Gealy WJ, Harley J, Malluh A. Wiskott-Aldrich syndrome: cellular impairments and their implication for carrier detection. Blood. 1980;56:1048–54.PubMedGoogle Scholar
- 13.Conley ME, Lavoie A, Briggs C, Brown P, Guerra C, Puck JM. Nonrandom X chromosome inactivation in B cells from carriers of X chromosome-linked severe combined immunodeficiency. Proc Natl Acad Sci U S A. 1988;85:3090–4.CrossRefPubMedPubMedCentralGoogle Scholar
- 14.Salsano E, Tabano S, Sirchia SM, Colapietro P, Castellotti B, Gellera C, Rimoldi M, Pensato V, Mariotti C, Pareyson D, Miozzo M, Uziel G. Preferential expression of mutant ABCD1 allele is common in adrenoleukodystrophy female carriers but unrelated to clinical symptoms. Orphanet J Rare Dis. 2012;7:10.CrossRefPubMedPubMedCentralGoogle Scholar
- 15.Kristiansen M, Langerød A, Knudsen GP, Weber BL, Børresen-Dale AL, ørstavik KH. High frequency of skewed X inactivation in young breast cancer patients. J Med Genet. 2002;39:30–3.CrossRefPubMedPubMedCentralGoogle Scholar
- 16.Clayton D. Testing for association on the X chromosome. Biostatistics. 2008;9:593–600.CrossRefPubMedPubMedCentralGoogle Scholar
- 17.Wang J, Yu R, Shete S. X-chromosome genetic association test accounting for X-inactivation, skewed X-inactivation, and escape from X-inactivation. Genet Epidemiol. 2014;38:483–93.CrossRefPubMedPubMedCentralGoogle Scholar
- 18.Kubota T. A new assay for the analysis of X-chromosome inactivation in carriers with an X-linked disease. Brain Dev. 2001;23(Suppl 1):S177–81.CrossRefPubMedGoogle Scholar
- 19.Busque L, Zhu J, DeHart D, Griffith B, Willman C, Carroll R, Black PM, Gilliland DG. An expression based clonality assay at the human androgen receptor locus (HUMARA) on chromosome X. Nucleic Acids Res. 1994;22:697–8.CrossRefPubMedPubMedCentralGoogle Scholar
- 20.Szelinger S, Malenica I, Corneveaux JJ, Siniard AL, Kurdoglu AA, Ramsey KM, Schrauwen I, Trent JM, Narayanan V, Huentelman MJ, Craig DW. Characterization of X chromosome inactivation using integrated analysis of whole-exome and mRNA sequencing. PLoS One. 2014;9:e113036.CrossRefPubMedPubMedCentralGoogle Scholar
- 21.Carrel L, Willard HF. Heterogeneous gene expression from the inactive X chromosome: an X-linked gene that escapes X inactivation in some human cell lines but is inactivated in others. Proc Natl Acad Sci U S A. 1999;96:7364–9.CrossRefPubMedPubMedCentralGoogle Scholar
- 22.Cotton AM, Lam L, Affleck JG, Wilson IM, Peñaherrera MS, McFadden DE, Kobor MS, Lam WL, Robinson WP, Brown CJ. Chromosome-wide DNA methylation analysis predicts human tissue-specific X inactivation. Hum Genet. 2011;130:187–201.CrossRefPubMedPubMedCentralGoogle Scholar
- 23.Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika. 1993;80:267–78.CrossRefGoogle Scholar
- 24.Chen Z, Ng HKT, Li J, Liu Q, Huang H. Detecting associated single-nucleotide polymorphisms on the X chromosome in case control genome-wide association studies. Stat Methods Med Res. 2017;26:567–82.CrossRefPubMedGoogle Scholar
- 25.Wang P, Xu SQ, Wang BQ, Fung WK, Zhou JY. A robust and powerful test for case–control genetic association study on X chromosome. Stat Methods Med Res. 2018. https://doi.org/10.1177/0962280218799532.
- 26.Team RC. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. 2013. http://www.r-project.org. 2018.
- 27.Liseo B. Bayesian and conditional frequentist analyses of the Fieller’s problem. A critical review Metron 2003;LXI:133–50.Google Scholar
- 28.Hwang JTG. Fieller’s problems and resampling techniques. Stat Sinica. 1995;5:161–71.Google Scholar
- 29.Amos CI, Chen WV, Remmers E, Siminovitch KA, Seldin MF, Criswell LA, Lee AT, John S, Shephard ND, Worthington J, Cornelis F, Plenge RM, Begovich AB, Dyer TD, Kastner DL, Gregersen PK. Data for genetic analysis workshop (GAW) 15 problem 2, genetic causes of rheumatoid arthritis and associated traits. BMC Proc. 2007;1(Suppl 1):S3.CrossRefPubMedPubMedCentralGoogle Scholar
- 30.Marsaglia G. Ratios of normal variables and ratios of sums of uniform variables. J Am Stat Assoc. 1965;60:193–204.CrossRefGoogle Scholar
- 31.Diaz-Francés E, Rubio FJ. On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables. Stat Pap. 2013;54:309–23.CrossRefGoogle Scholar
- 32.Wilcox RR, Keselman HJ. Modern robust data analysis methods: measures of central tendency. Psychol Methods. 2003;8:254–74.CrossRefPubMedGoogle Scholar
- 33.Chabchoub G, Uz E, Maalej A, Mustafa CA, Rebai A, Mnif M, Bahloul Z, Farid NR, Ozcelik T, Ayadi H. Analysis of skewed X-chromosome inactivation in females with rheumatoid arthritis and autoimmune thyroid diseases. Arthritis Res Ther. 2009;11:R106.CrossRefPubMedPubMedCentralGoogle Scholar
- 34.Genetic Analysis Workshop. 1982. https://www.gaworkshop.org. Accessed 3 Oct 2018.
- 35.Zou QL, You XP, Li JL, Fung WK, Zhou JY. A powerful parent-of-origin effects test for qualitative traits on X chromosome in general pedigrees. BMC Bioinformatics. 2018;19:8.CrossRefPubMedPubMedCentralGoogle Scholar
- 36.Hickey PF, Bahlo M. X chromosome association testing in genome wide association studies. Genet Epidemiol. 2011;35:664–70.CrossRefPubMedGoogle Scholar
- 37.Ma L, Hoffman G, Keinan A. X-inactivation informs variance-based testing for X-linked association of a quantitative trait. BMC Genomics. 2015;16:241.CrossRefPubMedPubMedCentralGoogle Scholar
- 38.Choi S, Lee S, Qiao D, Hardin M, Cho MH, Silverman EK, Park T, Won S. FARVATX: family-based rare variant association test for X-linked genes. Genet Epidemiol. 2016;40:475–85.CrossRefPubMedPubMedCentralGoogle Scholar
- 39.Gleser LJ, Hwang JT. The nonexistence of 100(1-α)% confidence sets of finite expected diameter in errors-in-variables and related models. Ann Stat. 1987;15:1351–62.CrossRefGoogle Scholar
- 40.Allele Frequency Net Database. 2015. http://www.allelefrequencies.net/. Accessed 3 Oct 2018.
- 41.UCSC Genome browser database. 2000. http://genome.ucsc.edu/. Accessed 3 Oct 2018.
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.