Introduction

Advances in next-generation sequencing (NGS) technologies have led to harnessing robust structural and functional knowledge of the human genome. Thus, NGS provides unprecedented opportunities to understand health and disease at the present time1,2,3,4. But the transformation of knowledge from bench to bedside lies in mining the massively available data for genes and variants of high clinical relevance5. To understand the genetic architecture of disease, genome-wide association studies6,7 and several other studies based on single-omic data such as gene expression or DNA methylation have catalogued many disease-associated loci. Nonetheless, we are yet to understand the aetiology of many complex diseases as they occur due to an intricate interplay of various genetic elements8.

Multi-omics data integration serves as a springboard for unravelling such inherent complexity underlying the genetic architecture of disease8,9. Data integration through statistical and/or computational models plays a major role in the prediction of genomic and environmental perturbations underlying disease/complex traits, and transferring preclinical knowledge to clinical trials with increased speed and accuracy10,11. Statistical models12,13,14,15,16 for data integration enhance the predictive power of gene-disease association, by incorporating prior knowledge of regulatory relation among different omics data and analysing them under one statistical framework.

With access to enormous data from several consortiums on various omics data, many data integration techniques have been developed so far. Such methods comprise of dimension reduction techniques, gene regulatory networks, feature selection techniques using supervised, unsupervised or semi-supervised learning, graph or kernel-based techniques, Bayesian, and frequentist approaches16,17,18,19,20,21,22,23. But more often, integration methods combine multiple omics data from large consortiums of different cohorts15,24. Such methods are prone to spurious prioritisation of associated genes owing to substantial cross-cell-type variation25. For these reasons and to reduce the stratification bias due to population diversity, increasing attempts are being made to create large scale multi-omics datasets recently by combining multiple assays from the same set of samples26.

However, individual research groups made substantial efforts for generating data on genetic variation, gene expression, methylation, phenotype, etc. simultaneously from the same set of samples to study biomarkers. Realising its great potential, data sharing platforms/repositories store such heterogeneous data for the broader markets demanding immediate study26,27,28. But unlike large consortium data, these data might have a relatively small sample size. In addition, missing data occur across multiple omics. For example, while integrating information from different types of genomic data, typically, genotype information is available for all the individuals but gene expression and/or methylation information are often partially missing29. Yet gene expression and/or methylation assays are rarely repeated for generating the missing data due to various reasons such as the huge cost of the assays2, degradation of mRNA30 and/or dearth of tissue samples, etc. So, a major challenge is to integrate multi-omics data in presence of partially missing individual-level observations26.

Few Bayesian methods31,32 consider missing value imputation in multi-omics data integration model. But sometimes imputed data overshadow the contribution from the partially observed data for certain percentages of missing data and generally involves a huge computational cost to decide whether or not to impute31. Moreover, imputing the missing values might be misleading25 as it introduces bias and uncertainty in the data33, especially when the missing percentage is large and/or the reason for missing is unclear. To deal with the missing values it is important to understand the data source, data structure, missing mechanism, and amount of missing data34 along with its relation with the phenotype33. Some network-based methods35 considers partial multi-omics data integration using a similarity network but assume the same contribution of different omics.

In this paper, we propose a multi-omics genetic association tool, called TiMEG (Tool for integrating Methylation, gene Expression and Genotype), for the identification of disease-associated biomarkers, by integrating single nucleotide polymorphism (SNP), gene expression, and DNA methylation with partially missing omics data under case–control paradigm. Our method elucidates the effect of multiple omics on the qualitative phenotype (case-control status). It jointly dissects the information on various omics data, their inter-relationship, and the information from individuals with completely as well as partially available omics data, without imputing the missing data. Using a likelihood-based approach, TiMEG models the conditional distribution of the response variable using the missing predictor variables36,37.

Asymptotic distribution of our test statistic under the null hypothesis of no genetic association computes p-values much faster compared to other computational methods like permutation-based or resampling techniques etc. Extensive simulation confirms robust performance in terms of prediction accuracy of estimation in tenfold cross-validation, controlled type I error rate, high statistical power, and consistency of the test under different missing data schemes. Application of our method on a real dataset of tuberous sclerosis (also called tuberous sclerosis complex (TSC)) patients and healthy controls (phs001357.v1.p1) identified functionally relevant genes and gene clusters belonging to the pathways that are involved in TSC pathogenesis. Even with small sample size and a substantially high percentage of missing data, TiMEG could be used for the identification of biomarkers without losing any information. This leads to capturing weaker signals that remain unidentified in larger single-omic data analysis.

Results

TiMEG method

In presence of limited sample size, missing individual-level information on multiple assays poses a great loss of information. Imputation might lead to bias in such a small sample size as the percentage of missing data is large. We introduce TiMEG, a general analytical approach for the identification of biomarkers associated with a disease by integrating multiple omics data with/without missing individual-level omics information under a case–control paradigm. We integrate data from DNA sequencing, gene expression, and DNA methylation assays along with covariates and qualitative phenotype (disease status) from the same set of samples. Figure 1 gives a general structure of omics data availability. Based on Fig. 1, we design our missing data schemes (Table 1).

TiMEG relies on a likelihood-based approach to gather information on the missing data by estimating the parameters in the likelihood function containing incomplete omics data rather than imputing the missing data before the analysis. To obtain the likelihood function, we find the probability distribution of the response variable (disease status) conditional on the available omics information by (1) integrating out the missing variable from the joint likelihood function (see Methods), and (2) exploiting the interdependence among the multiple omics data.

Since an individual’s gene expression level could be regulated by alteration in the DNA sequence, TiMEG considers the effect of genotype and methylation on gene expression. Similarly, it incorporates the effect of genotype on methylation and also the effect of genotype, methylation, gene expression, and covariates on disease status. We assume that after appropriate transformation and normalization procedures, individual-level gene expression and methylation data follows a bivariate normal distribution13.

Features of TiMEG

TiMEG is a statistical tool for the identification of disease-associated combinations of multiple omics. For example, a significant combination provides a gene along with a cis-methylation and a cis-genotype (collectively called a ‘trio’ throughout the paper). In this article, we illustrate a detailed pipeline for the identification of significant trios, if at least one component of the trio is associated with the disease.

Other advantages of TiMEG include the ability:

  • to theoretically handle an unrestricted number of missing transcriptomic and epigenetic data as elucidated in the Methods section and confirmed using simulations,

  • to capture weaker signals that remain unidentified in single-omic analysis by incorporating more information from the inter-relation among multiple omics data in the likelihood framework,

  • of robust performance in terms of prediction accuracy of estimation in tenfold cross-validation, controlled type I error rate, and high statistical power,

  • of efficient incorporation of correlated omics information to decipher significant signals resulting in reduced false-negative rate, and

  • prompt calculation of p-values using the asymptotic distribution as opposed to computation-intensive permutation-based resampling methods.

Since alteration in gene expression regulation is expected to alter phenotype more than any change in the DNA sequence, the performance of TiMEG reveals that the effect (in terms of statistical power) of a certain percentage of missing gene expression data is more than the same percentage of missing methylation data (see Simulations). Missing both omics information for a subset of individuals will lead to much more loss of information and therefore statistical power than missing gene expression or methylation data on any subset of equivalent size. Thus, TiMEG agrees with the biologically accepted notion. Regardless of the sample size and/or percentage of missing omics data, wet-lab researchers will be able to promptly identify significant biomarkers from their data by calculating p-values using this tool.

Figure 1
figure 1

Structure of data availability for genotype (G), gene expression (E), methylation (M) and phenotype (P). Each letter indicates the presence of corresponding data.

Performance of TiMEG

Simulations

We perform extensive simulations to study the performance of TiMEG for varying percentages of missing omics data under different missing data schemes. As more often gene expression and methylation data are missing for a subset of genotyped individuals, we assume that genotype, phenotype, and covariate data are available for the entire sample of size n (say). As shown in Table 1, for these n individuals, there could arise four different scenarios (1) none of the other two omics data is missing for a certain subset of size \(n_1\) (say), (2) only gene expression data are missing for another subset of size \(n_2\) (say), (3) only methylation data are missing for the third subset of size \(n_3\) (say), and (4) both gene expression and methylation data are missing for the remaining subset \(n_4\) (say). Here, we consider two covariates, age and gender and simulate them from N(40, 6) and Bin(1, 0.5) respectively. First, we simulate data under scheme 1 i.e when there is no missing observation.

Table 1 Missing data schemes.

For generating genotype data, we assume an SNP having two alleles A and a with A as a minor allele. Considering di-allelic loci, we simulate genotype data from Bernoulli distribution assuming Hardy-Weinberg Equilibrium (HWE) for controls with minor allele frequency (MAF) 0.2 for associated SNP. We generate genotypes for cases using additive model for relative risk38 based on disease prevalence \(= 0.1\), genotypes (A, Aa, and aa), and relative risk \(=1.2\).

Next, we generate methylation and gene expression values using Eqs. 2 and 3 (see Methods) assuming the values of the parameters as \(\alpha _0=1.3, \alpha _g=2.4, \gamma _0=1.9, \gamma _g=0.6, \gamma _m=2.3\). Methylation-gene expression pair follows a bivariate normal distribution with variances \(\sigma _1^2=\sigma _2^2=1\). We assume that the means of the bivariate distributions for cases and controls differ by 0.3. Now, based on covariates, genotype, gene expression, and methylation, we simulate the phenotype of each individual using Bernoulli distribution from Eq. 1 (see Methods) with parameters \(\beta _0=1, \varvec{\beta _x}'= 0.01{\varvec{1}}', \beta _g=0.1, \beta _m=0.2, \beta _e=0.3\). We generate data for n cases and n controls where the sample size n is taken as 100, 150 and 200. For the sake of power comparison, we take equal sample sizes for cases and controls. However, our method works for unequal sample sizes for cases and controls as well.

For the other three schemes, we generate complete data as above and remove some omics information to introduce missingness. For the second scheme where only gene expression is missing, we remove varying percentages (\(10\%\), \(20\%\), \(40\%\), \(60\%\), \(80\%\)) of gene expression values. Similarly, we remove varying percentages of methylation values for the third scheme. For both omics missing scheme, we remove gene expression and methylation values for different combinations of missing percentages (Tables 2 and 3).

To estimate the parameters in the model, we maximise the likelihood function using a numerical optimisation technique (see Methods). We construct a test statistic using the above estimates to test whether a trio is associated with the phenotype. Here, we use the likelihood ratio test for testing the null hypothesis (\(H_0\)) of no effect of genotype, gene expression, and methylation on affection status. The asymptotic distribution of this test statistic follows a \(\chi ^2\) distribution with 3 degrees of freedom under \(H_0\). Figure 2 illustrates the QQ plot of sample quantiles from the empirical distribution of the test statistic under \(H_0\) to theoretical quantiles of \(\chi ^2_3\) distribution, for a complete data and another dataset with missing data.

Figure 2
figure 2

QQ-plot with sample size 200 based on the performance of simulated data. (A): QQ-plot with no missing data, (B): QQ-plot with 10% both gene expression and methylation missing, 10% only methylation and 20% only gene expression missing.

Based on 10000 datasets, we find that type I error rate of our test is controlled nearly at \(5\%\) level of significance for each of the different sample sizes, missing data schemes, and percentages of missing omics data (Tables 2 and S1). Hence, our test statistic is conservative in controlling false positives and is useful for p-value computation in a real dataset. To examine the performance of the test, we calculate statistical power under different missing data schemes and for various percentages of missing omics data based on 1000 datasets (Tables 3 and S2). To find whether the power of the test increases with an increase in sample size, we calculate power based on \(5\%\) cut-off points from 10000 datasets generated under \(H_0\), for different percentages of missing omics data corresponding to each missing data scheme. This would keep the type I error rate fixed exactly at \(5\%\) level to make a uniform power comparison. Table 3 demonstrates a substantial increase in power for every combination of missing data with an increase in sample size. Under each missing data scheme, when the percentage of missing data increases, the power decreases. When there is no missing data the power would be maximum. Thus, our test is consistent.

As mentioned earlier, we now observe from Table 3, that TiMEG is more affected (as evident from the drop in statistical power) by (1) a certain percentage of missing gene expression data than the same percentage of missing methylation data and (2) missing both omics information for a subset of individuals than missing gene expression or methylation data on any subset of equivalent size. For instance, let us consider a fixed sample size of 200 (say) and a fixed percentage of missing omics data \(40\%\) (say). From Table 3 we note that the power of missing only gene expression data (0.866) is less than that of missing only methylation data (0.936). Clearly, the power of missing both omics (0.793) is less than the minimum of the above two. Such a difference in power is biologically expected (which is reflected in the complete-case analysis as well) because alteration in gene expression is more informative than any other change in the DNA sequence. Thus, for a subset of individuals, no information on any of the two mentioned omics causes more loss of information compared to the presence of at least one of them. So, at the individual level, if possible, it is better to collect at least one observation from gene expression or methylation. Thus, provided there exists a choice, less percentage of missing gene expression is preferred than that of methylation because gene expression data is more informative.

When miscellaneous percentages of data are missing, we observe (from Table 3) a similar phenomenon as above. For the fixed overall percentage of missing omics data (\(40\%\)) and the fixed sample size (200), we consider three combinations such as (1) \(20\%\) individuals have both omics missing, another \(10\%\) individuals have only gene expression missing, and another \(10\%\) individuals have only methylation missing, (2) \(10\%\) individuals have both omics missing, another \(20\%\) individuals have only gene expression missing, and another \(10\%\) individuals have only methylation missing, and (3) \(10\%\) individuals have both omics missing, another \(10\%\) individuals have only gene expression missing, and another \(20\%\) individuals have only methylation missing. The powers in the above three combinations are 0.858, 0.867, and 0.907. So we observe that even in a miscellaneous missing scenario with such marginal difference in missing percentage of the omics, TiMEG is able to differentiate between statistical powers in accordance with the biological expectation. These results indicate that our method is able to capture all available information corresponding to every individual under study and its performance is robust to the percentage and scheme of missing omics data.

In Table 3, we also include a comparison of the statistical power of TiMEG with a complete-case analysis. Note that, the performance of TiMEG is clearly better than the complete-case for moderate missing percentages. For lower percentages both of them give comparable powers. Besides, we compute powers using mean imputation (MI) but they are less than TiMEG and type I errors are too much inflated (Tables S1 and S2). For k-nearest neighbour (KNN) imputation, powers are not stable (Table S2). We observe that as the percentages of missing increases, the power decreases but after some point the power increases. The powers clearly fluctuate as the missing percentages were increased more than \(40\%\). This could be typically due to the uncertainty introduced by using the imputation technique when the missing data is moderately large34.

Table 2 Type I error rate under different combination of sample sizes and varying percentages of missing methylation and/or gene expression values based on 10,000 simulations.
Table 3 Power under different combination of sample sizes and varying percentages of missing methylation and/or gene expression values based on 1000 simulations.

Run time

We compare the computation time of TiMEG with other methods in Table 4. All programs are run in a Mac (OS Big Sur, version 11.5.1) laptop with Apple M1 chip having 8 GB RAM. We find that the maximum expected time per run for TiMEG is less than all other methods. Moreover, unlike other methods analysis time for TiMEG is consistent.

Performance evaluation

To evaluate the predictive performance of our method, we assess the prediction accuracy of our estimation by tenfold cross-validation (CV). We also compare TiMEG with commonly used imputation based methods such as KNN39, MI, and also actual dataset without missing omics data. For TiMEG we first generate a dataset that has a pre-assigned missing omics data structure and divide it into two parts, test set and training set. Observations with no missing data are then split into 10 blocks. One block is selected as a test set and all remaining individuals form a training set. Based on the first training set, we find an estimated \(\beta \) coefficients (using Eq. 1 in Methods) and classify the individuals in the corresponding test set to cases and controls. We repeat this procedure for all 10 test sets and calculate average prediction accuracy, specificity, and sensitivity for the dataset. For each pre-assigned missing omics data structure, we generate 100 datasets and perform a tenfold CV on each of them. Using them, we compute the specificity and sensitivity of our method for different thresholds of classification. Next, on the same generated dataset, we apply mean and KNN imputation methods to determine specificity and sensitivity using a tenfold CV. We provide four receiver operating characteristic curve (ROC) graphs each depicting better performance of TiMEG in comparison to other imputation methods. Also, we compare them with ROC on full data (Fig. 3). Interestingly, we observe that our method classifies an individual into a case or control group as efficiently as having full data.

Figure 3
figure 3

Plot of ROC graphs depicting situations with (A) only gene expression missing for \(60\%\) individuals, (B) only methylation missing for \(60\%\) individuals, (C) both omics missing for \(60\%\) individuals, and (D) both omics missing for \(20\%\), only gene expression missing for \(20\%\) and only methylation missing for \(20\%\) individuals.

Figure 4
figure 4

Boxplot of prediction accuracy from tenfold CV based on 100 datasets each with 200 cases and 200 controls under different missing omics data structure. The black horizontal line indicates median prediction accuracy for the datasets with no missing information. Each boxplot (from left to right) signifies one combination each viz. no missing information, only \(10\%\), \(20\%\), \(40\%\), \(60\%\), \(80\%\) gene expression missing respectively, only \(10\%\), \(20\%\), \(40\%\), \(60\%\), \(80\%\) methylation missing respectively, \(10\%\), \(20\%\), \(40\%\), \(60\%\), \(80\%\) of both gene expression and methylation missing respectively, \(5\%\) of both missing along with \(5\%\) of only gene expression missing, \(5\%\) of only gene expression missing along with \(5\%\) of only methylation missing, \(5\%\) of both missing along with \(5\%\) of only methylation missing, \(10\%\) of both missing along with \(10\%\) of only gene expression missing, \(10\%\) of both missing along with \(10\%\) of only methylation missing, \(10\%\) of both missing along with \(5\%\) of only gene expression missing and another \(5\%\) of only methylation missing.

Moreover, we find that the mean prediction accuracy for classifying an individual to case or control group is more than \(80\%\) under all missing omics schemes and for different missing percentages (Supplementary Table S3). We observe that median prediction accuracy remains the same under all scenarios except when the percentage of missing data is very high (Fig. 4). Although the deviation of median prediction accuracy for an extremely high percentage (\(\sim 80\%\)) of missing values compared to that for no missing data is small, the higher dispersions indicate fluctuations of the prediction accuracies. This implies that the number of false-positive and false-negative classifications fluctuates for these extreme missing scenarios. We illustrated this increase in dispersion through the plot of false-positive rate (1 - Specificity) versus misclassification rate ((\(1 - \text {prediction accuracy})/100\)) under different percentages of missing omics data of different missing data schemes (Figs. 5, S1, S2, S3). It is evident that for comparatively smaller percentages of missing data, the dispersions are much less. Only for extreme conditions of missing data, the dispersion is slightly higher. Thus, we find that our method provides a robust estimate of the parameters under different missing omics schemes with a reasonable missing percentage and has high predictive power.

Figure 5
figure 5

Plot of Misclassification rate vs False positive rate (1-Specificity) for only gene expression missing. (A) depicts no missing data scenario while (BF) respectively depict \(10\%\), \(20\%\), \(40\%\), \(60\%\) and \(80\%\) only gene expression data missing scenarios.

Table 4 Expected average computation time (in seconds) per run based on 100 simulations.

Application to a real dataset

We applied our proposed method to a dataset on Tuberous Sclerosis Complex (TSC) patients and healthy controls (phs001357.v1.p1)29 obtained from the database of Genotypes and Phenotypes (dbGaP). TSC is a rare genetic disorder that causes the growth of non-cancerous (benign) tumours in the brain and other vital organs like kidneys, heart, skin, etc., and in some cases leads to significant health problems40.

After processing raw data (see Methods) from brain tissues, we obtained 8036 gene expression data on 27 cases and 7 controls, methylation data at 481470 CpG sites on 22 cases and 7 controls, and 1298477 whole-genome genotype data on 38 cases and 7 controls. We got data on all three omics for the control individuals. But only 12 case individuals had complete omics information. 9 other case individuals had all omics data except gene expression data, another group of 13 patients had no methylation data, and 4 patients had neither gene expression nor methylation data. Phenotype or disease status, covariates (such as age and gender), and genotype data were available for all cases and controls.

Since control samples are only a few, we considered those genes that have no missing gene expression value in controls, while in the case samples we allow up to \(50\%\) missing gene expression value. Next, we find the SNPs that are within 2000 bp upstream and downstream of each gene. We considered these SNPs as the cis-SNPs to the gene. Moreover, if any methylation site is associated with a gene, information such as the corresponding gene name and chromosome number are known from dbGaP. So, methylation sites in the vicinity of a gene are considered cis-CpG sites corresponding to the gene. After filtering the data, we find the number of unique genes containing at least one cis-SNP and one cis-CpG site reduces to 1691 and the total number of trios (comprised of one gene expression with one cis-genotype and one cis-CpG site corresponding to the gene) is 1184436. To identify the trios associated with the disease, we perform our test for all the mentioned trios, followed by Benjamini-Hochberg (BH) multiple corrections (across all tests).

Interpretation of a TiMEG trio

For each significant trio, one or more of its components are associated with the disease. However, we are more interested to observe whether TiMEG is able to identify loci with moderately low effect sizes that are missed by single-omic analysis. Therefore, we find those combinations where TiMEG shows association but separate analyses do not. We see that TiMEG successfully captures weaker signals that remain unidentified in single-omic analysis. The probable reason might be that single-locus from any omics data is unlikely to account for much of the variability in the phenotype. Moreover, it is often indicated that an increase in the sample size might capture the loci with moderate or low effect on the disease, but in that case, multiple testing burden also increases resulting in missing out true signals. But our method reduces false-negative associations by efficiently incorporating correlated omics information to decipher the significant signals associated with the disease. Particularly in this article, TiMEG tests if there is any effect of at least one of the components of the trio on the phenotype, but it is also able to test the effect of a single omic locus or combination of any two omics (see Methods). Emphatically, testing a single omic locus using TiMEG will provide greater insight than traditional single-omic analysis because of incorporating additional information from other available omics in the integrated model.

Functional annotation of TSC genes

It is well known that mutation in either of the two tumor-suppressor genes viz. TSC141 and TSC242, that code for hamartin and tuberin proteins respectively are responsible for TSC. The hamartin/tuberin heterodimer encoded by the interaction of TSC1 and TSC2 gene products, function in complex pathways43. TSC1/2 genes and hence the hamartin/tuberin complex plays a fundamental role in the regulation of phosphoinositide 3-kinase (PI3K) signaling pathway44 that inhibits the mammalian target of rapamycin (mTOR) through activation of the GTPase activity of Rheb45.

Using TiMEG we obtained 170 unique genes (see Supplementary Table S4) from 3283 significant trios (https://github.com/sarmistha123/TiMEG), to be associated with the disease risk. These trios are significant due to the combined effect of all its components but none of the single-omic analysis could identify any of the corresponding components. Among the contents of this list, there exits a trio corresponding to gene TSC1 with cis-genotype kgp7096367 and cis-methylation site cg19350728 that shows no association in any of the single-omic analysis but their combined effect is significant. TSC2 gene is excluded from our analysis because of a mismatch between probe id and HGNC IDs (See Methods).

We use David software46,47 to identify the functional annotations of the identified genes. Based on David’s group enrichment score, we obtained 5 clusters of our genes. The cluster with the maximum group enrichment score is associated with the serine/threonine kinase pathway. This pathway has a strong functional relation with TSC disease, as it is known that mutations in TSC1/2 genes impair the inhibitory function of the hamartin/tuberin complex, leading to phosphorylation (activation) of ribosomal protein S6 kinase beta-1 (S6K1), a serine/threonine kinase which is a downstream target of mTOR45.

Another cluster is associated with the zinc-finger protein pathway. Recent findings have highlighted the importance of the zinc-finger family and its involvement in tumorigenesis48. Interestingly, this pathway has some special implications in terms of brain tissues. Protein associated with Myc (called Pam) that is abundantly expressed in the brain, is associated with the tuberin/hamartin complex49. The C terminus of Pam containing the RING zinc-finger motif binds to tuberin49. Besides, Pam is a highly conserved nuclear protein that interacts directly with the transcriptional-activating domain of Myc (a protooncogene that plays an important role in the regulation of cellular proliferation, differentiation, and apoptosis and can contribute to tumorigenesis)50 and regulates mTOR signaling51.

Studies have revealed that TSC receives inputs from at least three major signaling pathways (PI3K-Akt-mTOR, ERK1/2-RSK1, LKB1-AMPK) in the form of kinase-mediated phosphorylation events that regulate its function as a GTPase activating protein (GAP)52. But only two genes viz. \({ TSC}1/2\) are widely known to be responsible for the disease. Therefore, we searched whether any of our significant genes belong to the same pathway as that of TSC1 gene. Using David software we identified genes ACACA and CREB5 in our list of significant genes, that occur in two pathways to which TSC1 belongs.

Studies show that TSC1 deficiency elevated ACACA expression and fatty acid synthesis, leading to impaired epigenetic imprinting on selective genes; tempering ACACA activity was able to divert cytosolic acetyl-CoA for histone acetylation and restore the gene expression program compromised by TSC1 deficiency53. CREB5 encodes CREB protein that serves as a transcriptional activator of Rheb. Rheb acts as an immediate activator of mTOR and in turn promotes tumorigenesis independently of TSC254. Moreover, we identified other genes such as JAK3, GNG4, FGFR2, EFNA2, LAMC2 that belong to one of the pathways as that of TSC1.

Implication of TiMEG

It is important to note that these genes could not have been identified by single-omic analysis. Data integration of different omics led to these findings even when the sample size is small and individual-level data is not available on all omics. Thus, TiMEG holds the potential to understand the genetic architecture of a disease aetiology by combining three different omics data, even in presence of missing omics data, and when the sample size is not humongous.

Thus, if the sample size is not huge, studying single-omic data to find any new gene that is susceptible to the disease risk is difficult because most of the known diseases have already been studied extensively. Moreover, scientists nowadays are interested in developing drug targets with genetic evidence of disease association as they are much more likely to get approved55. So, identification of new disease-associated genes or biomarkers is immensely important to understand the relation of disease with various genes in the pathways. Downstream/detailed investigation of these biomarkers could provide a better understanding of the disease aetiology and hence discover the best drug targets that might lead to the successful development of novel drugs.

Discussion

Multi-omics data integration elucidates the understanding of the genetic architecture of diseases and complex traits by incorporating additional information from different types of genomic data. But the presence of missing values poses a major challenge. This is more crucial when the sample size is limited and/or the percentage of missing data is large. Sometimes, these missing values occur due to biological reasons such as degradation of RNA or other technical issues. But when resources are limited, more often these assays are not repeated for the missing omics data. Typically, genotype data, being less expensive than gene expression and DNA methylation assays, are available for the entire sample. One option to analyse such data is using a sub-sample for which data are available for all omics. Such a complete case analysis loses a great deal of information. Again, imputation might induce bias arising due to the genetic diversity of reference data15,24. On the other hand, different types of omics data may be correlated and associated with a disease, directly or indirectly. Thus, integrating evidence from the inter-relationship among omics data provides additional information for biomarker identification.

We propose TiMEG, a tool for the identification of biomarkers integrating genotype, gene expression, and DNA methylation in presence of missing data under the case–control paradigm. Based on a likelihood approach, TiMEG is able to capture weaker signals that are often missed by single-omic analysis, by efficiently combining the information on interdependence among multiple omics data. Rather than imputing the missing data before the analysis, TiMEG accumulates information on the missing data by estimating the parameters in the likelihood function containing incomplete omics data. For calculating the likelihood function for incomplete data, we evaluate the conditional distribution of the response variable given the available information. This information not only includes the available omics data but also the inter-relationship among different omics. Moreover, our method has the ability to tackle an unrestricted number of missing transcriptomic and epigenetic data. Asymptotic distribution of our test statistic derived under the null hypothesis of no association will lead to the fast calculation of p-values compared to computation-intensive techniques. Moreover, the normal approximation of the sigmoid function56 in the evaluation of the test statistic reduces the computation time to a great extent. Thus, TiMEG could be promptly applied by the end-users on real datasets.

Simulation results confirm consistency of the test, robust performance in terms of prediction accuracy of estimation in tenfold CV, controlled type I error rate, and high statistical power. Moreover, as the percentage of missing values increases, the power of the test decreases as expected. Our method also shows robust performance. Simulation study confirms that for moderately high percentages of missing data, the power and tenfold prediction accuracy of estimation are close to that of no missing data. Simulation results also indicate that reduction in power of the test is not substantial for extremely large missing percentages. Besides, the median prediction accuracy is nearly the same under all scenarios except when the percentage of missing data is very high (Fig. 4) but, the mean prediction accuracy of classifying an individual to case or control group is nearly the same (Table S3). Only for extreme percentages of missing omics data, fluctuations in misclassification rate are slightly higher (Figs. 5, S1, S2, S3). Thus, one of the major advantages of TiMEG lies in its applicability to moderately large missing percentages and limited sample size for the identification of biomarkers. Another advantage is that it identifies a combination of multiple omics loci (or trio) as biomarkers. So, even when anyone or all the components of a trio have some small effect (but significant when combined) on the disease, TiMEG detects it. This is because, it is able to integrate multiple omics loci with small effects together, such that their combined effect on the disease is moderately large.

More often, wet-lab researchers encounter missing data in multiple omics assays, when data are collected on both patients and matched controls. One example of such an experiment is by Martin et al.29. We applied our method to their real dataset related to TSC that we obtained from dbGaP. The dbGaP data had genotype for all individuals (both TSC patients and healthy controls) but gene expression and/or methylation data were missing for a number of TSC patients. Although mutations in TSC1 and TSC2 genes are widely known to be responsible for the occurrence of TSC disease, several studies illustrated evidence of factors other than mutations in these genes, to be involved in the aetiology of the disease. Our method could identify a few more TSC associated genes at a much smaller sample size combining different omics data. Some of the identified genes have been previously reported to actively participate in the TSC disease causation54,57.

Although TiMEG tests a trio for possible association with a disease, it could be extended to test multiple SNPs and multiple methylation sites along with gene expression. But, this will increase the number of parameters in the model. One possibility is to replace multiple SNPs and methylation values with some combined value or score, for example, the median of the methylation values under study etc. We plan to extend TiMEG for the accommodation of multiple SNPs and CpG sites as future work. We have not considered any interaction effect among different omics on the phenotype in this model. So, another extension of this work would be considering the interaction effect. Moreover, extending TiMEG to accommodate mutations instead of SNPs is important for experiments related to cancer. Some experiments collect data on quantitative phenotypes. TiMEG could be applied in such cases by dichotomising the quantitative phenotype but it would lose information. Therefore, it is important to develop a method for quantitative phenotypes, which might not be straightforward. However, another strength of TiMEG is that it can test the effect of a single omic locus or combination of any two omics immediately. Such tests of single omic locus would provide greater insight than traditional single-omic analysis due to the additional insights from other omics data.

Moreover, application of TiMEG to the available data from public repositories might enhance the understanding of the disease by identifying different biomarkers. A detailed functional analysis of the significant association signals might facilitate understanding of the intricate genetic architecture of disease and therefore, translate the potential stored in the genomic data to develop targeted therapies and aid in precision medicine research.

Methods

Model

Figure 1 provides a general missing data structure of multiple omics in different studies. The effect of this structure on the identification of biomarkers is much more prominent in studies with a limited sample size compared to large consortium data. The objective of TiMEG is to identify disease-associated biomarkers by integrating individual-level information from genetic (SNP), transcriptomic, and epigenomic data along with their phenotype (disease status), and covariates in such scenarios. So, TiMEG explores the effect of multi-omics data on disease status and the inter-relation among multiple omics for biomarker identification. To illustrate the scenario, we consider n individuals with a known binary qualitative phenotype. For individual i \((i=1,2,\ldots ,n)\), let \(Y_i\), \(G_i\), \(M_i\) and \(E_i\) denote respectively phenotype, genotype, methylation, and gene expression, and \({\varvec{X}}_i\) denote the vector of J covariates like age, gender, and other environmental variables. We denote \(Y_i=-1\), for controls and \(Y_i=1\), for cases. Conventionally, \(G_i\) takes value 0, 1, 2 depending on the number of minor alleles present. Let M and E denote the vectors of continuous values for n individuals and X be a matrix of order \(n\times J\) that includes covariate values for all n individuals. Based on the aforementioned omics data, we have proposed the following model for the \(i^{th} (i=1,\ldots n)\) individual.

$$\begin{aligned} P(Y_i=1)&=\sigma (y_i(\beta _0+\beta _e E_i+\beta _m M_i+\beta _g G_i+ {\varvec{\beta }}^{\prime }_{x} {\varvec{X}}_i)) \end{aligned}$$
(1)
$$\begin{aligned} M_{i}&= \alpha _0+\alpha _g G_i +\epsilon _{1i}\end{aligned}$$
(2)
$$\begin{aligned} E_{i}&= \gamma _0+\gamma _g G_i+\gamma _m M_i +\epsilon _{2i} \end{aligned}$$
(3)

where \(\sigma (x)=\frac{1}{1+e^{-x}}\). We assume that \(G_i \sim Bin(2,p)\) where, p is the probability of occurrence of a minor allele and \((\epsilon _{1i},\epsilon _{2i}) \sim N_2(0,0,\sigma ^2_1,\sigma ^2_2,\rho )\) where \(\sigma ^2_1\) and \(\sigma ^2_2\) denote the variances of \(\epsilon _{1i}\) and \(\epsilon _{2i}\) respectively, and \(\rho =Cor(\epsilon _{1i},\epsilon _{2i})\). Here, we have considered a likelihood based approach for estimation of the parameters in Eqs. \(1-3\). Denote the set of all parameters as, \(\theta =(\beta _0,\beta _e,\beta _m,\beta _g, \varvec{\beta }_x,\alpha _0,\alpha _g,\gamma _0,\gamma _g,\gamma _m,\sigma ^2_1,\sigma ^2_2,p)'\). So, our joint likelihood function for the full data (i.e. when there is no missing observation), becomes:

$$\begin{aligned} f_{\theta }({\varvec{y}}, {\varvec{E}}, {\varvec{M}}, {\varvec{G}} \vert X)&= \prod ^n_{i=1} f_{\theta }(y_i, E_i, M_i, G_i \vert {\varvec{X}}_i) \nonumber \\&= \prod ^n_{i=1} \Big \{f_{\theta }(y_i \vert E_i, M_i, G_i, {\varvec{X}}_i) f_{\theta }(E_i \vert M_i, G_i, {\varvec{X}}_i) f_{\theta }(M_i \vert G_i, {\varvec{X}}_i) f_{\theta }(G_i \vert {\varvec{X}}_i) \Big \}&\nonumber \\&= \prod ^n_{i=1}\Big \{ \sigma (y_i(\beta _0+\beta _e E_i+\beta _m M_i+\beta _g G_i+\varvec{\beta }^{\prime }_x {\varvec{X}}_i)) {2\atopwithdelims (){G_i}} p^{G_i} (1-p)^{2-G_i}&\nonumber \\&\quad \times \frac{1}{\sigma _2 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_2} (E_i-\gamma _0-\gamma _g G_i-\gamma _m M_i)^2} \frac{1}{\sigma _1 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_1} (M_i-\alpha _0-\alpha _g G_i)^2} \Big \} \end{aligned}$$
(4)

As discussed earlier, genetic variants are available for a large population but transcriptomic and epigenomic data tend to be missing due to various reasons. So, we assume that genotype, phenotype, and covariate data are available for the whole population while varying percentages of either gene expression or methylation or both the omics are missing (Table 1). In the following section, we have introduced different schemes of missing values across multiple platforms.

Missing values scheme

We suppose that among n individuals, \(n_1\) individuals have complete data on all omics, phenotype, and covariates, for \(n_2\) individuals only gene expression is missing, \(n_3\) has only methylation values missing, and \(n_4\) individuals neither have data on gene expression nor on methylation. Thus, depending on the missing data type(s), we can consider three schemes of missingness (Table 1). In each case, we have written the appropriate likelihood function. For that, we need to consider the following lemma (for proof see Appendix A, Supplementary Material).

Lemma 1

If \(\phi (x)\) is the p.d.f. of a standard normal distribution, i.e. \(\phi (x)=\frac{1}{\sqrt{2\pi }}e^{-\frac{x^2}{2}},\,\, -\infty<x<\infty \), then

$$\begin{aligned} \int \limits _{-\infty }^{\infty }\phi \Big (\frac{\alpha x-\beta }{\sigma _1}\Big ).\phi \Big (\frac{\gamma -\delta x}{\sigma _2}\Big ) dx= \phi \Big (\frac{\alpha \gamma - \beta \delta }{\sqrt{\alpha ^2\sigma _2^2+\delta ^2\sigma _1^2}}\Big ).\frac{1}{\sqrt{\frac{\alpha ^2}{\sigma _1^2}+\frac{\delta ^2}{\sigma _2^2}}} \end{aligned}$$
(5)

where \(\alpha \), \(\beta \), \(\gamma \), \(\delta \), \(\sigma _1^2\) and \(\sigma _2^2\) are constants.

Moreover, while deriving the likelihood functions, we approximate the logistic function in Eq. (4) by cumulative distribution function of a normal variable56. This approximation relation is given by:

$$\begin{aligned} \sigma (\nu ) = \int _{-\infty }^{\nu } \phi \left( \frac{z}{\beta }\right) dz \text { where }&\beta = \frac{\pi }{\sqrt{3}} \text { and } \phi (x) = \frac{1}{\sqrt{2\pi }} e^{-\frac{x^2}{2}} \end{aligned}$$
(6)

Scheme 1: Only partial gene expression data are missing

Consider a situation where phenotype, genotype, covariates, and methylation data are available for all n individuals but, gene expression data are missing only for \(n_2\) individuals. This indicates that \(n_3=n_4=0\), and \(n_1=n-n_2\). Based on this missing observation scheme, we need to write the likelihood function using Lemma 1. But before that, we have introduced a few notations for the sake of lucidity.

\({\varvec{Z}}_i=(1, {\varvec{X}}_i, G_i, M_i, E_i)'\), \({\varvec{w}}=(\beta _0,\varvec{\beta }_x', \beta _g, \beta _m, \beta _e)'\), \({\varvec{Z}}_{i,o}=(1, {\varvec{X}}_i, G_i, M_i)'\), \({\varvec{Z}}_{i,m}=E_i\), \({\varvec{w}}_{o}=(\beta _0,\varvec{\beta }_x', \beta _g, \beta _m)'\). Now, to rewrite the likelihood function as in (4), we need a precise expression for \(P(y_i|{\varvec{Z}}_{i,o})\), as given in the following result. Note that Result 1 (for Proof see Appendix B, Supplementary Material) is related to only \(n_2\) individuals for whom gene expression data are not available.

Result 1

Using the model (1–3),

$$\begin{aligned} P(y_i\vert {\varvec{Z}}_{i,o})= \sigma \Big (\frac{y_i\beta (\beta _0\varvec{\beta }_x'{\varvec{X}}_i+\beta _g G_i+\beta _m M_i + \beta _e \mu _0)}{\sqrt{\beta ^2+y_i^2\beta _e^2+\sigma _2^2}}\Big ) \end{aligned}$$
(7)

where \(\mu _0=\gamma _0+\gamma _g G_i+\gamma _m M_i\), for each \(i\in S_{-E}\), the set of \(n_2\) individuals for whom gene expression data are not available.

Now without any loss of generality, we have assumed that for the first \(n_1\) individuals all data are available whereas the last \(n_2\) individuals do not have gene expression data. Hence, using Result 1, the likelihood function (4) can be written under the scheme 1 as:

$$\begin{aligned}&L(\theta |{\varvec{y}}, {\varvec{E}}, {\varvec{M}}, {\varvec{G}}, X) \nonumber \\&\quad = \prod ^{n_1}_{i=1} \Big \{\sigma (y_i(\beta _0+\beta _e E_i+\beta _m M_i+\beta _g G_i+\varvec{\beta }^{\prime }_x {\varvec{X}}_i)) \frac{1}{\sigma _2 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_2} (E_i-\gamma _0-\gamma _g G_i-\gamma _m M_i)^2} \Big \} \nonumber \\&\qquad \times \prod ^n_{i=1}\Big \{{2\atopwithdelims (){G_i}} p^{G_i} (1-p)^{2-G_i}\Big \} \prod ^n_{i=1} \Big \{\frac{1}{\sigma _1 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_1} (M_i-\alpha _0-\alpha _g G_i)^2} \Big \} \nonumber \\&\qquad \times \prod _{i=n_1+1}^n \Big \{\sigma \Big (\frac{y_i \beta (\beta _0+\beta _g G_i+\varvec{\beta }'_x {\varvec{X}}_i+\beta _m M_i+\beta _e\mu _0)}{\sqrt{\beta ^2+\beta ^2_e\sigma _2^2}}\Big )\Big \} \end{aligned}$$
(8)

where \(n_2 = n - n_1\).

Scheme 2: Only partial methylation data are missing

We may have a situation where all types of data are available for \(n_1\) individuals and for another group of \(n_3\) individuals all types of data except methylation data are available. So here we have \(n_2=n_4=0\) and \(n_3 = n - n_1\). So, the terms involving \(M_i\) are not available for \(n_3\) individuals. Again as in the above scheme, we have now introduced a few notations as:

\({\varvec{Z}}_i=(1, {\varvec{X}}_i, G_i, M_i, E_i)'\), \({\varvec{w}}=(\beta _0,\varvec{\beta }_x', \beta _g, \beta _m, \beta _e)'\), \({\varvec{Z}}_{i,o}=(1, {\varvec{X}}_i, G_i, E_i)'\), \({\varvec{Z}}_{i,m}= M_i\), \({\varvec{w}}_{o}=(\beta _0,\varvec{\beta }_x', \beta _g, \beta _e)'\). To write down the likelihood function, we have first evaluated the expression for \(P(y_i \vert {\varvec{Z}}_{i,o})\) as given in Result 2 using Lemma 1. Note that Result 2 (for proof, see Appendix B, Supplementary Material) is related to only \(n_3\) individuals for whom methylation data are not available.

Result 2

Using the model (1-3), for each \(i\in S_{-M}\),

$$\begin{aligned} P(y_i \vert {\varvec{Z}}_{i,o}) = \sigma \left( \frac{y_i\beta (\beta _0+\varvec{\beta }'_x{\varvec{X}}_i +\beta _gG_i+\beta _eE_i+\beta _m\frac{(\alpha _0+\alpha _gG_i)\sigma ^2_2+\gamma _m (E_i-\gamma _0-\gamma _gG_i)\sigma ^2_1}{\sigma ^2_2+\gamma ^2_m\sigma ^2_1})}{\sqrt{\beta ^2+\beta ^2_m\frac{\sigma ^2_1\sigma ^2_2}{\sigma ^2_2+\gamma ^2_m\sigma ^2_1}}}\right) \end{aligned}$$
(9)

where \(S_{-M}\) is the set of \(n_3\) individuals for whom no methylation data are available.

Now without any loss of generality, we have assumed that for the first \(n_1\) individuals all data are available whereas the last \(n_3\) individuals do not have methylation data. Hence, using Result 2, the likelihood function (4) can be written under the scheme 2 as:

$$\begin{aligned} L(\theta |{\varvec{y}}, {\varvec{E}}, {\varvec{M}}, {\varvec{G}}, X)&=\prod ^{n_1}_{i=1} \sigma (y_i(\beta _0+\beta _e E_i+\beta _m M_i+\beta _g G_i+\varvec{\beta }^{\prime }_x {\varvec{X}}_i))\nonumber \\&\quad \times \prod _{i=1}^{n_1} \Big \{\frac{1}{\sigma _2 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_2} (E_i-\gamma _0-\gamma _g G_i-\gamma _m M_i)^2} \frac{1}{\sigma _1 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_1} (M_i-\alpha _0-\alpha _g G_i)^2} \Big \}\nonumber \\&\quad \times \prod ^{n}_{i=n_1+1} \sigma \left( \frac{y_i\beta (\beta _0+\beta _e E_i+\beta _g G_i+\varvec{\beta }^{\prime }_x {\varvec{X}}_i+\beta _m \frac{(\alpha _0+\alpha _g G_i)\sigma _2^2+ \gamma _m (E_i-\gamma _0-\gamma _G G_i)\sigma _1^2}{\sigma _2^2+\sigma _1^2 \gamma _m^2})}{\sqrt{\beta ^2+\beta _m^2 \frac{\sigma _1^2\sigma _2^2}{\sigma _2^2+\sigma _1^2\gamma _m^2}}}\right) \nonumber \\&\quad \times \prod _{i=n_1+1}^{n} \frac{1}{\sqrt{2\pi (\sigma _2^2+\sigma _1^2 \gamma _m^2)}} e^{-\frac{1}{2(\sigma _2^2+\sigma ^2_1\gamma _m^2 )} (E_i-\gamma _0-\gamma _g G_i - \gamma _m (\alpha _0+\alpha _g G_i))^2} \nonumber \\&\quad \times \prod ^{n}_{i=1} \Big \{{2\atopwithdelims ()G_i}p^G_i (1-p)^{2-G_i}\Big \} \end{aligned}$$
(10)

where, \(n_3 = n-n_1\)

Scheme 3: Methylation and gene expression data are partially missing

Lastly, under the most general missing value scheme, we have considered \(n_2\) individuals have only missing gene expression values, \(n_3\) individuals have only missing methylation values, \(n_4\) individuals have both missing gene expression and methylation values, and the rest of the individuals have all types of data. Similarly, as for other schemes, we have now introduced a few notations as:

\({\varvec{Z}}_i=(1, {\varvec{X}}_i, G_i, M_i, E_i)'\), \({\varvec{w}}=(\beta _0,\varvec{\beta }_x', \beta _g, \beta _m, \beta _e)'\), \({\varvec{Z}}_{i,o}=(1, {\varvec{X}}_i, G_i)'\), \({\varvec{Z}}_{i,m}=(E_i, M_i)'\), \({\varvec{w}}_{o}=(\beta _0,\varvec{\beta }_x', \beta _g)'\). Then, we have evaluated \(P(y_i\vert {\varvec{Z}}_{i,o})\) as given in Result 3 (for proof, see Appendix B, Supplementary Material) using Lemma 1 in order to write the joint likelihood equation.

Result 3

Under the model (1-3), for each \(i\in S_{-(E,M)}\),

$$\begin{aligned} P(y_i \vert {\varvec{Z}}_{i,o}) = \sigma \left( \frac{\beta y_i(\beta _0 + \beta _g G_i + \varvec{\beta }'_x{\varvec{X}}_i +\beta _e(\gamma _0 + \gamma _gG_i) +(\alpha _0+\alpha _gG_i)(\beta _e\gamma _m + \beta _m))}{\sqrt{\beta ^2 + y_i^2\beta _e^2 \sigma _2^2 + y_i^2(\beta _e\gamma _m + \beta _m)^2\sigma _1^2}}\right) \end{aligned}$$
(11)

where \(S_{-(E,M)}\) is the set of \(n_4\) individuals for whom both expression and methylation data are missing.

In order to write down the likelihood function, we have assumed without any loss of generality, that first \(n_1\) individuals have all data, next \(n_2\) individuals have all data except gene expression data, next \(n_3\) individuals have all data except methylation data and for the remaining \(n_4\) individuals neither gene expression data nor methylation data are available but phenotype, covariates, and genotype data are available for all n individuals. Clearly \(n_4 = n - n_1 - n_2 - n_3\).

Using Results 1-3, the likelihood function (4) under scheme 3 can be written as:

$$\begin{aligned} L(\theta \vert {\varvec{y}}, {\varvec{E}}, {\varvec{M}}, {\varvec{G}}, X)&= \prod ^{n_1}_{i=1} \sigma (y_i(\beta _0+\beta _e E_i+\beta _m M_i+\beta _g G_i+\varvec{\beta }^{\prime }_x {\varvec{X}}_i))\nonumber \\&\quad \times \prod _{i=1}^{n_1} \Big \{ \frac{1}{\sigma _2 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_2} (E_i-\gamma _0-\gamma _g G_i-\gamma _m M_i)^2} \frac{1}{\sigma _1 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_1} (M_i-\alpha _0-\alpha _g G_i)^2} \Big \} \nonumber \\&\quad \times \prod _{i=n_1+1}^{n_1+n_2} \Big \{\sigma \Big (\frac{y_i \beta (\beta _0+\beta _g G_i+\varvec{\beta }'_x {\varvec{X}}_i+\beta _m M_i+\beta _e\mu _0)}{\sqrt{\beta ^2+\beta ^2_e\sigma _2^2}}\Big )\frac{1}{\sigma _1 \sqrt{2\pi }} e^{-\frac{1}{2\sigma ^2_1} (M_i-\alpha _0-\alpha _g G_i)^2}\Big \}\nonumber \\&\quad \times \prod _{i=n_1+n_2+1}^{n_1+n_2+n_3} \Big \{ \sigma \Big (\frac{y_i\beta (\beta _0+\beta _e E_i+\beta _g G_i+\varvec{\beta }^{\prime }_x {\varvec{X}}_i+\beta _m \frac{(\alpha _0+\alpha _g G_i)\sigma _2^2+ \gamma _m (E_i-\gamma _0-\gamma _G G_i)\sigma _1^2}{\sigma _2^2+\sigma _1^2 \gamma _m^2})}{\sqrt{\beta ^2+\beta _m^2 \frac{\sigma _1^2\sigma _2^2}{\sigma _2^2+\sigma _1^2\gamma _m^2}}}\Big )\nonumber \\&\quad \times \frac{1}{\sqrt{2\pi (\sigma _2^2+\sigma _1^2 \gamma _m^2)}} e^{-\frac{1}{2(\sigma _2^2+\sigma ^2_1\gamma _m^2 )} (E_i-\gamma _0-\gamma _g G_i - \gamma _m (\alpha _0+\alpha _g G_i))^2} \Big \}\nonumber \\&\quad \times \prod _{i=n_{1}+n_{2}+n_{3}+1}^{n}\sigma \Big ({\frac{\beta y_i(\beta _0 + \beta _g G_i + \varvec{\beta }'_x{\varvec{X}}_i +\beta _e(\gamma _0 + \gamma _gG_i) +(\alpha _0+\alpha _gG_i)(\beta _e\gamma _m + \beta _m))}{\sqrt{\beta ^2 + y_i^2\beta _e^2 \sigma _2^2 + y_i^2(\beta _e\gamma _m + \beta _m)^2\sigma _1^2}}}\Big )\nonumber \\&\quad \times \prod ^{n}_{i=1} \Big \{{2\atopwithdelims ()G_i}p^G_i (1-p)^{2-G_i}\Big \} \end{aligned}$$
(12)

where \(\mu _o = \gamma _0 + \gamma _g G_i + \gamma _m M_i + \varvec{\beta }'_x {\varvec{X}}_i\), \(n_4=n-n_1-n_2-n_3\).

Next, for estimating the parameters in each of the likelihood functions, we used the L-BFGS-B method (in R package ‘stats’), an iterative algorithm for numerical optimisation to find the maximum likelihood estimates of the parameters. Thus, theoretically, our method is able to incorporate any amount of missing gene expression and methylation data. In this paper, we focused on the identification of disease-associated trios (that is, a combination of the gene along with its cis-genotype and cis-methylation site). Thus, the components of a significant trio are expected to have a joint effect on affection status. Traditional single-omic analysis of each component is likely to miss these loci unless the sample size is humongous and/or the technologies are tremendously improved. More information from multiple omics on each individual, coupled with additional insights from the inter-relationship among the omics, supported the identification of significant loci even at smaller sample sizes compared to large single-omic analysis.

Hypothesis of interest

With the objective to identify a trio that may be associated with the disease or phenotype, we formulated the hypotheses of interest as:

$$\begin{aligned} H_0: \beta _e=\beta _m=\beta _g=0 \text { against } H_1: \text {not } H_0 \end{aligned}$$

Rejection of \(H_0\) would indicate association with one or more components of the trio with the disease. To test the null hypothesis we adopted likelihood ratio test under a very general likelihood structure under various schemes of missing data. The test statistic for testing \(H_0\) would be,

$$\begin{aligned} \Lambda = -2 \ln \frac{\sup \limits _{\theta \in \Theta _0}L(\theta \vert data)}{\sup \limits _{\theta \in \Theta _0\cup \Theta _1}L(\theta \vert data)} \end{aligned}$$
(13)

where \(\theta \) is the vector of parameters in the likelihood function L, \(\Theta _0\) and \(\Theta _1\) are the parametric spaces under \(H_0\) and \(H_1\) respectively. Using standard asymptotic theory, it can be easily shown that the test statistic \(\Lambda \) follows \(\chi ^2\) distribution with 3 degrees of freedom asymptotically under \(H_0\). Usually, the sample sizes are considerably large so that we can use the asymptotic distribution of \(\Lambda \) under \(H_0\) for a real dataset. This reduces a huge computational burden while calculating the p-value in order to come to a conclusion.

If interested, one may test the effect of any two components (duos) such as genotype and gene expression, by testing

$$H_0: \beta _e=\beta _g=0 \text { against } H_1: \text {not } H_0$$

to find whether any combination of a gene and a genotype is associated with the disease. Other alternative hypotheses may be framed as per the objective. But here we considered only the identification of significant trios.

dbGaP data on TSC

All the real data on TSC patients and healthy controls have been published previously by Martin et al.29 and deposited on dbGaP. We obtained publicly available real data from dbGaP (phs001357.v1.p1). All the data are available through a request for external collaboration and upon approval of a letter of intent and a research proposal. Details of how to request controlled-access data for external collaboration is available on the dbGaP website https://urldefense.proofpoint.com/v2/url?u=https-3A__dbgap.ncbi.nlm.nih.gov_aa_wga.cgi-3Fpage-3Dlogin&d=DwIDaQ&c=vh6FgFnduejNhPPD0fl_yRaSfZy8CWbWnIf4XJhSqx8&r=QqDpGi6FCxUEcyvzrkCUIg&m=e3Ku3-dtS10VfFaPnA85WzqdSg7HsMqlS3UVFJ39LyU&s=aDZbfDRAg5aqiY4ZWXTC5TvbK4WN34r8R8rCALD6KpM&e=. Required ethical consent was obtained from the patients and/or their legal guardians before the data collection by the appropriate authorities. For this work, we analysed raw BAM files for gene expression data, IDAT files for genotype, and methylation data from brain tissues only. Genotypes were generated using Illumina Infinium Omni2.5 SNP arrays, methylation using Illumina Infinium HumanMethylation450 (HM450) BeadArrays, and gene expression using mRNA sequencing (RNAseq) for patient and control samples. For cases and controls, we derived log-normalised count for gene expression data, normalised-beta count for methylation data, and genotype data using Bioconductor package ‘DESeq2’, ‘methylumi’, and ‘CRLMM’ respectively in R software. For each probe ID, we have found its transcription start and end sites according to the human genome assembly 19 (hg19) from the UCSC genome browser using Bioconductor package TxDb.Hsapiens.UCSC.hg19.knownGene58. To have the same gene nomenclature across all omics platforms we converted probe IDs to gene names (or HGNC IDs) (using http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz) and soutannotate annotated the SNPs using Bionconductor package ‘humanomni258v1aCrlmm in R’.