Skip to main content
Log in

Controlling the false discovery rate by a Latent Gaussian Copula Knockoff procedure

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The penalized Lasso Cox proportional hazards model has been widely used to identify prognosis biomarkers in high-dimension settings. However, this method tends to select many false positives, affecting its interpretability. In order to improve the reproducibility, we develop a knockoff procedure that consists on wrapping the Lasso Cox model with the model-X knockoff, resulting in a powerful tool for variable selection that allows for the control of the false discovery rate in the presence of finite sample guarantees. In this paper, we propose a novel approach to sample valid knockoffs for ordinal and continuous variables whose distributions can be skewed or heavy-tailed, which employs a Latent Mixed Gaussian Copula model to account for the dependence structure between the variables, leading to what we call the Latent Gaussian Copula Knockoff (LGCK) procedure. We then combine the LGCK method with the Lasso coefficient difference (LCD) statistic as the importance metric. To our knowledge, our proposal is the first knockoff framework for jointly considering ordinal and continuous data in a non-Gaussian setting and a survival context. We illustrate the proposed methodology’s effectiveness by applying it to a real lung cancer gene expression dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability statement

The real dataset used in this study is available at http://lce.biohpc.swmed.edu/.

Code availability

The code for reproducing simulations experiments and the real data application is available at https://github.com/AlejandroRomanVasquez/LGCK-LCD.

References

Download references

Acknowledgements

Alejandro Román Vásquez acknowledges a grant from Consejo Nacional de Ciencia y Tecnología (CONACyT) Estancias Posdoctorales por México 2021 at Centro de Investigación en Matemáticas. Additionally, all the authors would like to thank Miguel Bedolla for revising the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graciela González Farías.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Computational implementations of the proposed methods

In the following, we describe specific aspects of the programming implementation of the LGCK procedure. We also clarify the estimation of the Lasso coefficient-difference (LCD) statistic and the computation of the data-dependent threshold.


Step 1: estimation of the latent correlation matrix. The estimation of the latent correlation matrix \({\varvec{\varSigma }}\) can be done using the SPSVERBc6 function from the latentcor SPSVERBc1 package (Huang et al. 2021). It works for mixed data, ordinal and continuous, under the Latent Mixed Gaussian copula model. The Original Computation Scheme Yoon et al. (2021) is set through putting the argument SPSVERBc8. For the ordinal case, the estimation method in the function SPSVERBc6 is limited to the binary and ternary data types. Thus, an ordinal variable with 4 or more levels must be treated as continuous. The corresponding bridge functions F for the combinations of variables (binary-continuous, ternary-continuous, binary-binary, binary-ternary, and ternary-ternary), and the equations for the estimators of the cutoffs D for binary and ternary variables can be found in the mathematical framework of the SPSVERBc10 package.


Step 2: estimation of the precision matrix of the latent correlation matrix. The SPSVERBc2 package SPSVERBc12 is a helpful library for solving general graphical Lasso problems (Schaipp et al. 2021). Its main class is glasso_problem which performs the task at hand with a simplified procedure. Creating a glasso_problem object requires an empirical covariance matrix \({\varvec{S}}\) and the sample size n as arguments. The optimal value for the penalization parameter is determined using the function model_selection(), which implements a grid search based on the extended BIC criterion (Foygel and Drton 2010).


Step 3: nonparametric transformation strategy to obtain marginal normality. The specific quantities involved in this step can be estimated using some functions from the base package of R. Concretely, the function ecdf() from the Stats package in R can be used to compute \({\hat{F}}_j\).


Step 4: sampling Gaussian knockoffs using the MRC approach. The Knockpy Python package can be employed to sample MVR knockoffs. This versatile library makes it easy to apply knockoff-based inference in only a few lines of code. A GaussianSampler object needs to be created using the transformed vector \({\varvec{X}}^{\text {norm} }\), the estimated latent correlation matrix \({\varvec{{\hat{\varSigma }}}}_\text {Lasso}\) and setting method equal to SPSVERBc13 in order to sample Gaussian minimum variance-based reconstructability knockoffs \(\tilde{{\varvec{X}}}^{\text {norm} }\).


Step 5: reversing transformation to obtain the non-Gaussian Knockoffs. The sample quantiles can be computed utilizing the function SPSVERBc14 from the SPSVERBc15 package in SPSVERBc1. The nine quantile types described in Hyndman and Fan (1996) can be set through the argument type, where the recommended median-unbiased is selected using the number 8. The transformation to get the binary-ternary knockoffs can be done using conditional statements.

LCD statistic and data-dependent threshold The Lasso Cox model may be trained using the SPSVERBc17 SPSVERBc1 package. Then, the computation of the Lasso coefficient-difference (LCD) statistic can be easily done. The data-dependent threshold calculation can be carried out using the function data_dependent_threshhold() from the SPSVERBc2’s SPSVERBc20 package, which completes the knockoff methodology.

B Additional simulation results: low dimensional case (\(p<n\))

In this section, we complement the simulation results by adding line plots for the empirical power and the FDR in a low-dimensional setting (\(p<n\)). The configurations considered include variations of the correlation coefficient of the autoregressive correlation (Fig. 6), the amplitude (Fig. 7), and the censoring rate (Fig. 8).

Fig. 6
figure 6

The figures illustrate the empirical power and the FDR as a function of the autocorrelation coefficient \(\rho\) in the low dimensional case (\(p<n\)). The results corresponding to the LGCK-LCD procedure appear in orange, while the results for the Lasso Cox model are in blue. The parameter conditions are the same as in Fig. 2, except for the number of variables \(p=200\). Each point in the graphs represents the average value across 200 repetitions

Fig. 7
figure 7

The graphs show the empirical power and the FDR as a function of the absolute value |a| of the coefficient’s amplitude in the low dimensional case (\(p<n\)). The results corresponding to the LGCK-LCD procedure appear in orange, while the results for the Lasso Cox model are in blue. The parameter conditions are the same as in Fig. 3, except for the number of variables \(p=200\). Each point in the graphs represents the average value across 200 repetitions

Fig. 8
figure 8

The plots present the empirical power and the FDR as the censoring rate changes in the low dimensional case (\(p<n\)). The results corresponding to the LGCK-LCD procedure appear in orange, while the results for the Lasso Cox model are in blue. The parameter conditions are the same as in Fig. 4, except for the number of variables \(p=200\). Each point in the graphs represents the average value across 200 repetitions

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vásquez, A.R., Márquez Urbina, J.U., González Farías, G. et al. Controlling the false discovery rate by a Latent Gaussian Copula Knockoff procedure. Comput Stat 39, 1435–1458 (2024). https://doi.org/10.1007/s00180-023-01346-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-023-01346-4

Keywords

Navigation