Statistics in Biosciences

, Volume 5, Issue 2, pp 250–260 | Cite as

A Note on Penalized Regression Spline Estimation in the Secondary Analysis of Case-Control Data

  • Suzan Gazioglu
  • Jiawei Wei
  • Elizabeth M. Jennings
  • Raymond J. CarrollEmail author


Primary analysis of case-control studies focuses on the relationship between disease (D) and a set of covariates of interest (Y,X). A secondary application of the case-control study, often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated due to the case-control sampling, and to avoid the biased sampling that arises from the design, it is typical to use the control data only. In this paper, we develop penalized regression spline methodology that uses all the data, and improves precision of estimation compared to using only the controls. A simulation study and an empirical example are used to illustrate the methodology.


Biased samples B-splines Homoscedastic regression Nonparametric regression Regression splines Secondary data Secondary phenotypes Two-stage samples 



Jennings, Wei and Carroll’s research were supported by a grant from the National Cancer Institute (R37-CA057030). This publication is based in part on work supported by Award Number KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST).


  1. 1.
    Chatterjee N, Carroll RJ (2005) Semiparametric maximum likelihood estimation in case-control studies of gene-environment interactions. Biometrika 92:399–418 MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Chen Y-H, Carroll RJ, Chatterjee N (2008) Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association. Biostatistics 9:81–99 CrossRefzbMATHGoogle Scholar
  3. 3.
    Chen Y-H, Chatterjee N, Carroll RJ (2009) Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. J Am Stat Assoc 104:220–233 MathSciNetCrossRefGoogle Scholar
  4. 4.
    Currie ID, Durban M (2002) Flexible smoothing with P-splines: a unified approach. Stat Model 4:333–349 MathSciNetCrossRefGoogle Scholar
  5. 5.
    Epstein M, Satten GA (2003) Inference on haplotype effects in case-control studies using unphased genotype data. Am J Hum Genet 73:1316–1329 CrossRefGoogle Scholar
  6. 6.
    Gohagan JK, Prorok PC, Hayes RB et al. (2000) The prostate, lung, colorectal and ovarian (PLCO) cancer screening trial of the National Cancer Institute: history, organization, and status. Control Clin Trials 21(6 Suppl):251S–272S CrossRefGoogle Scholar
  7. 7.
    Hu YJ, Lin DY, Zeng D (2010) A general framework for studying genetic effects and gene-environment interactions with missing data. Biostatistics 11:583–598 CrossRefGoogle Scholar
  8. 8.
    Kwee LC, Epstein MP, Manatunga AK, Duncan R, Allen AS, Satten GA (2007) Simple methods for assessing haplotype-environment interactions in case-only and case-control studies. Genet Epidemiol 31:75–90 CrossRefGoogle Scholar
  9. 9.
    Li H, Gail MH, Berndt S, Chatterjee N (2010) Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genome-wide association studies. Genet Epidemiol 34:427–433 CrossRefGoogle Scholar
  10. 10.
    Li Y, Ruppert D (2008) On the asymptotics of penalized splines. Biometrika 95:415–436 MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Lin DY, Zeng D (2006) Likelihood-based inference on haplotype effects in genetic association studies (with discussion). J Am Stat Assoc 101:89–118 MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Lin DY, Zeng D (2009) Proper analysis of secondary phenotype data in case-control association studies. Genet Epidemiol 33:256–265 CrossRefGoogle Scholar
  13. 13.
    Lobach I, Carroll RJ, Spinka C, Gail MH, Chatterjee N (2008) Haplotype-based regression analysis of case-control studies with unphased genotypes and measurement errors in environmental exposures. Biometrics 64:673–684 MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Maity A, Carroll RJ, Mammen E, Chatterjee N (2009) Testing in semiparametric models with interaction, with applications to gene-environment interactions. J R Stat Soc B 71:75–96 MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Modan MD, Hartge P, Hirsh-Yechezkel G, Chetrit A, Lubin F, Beller U, Ben-Baruch G, Fishman A, Menczer J, Struewing JP, Tucker MA, Wacholder S for the National Israel Ovarian Cancer Study Group (2001) Parity, oral contraceptives and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. N Engl J Med 345:235–240 CrossRefGoogle Scholar
  16. 16.
    Moslehi R, Chatterjee N, Church TR, Chen J, Yeager M, Weissfield J, Hein DW, Hayes RB (2006) Cigarette smoking, N-acetyltransferase genes and the risk of advanced colorectal adenoma. Pharmacogenomics 7:819–829 CrossRefGoogle Scholar
  17. 17.
    Piegorsch WW, Weinberg CR, Taylor JA (1994) Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Stat Med 13:153–162 CrossRefGoogle Scholar
  18. 18.
    Prentice RL, Pyke R (1979) Logistic disease incidence models and case-control studies. Biometrika 66:403–411 MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Ruppert D (2002) Selecting the number of knots for penalized splines. J Comput Graph Stat 11:735–757 MathSciNetCrossRefGoogle Scholar
  20. 20.
    Ruppert D, Wand MP, Carroll RJ (2003) Semiparametric regression. Cambridge University Press, Cambridge CrossRefzbMATHGoogle Scholar
  21. 21.
    Wang CY, Wang S, Carroll RJ (1997) Estimation in choice-based sampling with measurement error and bootstrap analysis. J Econom 77:65–86 CrossRefzbMATHGoogle Scholar
  22. 22.
    Wei J, Carroll RJ, Müller U, Van Keilegom I, Chatterjee N (2013) Locally efficient estimation for homoscedastic regression in the secondary analysis of case-control data. J R Stat Soc B 75:185–206 CrossRefGoogle Scholar
  23. 23.
    Wood SN (2006) Generalized additive models: an introduction with R. CRC Press, Boca Raton Google Scholar
  24. 24.
    Yang Q, Khoury MJ, Flanders WD (1997) Sample size requirements in case-only designs to detect gene-environment interaction. Am J Epidemiol 146:713–720 CrossRefGoogle Scholar
  25. 25.
    Yu Y, Ruppert D (2002) Penalized spline estimation for partially linear single-index models. J Am Stat Assoc 97:1042–1054 MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Zhao LP, Li SS, Khalid N (2003) A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am J Hum Genet 72:1231–1250 CrossRefGoogle Scholar

Copyright information

© International Chinese Statistical Association 2013

Authors and Affiliations

  • Suzan Gazioglu
    • 1
  • Jiawei Wei
    • 2
  • Elizabeth M. Jennings
    • 3
  • Raymond J. Carroll
    • 3
    Email author
  1. 1.Department of Mathematical SciencesMontana Tech of the University of MontanaButteUSA
  2. 2.Beijing Novartis Pharma Co. Ltd.ShanghaiChina
  3. 3.Department of StatisticsTexas A&M UniversityCollege StationUSA

Personalised recommendations