Determining the familial risk distribution of colorectal cancer: a data mining approach
This study was aimed to characterize the distribution of colorectal cancer risk using family history of cancers by data mining. Family histories for 10,066 colorectal cancer cases recruited to population cancer registries of the Colon Cancer Family Registry were analyzed using a data mining framework. A novel index was developed to quantify familial cancer aggregation. Artificial neural network was used to identify distinct categories of familial risk. Standardized incidence ratios (SIRs) and corresponding 95 % confidence intervals (CIs) of colorectal cancer were calculated for each category. We identified five major, and 66 minor categories of familial risk for developing colorectal cancer. The distribution the major risk categories were: (1) 7 % of families (SIR = 7.11; 95 % CI 6.65–7.59) had a strong family history of colorectal cancer; (2) 13 % of families (SIR = 2.94; 95 % CI 2.78–3.10) had a moderate family history of colorectal cancer; (3) 11 % of families (SIR = 1.23; 95 % CI 1.12–1.36) had a strong family history of breast cancer and a weak family history of colorectal cancer; (4) 9 % of families (SIR = 1.06; 95 % CI 0.96–1.18) had strong family history of prostate cancer and weak family history of colorectal cancer; and (5) 60 % of families (SIR = 0.61; 95 % CI 0.57–0.65) had a weak family history of all cancers. There is a wide variation of colorectal cancer risk that can be categorized by family history of cancer, with a strong gradient of colorectal cancer risk between the highest and lowest risk categories. The risk of colorectal cancer for people with the highest risk category of family history (7 % of the population) was 12-times that for people in the lowest risk category (60 %) of the population. Data mining was proven an effective approach for gaining insight into the underlying cancer aggregation patterns and for categorizing familial risk of colorectal cancer.
KeywordsData mining Colorectal cancer Familial risk Familial aggregation
The authors thank all study participants of the Colon Cancer Family Registry and staff for their contributions to this project.
This work was supported by Grant UM1 CA167551 from the National Cancer Institute, National Institutes of Health (NIH) and through cooperative agreements with the following Colon Cancer Family Registry (CCFR) centers: Australasian Colorectal Cancer Family Registry (U01/U24 CA097735), Mayo Clinic Cooperative Family Registry for Colon Cancer Studies (U01/U24 CA074800), Ontario Familial Colorectal Cancer Registry (U01/U24 CA074783), Seattle Colorectal Cancer Family Registry (U01/U24 CA074794), and USC Consortium Colorectal Cancer Family Registry (U01/U24 CA074799). Seattle CCFR research was also supported by the Cancer Surveillance System of the Fred Hutchinson Cancer Research Center, which was funded by Control Nos. N01-CN-67009 (1996–2003) and N01-PC-35142 (2003–2010) and Contract No. HHSN2612013000121 (2010–2017) from the Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute with additional support from the Fred Hutchinson Cancer Research Center. The collection of cancer incidence data used in this study was supported by the California Department of Public Health as part of the statewide cancer reporting program mandated by California Health and Safety Code Section 103885; the National Cancer Institute’s Surveillance, Epidemiology and End Results Program under contract HHSN261201000035C awarded to the University of Southern California, and contract HHSN261201000034C awarded to the Public Health Institute; and the Centers for Disease Control and Prevention’s National Program of Cancer Registries, under agreement U58DP003862-01 awarded to the California Department of Public Health. The ideas and opinions expressed herein are those of the author(s) and endorsement by the State of California, Department of Public Health the National Cancer Institute, and the Centers for Disease Control and Prevention or their Contractors and Subcontractors is not intended nor should be inferred. This work is also supported by Centre for Research Excellence grant APP1042021 and Program grant APP1074383 from National Health and Medical Research Council (NHMRC), Australia. AKW is a NHMRC Early Career Fellow. MAJ is an NHMRC Senior Research Fellow. JLH is a NHMRC Senior Principal Research Fellow. DDB is a University of Melbourne Research at Melbourne Accelerator Program (R@MAP) Senior Research Fellow.
Compliance with ethical standards
Conflict of interest
The authors have no conflict of interest to declare with respect to this manuscript.
- 17.Haykin SS (2009) Neural networks and learning machines, 3rd edn. Prentice Hall, New YorkGoogle Scholar
- 18.Vesanto J, Himberg J, Alhoniemi E, Parhankangas J (2000) SOM toolbox for Matlab. Tech Rep Laboratory of Computer and Information Science, Helsinki University of TechnologyGoogle Scholar
- 19.The MathWorks I (2010) MATLAB version 7.10.0. In: Natick, MassachusettsGoogle Scholar
- 20.Breslow NE, Day NE (1987) Statistical methods in cancer research. Volume II—the design and analysis of cohort studies. IARC Sci Publ 82:1–406Google Scholar
- 21.Parkin DM, Whelan SL, Ferlay J, Raymond L, Young J (1997) Cancer incidence in five continents, vol VII. International Agency for Research on Cancer, LyonGoogle Scholar
- 22.Gould W (1995) Jackknife estimation. Stata Tech Bull 4:25–29Google Scholar
- 23.Ries L, Eisner M, Kosary C et al (2003) SEER cancer statistics review, 1975–2000. National Cancer Institute, BethesdaGoogle Scholar
- 24.StataCorp (2009) Stata statistical software: release 11. StataCorp LP, College Station, TXGoogle Scholar
- 29.Win AK, Ait Ouakrim D, Jenkins MA (2014) Risk profiling: familial colorectal cancer. Cancer Forum 38(1):15–25Google Scholar