Abstract
Prostate cancer is one of the prominent types of cancer affecting the human male population throughout the world. Detecting cancer in the early-stage is a crucial factor in the effective treatment of the disease. Machine learning is a type of algorithm that can learn and predict from a given dataset without being manually programmed. Machine learning can be useful with gene expression data to discriminate cancer stage rather than relying on histology of tissue and various other diagnostic methods used in prostate cancer detection. In this study, the authors have developed a supervised classifier for detecting early- and late-stage prostate cancer using RNA sequencing-based gene expression data collected from The Cancer Genome Atlas. Supervised learning algorithms Naive Bayes, stochastic gradient descent, J48, and Random Forest, Multilayer Perceptron were employed with 276 most informative subset of features extracted from gene expression data. Accuracies of these developed models were evaluated after tenfold cross-validation. Among all, the trained classifiers stochastic gradient descent-based classifier performed best with accuracy 86.91%, sensitivity 86.9% and area under receiver operating curve 0.656. Gene Ontology and KEGG pathway enrichment analysis of these 276 gene features were also performed to functionally categorize these genes.


Similar content being viewed by others
References
Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M et al (2015) Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012: globocan 2012. Int J Cancer 136(5):E359–E386
Shen MM, Abate-Shen C (2010) Molecular genetics of prostate cancer: new prospects for old challenges. Genes Dev 24(18):1967–2000
Droz J-P, Albrand G, Gillessen S, Hughes S, Mottet N, Oudard S et al (2017) Management of prostate cancer in elderly patients: recommendations of a task force of the international society of geriatric oncology. Eur Urol 72(4):521–531
Hariharan K, Padmanabha V (2016) Demography and disease characteristics of prostate cancer in India. Indian J Urol 32(2):103
Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, Thornquist M et al (2001) Phases of biomarker development for early detection of cancer. JNCI J Natl Cancer Inst 93(14):1054–1061
Agnihotri S, Mittal RD, Kapoor R, Mandhani A (2014) Asymptomatic prostatic inflammation in men with clinical BPH and erectile dysfunction affects the positive predictive value of prostate-specific antigen. Urol Oncol Semin Orig Investig 32(7):946–951
Mejak SL, Bayliss J, Hanks SD (2013) Long distance bicycle riding causes prostate-specific antigen to increase in men aged 50 years and over. PLoS ONE 8(2):e56030
Cui T, Kovell RC, Terlecki RP (2016) Is it time to abandon the digital rectal examination? Lessons from the PLCO cancer screening trial and peer-reviewed literature. Curr Med Res Opin 32(10):1663–1669
Harvey CJ, Pilcher J, Richenberg J, Patel U, Frauscher F (2012) Applications of transrectal ultrasound in prostate cancer. Br J Radiol. 85(special_issue_1):S3–S17
Mkinen T, Auvinen A, Hakama M, åkan Stenman U-H, Tammela TLJ (2002) Acceptability and complications of prostate biopsy in population-based PSA screening versus routine clinical practice: a prospective controlled study. Urology 60(5):846–850
Raaijmakers R, Kirkels WJ, Roobol MJ, Wildhagen MF, Schrder FH (2002) Complication rates and risk factors of 5802 transrectal ultrasound-guided sextant biopsies of the prostate within a population-based screening program. Urology 60(5):826–830
Prensner JR, Rubin MA, Wei JT, Chinnaiyan AM (2012) Beyond PSA: the next generation of prostate cancer biomarkers. Sci Transl Med. 4(127):127rv3
Buyyounouski MK, Choyke PL, McKenney JK, Sartor O, Sandler HM, Amin MB et al (2017) Prostate cancer—major changes in the American Joint Committee on Cancer eighth edition cancer staging manual: prostate cancer-major 8th edition changes. CA Cancer J Clin. 67(3):245–253
Chen N, Zhou Q (2016) The evolving Gleason grading system. Chin J Cancer Res Chung-Kuo Yen Cheng Yen Chiu 28(1):58–64
You JS, Jones PA (2012) Cancer genetics and epigenetics: two sides of the same coin? Cancer Cell 22(1):9–20
Morozova O, Marra MA (2008) Applications of next-generation sequencing technologies in functional genomics. Genomics 92(5):255–264
Bhalla S, Chaudhary K, Kumar R, Sehgal M, Kaur H, Sharma S et al (2017) Gene expression-based biomarkers for discriminating early and late stage of clear cell renal cancer. Sci Rep 28(7):44997
Jagga Z, Gupta D (2014) Classification models for clear cell renal carcinoma stage progression, based on tumor RNAseq expression trained supervised machine learning algorithms. BMC Proc 8(Suppl 6):S2
Singireddy S, Alkhateeb A, Rezaeian I, Rueda L, Cavallo-Medved D, Porter L (2015) Identifying differentially expressed transcripts associated with prostate cancer progression using RNA-Seq and machine learning techniques. In: 2015 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB). Niagara Falls, ON, Canada: IEEE, p 1–5. http://ieeexplore.ieee.org/document/7300302/. Accessed 9 Apr 2019
Arvaniti E, Fricker KS, Moret M, Rupp N, Hermanns T, Fankhauser C et al (2018) Automated Gleason grading of prostate cancer tissue microarrays via deep learning. Sci Rep. 8(1):12054
Hussain L, Ahmed A, Saeed S, Rathore S, Awan IA, Shah SA et al (2018) Prostate cancer detection using machine learning techniques by employing combination of features extracting strategies. Cancer Biomark Sect Dis Markers 21(2):393–413
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10
Witten IH (ed) (2017) Data mining: practical machine learning tools and techniques, 4th edn. Elsevier, Amsterdam, p 621
John GH, Langley P (1995) Estimating Continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 338–345. (UAI’95). http://dl.acm.org/citation.cfm?id=2074158.2074196. Accessed 12 Apr 2018
Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat. 23:462–466
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Pandya AS, Macy RB (1996) Pattern recognition with neural networks in C++. CRC Press, Boca Raton, p 410
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Wang J, Duncan D, Shi Z, Zhang B (2013) WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Res 41(W1):W77–W83
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I et al (2006) Machine learning in bioinformatics. Brief Bioinform 7(1):86–112
Abeshouse A, Ahn J, Akbani R, Ally A, Amin S, Andry CD et al (2015) The molecular taxonomy of primary prostate cancer. Cell 163(4):1011–1025
Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H et al (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci 98(19):10869–10874
Acknowledgements
The results published here are based upon data generated by the TCGA Research Network (http://cancergenome.nih.gov/). The authors are thankful to the Department of Biotechnology, New Delhi, India, for providing financial assistance through Bioinformatics National Certification (BINC) (File No. PU/BINC/2016/E-04). They thank Mr. Purshotam Das for providing technical support while carrying out this research work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Significance Statement
In this work, the authors have used TCGA gene expression data and machine learning techniques to classify whether prostate cancer is in early- or late-stage. Using TCGA gene expression data the authors identified the most informative subset of gene features and used expression of these gene features to classify prostate cancer stage. The authors have shown that machine learning-based prediction methods can be substitute for histology-based cancer-stage determination.
Rights and permissions
About this article
Cite this article
Kumar, R., Bhanti, P., Marwal, A. et al. Gene Expression-Based Supervised Classification Models for Discriminating Early- and Late-Stage Prostate Cancer. Proc. Natl. Acad. Sci., India, Sect. B Biol. Sci. 90, 541–565 (2020). https://doi.org/10.1007/s40011-019-01127-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40011-019-01127-4