Abstract
Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components.
Similar content being viewed by others
References
Park J. Analogue and digital signals: practical data acquisition instrument control. System. 2003:13–35.
Measurement computing. Data acquisition handbook, a reference for DAQ and analog & digital signal conditioning. Third edit. A reference for DAQ And analog & digital signal conditioning. 2012.
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13(7):1–11.
Wold S. Chemometrics; what do we mean with it, and what do we want from it? Chemom Intell Lab Syst. 1995;30(1):109–15.
Otto M. Chemometrics, statistics and computer application in analytical chemistry. 2nd ed. Weinheim: Wiley VCH; 2007.
Lavine BK. Source identification of underground fuel spills by pattern recognition analysis. Anal Chem. 1995;67(27):3846–52.
Malmquist LMV, Olsen RR, Hansen AB, Andersen O, Christensen JH. Assessment of oil weathering by gas chromatography-mass spectrometry, time warping and principal component analysis. J Chromatogr A. 2007;1164(1–2):262–70.
Nelson RK, Kile BM, Plata DL, Sylva SP, Xu L, Reddy CM, et al. Tracking the weathering of an oil spill with comprehensive two-dimensional gas chromatography. Environ Forensic. 2006;7(1):33–44.
Pasupuleti D, Eiceman GA, Pierce KM. Classification of biodiesel and fuel blends using gas chromatography—differential mobility spectrometry with cluster analysis and isolation of C18:3 me by dual ion filtering. Talanta. 2016;155:278–88.
Sigman ME, Williams MR, Castelbuono JA, Colca JG, Clark CD. Ignitable liquid classification and identification using the summed-ion mass spectrum. Instrum Sci Technol. 2008;36(4):375–93.
Sinkov NA, Sandercock PML, Harynuk JJ. Chemometric classification of casework arson samples based on gasoline content. Forensic Sci Int. 2014;235:24–31.
Lopatka M, Sampat AA, Jonkers S, Adutwum LA, Mol HGJ, van der Weg G, et al. Local ion signatures (LIS) for comparison of comprehensive two-dimensional gas chromatography applied to fire debris analysis. Forensic Chem. 2016;3:1–13.
Waddell EE, Song ET, Rinke CN, Williams MR, Sigman ME. Progress toward the determination of correct classification rates in fire debris analysis. J Forensic Sci. 2013;58(4):887–96.
Lopatka M, Sigman ME, Sjerps MJ, Williams MR, Vivo-Truyols G. Class-conditional feature modeling for ignitable liquid classification with substantial substrate contribution in fire debris analysis. Forensic Sci Int. 2015;252:177–86.
Farag MA, Otify A, Porzel A, Michel CG, Elsayed A, Wessjohann LA. Comparative metabolite profiling and fingerprinting of genus Passiflora leaves using a multiplex approach of UPLC-MS and NMR analyzed by chemometric tools. Anal Bioanal Chem. 2016;408(12):3125–43.
Xiao Z, Liu S, Gu Y, Xu N, Shang Y, Zhu J. Discrimination of cherry wines based on their sensory properties and aromatic fingerprinting using HS-SPME-GC-MS and multivariate analysis. J Food Sci. 2014;79(3):C284–94.
Cordero C, Kiefl J, Schieberle P, Reichenbach SE, Bicchi C. Comprehensive two-dimensional gas chromatography and food sensory properties: potential and challenges. Anal Bioanal Chem. 2014;407(1):169–91.
Debska B, Guzowska-Swider B. Decision trees in selection of featured determined food quality. Anal Chim Acta. 2011;705(1–2):261–71.
Guan W, Zhou M, Hampton CY, Benigno BB, Walker LD, Gray A, et al. Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines. BMC Bioinf. 2009;10:259.
Szymanska E, Markuszewski MJ, Capron X, van Nederkassel AM, Vander Heyden Y, Markuszewski M, et al. Increasing conclusiveness of metabonomic studies by cheminformatic preprocessing of capillary electrophoretic data on urinary nucleoside profiles. J Pharm Biomed Anal. 2007;43(2):413–20.
Das MK, Bishwal SC, Das A, Dabral D, Varshney A, Badireddy VK, et al. Investigation of gender-specific exhaled breath volatome in humans by GCxGC-TOF-MS. Anal Chem. 2014;86(2):1229–37.
Katajamaa M, Orešič M. Data processing for mass spectrometry-based metabolomics. J Chromatogr A. 2007;1158(1–2):318–28.
Rajalahti T, Arneberg R, Berven FS, Myhr KM-M, Ulvik RJ, Kvalheim OM. Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemom Intell Lab Syst. 2009;95(1):35–48.
Shin H, Sheu B, Joseph M, Markey MK. Guilt-by-association feature selection: identifying biomarkers from proteomic profiles. J Biomed Inform. 2008;41(1):124–36.
Dang NA, Kolk AHJ, Kuijper S, Janssen H-G, Vivo-Truyols G. The identification of biomarkers differentiating Mycobacterium tuberculosis and non-tuberculous mycobacteria via thermally assisted hydrolysis and methylation gas chromatography-mass spectrometry and chemometrics. Metabolomics. 2013;9(6):1274–85.
Guyon I. An introduction to variable and feature selection 1 introduction. J Mach Learn Res. 2003;3:1157–82.
Guyon I, Elisseeff A. Feature extraction, foundations and applications: an introduction to feature extraction. Stud Fuzziness Soft Comput. 2006;207:1–25.
Engel J, Gerretzen J, Szymańska E, Jansen JJ, Downey G, Blanchet L, et al. Breaking with trends in pre-processing? TrAC Trends Anal Chem. 2013;50:96–106.
Bro R, Smilde AK. Centering and scaling in component analysis. J Chemom. 2003;17(1):16–33.
van den Berg RA, HCJ H, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006;7:142.
Craig A, Cloarec O, Holmes E, Nicholson JK, Lindon JC. Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Anal Chem. 2006;78(7):2262–7.
Caruana RA, Freitag D. How useful is relevance? AAAI Fall Syposium on Relevance. New Orleans; 1994. 25–9.
John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. 11th International Conference on Machine Learning. New Brunswick; 1994. 121–9.
John GH, Kohavi R. Wrappers for feature subset selection. Artif Intell. 1997;97(1):273–324.
Hall M. Correlation-based feature selection for machine learning. Methodology. 1999:1–5.
Vieira SM, Sousa JMCC, Kaymak U. Fuzzy criteria for feature selection. Fuzzy Sets Syst. 2012;189(1):1–18.
Boser BE, Guyon IM, Vapnik VN. Training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory; 1992. 144–52.
Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Snaromán M. Filter methods for feature selection—a comparative study. Intell Data Eng Autom Learn – IDEAL. 2007;178–87.
Science C, Arabia S. Learning boolean concepts in the presence of many irrelevant features. Artif Intell. 1994;69:279–305.
Cadenas JM, Garrido MC, Martínez R. Feature subset selection filter–wrapper based on low quality data. Expert Syst Appl. 2013;40(16):6241–52.
Soufan O, Kleftogiannis D, Kalnis P, Bajic VB. DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PLoS One. 2015;10(2):1–23. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0117988
Rinke CN, Williams MR, Brown C, Baudelet M, Richardson M, Sigman ME. Discriminant analysis in the presence of interferences: combined application of target factor analysis and a Bayesian soft-classifier. Anal Chim Acta, Elsevier BV. 2012;753:19–26.
Farrés M, Platikanov S, Tsakovski S, Tauler R. Comparison of the variable importance in projection (VIP) and of the selectivity ratio (SR) methods for variable selection and interpretation. J Chemom [Internet]. 2015;29(10):528–36. Available from: http://doi.wiley.com/10.1002/cem.2736
Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr K-M-M, Kvalheim OM. Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. Anal Chem. 2009;81(7):2581–90.
Sinkov NA, Harynuk JJ. Cluster resolution: a metric for automated, objective and optimized feature selection in chemometric modeling. Talanta [Internet], Elsevier B.V. 2011;83(4):1079–87.
Sinkov NA, Harynuk JJ. Three-dimensional cluster resolution for guiding automatic chemometric model optimization. Talanta. 2013;103:252–9.
Johnson KJ, Synovec RE. Pattern recognition of jet fuels: comprehensive GC×GC with ANOVA-based feature selection and principal component analysis. Chemom Intell Lab Syst. 2002;60(1–2):225–37.
Adutwum LAA, Harynuk JJJ. Unique ion filter: a data reduction tool for GC/MS data preprocessing prior to chemometric analysis. Anal Chem Am Chem Soc. 2014;86(15):7726–33.
de la Mata AP, McQueen RH, Nam SL, Harynuk JJ. Comprehensive two-dimensional gas chromatographic profiling and chemometric interpretation of the volatile profiles of sweat in knit fabrics. Anal Bioanal Chem. 2017;409(7):1905–13.
Oliynyk AOO, Adutwum LAA, Harynuk JJJ, Mar A. Classifying crystal structures of binary compounds AB through cluster resolution feature selection and support vector machine analysis. Chem Mater. 2016;28(18):6672–81.
Parsons BA, Marney LC, Siegler WC, Hoggard JC, Wright BW, Synovec RE. Tile-based Fisher ratio analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC×GC-TOFMS) data using a null distribution approach. Anal Chem. 2015;87(7):3812–9.
Weitzman MS. Measures of overlap of income distributions of white and Negro families in the United States. US Bureau of the Census; 1970.
Inman HF, Bradley EL. The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities. Commun Stat Theory Methods. 1989;18(10):3851–74.
Matusita K. Decision rule, based on the distance, for the classification problem. Ann Inst Stat Math. 1956;8(1):67.
Mulekar MS, Mishra SN. Confidence interval estimation of overlap: equal means case. Comput Stat Data Anal. 2000;34(2):121–37.
Akaike H. Information theory and an extensión of the maximum likelihood principle. Int Symp Inf Theory. 1973;1973:267–81.
Hu S. Akaike information criterion statistics. Math Comput Simul. 1987;29(5):452.
Tellstrom V, Harder A, Barsch A. Metabolic profiling of different coffee types on the Bruker compactTM QTOF system. Application Note. 2013. Available from: https://www.bruker.com/fileadmin/user_upload/8-PDF-Docs/Separations_MassSpectrometry/Literature/literature/ApplicationNotes/LCMS-79_compact_QTOF_03-2013_eBook.pdf
DeLeeuw J. Introduction to Akaike (1973) information theory and an extension of the maximum likelihood principle. Breakthroughs in statistics volume I: foundations and basic theory. 1992. p. 599–609.
Snipes M, Taylor DC. Model selection and Akaike information criteria: an example from wine ratings and prices. Wine Econ Policy. 2014;3(1):3–9.
Acknowledgements
The authors wish to acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), Genome Canada and Genome Alberta, as well as Cystic Fibrosis Foundation Postdoctoral Fellowship (Bean12F0) and CF Isolate Core at Seattle Children’s Research Institute (NIH P30 DK089507) for funding this research. They also wish to thank Dr. Aiko Barsch (Bruker Daltonics) for the coffee data used in this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Prior to any research being carried out involving human participants, all research protocols were approved by the relevant Human Research Ethics Board at the University of Alberta, including obtaining the informed consent of all participants in the wear trial that generated the fabric samples.
Conflict of interest
The authors declare that they have no conflicts of interest.
Electronic supplementary material
ESM 1
(PDF 365 kb).
Rights and permissions
About this article
Cite this article
Adutwum, L.A., de la Mata, A.P., Bean, H.D. et al. Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios. Anal Bioanal Chem 409, 6699–6708 (2017). https://doi.org/10.1007/s00216-017-0628-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00216-017-0628-8