Skip to main content
Log in

Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios

  • Research Paper
  • Published:
Analytical and Bioanalytical Chemistry Aims and scope Submit manuscript

Abstract

Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components.

Here, we describe how to determine the start and stop numbers for an automated feature selection routine, ensuring that you get the best model you can for your data with minimal effort.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Park J. Analogue and digital signals: practical data acquisition instrument control. System. 2003:13–35.

  2. Measurement computing. Data acquisition handbook, a reference for DAQ and analog & digital signal conditioning. Third edit. A reference for DAQ And analog & digital signal conditioning. 2012.

  3. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13(7):1–11.

    Article  Google Scholar 

  4. Wold S. Chemometrics; what do we mean with it, and what do we want from it? Chemom Intell Lab Syst. 1995;30(1):109–15.

    Article  CAS  Google Scholar 

  5. Otto M. Chemometrics, statistics and computer application in analytical chemistry. 2nd ed. Weinheim: Wiley VCH; 2007.

    Google Scholar 

  6. Lavine BK. Source identification of underground fuel spills by pattern recognition analysis. Anal Chem. 1995;67(27):3846–52.

    Article  CAS  Google Scholar 

  7. Malmquist LMV, Olsen RR, Hansen AB, Andersen O, Christensen JH. Assessment of oil weathering by gas chromatography-mass spectrometry, time warping and principal component analysis. J Chromatogr A. 2007;1164(1–2):262–70.

    Article  CAS  Google Scholar 

  8. Nelson RK, Kile BM, Plata DL, Sylva SP, Xu L, Reddy CM, et al. Tracking the weathering of an oil spill with comprehensive two-dimensional gas chromatography. Environ Forensic. 2006;7(1):33–44.

    Article  CAS  Google Scholar 

  9. Pasupuleti D, Eiceman GA, Pierce KM. Classification of biodiesel and fuel blends using gas chromatography—differential mobility spectrometry with cluster analysis and isolation of C18:3 me by dual ion filtering. Talanta. 2016;155:278–88.

    Article  CAS  Google Scholar 

  10. Sigman ME, Williams MR, Castelbuono JA, Colca JG, Clark CD. Ignitable liquid classification and identification using the summed-ion mass spectrum. Instrum Sci Technol. 2008;36(4):375–93.

    Article  CAS  Google Scholar 

  11. Sinkov NA, Sandercock PML, Harynuk JJ. Chemometric classification of casework arson samples based on gasoline content. Forensic Sci Int. 2014;235:24–31.

    Article  CAS  Google Scholar 

  12. Lopatka M, Sampat AA, Jonkers S, Adutwum LA, Mol HGJ, van der Weg G, et al. Local ion signatures (LIS) for comparison of comprehensive two-dimensional gas chromatography applied to fire debris analysis. Forensic Chem. 2016;3:1–13.

    Article  Google Scholar 

  13. Waddell EE, Song ET, Rinke CN, Williams MR, Sigman ME. Progress toward the determination of correct classification rates in fire debris analysis. J Forensic Sci. 2013;58(4):887–96.

    Article  Google Scholar 

  14. Lopatka M, Sigman ME, Sjerps MJ, Williams MR, Vivo-Truyols G. Class-conditional feature modeling for ignitable liquid classification with substantial substrate contribution in fire debris analysis. Forensic Sci Int. 2015;252:177–86.

    Article  CAS  Google Scholar 

  15. Farag MA, Otify A, Porzel A, Michel CG, Elsayed A, Wessjohann LA. Comparative metabolite profiling and fingerprinting of genus Passiflora leaves using a multiplex approach of UPLC-MS and NMR analyzed by chemometric tools. Anal Bioanal Chem. 2016;408(12):3125–43.

    Article  CAS  Google Scholar 

  16. Xiao Z, Liu S, Gu Y, Xu N, Shang Y, Zhu J. Discrimination of cherry wines based on their sensory properties and aromatic fingerprinting using HS-SPME-GC-MS and multivariate analysis. J Food Sci. 2014;79(3):C284–94.

    Article  CAS  Google Scholar 

  17. Cordero C, Kiefl J, Schieberle P, Reichenbach SE, Bicchi C. Comprehensive two-dimensional gas chromatography and food sensory properties: potential and challenges. Anal Bioanal Chem. 2014;407(1):169–91.

    Article  Google Scholar 

  18. Debska B, Guzowska-Swider B. Decision trees in selection of featured determined food quality. Anal Chim Acta. 2011;705(1–2):261–71.

    Article  CAS  Google Scholar 

  19. Guan W, Zhou M, Hampton CY, Benigno BB, Walker LD, Gray A, et al. Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines. BMC Bioinf. 2009;10:259.

    Article  Google Scholar 

  20. Szymanska E, Markuszewski MJ, Capron X, van Nederkassel AM, Vander Heyden Y, Markuszewski M, et al. Increasing conclusiveness of metabonomic studies by cheminformatic preprocessing of capillary electrophoretic data on urinary nucleoside profiles. J Pharm Biomed Anal. 2007;43(2):413–20.

    Article  CAS  Google Scholar 

  21. Das MK, Bishwal SC, Das A, Dabral D, Varshney A, Badireddy VK, et al. Investigation of gender-specific exhaled breath volatome in humans by GCxGC-TOF-MS. Anal Chem. 2014;86(2):1229–37.

    Article  CAS  Google Scholar 

  22. Katajamaa M, Orešič M. Data processing for mass spectrometry-based metabolomics. J Chromatogr A. 2007;1158(1–2):318–28.

    Article  CAS  Google Scholar 

  23. Rajalahti T, Arneberg R, Berven FS, Myhr KM-M, Ulvik RJ, Kvalheim OM. Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemom Intell Lab Syst. 2009;95(1):35–48.

    Article  CAS  Google Scholar 

  24. Shin H, Sheu B, Joseph M, Markey MK. Guilt-by-association feature selection: identifying biomarkers from proteomic profiles. J Biomed Inform. 2008;41(1):124–36.

    Article  CAS  Google Scholar 

  25. Dang NA, Kolk AHJ, Kuijper S, Janssen H-G, Vivo-Truyols G. The identification of biomarkers differentiating Mycobacterium tuberculosis and non-tuberculous mycobacteria via thermally assisted hydrolysis and methylation gas chromatography-mass spectrometry and chemometrics. Metabolomics. 2013;9(6):1274–85.

    Article  CAS  Google Scholar 

  26. Guyon I. An introduction to variable and feature selection 1 introduction. J Mach Learn Res. 2003;3:1157–82.

    Google Scholar 

  27. Guyon I, Elisseeff A. Feature extraction, foundations and applications: an introduction to feature extraction. Stud Fuzziness Soft Comput. 2006;207:1–25.

    Article  Google Scholar 

  28. Engel J, Gerretzen J, Szymańska E, Jansen JJ, Downey G, Blanchet L, et al. Breaking with trends in pre-processing? TrAC Trends Anal Chem. 2013;50:96–106.

    Article  CAS  Google Scholar 

  29. Bro R, Smilde AK. Centering and scaling in component analysis. J Chemom. 2003;17(1):16–33.

    Article  CAS  Google Scholar 

  30. van den Berg RA, HCJ H, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006;7:142.

    Article  Google Scholar 

  31. Craig A, Cloarec O, Holmes E, Nicholson JK, Lindon JC. Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Anal Chem. 2006;78(7):2262–7.

    Article  CAS  Google Scholar 

  32. Caruana RA, Freitag D. How useful is relevance? AAAI Fall Syposium on Relevance. New Orleans; 1994. 25–9.

  33. John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. 11th International Conference on Machine Learning. New Brunswick; 1994. 121–9.

  34. John GH, Kohavi R. Wrappers for feature subset selection. Artif Intell. 1997;97(1):273–324.

    Google Scholar 

  35. Hall M. Correlation-based feature selection for machine learning. Methodology. 1999:1–5.

  36. Vieira SM, Sousa JMCC, Kaymak U. Fuzzy criteria for feature selection. Fuzzy Sets Syst. 2012;189(1):1–18.

    Article  Google Scholar 

  37. Boser BE, Guyon IM, Vapnik VN. Training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory; 1992. 144–52.

  38. Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Snaromán M. Filter methods for feature selection—a comparative study. Intell Data Eng Autom Learn – IDEAL. 2007;178–87.

  39. Science C, Arabia S. Learning boolean concepts in the presence of many irrelevant features. Artif Intell. 1994;69:279–305.

    Article  Google Scholar 

  40. Cadenas JM, Garrido MC, Martínez R. Feature subset selection filter–wrapper based on low quality data. Expert Syst Appl. 2013;40(16):6241–52.

    Article  Google Scholar 

  41. Soufan O, Kleftogiannis D, Kalnis P, Bajic VB. DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PLoS One. 2015;10(2):1–23. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0117988

  42. Rinke CN, Williams MR, Brown C, Baudelet M, Richardson M, Sigman ME. Discriminant analysis in the presence of interferences: combined application of target factor analysis and a Bayesian soft-classifier. Anal Chim Acta, Elsevier BV. 2012;753:19–26.

    Article  CAS  Google Scholar 

  43. Farrés M, Platikanov S, Tsakovski S, Tauler R. Comparison of the variable importance in projection (VIP) and of the selectivity ratio (SR) methods for variable selection and interpretation. J Chemom [Internet]. 2015;29(10):528–36. Available from: http://doi.wiley.com/10.1002/cem.2736

    Article  Google Scholar 

  44. Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr K-M-M, Kvalheim OM. Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. Anal Chem. 2009;81(7):2581–90.

    Article  CAS  Google Scholar 

  45. Sinkov NA, Harynuk JJ. Cluster resolution: a metric for automated, objective and optimized feature selection in chemometric modeling. Talanta [Internet], Elsevier B.V. 2011;83(4):1079–87.

    CAS  Google Scholar 

  46. Sinkov NA, Harynuk JJ. Three-dimensional cluster resolution for guiding automatic chemometric model optimization. Talanta. 2013;103:252–9.

    Article  CAS  Google Scholar 

  47. Johnson KJ, Synovec RE. Pattern recognition of jet fuels: comprehensive GC×GC with ANOVA-based feature selection and principal component analysis. Chemom Intell Lab Syst. 2002;60(1–2):225–37.

    Article  CAS  Google Scholar 

  48. Adutwum LAA, Harynuk JJJ. Unique ion filter: a data reduction tool for GC/MS data preprocessing prior to chemometric analysis. Anal Chem Am Chem Soc. 2014;86(15):7726–33.

    Article  CAS  Google Scholar 

  49. de la Mata AP, McQueen RH, Nam SL, Harynuk JJ. Comprehensive two-dimensional gas chromatographic profiling and chemometric interpretation of the volatile profiles of sweat in knit fabrics. Anal Bioanal Chem. 2017;409(7):1905–13.

    Article  Google Scholar 

  50. Oliynyk AOO, Adutwum LAA, Harynuk JJJ, Mar A. Classifying crystal structures of binary compounds AB through cluster resolution feature selection and support vector machine analysis. Chem Mater. 2016;28(18):6672–81.

    Article  CAS  Google Scholar 

  51. Parsons BA, Marney LC, Siegler WC, Hoggard JC, Wright BW, Synovec RE. Tile-based Fisher ratio analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC×GC-TOFMS) data using a null distribution approach. Anal Chem. 2015;87(7):3812–9.

    Article  CAS  Google Scholar 

  52. Weitzman MS. Measures of overlap of income distributions of white and Negro families in the United States. US Bureau of the Census; 1970.

  53. Inman HF, Bradley EL. The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities. Commun Stat Theory Methods. 1989;18(10):3851–74.

    Article  Google Scholar 

  54. Matusita K. Decision rule, based on the distance, for the classification problem. Ann Inst Stat Math. 1956;8(1):67.

    Article  Google Scholar 

  55. Mulekar MS, Mishra SN. Confidence interval estimation of overlap: equal means case. Comput Stat Data Anal. 2000;34(2):121–37.

    Article  Google Scholar 

  56. Akaike H. Information theory and an extensión of the maximum likelihood principle. Int Symp Inf Theory. 1973;1973:267–81.

    Google Scholar 

  57. Hu S. Akaike information criterion statistics. Math Comput Simul. 1987;29(5):452.

    Google Scholar 

  58. Tellstrom V, Harder A, Barsch A. Metabolic profiling of different coffee types on the Bruker compactTM QTOF system. Application Note. 2013. Available from: https://www.bruker.com/fileadmin/user_upload/8-PDF-Docs/Separations_MassSpectrometry/Literature/literature/ApplicationNotes/LCMS-79_compact_QTOF_03-2013_eBook.pdf

  59. DeLeeuw J. Introduction to Akaike (1973) information theory and an extension of the maximum likelihood principle. Breakthroughs in statistics volume I: foundations and basic theory. 1992. p. 599–609.

  60. Snipes M, Taylor DC. Model selection and Akaike information criteria: an example from wine ratings and prices. Wine Econ Policy. 2014;3(1):3–9.

    Article  Google Scholar 

Download references

Acknowledgements

The authors wish to acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), Genome Canada and Genome Alberta, as well as Cystic Fibrosis Foundation Postdoctoral Fellowship (Bean12F0) and CF Isolate Core at Seattle Children’s Research Institute (NIH P30 DK089507) for funding this research. They also wish to thank Dr. Aiko Barsch (Bruker Daltonics) for the coffee data used in this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James J. Harynuk.

Ethics declarations

Prior to any research being carried out involving human participants, all research protocols were approved by the relevant Human Research Ethics Board at the University of Alberta, including obtaining the informed consent of all participants in the wear trial that generated the fabric samples.

Conflict of interest

The authors declare that they have no conflicts of interest.

Electronic supplementary material

ESM 1

(PDF 365 kb).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Adutwum, L.A., de la Mata, A.P., Bean, H.D. et al. Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios. Anal Bioanal Chem 409, 6699–6708 (2017). https://doi.org/10.1007/s00216-017-0628-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00216-017-0628-8

Keywords

Navigation