Advertisement

A Comparative Study of Feature Selection Methods for Stress Hotspot Classification in Materials

  • Ankita Mangal
  • Elizabeth A. Holm
Technical Article

Abstract

The first step in constructing a machine learning model is defining the features of the dataset that can be used for optimal learning. In this work, we discuss feature selection methods, which can be used to build better models, as well as achieve model interpretability. We applied these methods in the context of stress hotspot classification problem, to determine what microstructural characteristics can cause stress to build up in certain grains during uniaxial tensile deformation. The results show how some feature selection techniques are biased and demonstrate a preferred technique to get feature rankings for physical interpretations.

Keywords

Stress hotspots Machine learning Random forests Crystal plasticity Titanium alloys Feature selection 

Notes

Acknowledgements

This work was performed at Carnegie Mellon University. The authors are grateful to the authors of skfeature and sklearn python libraries who made their source code available through the Internet. We would also like to thank the reviewers for their thorough work. Ricardo Lebensohn of the Los Alamos National Laboratory is acknowledged for the use of the MASSIF code.

Funding Information

This work has been supported by the United States National Science Foundation award number DMR-1307138 and DMR-1507830.

References

  1. 1.
    O’Mara J, Meredig B, Michel K (2016) Materials data infrastructure: A case study of the citrination platform to examine data import, storage, and access. JOM 68(8):2031.  https://doi.org/10.1007/s11837-016-1984-0 CrossRefGoogle Scholar
  2. 2.
    Dima A, Bhaskarla S, Becker C, Brady M, Campbell C, Dessauw P, Hanisch R, Kattner U, Kroenlein K, Newrock M, Peskin A, Plante R, Li SY, Rigodiat PF, Amaral GS, Trautt Z, Schmitt X, Warren J, Youssef S (2016) Informatics infrastructure for the Materials Genome Initiative. JOM 68(8):2053.  https://doi.org/10.1007/s11837-016-2000-4 CrossRefGoogle Scholar
  3. 3.
    Mangal A, Holm EA (2018) Applied machine learning to predict stress hotspots I: Face centered cubic materials. arXiv:1711.00118v3
  4. 4.
    Mangal A, Holm EA (2018) Applied machine learning to predict stress hotspots II: Hexagonal close packed materials. arXiv:1804.05924
  5. 5.
    Orme AD, Chelladurai I, Rampton TM, Fullwood DT, Khosravani A, Miles MP, Mishra RK (2016) Insights into twinning in Mg AZ31: A combined EBSD and machine learning study. Comput Mater Sci 124:353CrossRefGoogle Scholar
  6. 6.
    Ch’Ng K, Carrasquilla J, Melko RG, Khatami E (2017) Machine learning phases of strongly correlated fermions. Phys Rev X 7(3):1.  https://doi.org/10.1103/PhysRevX.7.031038 Google Scholar
  7. 7.
    Ling J, Hutchinson M, Antono E, Paradiso S, Meredig B (2017) High-dimensional materials and process optimization using datadriven experimental design with well-calibrated uncertainty estimates. Integr Mater Manuf Innov 6(3):207.  https://doi.org/10.1007/s40192-017-0098-z CrossRefGoogle Scholar
  8. 8.
    Oliynyk AO, Antono E, Sparks TD, Ghadbeigi L, Gaultois MW, Meredig B, Mar A (2016) High-throughput machine-learning-driven synthesis of full-Heusler compounds. Chem Mater 28(20):7324.  https://doi.org/10.1021/acs.chemmater.6b02724 CrossRefGoogle Scholar
  9. 9.
    Wall ME, Rechtsteiner A, Rocha LM (2003) . In: A practical approach to microarray data analysis. Springer, Berlin, pp 91–109Google Scholar
  10. 10.
    Mika S, Scholkopf B, Smola A, Muller KR, Scholz M, Riitsch G (1999) . In: Adv. Neural Inf. Process. Syst., pp 536–542 http://papers.nips.cc/paper/1491-kernel-pca-and-de-noising-in-feature-spaces.pdf
  11. 11.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(80-.):504.  https://doi.org/10.1126/science.1127647 CrossRefGoogle Scholar
  12. 12.
    Yu L, Liu H (2003) . In: Proceedings of the 20th International Conference in Machine Learning, pp 856–863. https://doi.org/citeulike-article-id:3398512. http://www.aaai.org/Papers/ICML/2003/ICML03-111.pdf
  13. 13.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(3):1157.  https://doi.org/10.1016/j.aca.2011.07.027 Google Scholar
  14. 14.
    Van Der Maaten L, Postma E, Van Den Herik J (2009) Dimensionality reduction : A comparative review. J Mach Learn Res 10(2009):66.  https://doi.org/10.1080/13506280444000102. http://www.uvt.nl/ticc Google Scholar
  15. 15.
    Rajan K, Suh C, Mendez PF (2009) Principal component analysis and dimensional analysis as materials informatics tools to reduce dimensionality in materials science and engineering. Stat Anal Data Min ASA Data Sci J 1(6):361.  https://doi.org/10.1002/sam CrossRefGoogle Scholar
  16. 16.
    Agrawal A, Deshpande PD, Cecen A, Basavarsu GP, Choudhary AN, Kalidindi SR (2014) Exploration of data science techniques to predict fatigue strength of steel from composition and processing parameters. Integr Mater Manuf Innov 3(8):1.  https://doi.org/10.1186/2193-9772-3-8 Google Scholar
  17. 17.
    Kalidindi SR, Niezgoda SR, Salem AA (2011) Microstructure informatics using higher-order statistics and efficient data-mining protocols. JOM 63(4):34–41CrossRefGoogle Scholar
  18. 18.
    Dey P, Bible J, Datta S, Broderick S, Jasinski J, Sunkara M, Rajan K (2014) Informatics-aided bandgap engineering for solar materials. Comput Mater Sci 83:185–195CrossRefGoogle Scholar
  19. 19.
    Broderick SR, Nowers JR, Narasimhan B, Rajan K (2009) Tracking chemical processing pathways in combinatorial polymer libraries via data mining. J Comb Chem 12(2):270.  https://doi.org/10.1021/cc900145d CrossRefGoogle Scholar
  20. 20.
    Saeys Y, Inza I, Larranaga P (2007) Gene expression A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507.  https://doi.org/10.1093/bioinformatics/btm344 CrossRefGoogle Scholar
  21. 21.
    Lu F, Petkova E (2014) A comparative study of variable selection methods in the context of developing psychiatric screening instruments. Stat Med 33(3):401.  https://doi.org/10.1002/sim.5937 CrossRefGoogle Scholar
  22. 22.
    Wegner JK, Frȯhlich H, Zell A (2004) Feature selection for descriptor based classification models. 1. Theory and GA-SEC algorithm. J Chem Inf Comput Sci 44(3):921.  https://doi.org/10.1021/ci0342324 CrossRefGoogle Scholar
  23. 23.
    Hall MA, Smith LA (1999) Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: FLAIRS conference, vol 1999, pp 235–239. https://pdfs.semanticscholar.org/31ff/33fadae7b0b3a5608a85a35f84ed74659569.pdf
  24. 24.
    Cohen I, Huang Y, Chen J, Benesty J (2009) . In: Noise reduction in speech processing. Springer, pp 1–4.  https://doi.org/10.1007/978-3-642-00296-0
  25. 25.
    Zare H, Haffari G, Gupta A, Brinkman RR (2013) Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis. BMC Genom 14(Suppl 1):S14.  https://doi.org/10.1186/1471-2164-14-S1-S14. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3549810&tool=pmcentrez&rendertype=abstract CrossRefGoogle Scholar
  26. 26.
    Breiman L (1996) Out-of-bag-estimation.  https://doi.org/10.1007/s13398-014-0173-7.2
  27. 27.
    Tibshirani R (1996) Regression selection and shrinkage via the lasso.  https://doi.org/10.2307/2346178. http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.35.7574
  28. 28.
    Qidwai MAS, Lewis AC, Geltmacher AB (2009) Using image-based computational modeling to study microstructure – yield correlations in metals. Acta Mater 57(14):4233.  https://doi.org/10.1016/j.actamat.2009.05.021 CrossRefGoogle Scholar
  29. 29.
    Hull D, Rimmer DE (1959) The growth of grain-boundary voids under stress. Philos Mag 4(42):673.  https://doi.org/10.1080/14786435908243264 CrossRefGoogle Scholar
  30. 30.
    Lebensohn RA, Kanjarla AK, Eisenlohr P (2012) An elasto-viscoplastic formulation based on fast Fourier transforms for the prediction of micromechanical fields in polycrystalline materials. Int J Plast 59:32–33.  https://doi.org/10.1016/j.ijplas.2011.12.005 Google Scholar
  31. 31.
    Mangal A, Holm EA (2018) A dataset of synthetic hexagonal close packed 3D polycrystalline microstructures, grain-wise microstructural descriptors and grain averaged stress fields under uniaxial tensile deformation for two sets of constitutive parameters. (in preparation for Data in Brief)Google Scholar
  32. 32.
    Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29.  https://doi.org/10.1148/radiology.143.1.7063747 CrossRefGoogle Scholar
  33. 33.
    Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, Liu H (2010) Advancing Feature Selection Research, ASU Featur. Sel. Repos. Arizona State University, pp 1 – 28. http://featureselection.asu.edu/featureselection_techreport.pdf
  34. 34.
    Pearl J (1984) Heuristics: Intelligent search strategies for computer problem solving. Addison-Wesley Longman Publishing Co., BostonGoogle Scholar
  35. 35.
    Guyon I (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1-3):389.  https://doi.org/10.1023/A:1012487302797 CrossRefGoogle Scholar
  36. 36.
  37. 37.
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825Google Scholar
  38. 38.
    Sutter JM, Kalivas JH (1993) Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchem J 47(1-2):60.  https://doi.org/10.1006/mchj.1993.1012 CrossRefGoogle Scholar
  39. 39.
    Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. JR Stat Soc Ser B Stat Methodol 67(2):301.  https://doi.org/10.1111/j.1467-9868.2005.00503.x CrossRefGoogle Scholar
  40. 40.
    Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407.  https://doi.org/10.1214/009053604000000067. http://statweb.stanford.edu/tibs/ftp/lars.pdf CrossRefGoogle Scholar
  41. 41.
    Zare H (2015) FeaLect: Scores Features for Feature Selection. https://cran.r-project.org/package=FeaLect
  42. 42.
    Gregorutti B, Michel B, Saint-Pierre P (2017) Correlation and variable importance in random forests. Stat Comput 27(3):659–678CrossRefGoogle Scholar
  43. 43.
    Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinforma 9(23):307.  https://doi.org/10.1186/1471-2105-9-307 CrossRefGoogle Scholar
  44. 44.
    Toloşi L, Lengauer T (2011) Classification with correlated features: Unreliability of feature ranking and solutions. Bioinformatics 27(14):1986.  https://doi.org/10.1093/bioinformatics/btr300 CrossRefGoogle Scholar
  45. 45.
    Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8:25.  https://doi.org/10.1186/1471-2105-8-25. http://www.ncbi.nlm.nih.gov/pubmed/17254353 CrossRefGoogle Scholar

Copyright information

© The Minerals, Metals & Materials Society 2018
corrected publication August/2018

Authors and Affiliations

  1. 1.Department of Materials Science and EngineeringCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations