Feature range analysis

Abstract

We propose a feature range analysis algorithm whose aim is to derive features that explain the response variable better than the original features. Moreover, for binary classification problems, and for regression problems where positive and negative samples can be defined (e.g., using a threshold value of the numeric response variable), our aim is to derive features that explain, characterize and isolate the positive samples or subsets of positive samples that have the same root cause. Each derived feature represents a single or multi-dimensional subspace of the feature space, where each dimension is specified as a feature range pair for numeric features, and as a feature-level pair for categorical features. We call these derived features range features. Unlike most rule learning and subgroup discovery algorithms, the response variable can be numeric, and our algorithm does not require a discretization of the response. The algorithm has been applied successfully to real-life root-causing tasks in chip design, manufacturing, and validation, at Intel. Furthermore, we propose and experimentally evaluate a number of heuristics for usage of range features in building predictive models, demonstrating that prediction accuracy can be improved for the majority of real-life proprietary and open-source datasets used in the evaluation.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    The following origins of the data are cited here following UCI citation request: (1) U.S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files), (2) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan (1992), (3) U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan (1992); and (4) U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995).

References

  1. 1.

    Atzmueller, M.: subgroup discovery—advanced review. Data Mining Knowl. Discov. 5(1), 35–49 (2015)

    Article  Google Scholar 

  2. 2.

    Atzmueller, M., Lemmerich, F.: Fast subgroup discovery for continuous target concepts. Found. Intell. Syst., LNCS 5722, 35–44 (2009)

    Article  Google Scholar 

  3. 3.

    Atzmueller, M., Puppe, F., Buscher, H.-P.: Exploiting background knowledge for knowledge-intensive subgroup discovery. In: International Joint Conference on Artificial Intelligence, pp. 647–652 (2005)

  4. 4.

    Breiman, L., Cutler, A., Liaw, A., Wiener, M.: The R-package randomForest, version 4.6-10 (2014)

  5. 5.

    Baas, J., Feelders. A.: Package subgroup.discovery, version 0.2.0 (2017). https://github.com/Jurian/subgroup.discovery

  6. 6.

    Buza, K.: Feedback Prediction for Blogs. Data Analysis, Machine Learning and Knowledge Discovery, pp. 145–152. Springer (2014)

  7. 7.

    Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3, 261–283 (1989)

    Google Scholar 

  8. 8.

    De Jay, N., Papillon-Cavanagh, S., Olsen, C., El-Hachem, N., Bontempi, G., Haibe-Kains, B.: mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics 29(18), 2365–2368 (2013)

    Article  Google Scholar 

  9. 9.

    Dua, D., Graff, C.: UCI machine learning repository. University of California, School of Information and Computer Science, Irvine (2019). http://archive.ics.uci.edu/ml

  10. 10.

    Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3(2), 185–205 (2005). Imperial College Press

    Article  Google Scholar 

  11. 11.

    Friedman, J., Hastie, T., Simon, N., Tibshirani, R.: The R-package glmnet, version 2.0-2 (2015)

  12. 12.

    Friedman, J.H., Fisher, N.I.: Bump hunting in high-dimensional data. Stat. Comput. 9(2), 123–143 (1999)

    Article  Google Scholar 

  13. 13.

    Fürnkranz, J., Gamberger, D., Lavrac̆, N.: Foundations of Rule Learning. Cognitive Technologies. Springer (2012)

  14. 14.

    Goswami, S., Chakrabarti, A.: Feature selection: a practitioner view. Int. J. Inf. Technol. Comput. Sci. 11, 66–77 (2014)

    Google Scholar 

  15. 15.

    Guyon, I., Gunn, S. R., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Advances in Neural Information Processing Systems 17 (NIPS 2004)

  16. 16.

    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  17. 17.

    Hamidieh, K.: A data-driven statistical model for predicting the critical temperature of a superconductor. Comput. Mater. Sci. 154, 346–354 (2018)

    Article  Google Scholar 

  18. 18.

    Jensen, R., Cornelis, C., Shen, Q.: Hybrid fuzzy-rough rule induction and feature selection. In: IEEE Int. Conference on Fuzzy Systems, pp. 1151–1156 (2009)

  19. 19.

    Khasidashvili, Z., Norman, A.J.: Range analysis and applications to root causing. In: IEEE International Conference on Data Science and Advanced Analytics, pp. 298-307 (2019)

  20. 20.

    Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining, pp. 249–271. AAAI Press (1996)

  21. 21.

    Klösgen, W.: Handbook of Data Mining and Knowledge Discovery, Chapter 16.3: Subgroup Discovery. Oxford University Press, New York (2002)

  22. 22.

    Koay, C.W., Norman, A.J., Khasidashvili, Z.: Analog circuit process monitoring. In: IEEE Intl. Workshop on Defects, Adaptive Test, Yield and Data Analysis (2017)

  23. 23.

    Koller, D., Sahami, M.: Toward optimal feature Sslection. Yugosl. J. Oper. Res. 21(1), 119–135 (2011)

    MathSciNet  Article  Google Scholar 

  24. 24.

    Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32(1), 47–58 (2006)

    Google Scholar 

  25. 25.

    Lavrac̆, N., Kavsek, B., Flach, P., Todorovski, L.: Subgroup discovery with CN2-SD. J. Mach. Learn. Res. 5, 153–188 (2004)

  26. 26.

    Lemmerich, F., Atzmueller, M., Puppe, F.: Fast exhaustive subgroup discovery with numeric target concepts. Data Mining Knowl. Discov. 30, 711–762 (2016)

    MathSciNet  Article  Google Scholar 

  27. 27.

    Lemmerich, F., Atzmueller, M., Puppe, F.: Fast exhaustive subgroup discovery with numerical target concepts. Data Mining Knowl. Discov. 30(3), 711–762 (2018)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Lemmerich, F.: Package pysubgroup, version 0.5.4 (2018). https://pypi.org/project/pysubgroup/0.5.4/

  29. 29.

    Manukovsky, A., Juniman, Y., Khasidashvili, Z.: A novel method of precision channel modeling for high speed serial 56 GB interfaces. DesignCon (2018)

  30. 30.

    Manukovsky, A., Khasidashvili, Z., Norman, A.J., Juniman, Y., Bloch, R.: Machine learning applications for simulation and modeling of 56 and 112 GB SerDes systems. DesignCon (2019)

  31. 31.

    Manukovsky, A., Shlepnev, Y., Khasidashvili, Z., Zalianski, E.: Machine learning applications for COM based simulation of 112 GB systems. DesignCon (2020)

  32. 32.

    Manukovsky, A., Shlepnev, Y., Khasidashvili, Z., Zalianski, E.: Machine learning applications for COM based simulation of 112 GB systems (extended abstract). Signal Integr. J. (2020)

  33. 33.

    Manukovsky, A., Shlepnev, Y., Khasidashvili, Z.: Machine learning based design space exploration and applications to signal integrity analysis of 112 GB SerDes systems. In: IEEE Electronic Components and Technology Conference (2021)

  34. 34.

    Michalski, R.S.: A theory and methodology of inductive learning. Artif. Intell. 20(2), 111–161 (1983)

    MathSciNet  Article  Google Scholar 

  35. 35.

    Novak, P.K., Lavrač, N., Webb, G.I.: Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J. Mach. Learn. Res. 10, 377–403 (2009)

    MATH  Google Scholar 

  36. 36.

    Olson, R.S., La Cava, W., Orzechowski, P., Urbanowicz, R.J., Moore, J.H.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 36 (2017)

    Article  Google Scholar 

  37. 37.

    Pawlak, Z.: Roughsets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)

    Google Scholar 

  38. 38.

    Pearl, J.: Probabilistic Reasoning in Expert Systems. Morgan Kaufmann, San Matego (1988)

    Google Scholar 

  39. 39.

    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  40. 40.

    Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  41. 41.

    Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)

  42. 42.

    Redmond, M.A., Baveja, A.: A data-driven software tool for enabling cooperative information sharing among police departments. Eur. J. Oper. Res. 141, 660–678 (2002)

    Article  Google Scholar 

  43. 43.

    Ripley, B., Venables, W.: The R-package. nnet, version 7.3-11 (2016)

  44. 44.

    Saeys, Y., Abeel, T., Van de Peer , Y.: Robust feature selection using ensemble feature selection techniques. In: ECML PKDD 2008, Part II, LNAI 5212, pp. 313–325 (2008)

  45. 45.

    Septem Riza, L. , Janusz, A., Ślȩzak, D., Cornelis, C., Herrera, F., Manuel Benitez, J., Bergmeir, C., Stawicki, S.: Package RoughSets, version 1.3-0 (2015). https://github.com/janusza/RoughSets

  46. 46.

    Shen, Q., Diao, R., Su, P.: Feature selection ensemble. Turing-100, EPiC Series 10, 289–306 (2012)

    Google Scholar 

  47. 47.

    Torres-Sospedra, J., Montoliu, R., Martínez-Usó, A., Arnau, T. J., Avariento, J. P., Benedito-Bordonau, M., Huerta, J.: UJIIndoorLoc: a new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: International Conference on Indoor Positioning and Indoor Navigation (2014)

  48. 48.

    Vluymans, S., D’eer, L., Saeys, Y., Cornelis, C.: Applications of fuzzy rough set theory in machine learning: a survey. Fund. Inform. 142(1–4), 53–86 (2015)

    MathSciNet  MATH  Google Scholar 

  49. 49.

    Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Komorowski, J., Zytkow, J. (eds.) Principles of Data Mining and Knowledge Discovery. PKDD 1997. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 1263. Springer, Berlin (1997)

  50. 50.

    Zadeh, L.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zurab Khasidashvili.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Khasidashvili, Z., Norman, A.J. Feature range analysis. Int J Data Sci Anal 11, 195–219 (2021). https://doi.org/10.1007/s41060-021-00251-7

Download citation

Keywords

  • Range analysis
  • Feature selection
  • Feature synthesis
  • Rule learning
  • Rule induction
  • Subgroup discovery
  • Predictive modeling