Feature range analysis

Khasidashvili, Zurab; Norman, Adam J.

doi:10.1007/s41060-021-00251-7

199 Accesses
1 Citation
Explore all metrics

Abstract

We propose a feature range analysis algorithm whose aim is to derive features that explain the response variable better than the original features. Moreover, for binary classification problems, and for regression problems where positive and negative samples can be defined (e.g., using a threshold value of the numeric response variable), our aim is to derive features that explain, characterize and isolate the positive samples or subsets of positive samples that have the same root cause. Each derived feature represents a single or multi-dimensional subspace of the feature space, where each dimension is specified as a feature range pair for numeric features, and as a feature-level pair for categorical features. We call these derived features range features. Unlike most rule learning and subgroup discovery algorithms, the response variable can be numeric, and our algorithm does not require a discretization of the response. The algorithm has been applied successfully to real-life root-causing tasks in chip design, manufacturing, and validation, at Intel. Furthermore, we propose and experimentally evaluate a number of heuristics for usage of range features in building predictive models, demonstrating that prediction accuracy can be improved for the majority of real-life proprietary and open-source datasets used in the evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Process monitoring for quality–a feature selection method for highly unbalanced binary data

Article 17 February 2022

For real: a thorough look at numeric attributes in subgroup discovery

Article Open access 21 September 2020

Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains

Article Open access 25 February 2019

Notes

The following origins of the data are cited here following UCI citation request: (1) U.S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files), (2) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan (1992), (3) U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan (1992); and (4) U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995).

References

Atzmueller, M.: subgroup discovery—advanced review. Data Mining Knowl. Discov. 5(1), 35–49 (2015)
Article Google Scholar
Atzmueller, M., Lemmerich, F.: Fast subgroup discovery for continuous target concepts. Found. Intell. Syst., LNCS 5722, 35–44 (2009)
Article Google Scholar
Atzmueller, M., Puppe, F., Buscher, H.-P.: Exploiting background knowledge for knowledge-intensive subgroup discovery. In: International Joint Conference on Artificial Intelligence, pp. 647–652 (2005)
Breiman, L., Cutler, A., Liaw, A., Wiener, M.: The R-package randomForest, version 4.6-10 (2014)
Baas, J., Feelders. A.: Package subgroup.discovery, version 0.2.0 (2017). https://github.com/Jurian/subgroup.discovery
Buza, K.: Feedback Prediction for Blogs. Data Analysis, Machine Learning and Knowledge Discovery, pp. 145–152. Springer (2014)
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3, 261–283 (1989)
Google Scholar
De Jay, N., Papillon-Cavanagh, S., Olsen, C., El-Hachem, N., Bontempi, G., Haibe-Kains, B.: mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics 29(18), 2365–2368 (2013)
Article Google Scholar
Dua, D., Graff, C.: UCI machine learning repository. University of California, School of Information and Computer Science, Irvine (2019). http://archive.ics.uci.edu/ml
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3(2), 185–205 (2005). Imperial College Press
Article Google Scholar
Friedman, J., Hastie, T., Simon, N., Tibshirani, R.: The R-package glmnet, version 2.0-2 (2015)
Friedman, J.H., Fisher, N.I.: Bump hunting in high-dimensional data. Stat. Comput. 9(2), 123–143 (1999)
Article Google Scholar
Fürnkranz, J., Gamberger, D., Lavrac̆, N.: Foundations of Rule Learning. Cognitive Technologies. Springer (2012)
Goswami, S., Chakrabarti, A.: Feature selection: a practitioner view. Int. J. Inf. Technol. Comput. Sci. 11, 66–77 (2014)
Google Scholar
Guyon, I., Gunn, S. R., Ben-Hur, A., Dror, G.: Result analysis of the NIPS 2003 feature selection challenge. Advances in Neural Information Processing Systems 17 (NIPS 2004)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Hamidieh, K.: A data-driven statistical model for predicting the critical temperature of a superconductor. Comput. Mater. Sci. 154, 346–354 (2018)
Article Google Scholar
Jensen, R., Cornelis, C., Shen, Q.: Hybrid fuzzy-rough rule induction and feature selection. In: IEEE Int. Conference on Fuzzy Systems, pp. 1151–1156 (2009)
Khasidashvili, Z., Norman, A.J.: Range analysis and applications to root causing. In: IEEE International Conference on Data Science and Advanced Analytics, pp. 298-307 (2019)
Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining, pp. 249–271. AAAI Press (1996)
Klösgen, W.: Handbook of Data Mining and Knowledge Discovery, Chapter 16.3: Subgroup Discovery. Oxford University Press, New York (2002)
Koay, C.W., Norman, A.J., Khasidashvili, Z.: Analog circuit process monitoring. In: IEEE Intl. Workshop on Defects, Adaptive Test, Yield and Data Analysis (2017)
Koller, D., Sahami, M.: Toward optimal feature Sslection. Yugosl. J. Oper. Res. 21(1), 119–135 (2011)
Article MathSciNet Google Scholar
Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32(1), 47–58 (2006)
Google Scholar
Lavrac̆, N., Kavsek, B., Flach, P., Todorovski, L.: Subgroup discovery with CN2-SD. J. Mach. Learn. Res. 5, 153–188 (2004)
Lemmerich, F., Atzmueller, M., Puppe, F.: Fast exhaustive subgroup discovery with numeric target concepts. Data Mining Knowl. Discov. 30, 711–762 (2016)
Article MathSciNet Google Scholar
Lemmerich, F., Atzmueller, M., Puppe, F.: Fast exhaustive subgroup discovery with numerical target concepts. Data Mining Knowl. Discov. 30(3), 711–762 (2018)
Article MathSciNet Google Scholar
Lemmerich, F.: Package pysubgroup, version 0.5.4 (2018). https://pypi.org/project/pysubgroup/0.5.4/
Manukovsky, A., Juniman, Y., Khasidashvili, Z.: A novel method of precision channel modeling for high speed serial 56 GB interfaces. DesignCon (2018)
Manukovsky, A., Khasidashvili, Z., Norman, A.J., Juniman, Y., Bloch, R.: Machine learning applications for simulation and modeling of 56 and 112 GB SerDes systems. DesignCon (2019)
Manukovsky, A., Shlepnev, Y., Khasidashvili, Z., Zalianski, E.: Machine learning applications for COM based simulation of 112 GB systems. DesignCon (2020)
Manukovsky, A., Shlepnev, Y., Khasidashvili, Z., Zalianski, E.: Machine learning applications for COM based simulation of 112 GB systems (extended abstract). Signal Integr. J. (2020)
Manukovsky, A., Shlepnev, Y., Khasidashvili, Z.: Machine learning based design space exploration and applications to signal integrity analysis of 112 GB SerDes systems. In: IEEE Electronic Components and Technology Conference (2021)
Michalski, R.S.: A theory and methodology of inductive learning. Artif. Intell. 20(2), 111–161 (1983)
Article MathSciNet Google Scholar
Novak, P.K., Lavrač, N., Webb, G.I.: Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J. Mach. Learn. Res. 10, 377–403 (2009)
MATH Google Scholar
Olson, R.S., La Cava, W., Orzechowski, P., Urbanowicz, R.J., Moore, J.H.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 36 (2017)
Article Google Scholar
Pawlak, Z.: Roughsets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)
Google Scholar
Pearl, J.: Probabilistic Reasoning in Expert Systems. Morgan Kaufmann, San Matego (1988)
Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)
Redmond, M.A., Baveja, A.: A data-driven software tool for enabling cooperative information sharing among police departments. Eur. J. Oper. Res. 141, 660–678 (2002)
Article Google Scholar
Ripley, B., Venables, W.: The R-package. nnet, version 7.3-11 (2016)
Saeys, Y., Abeel, T., Van de Peer , Y.: Robust feature selection using ensemble feature selection techniques. In: ECML PKDD 2008, Part II, LNAI 5212, pp. 313–325 (2008)
Septem Riza, L. , Janusz, A., Ślȩzak, D., Cornelis, C., Herrera, F., Manuel Benitez, J., Bergmeir, C., Stawicki, S.: Package RoughSets, version 1.3-0 (2015). https://github.com/janusza/RoughSets
Shen, Q., Diao, R., Su, P.: Feature selection ensemble. Turing-100, EPiC Series 10, 289–306 (2012)
Google Scholar
Torres-Sospedra, J., Montoliu, R., Martínez-Usó, A., Arnau, T. J., Avariento, J. P., Benedito-Bordonau, M., Huerta, J.: UJIIndoorLoc: a new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: International Conference on Indoor Positioning and Indoor Navigation (2014)
Vluymans, S., D’eer, L., Saeys, Y., Cornelis, C.: Applications of fuzzy rough set theory in machine learning: a survey. Fund. Inform. 142(1–4), 53–86 (2015)
MathSciNet MATH Google Scholar
Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Komorowski, J., Zytkow, J. (eds.) Principles of Data Mining and Knowledge Discovery. PKDD 1997. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 1263. Springer, Berlin (1997)
Zadeh, L.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Intel Corporation, Haifa, Israel
Zurab Khasidashvili
Intel Corporation, Portland, OR, USA
Adam J. Norman

Authors

Zurab Khasidashvili
View author publications
You can also search for this author in PubMed Google Scholar
Adam J. Norman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zurab Khasidashvili.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khasidashvili, Z., Norman, A.J. Feature range analysis. Int J Data Sci Anal 11, 195–219 (2021). https://doi.org/10.1007/s41060-021-00251-7

Download citation

Received: 24 May 2020
Accepted: 15 February 2021
Published: 24 March 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s41060-021-00251-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature range analysis

Abstract

Access this article

Similar content being viewed by others

Process monitoring for quality–a feature selection method for highly unbalanced binary data

For real: a thorough look at numeric attributes in subgroup discovery

Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation