Continuous Variable Binning Algorithm to Maximize Information Value Using Genetic Algorithm
Binning (bucketing or discretization) is a commonly used data pre-processing technique for continuous predictive variables in machine learning. There are guidelines for good binning which can be treated as constraints. However, there are also statistics which should be optimized. Therefore, we view the binning problem as a constrained optimization problem. This paper presents a novel supervised binning algorithm for binary classification problems using a genetic algorithm, named GAbin, and demonstrates usage on a well-known dataset. It is inspired by the way that human bins continuous variables. To bin a variable, first, we choose output shapes (e.g., monotonic or best bins in the middle). Second, we define constraints (e.g., minimum samples in each bin). Finally, we try to maximize key statistics to assess the quality of the output bins. The algorithm automates these steps. Results from the algorithm are in the user-desired shapes and satisfy the constraints. The experimental results reveal that the proposed GAbin provides competitive results when compared to other binning algorithms. Moreover, GAbin maximizes information value and can satisfy user-desired constraints such as monotonicity or output shape controls.
KeywordsBinning Genetic algorithm Data pre-processing Information value Constrained optimization
This research was partially supported by Taskworld Inc.
- 1.Siddiqi, N.: Credit Risk Scorecards, pp. 79–82. Wiley, Hoboken (2013)Google Scholar
- 2.Thomas, L., Edelman, D., Crook, J.: Credit scoring and its applications, pp. 131–139. SIAM, Society for industrial and applied mathematics, Philadelphia (2002)Google Scholar
- 3.Refaat, M.: Credit Risk Scorecards: Development and Implementation Using SAS. Lulu.com, Raleigh (2011)Google Scholar
- 4.Kerber, R.: ChiMerge: discretization of numeric attributes. In: The Tenth National Conference on Artificial Intelligence, San Jose, California (1992)Google Scholar
- 5.Fayyad, U.M., Irani, K.B.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In: IJCAI (1993)Google Scholar
- 6.Jopia, H.: Scoring Modeling and Optimal Binning. (2019). https://cran.r-project.org/web/packages/smbinning/smbinning.pdf. Accessed April 2019
- 10.Mironchyk, P., Tchistiakov, V.: Monotone optimal binning algorithm for credit risk modeling. Researchgate (2017). https://www.researchgate.net/publication/322520135_Monotone_optimal_binning_algorithm_for_credit_risk_modeling. Accessed April 2019
- 11.FICO: Home Equity Line of Credit (HELOC) Dataset. FICO. https://community.fico.com/s/explainable-machine-learning-challenge?tabset-3158a=2. Accessed April 2019
- 13.Coello, C.A.C.: Constraint-handling Techniques used with evolutionary algorithms. In: The Genetic and Evolutionary Computation Conference Companion, Kyoto, Japan (2018)Google Scholar