The article considers various methods for classification of a set of objects into two classes when all the attributes are categorical (nominal or factor attributes), i.e., describe the membership of an object in a category. Some methods are a simple generalization of classical methods (Bayesian algorithms, singular decomposition methods), others are fundamentally novel. An efficient technique is proposed for encoding categorical attributes by real numbers, which makes it possible to apply classical machine-learning methods (e.g., the random forest). A generalization of the k nearest neighbors (kNN) algorithm and Zhuravlev’s estimate calculation algorithm (AEC) achieve best performance on real-life data. All methods have been tested on an applied problem involving construction of a recommender system for a security service.
Similar content being viewed by others
References
K. V. Vorontsov, Machine Learning [in Russian] [http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf].
The R Project for Statistical Computing [http://cran.r-project.org].
Y. Koren, R. M. Bell, C. Volinsky, “Matrix factorization techniques for recommender systems,” IEEE Computer, 42, No. 8, 30–37 (2009).
Amazon.com — Employee Access Challenge, international competition on data analysis [http://www.kaggle.com/c/amazon-employee-access-challenge].
Library scikit-learn for Python [https://github.com/scikit-learn/scikit-learn].
T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, 27, 861–868 (2006).
A. G. D’yakonov, “A theory of systems of equivalences for the descriptions of algebraic closures of the generalized estimate calculation model,” Zh. Vychisl. Matem. i Mat. Fiz., 50, No. 2, 388–400 (2010).
G. Strang, Linear Algebra and Its Applications, fourth edition, Thomson Brooks/Cole (2005).
C. D. Martin and M. A. Porter, “The extraordinary SVD,” American Mathematical Monthly, 119, No. 10, 838–851 (2012).
G. H. Golub and C. F. Van Loan, Matrix Computations, third edition, The Johns Hopkins University Press, Baltimore, MD, (1996).
T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, 51, No. 3, 455–500 (September 2009).
A Library for Large Linear Classification [http://www.csie.ntu.edu.tw/~cjlin/liblinear/].
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for large-scale linear SVM,” ICML (2008).
A. D’yakonov, “A blending of simple algorithms for topical classification,” in: Rough Sets and Current Trends in Computing, Lecture Notes in Computer Science, 7413/2012, 432–438 (2012) [http://www.springerlink.com/content/73g4kl50m6112420/].
K. D. Manning, P. Raghavan, and H. Schutze, An Introduction to Information Retrieval [Russian translation], I. D. Vil’yams Publ., Moscow (2011).
Yu. I. Zhuravlev, “An algebraic approach to recognition or classification problems,” Probl. Kibernet., No. 33, 5–68 (1978).
A. G. D’yakonov, “Predicting supermarket customer behavior by weighted schemes that estimate probabilities and densities,” Biznes-informatika (2014) (in press).
A. G. D’yakonov, “Two recommendation algorithms based on deformed linear combinations,” Proc. of ECML-PKDD 2011 Discovery Challenge Workshop (2011), pp 21–28.
S. Funk, “Netflix update: Try this at home,” [http://sifter.org/~simon/journal/20061211.html].
L. Breiman, “Random forests,” Machine Learning, 45, No. 1, 5–32 (2001).
WikiMart Olympiad – data analysis competition http://olymp.wikimart.ru.
S. Rendle, “Factorization machines with libFM,” ACM Trans. Intell. Syst. Technol., 3, No. 3, 57:1–57:22 (2012).
Author information
Authors and Affiliations
Corresponding author
Additional information
Translated from Prikladnaya Matematika i Informatika, No. 46, 2014, pp. 103–127.
Rights and permissions
About this article
Cite this article
D’yakonov, A.G. Solution Methods for Classification Problems with Categorical Attributes. Comput Math Model 26, 408–428 (2015). https://doi.org/10.1007/s10598-015-9281-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10598-015-9281-2