Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data
- 12 Downloads
Categorical attributes are ubiquitous in real-world collected data. However, such attributes lack a well-defined distance metric and cannot be directly manipulated per algebraic operations, so many data mining algorithms are unable to work directly on them. Learning an appropriate metric or an effective numerical embedding is very vital yet challenging, for categorical attributes with multi-view heterogeneous data characteristics. This paper proposes a novel multi-view heterogeneous fusion model (MVHF), which first captures basic coupling information for each view and then fuses these heterogeneous information from different views by multi-kernel metric learning, to measure the intrinsic distances between this type of categorical attributes; based on these measured distances, further, we use the manifold learning method to learn a high-quality numerical embedding for each categorical value. Experiments on 33 mixed data sets demonstrate that MVHF-enabled classification significantly enhances the performance, compared with state-of-the-art distance metrics or embedding competitors.
KeywordsCategorical attributes Coupling learning Heterogeneous fusion Metric learning Embedding learning
We thank anonymous reviewers for their valuable comments and suggestions. The work was supported by the Key Research Program of Chongqing Science & Technology Commission (Grant No. CSTC2017jcyjBX0025 and CSTC2019jscx-zdztzx0043), the Science and Technology Major Special Project of Guangxi (Grant No. GKAA17129002), the National Natural Science Foundations of China (Grant No. 61771077), and the National Key R&D Program of China (Grant No. 2018YFF0214706), Graduate Scientific Research and Innovation Foundation of Chongqing (Grant No. CYB19072 and CYS19028).
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
- Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66Google Scholar
- Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining, SIAM, pp 243–254Google Scholar
- Croft WB, Metzler D, Strohman T (2010) Search engines: Information retrieval in practice, vol 283. Addison-Wesley, ReadingGoogle Scholar
- Frank A, Asuncion A (2010) UCI machine learning repository. School of Information and Computer Science, University of California, IrvineGoogle Scholar
- Golinko E, Sonderman T, Zhu X (2017) CNFL: categorical to numerical feature learning for clustering and classification. In: 2017 IEEE second international conference on data science in cyberspace (DSC). IEEE, pp 585–594Google Scholar
- Guo C, Berkhahn F (2016) Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737
- Hsu CW, Chang CC, Lin CJ et al (2003) A practical guide to support vector classificationGoogle Scholar
- Jain P, Kulis B, Dhillon IS (2010) Inductive regularized learning of kernel functions. In: Advances in neural information processing systems, pp 946–954Google Scholar
- Zhang K, Wang Q, Chen Z, Marsic I, Kumar V, Jiang G, Zhang J (2015) From categorical to numerical: multiple transitive distance learning and embedding. In: Proceedings of the 2015 SIAM international conference on data mining. SIAM, pp 46–54Google Scholar
- Zhou ZH (2016) Machine learning. Tsinghua Press, BeijingGoogle Scholar