Skip to main content
Log in

An attribute-weighted isometric embedding method for categorical encoding on mixed data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Mixed data containing categorical and numerical attributes are widely available in real-world. Before analysing such data, it is typically necessary to process (transform/embed/represent) them into high-quality numerical data. The conditional probability transformation method (CPT) can provide acceptable performance in the majority of cases, but it is not satisfactory for datasets with strong attribute association. Inspired by the one dependence value difference metric method, the concept of relaxing the attributes conditional independence has been applied to CPT, but this approach has the drawback of dramatically-expanding the attribute dimensionality. We employ the isometric embedding method to tackle the problem of dimensionality expansion. In addition, an attribute weighting method based on the must-link and cannot-link constraints is designed to optimize the data transformation quality. Combining these methods, we propose an attribute-weighted isometric embedding (AWIE) for categorical encoding on mixed data. Extensive experimental results obtained on 16 datasets demonstrate that AWIE significantly improves upon the classification performance (increasing the F1-score by 2.54%, attaining 6/16 best results, and reaching average ranks of 1.94/8), compared with 28 competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The experimental data of this paper is derived from the UCI (http://archive.ics.uci.edu/ml/index.php).

Notes

  1. http://contrib.scikit-learn.org/category_encoders/

References

  1. Ramírez-Gallego S, Krawczyk B, García S, Wozniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing 239:39–57. https://doi.org/10.1016/j.neucom.2017.01.078

    Article  Google Scholar 

  2. García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Intell Syst Refer Libr. https://doi.org/10.1007/978-3-319-10247-4

    Article  Google Scholar 

  3. Zhang Y, Cheung YM (2022) A new distance metric exploiting heterogeneous interattribute relationship for ordinal-and-nominal-attribute data clustering. IEEE Trans Cybern 52:758–771. https://doi.org/10.1109/TCYB.2020.2983073

    Article  Google Scholar 

  4. Mousavi E, Sehhati M (2023) A generalized multi-aspect distance metric for mixed-type data clustering. Pattern Recognit 138:109353. https://doi.org/10.1016/j.patcog.2023.109353

    Article  Google Scholar 

  5. Li Q, Xiong Q, Ji S, Yu Y, Wu C, Yi H (2021) A method for mixed data classification base on RBF-ELM network. Neurocomputing 431:7–22. https://doi.org/10.1016/j.neucom.2020.12.032

    Article  Google Scholar 

  6. Zhang K, Wang Q, Chen Z, Marsic I, Kumar V, Jiang G, Zhang J (2015) From categorical to numerical: multiple transitive distance learning and embedding. 46–54. https://doi.org/10.1137/1.9781611974010.6

  7. Kasif S, Salzberg S, Waltz DL, Rachlin J, Aha DW (1998) A probabilistic framework for memory-based reasoning. Artif Intell 104:287–311. https://doi.org/10.1016/S0004-3702(98)00046-0

    Article  MathSciNet  MATH  Google Scholar 

  8. Perlich C, Swirszcz G (2010) On cross-validation and stacking: building seemingly predictive models on random data. SIGKDD Explor 12:11–15. https://doi.org/10.1145/1964897.1964901

    Article  Google Scholar 

  9. Mougan C, Masip D, Nin J, Pujol O (2021) Quantile encoder: tackling high cardinality categorical features in regression problems. CoRR abs/2105.13783:

  10. Efron B, Morris C (1977) Stein’s paradox in statistics. Sci Am - SCI AMER 236:119–127. https://doi.org/10.1038/scientificamerican0577-119

    Article  Google Scholar 

  11. Cestnik B, Bratko I (1991) On estimating probabilities in tree pruning. 138–150. https://doi.org/10.1007/BFb0017010

  12. Micci-Barreca D (2001) A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor 3:27–32. https://doi.org/10.1145/507533.507538

    Article  Google Scholar 

  13. Zdravevski E, Lameski P, Kulakov A (2011) Weight of evidence as a tool for attribute transformation in the preprocessing stage of supervised learning algorithms. 181–188. https://doi.org/10.1109/IJCNN.2011.6033219

  14. Prokhorenkova LO, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: unbiased boosting with categorical features. 6639–6649

  15. Zhang H, Jiang L, Yu L (2021) Attribute and instance weighted naive Bayes. Pattern Recognit 111:107674. https://doi.org/10.1016/j.patcog.2020.107674

    Article  Google Scholar 

  16. Wang L, Xie Y, Pang M, Wei J (2022) Alleviating the attribute conditional independence and I.I.D. assumptions of averaged one-dependence estimator by double weighting. Knowl Based Syst 250:109078. https://doi.org/10.1016/j.knosys.2022.109078

    Article  Google Scholar 

  17. Lopez-Arevalo I, Aldana-Bobadilla E, Molina-Villegas A, Galeana-Zapién H, Muñiz-Sánchez V, Gausin-Valle S (2020) A memory-efficient encoding method for processing mixed-type data on machine learning. Entropy 22. https://doi.org/10.3390/e22121391

  18. Kunanbayev K, Temirbek I, Zollanvari A (2021) Complex encoding. Int Jt Conf Neural Netw (IJCNN) 2021:1–6. https://doi.org/10.1109/IJCNN52387.2021.9534094

    Article  Google Scholar 

  19. Yan X, Chen L, Guo G (2021) Kernel-based data transformation model for nonlinear classification of symbolic data. Soft Comput 26:1249–1259. https://doi.org/10.1007/s00500-021-06600-9

    Article  Google Scholar 

  20. Stanfill C, Waltz DL (1986) Toward memory-based reasoning. Commun ACM 29:1213–1228. https://doi.org/10.1145/7902.7906

    Article  Google Scholar 

  21. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163. https://doi.org/10.1023/A:1007465528199

    Article  MATH  Google Scholar 

  22. Li C, Li H (2011) One dependence value difference metric. Knowl Based Syst 24:589–594. https://doi.org/10.1016/j.knosys.2011.01.005

    Article  Google Scholar 

  23. Li Q, Xiong Q, Ji S, Wen J, Gao M, Yu Y, Xu R (2019) Using fine-tuned conditional probabilities for data transformation of nominal attributes. Pattern Recognit Lett 128:107–114. https://doi.org/10.1016/j.patrec.2019.08.024

    Article  Google Scholar 

  24. Jiang L, Zhang H, Cai Z (2009) A novel Bayes model: hidden naive Bayes. IEEE Trans Knowl Data Eng 21:1361–1371. https://doi.org/10.1109/TKDE.2008.234

    Article  Google Scholar 

  25. Li Q, Ji S, Hu S, Yu Y, Chen S, Xiong Q, Zeng Z (2022) A Multi-view deep metric learning approach for categorical representation on mixed data. Knowl-Based Syst 260:110161. https://doi.org/10.1016/j.knosys.2022.110161

    Article  Google Scholar 

  26. Li Q, Xiong Q, Ji S, Gao M, Yu Y, Wu C (2020) Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data. Soft Comput 24:10843–10863. https://doi.org/10.1007/s00500-019-04586-z

    Article  MATH  Google Scholar 

  27. Cox MAA, Cox TF (2008) Multidimensional scaling. Handbook of Data Visualization 315–347. https://doi.org/10.1007/978-3-540-33037-0_14

  28. Huo X, Smith A (2007) A survey of manifold-based learning methods. https://doi.org/10.1142/9789812779861_0015

  29. Roweis S, Saul L (2001) Nonlinear dimensionality reduction by locally linear embedding. Science (New York, N.Y.) 290:2323–2326. https://doi.org/10.1126/science.290.5500.2323

    Article  Google Scholar 

  30. Luo S, Miao D, Zhang Z, Zhang Y, Hu S (2020) A neighborhood rough set model with nominal metric embedding. Inf Sci 520:373–388. https://doi.org/10.1016/j.ins.2020.02.015

    Article  MathSciNet  MATH  Google Scholar 

  31. Yuan Z, Chen H, Li T (2022) Exploring interactive attribute reduction via fuzzy complementary entropy for unlabeled mixed data. Pattern Recognit 127:108651. https://doi.org/10.1016/j.patcog.2022.108651

    Article  Google Scholar 

  32. Jiang L, Li C (2013) An augmented value difference measure. Pattern Recogn Lett 34:1169–1174. https://doi.org/10.1016/j.patrec.2013.03.030

    Article  Google Scholar 

  33. Li C, Jiang L, Li H (2014) Naive Bayes for value difference metric. Front Comput Sci 8. https://doi.org/10.1007/s11704-014-3038-5

  34. Jiang L, Wang D, Cai Z (2012) Discriminatively weighted naive bayes and its application in text classification. Int J Artif Intell Tools 21. https://doi.org/10.1142/S0218213011004770

  35. Jiang L, Li C (2019) Two improved attribute weighting schemes for value difference metric. Knowl Inf Syst 60. https://doi.org/10.1007/s10115-018-1229-3

  36. Jiang L, Li C, Wang S, Zhang L (2016) Deep feature weighting for naive Bayes and its application to text classification. Eng Appl Artif Intell 52:26–39. https://doi.org/10.1016/j.engappai.2016.02.002

    Article  Google Scholar 

  37. Zhang H, Jiang L, Yu L (2020) Class-specific attribute value weighting for Naive Bayes. Inf Sci 508:260–274. https://doi.org/10.1016/j.ins.2019.08.071

    Article  Google Scholar 

  38. Jiang L, Zhang L, Li C, Wu J (2019) A correlation-based feature weighting filter for Naive Bayes. IEEE Trans Knowl Data Eng 31:201–213 (https://ieeexplore.ieee.org/document/8359364)

    Article  Google Scholar 

  39. Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn 107:1477–1494. https://doi.org/10.1007/s10994-018-5724-2

    Article  MathSciNet  Google Scholar 

  40. Li Q, Xiong Q, Ji S, Yu Y, Wu C, Gao M (2021) Incremental semi-supervised extreme learning machine for mixed data stream classification. Expert Syst Appl 185:115591. https://doi.org/10.1016/j.eswa.2021.115591

    Article  Google Scholar 

  41. Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simplemkl Alain Rakotomamonjy Stéphane Canu. http://jmlr.org/papers/v9/rakotomamonjy08a.html

  42. Schölkopf B, Smola AJ (2002) Learning with Kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning series I-XVIII. https://ieeexplore.ieee.org/servlet/opac?bknumber=6267332

  43. Popescu M-C, Balas V, Perescu-Popescu L, Mastorakis N (2009) Multilayer perceptron and neural networks. WSEAS Trans Circuits Syst 8.

  44. Yang CC (2010) Search engines information retrieval in practice. J Assoc Inf Sci Technol 61:430. https://doi.org/10.1002/asi.21194

    Article  Google Scholar 

  45. Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52:239–281. https://doi.org/10.1023/A:1024068626366

    Article  MATH  Google Scholar 

  46. Li M-W, Xu D-Y, Geng J, Hong W-C (2022) A hybrid approach for forecasting ship motion using CNN–GRU–AM and GCWOA. Appl Soft Comput 114:108084. https://doi.org/10.1016/j.asoc.2021.108084

    Article  Google Scholar 

  47. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  48. Hoerl AE, Kennard RW (2000) Ridge regression: biased estimation for Nonorthogonal problems. Technometrics 42:80–86 (https://www.tandfonline.com/doi/abs/10.1080/00401706.2000.10485983)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The work was funded by the National Natural Science Foundation of China (grant no. 62166009), the Guizhou Provincial Natural Science Foundation of China (grant nos. ZK[2021]333 and ZK[2022]350), the Science and Technology Foundation of the Guizhou Provincial Health Commission (grant no. gzwkj2023-258), and the Ph.D. Research Startup Foundation of Guizhou Medical University (grant nos. 2020-051 and 2023-009).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiude Li.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Tables 9, 10 and 11

Table 9 F1 score comparisons between our method and the first category of encoding methods
Table 10 F1 score comparisons between our method and the second category of encoding methods
Table 11 F1 score comparisons between our method and the third of category encoding methods

Figures 7, 8, 9, 10, 11 and 12

Fig. 7
figure 7

Comparison between AWIE and the first category of encoding methods in terms of their F1 scores

Fig. 8
figure 8

Comparison between AWIE and the second category of encoding methods in terms of their F1 scores

Fig. 9
figure 9

Comparison between AWIE and the third category of encoding methods in terms of their F1 scores

Fig. 10
figure 10

F1 score comparison between our method and the first category of methods according to the Nemenyi test

Fig. 11
figure 11

F1 score comparison between our method and the second category of methods according to the Nemenyi test

Fig. 12
figure 12

F1 score comparison between our method and the third category of methods according to the Nemenyi test

Tables 12, 13 and 14

Table 12 Time cost comparisons between our method and the first category of encoding methods
Table 13 Time cost comparisons between our method and the second category of encoding methods
Table 14 Time cost comparisons between our method and the third category of encoding methods

Figures 13, 14 and 15

Fig. 13
figure 13

Time cost comparison between our method and the first category of methods according to the Nemenyi test

Fig. 14
figure 14

Time cost comparison between our method and the second category of methods according to the Nemenyi test

Fig. 15
figure 15

Time cost comparison between our method and the third category of methods according to the Nemenyi test

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, Z., Ji, S., Li, Q. et al. An attribute-weighted isometric embedding method for categorical encoding on mixed data. Appl Intell 53, 26472–26496 (2023). https://doi.org/10.1007/s10489-023-04899-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04899-5

Keywords

Navigation