Skip to main content
Log in

Kernel-based data transformation model for nonlinear classification of symbolic data

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Symbolic data are usually composed of some categorical variables used to represent discrete entities in many real-world applications. Mining of symbolic data is more difficult than numerical data due to the lack of inherent geometric properties of this type of data. In this paper, we use two kinds of kernel learning methods to create a kernel estimation model and a nonlinear classification algorithm for symbolic data. By using the kernel smoothing method, we construct a squared-error consistent probability estimator for symbolic data, followed by a new data transformation model proposed to embed symbolic data into Euclidean space. Based on the model, the inner product and distance measure between symbolic data objects are reformulated, allowing a new Support Vector Machine (SVM), called SVM-S, to be defined for nonlinear classification on symbolic data using the Mercer kernel learning method. The experiment results show that SVM can be much more effective for symbolic data classification based on our proposed model and measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Agresti A (2008) An introduction to categorical data analysis. Wiley, New York

    MATH  Google Scholar 

  • Aitchison J, Aitken CGG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63(3):413–420

    Article  MathSciNet  Google Scholar 

  • Alaya MZ, Bussy S, Gaiffas S, Guilloux A (2017) Binarsity: a penalization for one-hot encoded features. J Machine Learn Res 20:1–34

    MATH  Google Scholar 

  • Boriah S, Chandola V, Kumar V (2008). Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th SIAM international conference on data mining, pp 243–254

  • Breiman L (2001) Random forests. Machine Learn 45(1):5–32

    Article  Google Scholar 

  • Bremner AP, Taplin RH (2002) Theory & methods: modified classification and regression tree splitting criteria for data with interactions. Aust & N. Z. J Stat 44(2):169–176

    Article  MathSciNet  Google Scholar 

  • Buttrey SE (1998) Nearest-neighbor classification with categorical variables. Comput Stat Data Anal 28(2):157–169

    Article  Google Scholar 

  • Casquilho JP (2020) On the weighted gini-simpson index: estimating feasible weights using the optimal point and discussing a link with possibility theory. Soft Comput 24(22):17187–17194

    Article  Google Scholar 

  • Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Machine Learn 107(8–10):1477–1494

    Article  MathSciNet  Google Scholar 

  • Chen L, Guo G (2015) Nearest neighbor classification of categorical data by attributes weighting. Expert Syst Appl 42(6):3142–3149

    Article  Google Scholar 

  • Chen L, Ye Y, Guo G, Zhu J (2016) Kernel-based linear classification on categorical data. Soft Comput 20(8):2981–2993

    Article  Google Scholar 

  • Chen L, Wang S, Wang K, Zhu J (2016) Soft subspace clustering of categorical data with probabilistic distance. Pattern Recognit 51:322–332

    Article  Google Scholar 

  • Cheng L, Wang Y, Ma X (2019) A neural probabilistic outlier detection method for categorical data. Neurocomputing 365:325–335

    Article  Google Scholar 

  • Chen T, Guestrin C (2016). XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD’16), pp 785–794

  • Chen L, Guo G, Wang S, Kong X (2014b). Kernel learning method for distance-based classification of categorical data. In: Proceedings of the 14th UK workshop on computational intelligence (UKCI’14), pp 58–63

  • Chen L, Wang S (2013). Central clustering of categorical data with automated feature weighting. In: Proceedings of the 23th international joint conference on artificial intelligence (IJCAI’13), pp 1260–1266

  • Cortes C, Vapnik V (1995) Support-vector networks. Machine Learn 20:273–297

    MATH  Google Scholar 

  • Deng G, Manton JH, Wang S (2018) Fast kernel smoothing by a low-rank approximation of the kernel toeplitz matrix. J Math Imaging Vis 60(8):1181–1195

    Article  MathSciNet  Google Scholar 

  • Dos Santos TRL, Zárate LE (2015) Categorical data clustering: What similarity measure to recommend? Expert Syst Appl 42(3):1247–1260

    Article  Google Scholar 

  • Ghosh S (2018) Kernel smoothing principles. Wiley, Hoboken

    MATH  Google Scholar 

  • Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inform Syst 25(5):345–366

    Article  Google Scholar 

  • Han E, Karypis G (2000). Centroid-based document classification: analysis & experimental results. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD’00), pp 424–431

  • He Z, Xu X, Deng S (2008) K-ANMI: a mutual information based clustering algorithm for categorical data. Inform Fusion 9(2):223–233

    Article  Google Scholar 

  • Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220

    Article  MathSciNet  Google Scholar 

  • Huang Z (1998) Extensions to the K-means algorithm for clustering large data sets with categorical values. Data Mining Knowl Discovery 2(3):283–304

    Article  Google Scholar 

  • Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in K-means type clustering. IEEE Trans Pattern Anal Machine Intell 27(5):657–668

    Article  Google Scholar 

  • Jin W, Li ZJ, Wei LS, Zhen H (2000). The improvements of BP neural network learning algorithm. In: Proceedings of the 5th international conference on signal processing, pp 1647–1649

  • Larochelle H, Mandel M, Pascanu R, Bengio Y (2012) Learning algorithms for the classification restricted boltzmann machine. J Machine Learn Res 13(1):643–669

    MathSciNet  MATH  Google Scholar 

  • Li Q, Racine JS (2007) Nonparametric econometrics: theory and practice. Princeton University Press, Princeton

    MATH  Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  • Ouyang D, Li Q, Racine JS (2006) Cross-validation and the estimation of probability distributions with categorical data. J Nonparametric Stat 18(1):69–100

    Article  MathSciNet  Google Scholar 

  • Qian Y, Li F, Liang J, Liu B, Dang C (2016) Space structure and clustering of categorical data. IEEE Trans Neural Netw Learn Syst 27(10):2047–2059

    Article  MathSciNet  Google Scholar 

  • Quinlan J (1995). C4.5: Programms for machine learning. Morgan Kaufmann Publishers Inc

  • Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323

    Article  Google Scholar 

  • Scott DW (1992) Multivariate density estimation: theory, practice, and visualization. Wiley, New York

    Book  Google Scholar 

  • Seeger M (2006). Bayesian modeling in machine learning: a tutorial review. Tutorial, Saarland University. http://lapmal.epfl.ch/papers/bayes-review

  • Stone CJ (1984) An asymptotically optimal window selection rule for kernel density estimates. Ann Stat 12(4):1285-1297

    Article  MathSciNet  Google Scholar 

  • Vo KT, Sowmya A (2010). Multiple kernel learning for classification of diffuse lung disease using HRCT lung images. In: Proceedings of the 2010 annual international conference of the IEEE engineering in medicine and biology, pp 3085–3088

  • Wang MQ, Yue XD, Gao C, Chen Y (2018). Feature selection ensemble for symbolic data classification with AHP. In: Proceedings of the 24th international conference on pattern recognition (ICPR’08), pp 868–873

  • Wang Z, Zhu Z, Li D (2020) Collaborative and geometric multi-kernel learning for multi-class classification. Pattern Recognit 99:107050

    Article  Google Scholar 

  • Wang R, Li Z, Cao J, Chen T, Wang L (2019). Convolutional recurrent neural networks for text classification. In: Proceedings of the 2019 international joint conference on neural networks (IJCNN), pp 1–6

  • Wang D, Tanaka T (2016). Sparse kernel principal component analysis based on elastic net regularization. In: Proceedings of the 2016 international joint conference on neural networks (IJCNN), pp 3703–3708

  • Yan X, Chen L, Guo G (2018) Center-based clustering of categorical data using kernel smoothing methods. Front Computer Sci 12(5):1032–1034

    Article  Google Scholar 

  • Zhang J, Chen L, Guo G (2013) Projected-prototype-based classifier for text categorization. Knowl Based Syst 49:179–189

    Article  Google Scholar 

  • Zhong S, Chen T, He F, Niu Y (2014) Fast gaussian kernel learning for classification tasks based on specially structured global optimization. Neural Netw 57:51–62

    Article  Google Scholar 

  • Zhou J, Chen L, Chen CLP, Zhang Y, Li HX (2016) Fuzzy clustering with the entropy of attribute weights. Neurocomputing 198(19):125–134

    Article  Google Scholar 

  • Zhu S, Xu L (2018) Many-objective fuzzy centroids clustering algorithm for categorical data. Expert Syst Appl 96:230–248

    Article  Google Scholar 

Download references

Acknowledgements

X. Yan, L. Chen and G. Guo’s work was supported by the National Natural Science Foundation of China under Grant Nos. U1805263, 61976053. X. Yan’s work was also supported by the National Natural Science Foundation of China under Grant No. 61772004 and the Guiding Foundation of Fujian Province of China No. 2020H0011.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lifei Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

A Proof of Theorem 1

Since \({[I\left( \cdot \right) ]}^{2} = I\left( \cdot \right) \) and \(\sum _{o \in O_{d}}^{}{[p(o)]} = 1 ,\) the expectation of \(\hat{p}\left( o_{dl} \big | \lambda _{d} \right) \) can be obtained from Eq. (4):

$$\begin{aligned} \begin{array}{l} E\left( \hat{p}\left( o_{dl} \big | \lambda _{d} \right) \right) \\ \quad = E\left[\ell \left( X_{d},o_{dl},\lambda _{d} \right) \right]\\ \quad = \sum _{o \in O_{d}}^{}{[\frac{1}{{|O}_{d}|}\lambda _{d} + \left( 1 - \lambda _{d} \right) I\left( o = o_{dl} \right) ]p(o)}\\ \quad = \frac{\lambda _{d}}{{|O}_{d}|} + \left( 1 - \lambda _{d} \right) p\left( o_{dl} \right) \text { .} \end{array} \end{aligned}$$

So, the \(\text {Bias}\left( \hat{p}\left( o_{dl} \big | \lambda _{d} \right) \right) \) and the\(\text {Var}\left( \hat{p}\left( o_{dl} \big | \lambda _{d} \right) \right) \) can be computed as:

$$\begin{aligned}&[Bias\left( \hat{p}\left( o_{dl} \big | \lambda _{d} \right) \right) ]^{2}\\&= \left[\frac{\lambda _{d}}{{|O}_{d}|} - \lambda _{d}p\left( o_{dl} \right) \right]^{2}\\&= \lambda _{d}^{2}\left[{{|O}_{d}|}^{- 1} - p\left( o_{dl} \right) \right]^{2}, \end{aligned}$$

and

$$\begin{aligned}&\text {Var}\left( \hat{p}\left( o_{dl} \big | \lambda _{d} \right) \right) \nonumber \\&\quad = \frac{1}{N}\text {Var}\left[\ell \left( X_{d},o_{dl},\lambda _{d} \right) \right]\nonumber \\&\quad = \frac{1}{N}\left[E\left( \ell ^{2}\left( X_{d},o_{dl},\lambda _{d} \right) \right) - \left( E\left( \ell \left( X_{d},o_{dl},\lambda _{d} \right) \right) \right) ^{2} \right]\nonumber \\&\quad = \frac{1}{N}\left\{ \sum _{o \in O_{d}}^{}\left[\frac{\lambda _{d}}{{|O}_{d}|} + \left. (1 - \lambda _{d} \right. ) I\left( o = o_{dl} \right) \right]^{2} p(o)\right. \\&\qquad \left. - \left[\frac{\lambda _{d}}{{|O}_{d}|} + \left( 1 - \lambda _{d} \right) p\left( o_{dl} \right) \right]^{2} \right\} \nonumber \\&\quad = {\frac{1}{N}[\left. (1 - \lambda _{d} \right. )}^{2}p\left( o_{dl} \right) - \left. (1 - \lambda _{d} \right. )^{2}p^{2}\left( o_{dl} \right) ]\nonumber \\&\quad = \frac{\left. (1 - \lambda _{d} \right. )^{2}}{N}\left[p\left( o_{dl} \right) - p^{2}\left( o_{dl} \right) \right]\nonumber \end{aligned}$$
(15)

By combining the above two equalities, the theorem is proved.

B Proof of Theorem 2

For each \(o_{dl}\) in Eq. (6), we have that

$$\begin{aligned}&E\left[{(\left( 1 - \lambda _{d} \right) f\left( o_{dl} \right) + \frac{\lambda _{d}}{{|O}_{d}|} - \ p{(o}_{dl}))}^{2} \right]\\&\quad =\left( 1 - \lambda _{d} \right) ^{2}E\left[f^{2}\left( o_{dl} \right) \right]\\&\qquad + 2\left[\frac{\lambda _{d} - \lambda _{d}^{2}}{{|O}_{d}|} + (\lambda _{d} - 1)p(o_{dl}))\right]\\&\quad E\left[f\left( o_{dl} \right) \right]+ [{p{(o}_{dl})]}^{2} - \frac{2\lambda _{d}}{{|O}_{d}|}p{(o}_{dl}) + \frac{\lambda _{d}^{2}}{{{|O}_{d}|}^{2}}. \end{aligned}$$

Base on the facts that \(E\left[f(o_{dl}) \right]=p( o_{dl}) \) and \({[I(\cdot )]}^{2} = I(\cdot )\), the above equality can be simplified as

$$\begin{aligned} \begin{array}{l} \left( 1 - \lambda _{d} \right) ^{2} \left( E\left[f^{2}\left( o_{dl} \right) \right]- (E{\left[f \left( o_{dl} \right) \right])}^{2} \right) \\ \qquad + \left( 1 - \lambda _{d} \right) ^{2}p^{2}{(o}_{dl}) + \ 2\left[\frac{\lambda _{d} - \lambda _{d}^{2}}{{|O}_{d}|} + (\lambda _{d} - 1)p{(o}_{dl})) \right]\\ \qquad p{(o}_{dl}) + p^{2}{(o}_{dl}) - \frac{2\lambda _{d}}{{|O}_{d}|}p{(o}_{dl}) + \frac{\lambda _{d}^{2}}{{{|O}_{d}|}^{2}}\\ \quad = \left( 1 - \lambda _{d} \right) ^{2}\frac{p{(o}_{dl})(1 - p{(o}_{dl}))}{N} + \lambda _{d}^{2} [{p(o_{dl})}]^{2}\\ \qquad - \frac{2\lambda _{d}^{2}}{|O_{d}|}p(o_{dl}) + \frac{\lambda _{d}^{2}}{{{|O}_{d}|}^{2}}\\ \quad = \left[\lambda _{d}^{2} - \frac{\left( 1 - \lambda _{d} \right) ^{2}}{N} \right]p^{2}{(o}_{dl}) + \left[\frac{\left( 1 - \lambda _{d} \right) ^{2}}{N} - \frac{2\lambda _{d}^{2}}{{|O}_{d}|} \right]p{(o}_{dl})\\ \qquad + \frac{\lambda _{d}^{2}}{{{|O}_{d}|}^{2}}. \end{array} \end{aligned}$$

Therefore, \(\mathcal {L}\left( \lambda _{d} \right) \) can be computed as

$$\begin{aligned} \mathcal {L}\left( \lambda _{d} \right)= & {} \left[\lambda _{d}^{2} - \frac{\left( 1 - \lambda _{d} \right) ^{2}}{N} \right]\sum _{o_{dl} \in O_{d}}^{}{[{p{(o}_{dl})]}^{2}}\\&+ \frac{\left( 1 - \lambda _{d} \right) ^{2}}{N} - \frac{\lambda _{d}^{2}}{{|O}_{d}|} \\= & {} \left( 1 - \frac{1}{{|O}_{d}|} \right) \lambda _{d}^{2} + \left[\frac{\left( 1 - \lambda _{d} \right) ^{2}}{N} - \lambda _{d}^{2} \right]\sigma _{d}^{2}\text {\ .} \end{aligned}$$

Let \(\frac{\partial \mathcal {L}\left( \lambda _{d} \right) }{\partial \lambda _{d}} = 0\), we get the optimal estimate of \(\lambda _{d}\), and Eq. (7).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, X., Chen, L. & Guo, G. Kernel-based data transformation model for nonlinear classification of symbolic data. Soft Comput 26, 1249–1259 (2022). https://doi.org/10.1007/s00500-021-06600-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-06600-9

Keywords

Navigation