Classification of Heterogeneous Data Based on Data Type Impact on Similarity

Ali, Najat; Neagu, Daniel; Trundle, Paul

doi:10.1007/978-3-319-97982-3_21

Najat Ali¹⁹,
Daniel Neagu¹⁹ &
Paul Trundle¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 840))

Included in the following conference series:

UK Workshop on Computational Intelligence

1327 Accesses
5 Citations

Abstract

Real-world datasets are increasingly heterogeneous, showing a mixture of numerical, categorical and other feature types. The main challenge for mining heterogeneous datasets is how to deal with heterogeneity present in the dataset records. Although some existing classifiers (such as decision trees) can handle heterogeneous data in specific circumstances, the performance of such models may be still improved, because heterogeneity involves specific adjustments to similarity measurements and calculations. Moreover, heterogeneous data is still treated inconsistently and in ad-hoc manner. In this paper, we study the problem of heterogeneous data classification: our purpose is to use heterogeneity as a positive feature of the data classification effort by using consistently the similarity between data objects. We address the heterogeneity issue by studying the impact of mixing data types in the calculation of data objects’ similarity. To reach our goal, we propose an algorithm to divide the initial data records based on pairwise similarity for classification subtasks with the aim to increase the quality of the data subsets and apply specialized classifier models on them. The performance of the proposed approach is evaluated on 10 publicly available heterogeneous data sets. The results show that the models achieve better performance for heterogeneous datasets when using the proposed similarity process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Waltham (2011)
MATH Google Scholar
Sarle, W.S.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1991). JSTOR
Google Scholar
Myatt, G.J., Johnson, W.P.: Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications. Wiley, Cambridge (2009)
Book Google Scholar
Deza, M.M., Deza, E.: Distances and similarities in data analysis. In: Encyclopedia of Distances, pp. 291–305. Springer, Heidelberg (2013)
Google Scholar
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)
Article Google Scholar
Ottaway, B.: Mixed data classification in archaeology. Revue d’Archéométrie 5(1), 139–144 (1981)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Stone, C.J.: Classification and Regression Trees, vol. 8, pp. 452–456. Wadsworth International Group, Belmont (1984)
Google Scholar
Salzberg, S.L.: C4. 5: Programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach. Learn. 16(3), 235–240 (1994)
MathSciNet Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Article Google Scholar
Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79(8), 2554–2558 (1982)
Article MathSciNet Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2013)
Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc. (1995)
Google Scholar
Hu, L.-Y., et al.: The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus 5(1), 1304 (2016)
Article MathSciNet Google Scholar
Chandrasekar, P., et al.: Improving the prediction accuracy of decision tree mining with data preprocessing. In: 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC). IEEE (2017)
Google Scholar
Pereira, C.L., Cavalcanti, G.D., Ren, T.I.: A new heterogeneous dissimilarity measure for data classification. In: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI). IEEE (2010)
Google Scholar
Jin, R., Liu, H.: A novel approach to model generation for heterogeneous data classification
Google Scholar
Hsu, C.-C., Huang, Y.-P., Chang, K.-W.: Extended Naive Bayes classifier for mixed data. Expert Syst. Appl. 35(3), 1080–1083 (2008)
Article Google Scholar
Li, X., Ye, N.: A supervised clustering and classification algorithm for mining data with mixed variables. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 36(2), 396–406 (2006)
Article Google Scholar
Sun, Y., Karray, F., Al-Sharhan, S.: Hybrid soft computing techniques for heterogeneous data classification. In: Proceedings of the 2002 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2002. IEEE (2002)
Google Scholar
Frank, A., Asuncion, A.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. University of California. School of Information and Computer Science, Irvine, p. 213 (2010)
R Data Sets. https://vincentarelbundock.github.io/Rdatasets/datasets.html

Download references

Author information

Authors and Affiliations

Artificial Intelligence Research (AIRe) Group, Faculty of Engineering and Informatics, University of Bradford, Bradford, UK
Najat Ali, Daniel Neagu & Paul Trundle

Authors

Najat Ali
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Neagu
View author publications
You can also search for this author in PubMed Google Scholar
Paul Trundle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Najat Ali .

Editor information

Editors and Affiliations

School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
Ahmad Lotfi
Faculty of Science and Technology, Bournemouth University, Poole, Dorset, United Kingdom
Hamid Bouchachia
School of Computing, University of Portsmouth, Portsmouth, Hampshire, United Kingdom
Alexander Gegov
School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
Caroline Langensiepen
College of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
Martin McGinnity

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ali, N., Neagu, D., Trundle, P. (2019). Classification of Heterogeneous Data Based on Data Type Impact on Similarity. In: Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C., McGinnity, M. (eds) Advances in Computational Intelligence Systems. UKCI 2018. Advances in Intelligent Systems and Computing, vol 840. Springer, Cham. https://doi.org/10.1007/978-3-319-97982-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-97982-3_21
Published: 11 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97981-6
Online ISBN: 978-3-319-97982-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics