Coupled K-Nearest Centroid Classification for Non-iid Data

  • Mu LiEmail author
  • Jinjiu Li
  • Yuming Ou
  • Ya Zhang
  • Dan Luo
  • Maninder Bahtia
  • Longbing Cao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8670)


Most traditional classification methods assume the independence and identical distribution (iid) of objects, attributes and values. However, real world data, such as multi-agent data and behavioral data, usually contains strong couplings among values, attributes and objects, which greatly challenges existing methods and tools. This work targets the coupling similarities from these three perspectives and designs a novel classification method that applies a weighted K-Nearest Centroid to obtain the coupled similarity for non-iid data. From value and attribute perspectives, coupled similarity serves as a metric for nominal objects, which consider not only intra-coupled similarity within an attribute but also inter-coupled similarity between attributes. From the object perspective, we propose a more effective method that measures the centroid object by connecting all related objects. Extensive experiments on UCI and student data sets reveal that the proposed method outperforms classical methods for higher accuracy, especially in imbalanced data.


Support Vector Machine Classification Task Educational Data Mining Couple Similarity Couple Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is sponsored by the Australian Research Council Grants (DP1096218, DP0988016, LP100200774, LP0989721), and Australian Research Council Linkage Grant (LP100200774).


  1. 1.
    Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63, 503–527 (2007)CrossRefGoogle Scholar
  2. 2.
    Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, pp. 243–254 (2008)Google Scholar
  3. 3.
    Cao, L.: Non-iidness learning: an overview. Comput. J. 1–18 (2013)Google Scholar
  4. 4.
    Cao, L., Philip, S.Y.: Behavior Computing: Modeling, Analysis, Mining and Decision. Springer, Berlin (2012)CrossRefGoogle Scholar
  5. 5.
    Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)CrossRefGoogle Scholar
  6. 6.
    Cost, S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Mach. Learn. 10(1), 57–78 (1993)Google Scholar
  7. 7.
    Das, G., Mannila, H.: Context-based similarity measures for categorical databases. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 201–210. Springer, Heidelberg (2000) CrossRefGoogle Scholar
  8. 8.
    Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability. SIAM, Philadelphia (2007)Google Scholar
  9. 9.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  10. 10.
    Houle, M.E., Oria, V., Qasim, U.: Active caching for similarity queries based on shared-neighbor information. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 669–678 (2010)Google Scholar
  11. 11.
    Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al.: A practical guide to support vector classification (2003)Google Scholar
  12. 12.
    Joachims, T.: Making large scale svm learning practical (1999)Google Scholar
  13. 13.
    Li, C., Li, H.: A survey of distance metrics for nominal attributes. J. Softw. 5(11), 1262–1269 (2010)Google Scholar
  14. 14.
    Romero, C., Ventura, S.: Educational data mining: a review of the state of the art. IEEE Trans. Syst. Man, Cybern. Part C: Appl. Rev. 40(6), 601–618 (2010)CrossRefGoogle Scholar
  15. 15.
    Teknomo, K.: K-means clustering tutorial. Medicine 100(4), 3 (2006)Google Scholar
  16. 16.
    Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)zbMATHMathSciNetGoogle Scholar
  17. 17.
    Zhong, S.: Efficient online spherical k-means clustering. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, IJCNN’05, vol. 5, pp. 3180–3185. IEEE (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Mu Li
    • 1
    Email author
  • Jinjiu Li
    • 1
  • Yuming Ou
    • 1
  • Ya Zhang
    • 2
  • Dan Luo
    • 1
  • Maninder Bahtia
    • 3
  • Longbing Cao
    • 1
  1. 1.Advanced Analytics InstituteUniversity of Technology SydneyUltimoAustralia
  2. 2.Institute of Image Communication and Information ProcessingShanghai Jiao Tong UniversityShanghaiChina
  3. 3.Australian Tax OfficeSydneyAustralia

Personalised recommendations