A Knowledge Based Approach for Tackling Mislabeled Multi-class Big Social Data

  • Minyi Guo
  • Yi Liu
  • Jie Li
  • Huakang Li
  • Bei Xu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8465)


The performance of classification models extremely relies on the quality of training data. However, label imperfection is an inherent fault of training data, which is impossible manually handled in big data environment. Various methods have been proposed to remove label noises in order to improve classification quality, with the side effect of cutting down data bulk. In this paper, we propose a knowledge based approach for tackling mislabeled multi-class big data, in which knowledge graph technique is combined with other data correction method to perceive and correct the error labels in big data. The knowledge graph is built with the medical concepts extracted from online health consulting and medical guidance. Experimental results show our knowledge graph based approach can effectively improve data quality and classification accuracy. Furthermore, this approach can be applied in other data mining tasks requiring deep understanding.


#eswc2014Guo label imperfection knowledge graph label correction classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Zhu, X., Wu, X.: Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22(3), 177–210 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  2. 2.
    Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)Google Scholar
  3. 3.
    Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A probabilistic taxonomy for text understanding. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481–492. ACM (2012)Google Scholar
  4. 4.
    Zhang, Y.: Contextualizing consumer health information searching: an analysis of questions in a social q&a community. In: Proceedings of the 1st ACM International Health Informatics Symposium, pp. 210–219. ACM (2010)Google Scholar
  5. 5.
    Kunz, H., Schaaf, T.: General and specific formalization approach for a balanced scorecard: An expert system with application in health care. Expert Systems with Applications 38(3), 1947–1955 (2011)CrossRefGoogle Scholar
  6. 6.
    Zeng, X., Martinez, T.R.: An algorithm for correcting mislabeled data. Intelligent Data Analysis 5(6), 491–502 (2001)zbMATHGoogle Scholar
  7. 7.
    Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: ICML, vol. 97, pp. 403–411 (1997)Google Scholar
  8. 8.
    Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning 38(3), 257–286 (2000)CrossRefzbMATHGoogle Scholar
  9. 9.
    Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics (3), 408–421 (1972)Google Scholar
  10. 10.
    Aha, D.W., Kibler, D.F.: Noise-tolerant instance-based learning algorithms. In: IJCAI, pp. 794–799. Citeseer (1989)Google Scholar
  11. 11.
    Brodley, C.E., Friedl, M.A.: Identifying and eliminating mislabeled training instances. In: AAAI/IAAI, vol. 1, pp. 799–805. Citeseer (1996)Google Scholar
  12. 12.
    Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. arXiv preprint arXiv:1106.0219 (2011)Google Scholar
  13. 13.
    Teng, C.M.: Evaluating noise correction. In: Mizoguchi, R., Slaney, J.K. (eds.) PRICAI 2000. LNCS, vol. 1886, pp. 188–198. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  14. 14.
    Teng, C.M.: Polishing blemishes: Issues in data correction. IEEE Intelligent Systems 19(2), 34–39 (2004)CrossRefGoogle Scholar
  15. 15.
    Teng, C.M.: A comparison of noise handling techniques. In: FLAIRS Conference, pp. 269–273 (2001)Google Scholar
  16. 16.
    Li, J., Zhang, K., et al.: Keyword extraction based on tf/idf for chinese news document. Wuhan University Journal of Natural Sciences 12(5), 917–921 (2007)CrossRefGoogle Scholar
  17. 17.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)Google Scholar
  18. 18.
    Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1(1), 269–271 (1959)CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Minyi Guo
    • 1
  • Yi Liu
    • 1
  • Jie Li
    • 1
  • Huakang Li
    • 2
  • Bei Xu
    • 2
  1. 1.Department of Computer Science and EngineeringShanghai Jiao Tong UniversityChina
  2. 2.School of Computer Science & School of SoftwareNanjing University of Posts and TelecommunicationsChina

Personalised recommendations