Towards Increasing Density of Relations in Category Graphs

  • Karol Draszawka
  • Julian Szymański
  • Henryk Krawczyk
Part of the Studies in Computational Intelligence book series (SCI, volume 541)


In the chapter we propose methods for identifying new associations between Wikipedia categories. The first method is based on Bag-of-Words (BOW) representation of Wikipedia articles. Using similarity of the articles belonging to different categories allows to calculate the information about categories similarity. The second method is based on average scores given to categories while categorizing documents by our dedicated score-based classifier. As a result of application of presented methods we obtain weighed category graphs that allow to extend original relations between Wikipedia categories. We propose the method for selecting the weight value for cutting off less important relations. The given preliminary examination of the quality of obtained new relations supports our procedure.


Associations mining Information retrieval Text documents categorization 



This work has been supported by the National Centre for Research and Development (NCBiR) under research Grant No. SP/I/1/77065/1 SYNAT: “Establishment of the universal, open, hosting and communication, repository platform for network resources of knowledge to be used by science, education and open knowledge society”.


  1. 1.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. (1999)Google Scholar
  2. 2.
    Langville, A.N., Meyer, C.D.: Deeper inside pagerank. Internet Math. 1, 335–380 (2004)CrossRefMATHMathSciNetGoogle Scholar
  3. 3.
    Baeza-Yates, R., Davis, E.: Web page ranking using link attributes. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers and Posters, ACM, 328–329, 2004Google Scholar
  4. 4.
    Cleophas, T.J., Zwinderman, A.H.: Missing data imputation. In: Statistical Analysis of Clinical Data on a Pocket Calculator, Part 2, pp. 7–10. Springer (2012)Google Scholar
  5. 5.
    Deptuła, M., Szymański, J., Krawczyk, H.: Interactive information search in text data collections. In: Intelligent Tools for Building a Scientific Information Platform, pp. 25–40, Springer. (2013)Google Scholar
  6. 6.
    Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: Missing is useful: missing values in cost-sensitive decision trees. IEEE Trans. Knowl. Data Eng. 17, 1689–1693 (2005)CrossRefGoogle Scholar
  7. 7.
    Zesch, T., Gurevych, I.: Analysis of the wikipedia category graph for nlp applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), pp. 1–8, 2007Google Scholar
  8. 8.
    Schonhofen, P.: Identifying document topics using the wikipedia category network. In: Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on, IEEE. pp. 456–462 (2006)Google Scholar
  9. 9.
    Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 389–396, 2009Google Scholar
  10. 10.
    Biuk-Aghai, R.P., Pang, C.I., Si, Y.W.: Visualizing large-scale human collaboration in wikipedia. Future Gener. Comput. Syst. 31, 120–133 (2013)Google Scholar
  11. 11.
    Szymański, J.: Mining relations between wikipedia categories. In: Networked Digital Technologies, 248—255. Springer (2010) Google Scholar
  12. 12.
    Chernov, S., Iofciu, T., Nejdl, W., Zhou, X.: Extracting semantic relationships between wikipedia categories. In: Proceedings of Workshop on Semantic Wikis (SemWiki 2006), Citeseer (2006)Google Scholar
  13. 13.
    Holloway, T., Bozicevic, M., Börner, K.: Analyzing and visualizing the semantic coverage of wikipedia and its authors. Complexity 12, 30–40 (2007)CrossRefGoogle Scholar
  14. 14.
    Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 143–175 (2001)CrossRefMATHGoogle Scholar
  15. 15.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1, Cambridge University Press, Cambridge (2008)Google Scholar
  16. 16.
    Day, W.H., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1, 7–24 (1984)CrossRefMATHGoogle Scholar
  17. 17.
    Yang, Y.: A study of thresholding strategies for text categorization. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. pp. 137–145 (2001)Google Scholar
  18. 18.
    Ioannou, M., Sakkas, G., Tsoumakas, G., Vlahavas, L.: Obtaining bipartitions from score vectors for multi-label classification. In: Tools with Artificial Intelligence (ICTAI), 2010 22nd IEEE International Conference on, vol. 1, 409–416 (2010)Google Scholar
  19. 19.
    Draszawka, K., Szymański, J.: Thresholding strategies for large scale multi-label text classifier. In: Proceedings of the 6th International Conference on Human System Interaction, IEEE. pp. 347–352 (2013)Google Scholar
  20. 20.
    Draszawka, K., Szymanski, J.: External validation measures for nested clustering of text documents. In: Ryzko D., Rybinski H., Gawrysiak P., Kryszkiewicz M. (eds.) ISMIS Industrial Session. Volume 369 of Studies in Computational Intelligence, Springer. 207–225 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Karol Draszawka
    • 1
  • Julian Szymański
    • 1
  • Henryk Krawczyk
    • 1
  1. 1.Department of Computer Systems ArchitectureGdańsk University of TechnologyGdańskPoland

Personalised recommendations