Advertisement

Journal of Intelligent Information Systems

, Volume 50, Issue 1, pp 29–61 | Cite as

HINMINE: heterogeneous information network mining with information retrieval heuristics

  • Jan Kralj
  • Marko Robnik-Šikonja
  • Nada Lavrač
Article

Abstract

The paper presents an approach to mining heterogeneous information networks by decomposing them into homogeneous networks. The proposed HINMINE methodology is based on previous work that classifies nodes in a heterogeneous network in two steps. In the first step the heterogeneous network is decomposed into one or more homogeneous networks using different connecting nodes. We improve this step by using new methods inspired by weighting of bag-of-words vectors mostly used in information retrieval. The methods assign larger weights to nodes which are more informative and characteristic for a specific class of nodes. In the second step, the resulting homogeneous networks are used to classify data either by network propositionalization or label propagation. We propose an adaptation of the label propagation algorithm to handle imbalanced data and test several classification algorithms in propositionalization. The new methodology is tested on three data sets with different properties. For each data set, we perform a series of experiments and compare different heuristics used in the first step of the methodology. We also use different classifiers which can be used in the second step of the methodology when performing network propositionalization. Our results show that HINMINE, using different network decomposition methods, can significantly improve the performance of the resulting classifiers, and also that using a modified label propagation algorithm is beneficial when the data set is imbalanced.

Keywords

Network analysis Heterogeneous information networks Network decomposition Personalized PageRank Information retrieval Text mining heuristics Centroid classifier SVM label propagation Imbalanced data 

Notes

Acknowledgments

This research was supported by the European Commission through the Human Brain Project (Grant number 604102) and three National Research Agency grants: the research programmes Knowledge Technologies (P2-0103), Artificial intelligence and intelligent systems (P2-0209) and project Development and applications of new semantic data mining methods in life sciences (J2-5478). Our thanks goes to Miha Grčar for previous work on this topic, which has inspired the research described in this paper.

References

  1. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7, 2399–2434.MathSciNetzbMATHGoogle Scholar
  2. Burt, R., & Minor, M. (1983). Applied Network Analysis: A Methodological Introduction: Sage Publications.Google Scholar
  3. Cantador, I., Brusilovsky, P., & Kuflik, T. (2011). 2Nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems. RecSys. New York: ACM.Google Scholar
  4. Consortium (2000). Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature genetics, 25(1), 25–29.Google Scholar
  5. de Sousa, C. A. R., Rezende, S. O., & Batista, G. E (2013). Influence of graph construction on semi-supervised learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 160–175): Springer.Google Scholar
  6. Debole, F., & Sebastiani, F (2004). Supervised term weighting for automated text categorization. In Text Mining and Its Applications (pp. 81–97): Springer.Google Scholar
  7. Demṡar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(Jan), 1–30.MathSciNetGoogle Scholar
  8. D’Orazio, V., Landis, S. T., Palmer, G., & Schrodt, P. (2014). Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Polytical Analysis, 22(2), 224–242.CrossRefGoogle Scholar
  9. Grčar, M., Trdin, N., & Lavrač, N. (2013). A methodology for mining document-enriched heterogeneous information networks. The Computer Journal, 56(3), 321–335.CrossRefGoogle Scholar
  10. Han, E.-H., & Karypis, G (2000). Centroid-based document classification: Analysis and experimental results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 424–431): Springer.Google Scholar
  11. Hwang, T., & Kuang, R. (2010). A heterogeneous label propagation algorithm for disease gene discovery. In Proceedings of SIAM International Conference on Data Mining (pp. 583–594).Google Scholar
  12. Jeh, G., & Widom, J (2002). SimRank: A measure of structural-context similarity. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 538–543): ACM.Google Scholar
  13. Ji, M., Sun, Y., Danilevsky, M., Han, J., & Gao, J. (2010). Graph regularized transductive classification on heterogeneous information networks. In Proceedings of the 25th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (pp. 570–586).Google Scholar
  14. Jones, K.S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.CrossRefGoogle Scholar
  15. Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.MathSciNetCrossRefzbMATHGoogle Scholar
  16. Kondor, R.I., & Lafferty, J.D. (2002). Diffusion kernels on graphs and other discrete input spaces. In Proceedings of the 19th International Conference on Machine Learning (pp. 315–322).Google Scholar
  17. Kralj, J., Valmarska, A., Robnik-Ṡikonja, M., & Lavraċ, N. (2015). Mining text enriched heterogeneous citation networks. In Proceedings of the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 672–683).Google Scholar
  18. Kwok, J.T.-Y. (1998). Automated text categorization using support vector machine. In Proceedings of the 5th International Conference on Neural Information Processing (pp. 347–351).Google Scholar
  19. Lan, M., Tan, C.L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721–735.CrossRefGoogle Scholar
  20. Liu, W., & Chang, S.-F (2009). Robust multi-class transductive learning with graphs. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009 (pp. 381–388): IEEE.Google Scholar
  21. Manevitz, L.M., & Yousef, M. (2002). One-class SVMs for document classification. Journal of Machine Learning Research, 2, 139–154.zbMATHGoogle Scholar
  22. Martineau, J., & Finin, T. (2009). Delta TFIDF: an improved feature space for sentiment analysis. In Proceedings of the third AAAI internatonal conference on weblogs and social media. San Jose: AAAI Press.Google Scholar
  23. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Technical report: Stanford InfoLab.Google Scholar
  24. Robertson, S.E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 232–241). New York: Springer.Google Scholar
  25. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., & Eliassi-Rad, T. (2008). Collective classification in network data. AI magazine, 29(3), 93.CrossRefGoogle Scholar
  26. Storn, R., & Price, K. (1997). Differential evolution; A simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11(4), 341–359.MathSciNetCrossRefzbMATHGoogle Scholar
  27. Sun, Y., & Han, J. (2012). Mining Heterogeneous Information Networks: Principles and Methodologies: Morgan & Claypool Publishers.Google Scholar
  28. Sun, Y., Yu, Y., & Han, J. (2009). Ranking-based clustering of heterogeneous information networks with star network schema. In Proceedings of the 15th ACM SIGKDD I,nternational Conference on Knowledge Discovery and Data Mining (pp. 797–806).Google Scholar
  29. Tan, S. (2006). An effective refinement strategy for KNN text classifier. Expert Systems with Applications, 30(2), 290–298.CrossRefGoogle Scholar
  30. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. In KDD’08 (pp. 990–998).Google Scholar
  31. Vanunu, O., Magger, O., Ruppin, E., Shlomi, T., & Sharan, R. (2010). Associating genes and protein complexes with disease via network propagation. PLoS Computational Biology, 6(1).Google Scholar
  32. Zachary, W. (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33, 452–473.CrossRefGoogle Scholar
  33. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. Advances in N,eural Information Processing Systems, 16(16), 321–328.Google Scholar
  34. Zhu, X., Ghahramani, Z., Lafferty, J., & et al. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In ICML, (Vol. 3 pp. 912–919).Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Jožef Stefan InstituteLjubljanaSlovenia
  2. 2.Jožef Stefan International Postgratuate SchoolLjubljanaSlovenia
  3. 3.Faculty of Computer and Information ScienceUniversity of LjubljanaLjubljanaSlovenia

Personalised recommendations