A Methodology for Mining Document-Enriched Heterogeneous Information Networks

Grčar, Miha; Lavrač, Nada

doi:10.1007/978-3-642-24477-3_11

Miha Grčar²² &
Nada Lavrač²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6926))

Included in the following conference series:

International Conference on Discovery Science

1368 Accesses
1 Citations

Abstract

The paper presents a new methodology for mining heterogeneous information networks, motivated by the fact that, in many real-life scenarios, documents are available in heterogeneous information networks, such as interlinked multimedia objects containing titles, descriptions, and subtitles. The methodology consists of transforming documents into bag-of-words vectors, decomposing the corresponding heterogeneous network into separate graphs and computing structural-context feature vectors with PageRank, and finally constructing a common feature vector space in which knowledge discovery is performed. We exploit this feature vector construction process to devise an efficient classification algorithm. We demonstrate the approach by applying it to the task of categorizing video lectures. We show that our approach exhibits low time and space complexity without compromising classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based Keyword Search in Databases. In: Proceedings of VLDB 2004, pp. 564–575 (2004)
Google Scholar
Crestani, F.: Application of Spreading Activation Techniques in Information Retrieval. Artificial Intelligence Review 11, 453–482 (1997)
Article Google Scholar
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)
Book Google Scholar
Fortuna, B., Grobelnik, M., Mladenic, D.: OntoGen: Semi-Automatic Ontology Editor. In: Smith, M.J., Salvendy, G. (eds.) HCII 2007. LNCS, vol. 4558, pp. 309–318. Springer, Heidelberg (2007)
Chapter Google Scholar
Grobelnik, M., Mladenic, D.: Simple Classification into Large Topic Ontology of Web Documents. Journal of Computing and Information Technology 13(4), 279–285 (2005)
Article Google Scholar
Han, J.: Mining Heterogeneous Information Networks by Exploring the Power of Links. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 13–30. Springer, Heidelberg (2009)
Chapter Google Scholar
Jeh, G., Widom, J.: SimRank: A Measure of Structural Context Similarity. In: Proceedings of KDD 2002, pp. 538–543 (2002)
Google Scholar
Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph Regularized Transductive Classification on Heterogeneous Information Networks. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS, vol. 6321, pp. 570–586. Springer, Heidelberg (2010)
Chapter Google Scholar
Joachims, T., Finley, T., Yu, C.-N.J.: Cutting-Plane Training of Structural SVMs. Journal of Machine Learning 77(1) (2009)
Google Scholar
Kim, H.R., Chan, P.K.: Learning Implicit User Interest Hierarchy for Context in Personalization. Journal of Applied Intelligence 28(2) (2008)
Google Scholar
Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. Journal of the Association for Computing Machinery 46, 604–632 (1999)
Article MathSciNet MATH Google Scholar
Kondor, R.I., Lafferty, J.: Diffusion Kernels on Graphs and Other Discrete Structures. In: Proceedings of ICML 2002, pp. 315–322 (2002)
Google Scholar
Lanckriet, G.R.G., Deng, M., Cristianini, N., Jordan, M.I., Noble, W.S.: Kernel-based Data Fusion and Its Application to Protein Function Prediction in Yeast. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 300–311 (2004)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford InfoLab (1999)
Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Storn, R., Price, K.: Differential Evolution: A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. Journal of Global Optimization 11, 341–359 (1997)
Article MathSciNet MATH Google Scholar
Mladenic, D.: Machine Learning on Non-Homogeneous, Distributed Text Data. PhD thesis (1998)
Google Scholar
Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
MATH Google Scholar
de Nooy, W., Mrvar, A., Batagelj, V.: Exploratory Social Network Analysis with Pajek. Cambridge University Press, Cambridge (2005)
Book Google Scholar
Getoor, L., Diehl, C.P.: Link Mining: A Survey. SIGKDD Explorations 7(2), 3–12 (2005)
Article Google Scholar
Tan, S.: An Improved Centroid Classifier for Text Categorization. Expert Systems with Applications 35(1-2) (2008)
Google Scholar
Gärtner, T.: A Survey of Kernels for Structured Data. ACM SIGKDD Explorations Newsletter 5(1), 49–58 (2003)
Article Google Scholar
Chakrabarti, S.: Dynamic Personalized PageRank in Entity-Relation Graphs. In: Proceedings of WWW 2007, pp. 571–580 (2007)
Google Scholar
Stoyanovich, J., Bedathur, S., Berberich, K., Weikum, G.: EntityAuthority: Semantically Enriched Graph-based Authority Propagation. In: Proceedings of the 10th International Workshop on Web and Databases (2007)
Google Scholar
Fogaras, D., Rácz, B.: Towards Scaling Fully Personalized PageRank. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 105–117. Springer, Heidelberg (2004)
Chapter Google Scholar
Rakotomamonjy, A., Bach, F., Grandvalet, Y., Canu, S.: SimpleMKL. Journal of Machine Learning Research 9, 2491–2521 (2008)
MathSciNet MATH Google Scholar
Vishwanathan, S.V.N., Sun, Z., Theera-Ampornpunt, N., Varma, M.: Multiple Kernel Learning and the SMO Algorithm. In: Advances in Neural Information Processing Systems, vol. 23 (2010)
Google Scholar
Zhu, X., Ghahramani, Z.: Learning from Labeled and Unlabeled Data with Label Propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University (2002)
Google Scholar
Zhou, D., Schölkopf, B.: A Regularization Framework for Learning from Graph Data. In: ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields (2004)
Google Scholar
Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph Regularized Transductive Classification on Heterogeneous Information Networks. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS, vol. 6321, pp. 570–586. Springer, Heidelberg (2010)
Chapter Google Scholar
Yin, X., Han, J., Yang, J., Yu, P.S.: CrossMine: Efficient Classification Across Multiple Database Relations. In: Boulicaut, J.-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 172–195. Springer, Heidelberg (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000, Ljubljana, Slovenia
Miha Grčar & Nada Lavrač

Authors

Miha Grčar
View author publications
You can also search for this author in PubMed Google Scholar
Nada Lavrač
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software Systems, Tampere University of Technology, P. O. Box 553, 33101, Tampere, Finland
Tapio Elomaa
Department of Information and Computer Science, Aalto University School of Science, P.O. Box 15400, 00076, Aalto, Finland
Jaakko Hollmén
Helsinki Institute for Information Technology (HIIT), Finland
Heikki Mannila

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grčar, M., Lavrač, N. (2011). A Methodology for Mining Document-Enriched Heterogeneous Information Networks. In: Elomaa, T., Hollmén, J., Mannila, H. (eds) Discovery Science. DS 2011. Lecture Notes in Computer Science(), vol 6926. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24477-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-24477-3_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24476-6
Online ISBN: 978-3-642-24477-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics