Skip to main content

A Methodology for Mining Document-Enriched Heterogeneous Information Networks

  • Conference paper
Discovery Science (DS 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6926))

Included in the following conference series:

Abstract

The paper presents a new methodology for mining heterogeneous information networks, motivated by the fact that, in many real-life scenarios, documents are available in heterogeneous information networks, such as interlinked multimedia objects containing titles, descriptions, and subtitles. The methodology consists of transforming documents into bag-of-words vectors, decomposing the corresponding heterogeneous network into separate graphs and computing structural-context feature vectors with PageRank, and finally constructing a common feature vector space in which knowledge discovery is performed. We exploit this feature vector construction process to devise an efficient classification algorithm. We demonstrate the approach by applying it to the task of categorizing video lectures. We show that our approach exhibits low time and space complexity without compromising classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based Keyword Search in Databases. In: Proceedings of VLDB 2004, pp. 564–575 (2004)

    Google Scholar 

  2. Crestani, F.: Application of Spreading Activation Techniques in Information Retrieval. Artificial Intelligence Review 11, 453–482 (1997)

    Article  Google Scholar 

  3. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)

    Book  Google Scholar 

  4. Fortuna, B., Grobelnik, M., Mladenic, D.: OntoGen: Semi-Automatic Ontology Editor. In: Smith, M.J., Salvendy, G. (eds.) HCII 2007. LNCS, vol. 4558, pp. 309–318. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  5. Grobelnik, M., Mladenic, D.: Simple Classification into Large Topic Ontology of Web Documents. Journal of Computing and Information Technology 13(4), 279–285 (2005)

    Article  Google Scholar 

  6. Han, J.: Mining Heterogeneous Information Networks by Exploring the Power of Links. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 13–30. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  7. Jeh, G., Widom, J.: SimRank: A Measure of Structural Context Similarity. In: Proceedings of KDD 2002, pp. 538–543 (2002)

    Google Scholar 

  8. Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph Regularized Transductive Classification on Heterogeneous Information Networks. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS, vol. 6321, pp. 570–586. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  9. Joachims, T., Finley, T., Yu, C.-N.J.: Cutting-Plane Training of Structural SVMs. Journal of Machine Learning 77(1) (2009)

    Google Scholar 

  10. Kim, H.R., Chan, P.K.: Learning Implicit User Interest Hierarchy for Context in Personalization. Journal of Applied Intelligence 28(2) (2008)

    Google Scholar 

  11. Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. Journal of the Association for Computing Machinery 46, 604–632 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  12. Kondor, R.I., Lafferty, J.: Diffusion Kernels on Graphs and Other Discrete Structures. In: Proceedings of ICML 2002, pp. 315–322 (2002)

    Google Scholar 

  13. Lanckriet, G.R.G., Deng, M., Cristianini, N., Jordan, M.I., Noble, W.S.: Kernel-based Data Fusion and Its Application to Protein Function Prediction in Yeast. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 300–311 (2004)

    Google Scholar 

  14. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford InfoLab (1999)

    Google Scholar 

  15. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  16. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  17. Storn, R., Price, K.: Differential Evolution: A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. Journal of Global Optimization 11, 341–359 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  18. Mladenic, D.: Machine Learning on Non-Homogeneous, Distributed Text Data. PhD thesis (1998)

    Google Scholar 

  19. Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)

    MATH  Google Scholar 

  20. de Nooy, W., Mrvar, A., Batagelj, V.: Exploratory Social Network Analysis with Pajek. Cambridge University Press, Cambridge (2005)

    Book  Google Scholar 

  21. Getoor, L., Diehl, C.P.: Link Mining: A Survey. SIGKDD Explorations 7(2), 3–12 (2005)

    Article  Google Scholar 

  22. Tan, S.: An Improved Centroid Classifier for Text Categorization. Expert Systems with Applications 35(1-2) (2008)

    Google Scholar 

  23. Gärtner, T.: A Survey of Kernels for Structured Data. ACM SIGKDD Explorations Newsletter 5(1), 49–58 (2003)

    Article  Google Scholar 

  24. Chakrabarti, S.: Dynamic Personalized PageRank in Entity-Relation Graphs. In: Proceedings of WWW 2007, pp. 571–580 (2007)

    Google Scholar 

  25. Stoyanovich, J., Bedathur, S., Berberich, K., Weikum, G.: EntityAuthority: Semantically Enriched Graph-based Authority Propagation. In: Proceedings of the 10th International Workshop on Web and Databases (2007)

    Google Scholar 

  26. Fogaras, D., Rácz, B.: Towards Scaling Fully Personalized PageRank. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 105–117. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  27. Rakotomamonjy, A., Bach, F., Grandvalet, Y., Canu, S.: SimpleMKL. Journal of Machine Learning Research 9, 2491–2521 (2008)

    MathSciNet  MATH  Google Scholar 

  28. Vishwanathan, S.V.N., Sun, Z., Theera-Ampornpunt, N., Varma, M.: Multiple Kernel Learning and the SMO Algorithm. In: Advances in Neural Information Processing Systems, vol. 23 (2010)

    Google Scholar 

  29. Zhu, X., Ghahramani, Z.: Learning from Labeled and Unlabeled Data with Label Propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University (2002)

    Google Scholar 

  30. Zhou, D., Schölkopf, B.: A Regularization Framework for Learning from Graph Data. In: ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields (2004)

    Google Scholar 

  31. Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph Regularized Transductive Classification on Heterogeneous Information Networks. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS, vol. 6321, pp. 570–586. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  32. Yin, X., Han, J., Yang, J., Yu, P.S.: CrossMine: Efficient Classification Across Multiple Database Relations. In: Boulicaut, J.-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 172–195. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Grčar, M., Lavrač, N. (2011). A Methodology for Mining Document-Enriched Heterogeneous Information Networks. In: Elomaa, T., Hollmén, J., Mannila, H. (eds) Discovery Science. DS 2011. Lecture Notes in Computer Science(), vol 6926. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24477-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24477-3_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24476-6

  • Online ISBN: 978-3-642-24477-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics