Clustering Scientific Literature Using Sparse Citation Graph Analysis

  • Levent Bolelli
  • Seyda Ertekin
  • C. Lee Giles
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)


It is well known that connectivity analysis of linked documents provides significant information about the structure of the document space for unsupervised learning tasks. However, the ability to identify distinct clusters of documents based on link graph analysis is proportional to the density of the graph and depends on the availability of the linking and/or linked documents in the collection. In this paper, we present an information theoretic approach towards measuring the significance of individual words based on the underlying link structure of the document collection. This enables us to generate a non-uniform weight distribution of the feature space which is used to augment the original corpus-based document similarities. The experimental results on the collection of scientific literature show that our method achieves better separation of distinct groups of documents, yielding improved clustering solutions.


Digital Library Cluster Solution Link Prediction Textual Content Link Structure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Brin, S., Page, L.: The anatomy of a large scale hypertextual web search engine. In: 7th WWW Conference (1998)Google Scholar
  2. 2.
    Chiu, T., Fang, D., Chen, J., Wang, Y., Jeris, C.: A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: KDD 2001, pp. 263–268 (2001)Google Scholar
  3. 3.
    Cohn, D., Hoffmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. In: NIPS (2001)Google Scholar
  4. 4.
    Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of relational data. In: ICML 2002, pp. 170–177 (2001)Google Scholar
  5. 5.
    Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. Journal of Machine Learning Research, 679–708 (2002)Google Scholar
  6. 6.
    Lee Giles, C., Bollacker, K., Lawrence, S.: CiteSeer: An automatic citation indexing system. In: The 3rd ACM Conf. on Digital Libraries, pp. 89–98 (1998)Google Scholar
  7. 7.
    Glover, E., Lawrence, S., Flake, G., Kruger, A., Pennock, D., Birmingham, W.P., Giles, C.L.: Improving category specific web search by learning query modifications. In: SAINT 2001, p. 23 (2001)Google Scholar
  8. 8.
    Goldenberg, A., Kubica, J., Komarek, P., Moore, A., Schneider, J.: A comparison of statistical and machine learning algorithms on the task of link completion. In: KDD Workshop on Link Analysis for Detecting Complex Behavior (August 2003)Google Scholar
  9. 9.
    Chen, Z., Zeng, H.-J., He, Q.-C., Ma, W.-Y., Ma, J.: Learning to cluster web search results. In: SIGIR 2004, pp. 210–217 (2004)Google Scholar
  10. 10.
    Hayland, K.: Self-citation and self-reference: credibility and promotion in academic publication. Journal of the Academic Society for Information Science 54(3), 251–259 (2003)CrossRefGoogle Scholar
  11. 11.
    Hou, J., Zhang, Y.: Utilizing hyperlink transitivity to improve web page clustering. In: Proc. of 14th Australasian database conference on Database technologies, pp. 49–57 (2003)Google Scholar
  12. 12.
  13. 13.
    Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Kubica, J., Moore, A., Schneider, J., Yang, Y.: Stochastic link and group detection. In: 18th National Conference on Artificial Intelligence (AAAI 2002) (2002)Google Scholar
  15. 15.
    Lawrence, S.: Online or invisible. Nature 411(6837), 521 (2001)CrossRefGoogle Scholar
  16. 16.
    Lawrence, S., Lee Giles, C., Bollacker, K.: Digital libraries and autonomous citation indexing. IEEE Computer 32(6), 67–71 (1999)Google Scholar
  17. 17.
    Modha, D., Spangler, W.: Clustering hypertext with applications to web searching. In: 11th ACM Conf. on Hypertext and Hypermedia (2000)Google Scholar
  18. 18.
    Neville, J.M., Jensen, D.: Clustering relational data using attribute and link information. In: Text Mining and Link Analysis Workshop, 18th Int’l Conf. on Artificial Intelligence (2003)Google Scholar
  19. 19.
    Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty in citation matching. In: Advances in Neural Information Processing (2003)Google Scholar
  20. 20.
    Rastogi, R., Guha, S., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: SIGMOD 1998, pp. 73–84 (1998)Google Scholar
  21. 21.
    Gong, Y., Xu, W., Liu, X.: Document clustering based on non-negative matrix factorization. In: SIGIR 2003, pp. 267–273 (2003)Google Scholar
  22. 22.
    Wang, Y., Kitsuregawa, M.: Use link-based clustering to improve web search results. In: Second Int’l. Conf. on Web Information Systems Engineering (December 2001)Google Scholar
  23. 23.
    Ding, C., He, X., Zha, H., Simon, H.: Web document clustering using hyperlink structures. Computational Statistics and Data Analysis 41, 19–45 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Yitzhaki, M.: The language preference in sociology: measurements of language self-citation, relative own language preference indicator and mutual use of languages. Scientometrics 41, 243–254 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Levent Bolelli
    • 1
  • Seyda Ertekin
    • 1
  • C. Lee Giles
    • 1
    • 2
  1. 1.Department of Computer Science and EngineeringThe Pennsylvania State UniversityUniversity ParkUSA
  2. 2.College of Information Sciences and TechnologyThe Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations