Using Link-Based Content Analysis to Measure Document Similarity Effectively

  • Pei Li
  • Zhixu Li
  • Hongyan Liu
  • Jun He
  • Xiaoyong Du
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5446)


Along with a massive amount of information being placed online, it is a challenge to exploit the internal and external information of documents when assessing similarity between them. A variety of approaches have been proposed to model the document similarity based on different foundations, but usually they are not applicable for combining internal and external information. In this paper, we introduce a link-based method into content analysis, which is based on random walk on graphs. By defining similarity as the meeting probability of two random surfers, we propose a computational model for content analysis, which can also be integrated with external information of documents. Empirical study shows that our method achieves good accuracy, acceptable performance and fast convergent rate in multi-relational document similarity measuring.


link graph content analysis document similarity 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Renda, M.E., Straccia, U.: A Personalized Collaborative Digital Library Environment: a model and an application. Information Processing and Management 41(1), 5–21 (2005)CrossRefzbMATHGoogle Scholar
  2. 2.
    Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)CrossRefzbMATHGoogle Scholar
  3. 3.
    Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848Google Scholar
  4. 4.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  5. 5.
    Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: SIGKDD 2002, pp. 538–543 (2002)Google Scholar
  6. 6.
    Xi, W., Fox, E.A., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: SimFusion: measuring similarity using unified relationship matrix. In: SIGIR 2005, pp. 130–137 (2005)Google Scholar
  7. 7.
    Yin, X., Han, J., Yu, P.S.: Linkclus: Efficient clustering via heterogeneous semantic links. In: VLDB 2006, pp. 427–438 (2006)Google Scholar
  8. 8.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)CrossRefzbMATHGoogle Scholar
  9. 9.
    Lovasz, L.: Random walks on graphs: a survey. In: Combinatorics, Paul Erdos is Eighty, vol. 2, pp. 1–46, Keszthely, Hungary (1993)Google Scholar
  10. 10.
    Kallenberg, O.: Foundations of Modern Probability. Springer, New York (1997)zbMATHGoogle Scholar
  11. 11.
    Fogaras, D., Racz, B.: Scaling Link-Based Similarity Search. In: WWW 2005, pp. 641–650 (2005)Google Scholar
  12. 12.
    Getoor, L., Diehl, C.P.: Link mining: A survey. In: SIGKDD 2005 Explorations, vol. 7(2), pp. 3–12.Google Scholar
  13. 13.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  14. 14.
    Hammouda, K.M., Kamel, M.S.: Phrase-based Document Similarity Based on an Index Graph Model. In: ICDM 2002, pp. 203–210 (2002)Google Scholar
  15. 15.
    Aslam, J.A., Frost, M.: An Information-theoretic Measure for Document Similarity. In: SIGIR 2003, pp. 449–450 (2003)Google Scholar
  16. 16.
    Calado, P., Cristo, M., Moura, E.S., Ziviani, N., Ribeiro-Neto, B.A., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: CIKM 2003, pp. 394–401 (2003)Google Scholar
  17. 17.
    Jin, R., Dumais, S.: Probabilistic Combination of Content and Links. In: SIGIR 2001, pp. 402–403 (2001)Google Scholar
  18. 18.
    Zhu, S., Yu, K., Chi, Y., Gong, Y.: Combining content and link for classification using matrix factorization. In: SIGIR 2007, pp. 487–494 (2007)Google Scholar
  19. 19.
    Porter, M.: An algorithm for suffix stripping. Program, vol. 14(3), pp. 130–137 (1980),
  20. 20.
    Salton, G.: The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River (1971)Google Scholar
  21. 21.
  22. 22.
    ACM Computing Classification System,

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Pei Li
    • 1
    • 2
  • Zhixu Li
    • 1
    • 2
  • Hongyan Liu
    • 3
  • Jun He
    • 1
    • 2
  • Xiaoyong Du
    • 1
    • 2
  1. 1.Key Labs of Data Engineering and Knowledge EngineeringMinistry of EducationChina
  2. 2.School of InformationRenmin University of ChinaBeijingChina
  3. 3.Department of Management Science and EngineeringTsinghua UniversityBeijingChina

Personalised recommendations