Skip to main content

When Are Links Useful? Experiments in Text Classification

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Included in the following conference series:

Abstract

Link analysis methods have become popular for information access tasks, especially information retrieval, where the link information in a document collection is used to complement the traditionally used content information. However, there has been little firm evidence to confirm the utility of link information. We show that link information can be useful when the document collection has a sufficiently high link density and links are of sufficiently high quality. We report experiments on text classification of the Cora and WebKB data sets using Probabilistic Latent Semantic Analysis and Probabilistic Hypertext Induced Topic Selection. Comparison with manually assigned classes shows that link information enhances classification in data with sufficiently high link density, but is detrimental to performance at low link densities or if the quality of the links is degraded. We introduce a new frequency-based method for selecting the most useful citations from a document collection for use in the model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. CMU world wide knowledge base WebKB project. http://www-2.cs.cmu.edu/webkb/.

  2. P. Bailey, N. Craswell, and D. Hawking. Engineering a multi-purpose test collection for web retrieval experiments. Information Processing and Management, 2001.

    Google Scholar 

  3. D. Cohn and T. Hofmann. The missing link O a probabilistic model of document content and hypertext connectivity. Neural Information Processing Systems, 13:430O436, 2001. T. Leen et al. eds.

    Google Scholar 

  4. N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In Proc. 24th SIGIR, pages 250O257, 2001.

    Google Scholar 

  5. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Am. Soc. Info. Science 41,6:391O407, 1990.

    Google Scholar 

  6. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm with discussion. Journal Royal Statisical Society 2,39:1O38, 1977.

    Google Scholar 

  7. Google. http://www.google.com/technology/whyuse.html.

  8. D. Hawking. Overview of the TREC-9 Web Track. In 9th Text REtrieval Conference (TREC-9), 2000.

    Google Scholar 

  9. D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the TREC-8 Web Track. In Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, 1999.

    Google Scholar 

  10. T. Hofmann. Probabilistic latent semantic indexing. In Proc. 22nd SIGIR, pages 50O57, 1999.

    Google Scholar 

  11. T. Hofmann and J. Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98-042, University of California, Berkeley, CA, 1998.

    Google Scholar 

  12. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): 604O632, 1999.

    Article  MathSciNet  Google Scholar 

  13. A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval Journal, 3:127O163, 2000. http://www.research.whizbang.com/data/.

    Google Scholar 

  14. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.

    Google Scholar 

  15. D. Sullivan. Search engine watch, 2002. http://www.searchenginewatch.com.

  16. Text REtrieval Conference (TREC) Home Page. http://www.trec.nist.gov/.

  17. H. Zipf. Human behaviour and the principle of least effort. Addison-Wesley, 1949.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fisher, M., Everson, R. (2003). When Are Links Useful? Experiments in Text Classification. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_4

Download citation

  • DOI: https://doi.org/10.1007/3-540-36618-0_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-01274-0

  • Online ISBN: 978-3-540-36618-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics