When Are Links Useful? Experiments in Text Classification

Fisher, Michelle; Everson, Richard

doi:10.1007/3-540-36618-0_4

Michelle Fisher⁵ &
Richard Everson⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Included in the following conference series:

European Conference on Information Retrieval

1300 Accesses
12 Citations

Abstract

Link analysis methods have become popular for information access tasks, especially information retrieval, where the link information in a document collection is used to complement the traditionally used content information. However, there has been little firm evidence to confirm the utility of link information. We show that link information can be useful when the document collection has a sufficiently high link density and links are of sufficiently high quality. We report experiments on text classification of the Cora and WebKB data sets using Probabilistic Latent Semantic Analysis and Probabilistic Hypertext Induced Topic Selection. Comparison with manually assigned classes shows that link information enhances classification in data with sufficiently high link density, but is detrimental to performance at low link densities or if the quality of the links is degraded. We introduce a new frequency-based method for selecting the most useful citations from a document collection for use in the model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

CMU world wide knowledge base WebKB project. http://www-2.cs.cmu.edu/webkb/.
P. Bailey, N. Craswell, and D. Hawking. Engineering a multi-purpose test collection for web retrieval experiments. Information Processing and Management, 2001.
Google Scholar
D. Cohn and T. Hofmann. The missing link O a probabilistic model of document content and hypertext connectivity. Neural Information Processing Systems, 13:430O436, 2001. T. Leen et al. eds.
Google Scholar
N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In Proc. 24th SIGIR, pages 250O257, 2001.
Google Scholar
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Am. Soc. Info. Science 41,6:391O407, 1990.
Google Scholar
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm with discussion. Journal Royal Statisical Society 2,39:1O38, 1977.
Google Scholar
Google. http://www.google.com/technology/whyuse.html.
D. Hawking. Overview of the TREC-9 Web Track. In 9th Text REtrieval Conference (TREC-9), 2000.
Google Scholar
D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the TREC-8 Web Track. In Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, 1999.
Google Scholar
T. Hofmann. Probabilistic latent semantic indexing. In Proc. 22nd SIGIR, pages 50O57, 1999.
Google Scholar
T. Hofmann and J. Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98-042, University of California, Berkeley, CA, 1998.
Google Scholar
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): 604O632, 1999.
Article MathSciNet Google Scholar
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval Journal, 3:127O163, 2000. http://www.research.whizbang.com/data/.
Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.
Google Scholar
D. Sullivan. Search engine watch, 2002. http://www.searchenginewatch.com.
Text REtrieval Conference (TREC) Home Page. http://www.trec.nist.gov/.
H. Zipf. Human behaviour and the principle of least effort. Addison-Wesley, 1949.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Exeter University, UK
Michelle Fisher & Richard Everson

Authors

Michelle Fisher
View author publications
You can also search for this author in PubMed Google Scholar
Richard Everson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi, 1, 56124, Pisa, Italy
Fabrizio Sebastiani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fisher, M., Everson, R. (2003). When Are Links Useful? Experiments in Text Classification. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_4

Download citation

DOI: https://doi.org/10.1007/3-540-36618-0_4
Published: 15 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics