Abstract
Link analysis methods have become popular for information access tasks, especially information retrieval, where the link information in a document collection is used to complement the traditionally used content information. However, there has been little firm evidence to confirm the utility of link information. We show that link information can be useful when the document collection has a sufficiently high link density and links are of sufficiently high quality. We report experiments on text classification of the Cora and WebKB data sets using Probabilistic Latent Semantic Analysis and Probabilistic Hypertext Induced Topic Selection. Comparison with manually assigned classes shows that link information enhances classification in data with sufficiently high link density, but is detrimental to performance at low link densities or if the quality of the links is degraded. We introduce a new frequency-based method for selecting the most useful citations from a document collection for use in the model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
CMU world wide knowledge base WebKB project. http://www-2.cs.cmu.edu/webkb/.
P. Bailey, N. Craswell, and D. Hawking. Engineering a multi-purpose test collection for web retrieval experiments. Information Processing and Management, 2001.
D. Cohn and T. Hofmann. The missing link O a probabilistic model of document content and hypertext connectivity. Neural Information Processing Systems, 13:430O436, 2001. T. Leen et al. eds.
N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In Proc. 24th SIGIR, pages 250O257, 2001.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Am. Soc. Info. Science 41,6:391O407, 1990.
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm with discussion. Journal Royal Statisical Society 2,39:1O38, 1977.
D. Hawking. Overview of the TREC-9 Web Track. In 9th Text REtrieval Conference (TREC-9), 2000.
D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the TREC-8 Web Track. In Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, 1999.
T. Hofmann. Probabilistic latent semantic indexing. In Proc. 22nd SIGIR, pages 50O57, 1999.
T. Hofmann and J. Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98-042, University of California, Berkeley, CA, 1998.
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): 604O632, 1999.
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval Journal, 3:127O163, 2000. http://www.research.whizbang.com/data/.
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.
D. Sullivan. Search engine watch, 2002. http://www.searchenginewatch.com.
Text REtrieval Conference (TREC) Home Page. http://www.trec.nist.gov/.
H. Zipf. Human behaviour and the principle of least effort. Addison-Wesley, 1949.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fisher, M., Everson, R. (2003). When Are Links Useful? Experiments in Text Classification. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_4
Download citation
DOI: https://doi.org/10.1007/3-540-36618-0_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive