Abstract
The linkage information is shown to be useful to find goodWeb pages at a search engine [5],[10]. But, in general, a search result contains several topics. Clustering Web pages enables a user to browse them easily. There are several works on clustering Web pages [7,10-12]. In [9], we visualizedWeb graphs using spring model. But clustering is not enough to understand the topics of the clusters. Extraction of meta-data that explains communities is an important subject. Chakrabarti et al [6] used the terms in the small neighborhood around a document. Our approach is to combine the clustering and keyword extraction to interpret the communities.
To find communities, we solve the eigensystem of the matrix made from the link structure of Web pages. To get characteristic keywords from found communities, we use the algorithm developed in [1, 2, 8]. The input for the algorithm are two sets of documents - positive and negative documents. The algorithm outputs a pattern which well classifies them. This algorithm is robust for errors and noises, so that it is suitable for Web pages. The novelty of the keyword extraction algorithm is that keywords not only characterize one community but also distinguish the community from others. Thus, even if we fix a community, we have different characteristic keywords for the community according to the counter part.
We found good characteristic keywords from two communities without seeing Web pages in them. We also show an experimental result in which different keywords are extracted according to the counter part.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
H. Arimura and S. Shimozono, Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words. Proc. the 9th International Symposium on Algorithms and Computation (1998).
H. Arimura, A. Wataki, R. Fujino, and S. Arikawa, A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases. Proc. the 8th International Workshop on Algorithmic Learning Theory, Otzenhausen, Germany, Lecture Notes in Artificial Intelligence 1501, Springer-Verlag, pp. 247–261, 1998.
M. W. Berry, Z. Drmac, and E. R. Jessup, Matrices, Vector Spaces, and Information Retrieval, SIAM Review, 41 (1999) pp. 335–362.
M. Berry, S. T. Dumains, G. W. O’brien, Using Linear Algebra for Intelligent Information Retrieval, SIAM Rev., 37 (1995), pp. 573–595.
S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proc. WWW7, 1998.
S. Chakrabarti, B. Dom, and P. Indyk, Enhanced Hypertext Categorization Using Hyperlinks, Proc. ACM SIGMOD (1998) pp.307–318.
J. Dean and M. R. Henzinger, Finding Related Pages in the World Wide Web, Proc. WWW8, 1999.
D. Ikeda, Characteristic Sets of Strings Common to Semi-Structured Documents, Proc. the 2nd International Conference on Discovery Science, Lecture Notes in Artificial Intelligence 1721, Springer-Verlag, pp. 139–147, 1999.
D. Ikeda, T. Taguchi, and S. Hirokawa, Developing a Knowledge Network of URLs, Proc. the 2nd International Conference on Discovery Science, Lecture Notes in Artificial Intelligence 1721, Springer-Verlag, pp. 328–329, 1999.
J. M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc. ACM-SIAM Symp. on Discrete Algorithms, 668–677, 1998.
J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins, The Web as a Graph: Measurements, Models, and Methods, Proc. 5th Annual International Conference on Computing and Combinatorics, Lecture Notes in Computer Science 1627, Springer-Verlag, pp. 1–17, 1999.
O. Zamir and O. Etzioni, Web Document Clustering: a Feasibility Demonstration, Proc. ACM SIGIR’98 (1998) pp. 46–54.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ikeda, D., Hirokawa, S. (2000). Extracting Positive and Negative Keywords for Web Communities. In: Arikawa, S., Morishita, S. (eds) Discovery Science. DS 2000. Lecture Notes in Computer Science(), vol 1967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44418-1_34
Download citation
DOI: https://doi.org/10.1007/3-540-44418-1_34
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41352-3
Online ISBN: 978-3-540-44418-3
eBook Packages: Springer Book Archive