Extracting Positive and Negative Keywords for Web Communities

Ikeda, Daisuke; Hirokawa, Sachio

doi:10.1007/3-540-44418-1_34

Daisuke Ikeda³ &
Sachio Hirokawa³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1967))

Included in the following conference series:

International Conference on Discovery Science

372 Accesses
1 Citations

Abstract

The linkage information is shown to be useful to find goodWeb pages at a search engine [5],[10]. But, in general, a search result contains several topics. Clustering Web pages enables a user to browse them easily. There are several works on clustering Web pages [7,10-12]. In [9], we visualizedWeb graphs using spring model. But clustering is not enough to understand the topics of the clusters. Extraction of meta-data that explains communities is an important subject. Chakrabarti et al [6] used the terms in the small neighborhood around a document. Our approach is to combine the clustering and keyword extraction to interpret the communities.

To find communities, we solve the eigensystem of the matrix made from the link structure of Web pages. To get characteristic keywords from found communities, we use the algorithm developed in [1, 2, 8]. The input for the algorithm are two sets of documents - positive and negative documents. The algorithm outputs a pattern which well classifies them. This algorithm is robust for errors and noises, so that it is suitable for Web pages. The novelty of the keyword extraction algorithm is that keywords not only characterize one community but also distinguish the community from others. Thus, even if we fix a community, we have different characteristic keywords for the community according to the counter part.

We found good characteristic keywords from two communities without seeing Web pages in them. We also show an experimental result in which different keywords are extracted according to the counter part.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

H. Arimura and S. Shimozono, Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words. Proc. the 9th International Symposium on Algorithms and Computation (1998).
Google Scholar
H. Arimura, A. Wataki, R. Fujino, and S. Arikawa, A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases. Proc. the 8th International Workshop on Algorithmic Learning Theory, Otzenhausen, Germany, Lecture Notes in Artificial Intelligence 1501, Springer-Verlag, pp. 247–261, 1998.
Google Scholar
M. W. Berry, Z. Drmac, and E. R. Jessup, Matrices, Vector Spaces, and Information Retrieval, SIAM Review, 41 (1999) pp. 335–362.
Article MATH MathSciNet Google Scholar
M. Berry, S. T. Dumains, G. W. O’brien, Using Linear Algebra for Intelligent Information Retrieval, SIAM Rev., 37 (1995), pp. 573–595.
Article MATH MathSciNet Google Scholar
S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proc. WWW7, 1998.
Google Scholar
S. Chakrabarti, B. Dom, and P. Indyk, Enhanced Hypertext Categorization Using Hyperlinks, Proc. ACM SIGMOD (1998) pp.307–318.
Google Scholar
J. Dean and M. R. Henzinger, Finding Related Pages in the World Wide Web, Proc. WWW8, 1999.
Google Scholar
D. Ikeda, Characteristic Sets of Strings Common to Semi-Structured Documents, Proc. the 2nd International Conference on Discovery Science, Lecture Notes in Artificial Intelligence 1721, Springer-Verlag, pp. 139–147, 1999.
Google Scholar
D. Ikeda, T. Taguchi, and S. Hirokawa, Developing a Knowledge Network of URLs, Proc. the 2nd International Conference on Discovery Science, Lecture Notes in Artificial Intelligence 1721, Springer-Verlag, pp. 328–329, 1999.
Google Scholar
J. M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc. ACM-SIAM Symp. on Discrete Algorithms, 668–677, 1998.
Google Scholar
J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins, The Web as a Graph: Measurements, Models, and Methods, Proc. 5th Annual International Conference on Computing and Combinatorics, Lecture Notes in Computer Science 1627, Springer-Verlag, pp. 1–17, 1999.
Google Scholar
O. Zamir and O. Etzioni, Web Document Clustering: a Feasibility Demonstration, Proc. ACM SIGIR’98 (1998) pp. 46–54.
Google Scholar

Download references

Author information

Authors and Affiliations

Computing and Communications Center, Kyushu University, 812-8581, Fukuoka, Japan
Daisuke Ikeda & Sachio Hirokawa

Authors

Daisuke Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Sachio Hirokawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Information Science and Electrical Engineering, Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, 812-8581, Fukuoka, Japan
Setsuo Arikawa
Faculty of Science Department of Information Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-0033, Tokyo, Japan
Shinichi Morishita

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ikeda, D., Hirokawa, S. (2000). Extracting Positive and Negative Keywords for Web Communities. In: Arikawa, S., Morishita, S. (eds) Discovery Science. DS 2000. Lecture Notes in Computer Science(), vol 1967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44418-1_34

Download citation

DOI: https://doi.org/10.1007/3-540-44418-1_34
Published: 19 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41352-3
Online ISBN: 978-3-540-44418-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics