Extracting Topic Maps from Web Pages

Mase, Motohiro; Yamada, Seiji; Nitta, Katsumi

doi:10.1007/978-3-642-00399-8_15

Motohiro Mase²⁶,
Seiji Yamada²⁷ &
Katsumi Nitta²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5433))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

480 Accesses

Abstract

We propose a framework to extract topic maps from a set of Web pages. We use the clustering method with the Web pages and extract the topic map prototypes. We introduced the following two points to the existing clustering method: The first is merging only the linked Web pages, thus extracting the underlying relationships between the topics. The second is introducing weighting based on similarity from the contents of the Web pages and relevance between topics of pages. The relevance is based on the types of links with directories in Web sites structure and the distance between the directories in which the pages are located. We generate the topic map prototypes from the results of the clustering. Finally, users complete the prototype by labeling the topics and associations and removing the unnecessary items. For this paper, at the first step, we mounted the proposed clustering method and extracted the prototype with the method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web: experiments and models. In: 5th International World Wide Web Conference (2000)
Google Scholar
Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of Web communities. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 150–160 (2000)
Google Scholar
Gansner, R.E., North, S.C.: An open graph visualization system and its applications to software engineering. Software – Practice and Experience 30(11), 1203–1233 (2000)
Article Google Scholar
Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. PNAS 99(12), 7821–7826 (2002)
Article CAS PubMed PubMed Central Google Scholar
GVU’s WWW Surveying Team: GVU’s 10th WWW User Survey: Problem Using the Web (1998), http://www.gvu.gatech.edu/user_surveys/
International Standard Organization: ISO/IEC 13250 Topic Maps: Information Tecknology Document Description and Markup Language (2000)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall Inc., Upper Saddle River (1998)
Google Scholar
Kerk, R., Groschupf, S.: How to Create Topic Maps (2003), http://www.media-style.com/gfx/assets/HowtoCreateTopicMaps.pdf
Menczer, F.: Lexical and semantic clustering by web links. Journal of American Society Information Science and Technology 55(14), 1261–1269 (2004)
Article Google Scholar
Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004)
Article CAS Google Scholar
Reynolds, J., Kimber, W.E.: Topic Map Authoring With Reusable Ontologies and Automated Knowledge Mining. In: XML 2002 Conference (2002)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)
Article Google Scholar
Spertus, E.: ParaSite: mining structural information on the Web. In: The 6th International World Wide Web Conference, pp. 1205–1215 (1997)
Google Scholar
TopicMaps.Org: XML Topic Maps 1.0 (2001), http://www.topicmaps.org/xtm/1.0/

Download references

Author information

Authors and Affiliations

Tokyo Institute of Technology, Japan
Motohiro Mase & Katsumi Nitta
National Institute of Informatics, Japan
Seiji Yamada

Authors

Motohiro Mase
View author publications
You can also search for this author in PubMed Google Scholar
Seiji Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Katsumi Nitta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technologies, University of Sydney, NSW, Australia
Sanjay Chawla
The Institute of Scientific and Industrial Research, Osaka University, 8-1, Mihogaoka, Ibaraki, 567, Osaka, Japan
Takashi Washio
Division of Computing Science, Hokkaido University, 060-0814, Sapporo, Japan
Shin-ichi Minato
School of Medicine, Department of Medical Informatics, Shimane University, 89-1 Enya-cho, Izumo, 693-8501, Shimane, Japan
Shusaku Tsumoto
Central Research Institute of Electric Power Industry, 2-11-1 Iwado-kita, Komae-shi, 201-8511, Tokyo, Japan
Takashi Onoda
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, 101-8430, Tokyo, Japan
Seiji Yamada
The Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibaraki, 567, Osaka, Japan
Akihiro Inokuchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mase, M., Yamada, S., Nitta, K. (2009). Extracting Topic Maps from Web Pages. In: Chawla, S., et al. New Frontiers in Applied Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00399-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-00399-8_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00398-1
Online ISBN: 978-3-642-00399-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics