Using WWW-Distribution of Words in Detecting Peculiar Web Pages

Hirose, Masayuki; Suzuki, Einoshin

doi:10.1007/978-3-540-30214-8_31

Using WWW-Distribution of Words in Detecting Peculiar Web Pages

Masayuki Hirose²⁰ &
Einoshin Suzuki²⁰

Conference paper

875 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3245))

Abstract

In this paper, we propose TFIGF, a method which detects peculiar web pages using distribution of words in WWW given a set of keywords. Our TFIGF detects a set of index words which represent a WWW page by estimating their importance in the WWW page and their rareness in WWW. Experiments using both English and Japanese WWW pages clearly show superiority of our approach over a traditional method which employs a limited number of WWW pages in the estimation.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Billsus, D., Pazzani, M.: A Hybrid User Model for News Story Classification. In: Proc. Seventh International Conference on User Modeling, pp. 99–108 (1999)
Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Chapter Google Scholar
Domingos, P.: MetaCost: A General Method for Making Classifiers Cost-Sensitive. In: Proc. Fifth Intl. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 155–164 (1999)
Google Scholar
Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: AdaCost: Misclassification Costsensitive Boosting. In: Proc. Sixteenth Intl. Conf. on Machine Learning (ICML), pp. 97–105 (1999)
Google Scholar
Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: Proc. Thirteenth Int’l Conf. on Machine Learning (ICML), pp. 148–156 (1996)
Google Scholar
Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-Based Outliers: Algorithms and Applications. VLDB J. 8(3-4), 237–253 (2000)
Article Google Scholar
Kosala, R., Blockeel, H.: Web Mining Research: A Survey. ACM SIGKDD Exploration (2), 1–15 (2000)
Article Google Scholar
Narahashi, M., Suzuki, E.: Detecting Hostile Accesses through Incremental Subspace Clustering. In: Proc. 2003 IEEE/WIC International Conference on Web Intelligence (WI), pp. 337–343 (2003)
Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Electrical and Computer Engineering, Yokohama National University, Japan
Masayuki Hirose & Einoshin Suzuki

Authors

Masayuki Hirose
View author publications
You can also search for this author in PubMed Google Scholar
Einoshin Suzuki
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University, 744 Motooka, Nishi, 819-0395, Fukuoka, Japan
Einoshin Suzuki
Kyushu University, 6–10–1 Hakozaki Higashi-ku, 812–8581, Fukuoka, Japan
Setsuo Arikawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hirose, M., Suzuki, E. (2004). Using WWW-Distribution of Words in Detecting Peculiar Web Pages. In: Suzuki, E., Arikawa, S. (eds) Discovery Science. DS 2004. Lecture Notes in Computer Science(), vol 3245. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30214-8_31

Download citation

DOI: https://doi.org/10.1007/978-3-540-30214-8_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23357-2
Online ISBN: 978-3-540-30214-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics