Skip to main content

Using WWW-Distribution of Words in Detecting Peculiar Web Pages

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3245))

Abstract

In this paper, we propose TFIGF, a method which detects peculiar web pages using distribution of words in WWW given a set of keywords. Our TFIGF detects a set of index words which represent a WWW page by estimating their importance in the WWW page and their rareness in WWW. Experiments using both English and Japanese WWW pages clearly show superiority of our approach over a traditional method which employs a limited number of WWW pages in the estimation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Billsus, D., Pazzani, M.: A Hybrid User Model for News Story Classification. In: Proc. Seventh International Conference on User Modeling, pp. 99–108 (1999)

    Google Scholar 

  2. Chawla, N.V., Lazarevic, A., Hall, L.O.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Domingos, P.: MetaCost: A General Method for Making Classifiers Cost-Sensitive. In: Proc. Fifth Intl. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 155–164 (1999)

    Google Scholar 

  4. Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: AdaCost: Misclassification Costsensitive Boosting. In: Proc. Sixteenth Intl. Conf. on Machine Learning (ICML), pp. 97–105 (1999)

    Google Scholar 

  5. Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: Proc. Thirteenth Int’l Conf. on Machine Learning (ICML), pp. 148–156 (1996)

    Google Scholar 

  6. Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-Based Outliers: Algorithms and Applications. VLDB J. 8(3-4), 237–253 (2000)

    Article  Google Scholar 

  7. Kosala, R., Blockeel, H.: Web Mining Research: A Survey. ACM SIGKDD Exploration (2), 1–15 (2000)

    Article  Google Scholar 

  8. Narahashi, M., Suzuki, E.: Detecting Hostile Accesses through Incremental Subspace Clustering. In: Proc. 2003 IEEE/WIC International Conference on Web Intelligence (WI), pp. 337–343 (2003)

    Google Scholar 

  9. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hirose, M., Suzuki, E. (2004). Using WWW-Distribution of Words in Detecting Peculiar Web Pages. In: Suzuki, E., Arikawa, S. (eds) Discovery Science. DS 2004. Lecture Notes in Computer Science(), vol 3245. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30214-8_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30214-8_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23357-2

  • Online ISBN: 978-3-540-30214-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics