Automatic Document Organization in a P2P Environment

  • Stefan Siersdorfer
  • Sergej Sizov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)

Abstract

This paper describes an efficient method to construct reliable machine learning applications in peer-to-peer (P2P) networks by building ensemble based meta methods. We consider this problem in the context of distributed Web exploration applications like focused crawling. Typical applications are user-specific classification of retrieved Web contents into personalized topic hierarchies as well as automatic refinements of such taxonomies using unsupervised machine learning methods (e.g. clustering). Our approach is to combine models from multiple peers and to construct the advanced decision model that takes the generalization performance of multiple ‘local’ peer models into account. In addition, meta algorithms can be applied in a restrictive manner, i.e. by leaving out some ‘uncertain’ documents. The results of our systematic evaluation show the viability of the proposed approach.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    The 20 newsgroups data set, http://www.ai.mit.edu/jrennie/20Newsgroups/
  2. 2.
    dmoz - open directory project, http://dmoz.org/
  3. 3.
    Internet movie database, http://www.imdb.com
  4. 4.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  5. 5.
    Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)MATHGoogle Scholar
  6. 6.
    Burges, C.: A tutorial on Support Vector Machines for pattern recognition. Data Mining and Knowledge Discovery 2(2) (1998)Google Scholar
  7. 7.
    Chakrabarti, S.: Mining the Web. Morgan Kaufmann, San Francisco (2003)Google Scholar
  8. 8.
    Chan, P.: An extensible meta-learning approach for scalable and accurate inductive learning. PhD thesis, Department of Computer Science, Columbia University, New York (1996)Google Scholar
  9. 9.
    Craven, M., et al.: Learning to extract symbolic knowledge from the World Wide Web. In: 15th National Conference on Artificial Intelligence, AAAI (1998)Google Scholar
  10. 10.
    Demers, A., et al.: Epidemic algorithms for replicated database management. In: 6th Annual ACM Symposium on Principles of Distributed Computing, PODC 1987 (1987)Google Scholar
  11. 11.
    Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  12. 12.
    Fred, A., Jain, A.K.: Robust data clustering. In: Proc. Conference on Computer Vision and Pattern Recognition, CVPR (2003)Google Scholar
  13. 13.
    Freund, Y.: An adaptive version of the boost by majority algorithm. In: Workshop on Computational Learning Theory (1999)Google Scholar
  14. 14.
    Gorunova, K., Merz, P.: Reliable multicast and its probabilistic model for job submission in peer-to-peer grids. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 504–511. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  15. 15.
    Hartigan, J., Wong, M.: A k-means clustering algorithm. Applied Statistics 28, 100–108 (1979)CrossRefMATHGoogle Scholar
  16. 16.
    Kargupta, H., Huang, W., Sivakumar, K., Johnson, E.L.: Distributed clustering using collective principal component analysis. Knowledge and Information Systems 3(4), 422–448 (2001)CrossRefMATHGoogle Scholar
  17. 17.
    Lewis, D.D.: Evaluating text categorization. In: Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, pp. 312–318. Morgan Kaufmann, San Francisco (1991)CrossRefGoogle Scholar
  18. 18.
    Li, T., Zhu, S., Ogihara, M.: Algorithms for Clustering High Dimensional and Distributed Data. Intelligent Data Analysis Journal 7(4) (2003)Google Scholar
  19. 19.
    Littlestone, N., Warmuth, M.: The weighted majority algorithm. In: FOCS (1989)Google Scholar
  20. 20.
    Merugu, S., Ghosh, J.: Privacy-preserving distributed clustering using generative models. In: International Conference on Data Mining (ICDM 2003), Melbourne, FL (2003)Google Scholar
  21. 21.
    Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers. MIT Press, Cambridge (1999)Google Scholar
  22. 22.
    Porter, M.: An algorithm for suffix stripping. Automated Library and Information Systems 14(3)Google Scholar
  23. 23.
    Rivest, R.: The MD5 message digest algorithm. RFC 1321 (1992)Google Scholar
  24. 24.
    Siersdorfer, S., Sizov, S.: Restrictive Clustering and Metaclustering for Self- Organizing Document Collections. In: SIGIR (2004)Google Scholar
  25. 25.
    Siersdorfer, S., Sizov, S., Weikum, G.: Goal-oriented methods and meta methods for document classification and their parameter tuning. In: CIKM, Washington, USA (2004)Google Scholar
  26. 26.
    Strehl, A., Gosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)MathSciNetMATHGoogle Scholar
  27. 27.
    Vaidya, J., Clifton, C.: Privacy preserving naïve bayes classifier for vertically partitioned data. In: SDM (2004)Google Scholar
  28. 28.
    Wang, H., Fan, W., Yu, P., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: SIGKDD (2003)Google Scholar
  29. 29.
    Wolpert, D.: Stacked generalization. Neural Networks 5, 241–259 (1992)CrossRefGoogle Scholar
  30. 30.
    Yu, H., Chang, K., Han, J.: Heterogeneous learner for Web page classification. In: ICDM (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Stefan Siersdorfer
    • 1
  • Sergej Sizov
    • 1
  1. 1.Max-Planck Institute for Computer ScienceGermany

Personalised recommendations