An adaptive personalized news dissemination system

Abstract

With the explosive growth of the Word Wide Web, information overload became a crucial concern. In a data-rich information-poor environment like the Web, the discrimination of useful or desirable information out of tons of mostly worthless data became a tedious task. The role of Machine Learning in tackling this problem is thoroughly discussed in the literature, but few systems are available for public use. In this work, we bridge theory to practice, by implementing a web-based news reader enhanced with a specifically designed machine learning framework for dynamic content personalization. This way, we get the chance to examine applicability and implementation issues and discuss the effectiveness of machine learning methods for the classification of real-world text streams. The main features of our system named PersoNews are: (a) the aggregation of many different news sources that offer an RSS version of their content, (b) incremental filtering, offering dynamic personalization of the content not only per user but also per each feed a user is subscribed to, and (c) the ability for every user to watch a more abstracted topic of interest by filtering through a taxonomy of topics. PersoNews is freely available for public use on the WWW (http://news.csd.auth.gr).

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. 1.

    The Apache SpamAssassin Project: http://spamassassin.apache.org/

  2. 2.

    SpamBayes: Bayesian Anti-Spam Classifier: http://spambayes.sourceforge.net/

  3. 3.

    Mozilla Thunderbird: http://wwwmozill.com/thunderbird/

  4. 4.

    Google Reader—http://reader.google.com

  5. 5.

    Bloglines—http://www.bloglines.com

  6. 6.

    SharpReader—http://sharpreader.net

  7. 7.

    Digg—http://digg.com

  8. 8.

    NewsCloud—http://newscloud.com

  9. 9.

    Findory - http://www.findory.com

  10. 10.

    Spotback – http://www.spotback.com

  11. 11.

    Reddit – http://www.reddit.com

  12. 12.

    Google News – http://news.google.com

  13. 13.

    PNS - http://pns.iit.demokritos.gr/

  14. 14.

    MyFeedz – http://www.myfeedz.com

  15. 15.

    http://spambayes.sourceforge.net/, http://popfile.sourceforge.net/

  16. 16.

    http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

  17. 17.

    Both datasets are available at http://mlkd.csd.auth.gr/datasets.html

  18. 18.

    Note that all IFS enhanced methods can be applied with no initial training set. Unfortunately the three baseline methods described in the section need a set of training documents in order to construct the feature space that they use.

  19. 19.

    The respective figures for the spam corpus are similar.

  20. 20.

    http://www.acm.org/class/

  21. 21.

    http://www.w3.org/MarkUp/

  22. 22.

    http://www.w3.org/Style/CSS/

  23. 23.

    http://www.mozilla.org/js/

  24. 24.

    http://www.dhtmlcentral.com/

  25. 25.

    http://www.debian.org

  26. 26.

    http://www.apache.org

  27. 27.

    http://www.mysql.com

  28. 28.

    http://www.php.net

  29. 29.

    http://pear.php.net

  30. 30.

    http://magpierss.sourceforge.net/

  31. 31.

    http://smarty.php.net

  32. 32.

    http://news.csd.auth.gr

  33. 33.

    As positive, we consider the characterization of a message as uninteresting.

References

  1. Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of naive bayesian anti-spam filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain.

  2. Banos, E., Katakis, I., Bassiliades, N., Tsoumakas, G., & Vlahavas, I. (2006). PersoNews: A personalized news reader enhanced by machine learning and semantic filtering. In Proceedings of the 5th International Conference on Ontologies, DataBases and Applications of Semantics (ODBASE 2006). Montpellier, France: Springer.

  3. Bharat, K., Kamba, T., & Albers, M. (1998). Personalized, interactive news on the web. Multimedia Systems, 6(5), 349–358.

    Article  Google Scholar 

  4. Billsus, D., & Pazzani, M. (1999). A hybrid user model for news story classification. In Proceedings of the Seventh International Conference on User Modeling. Banff, Canada: Springer.

  5. Carreira, R., Crato, J. M., Goncalves, D., & Jorge, J. A. (2004). Evaluating adaptive user profiles for news classification. In Proceedings of the 9th International Conference on Intelligent user Interface. Funchal. Madeira, Portugal: ACM.

  6. Chan, C.-H., Sun, A., & Lim, E.-P. (2001). Automated online news classification with personalization. In Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001), Bangalore, India.

  7. Chin, J. P., Diehl, V. A., & Norman, K. L. (1988). Development of an instrument measuring user satisfaction of the human-computer interface. In Proceedings of SIGCHI Conference on Human factors in computing systems. Washington, DC: ACM.

  8. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management. Bethesda, MD: ACM.

  9. Fan, W. (2004). Systematic data selection to mine concept-drifting data streams. In Proceedings of the Tenth ACM SIGKDD international conference on knowledge discovery and data mining. Seattle, WA: ACM.

  10. Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining. San Francisco, CA: ACM.

  11. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning. New York: Springer.

  12. Katakis, I., Tsoumakas, G., & Vlahavas, I. (2006). Dynamic feature space and incremental feature selection for the classification of textual data streams. In Proceedings of ECML/PKDD-2006 International Workshop on knowledge discovery from data streams. Berlin, Germany: Springer.

  13. Kim, B. M., Li, Q., Park, C. S., Kim, S. G., & Kim, J. Y. (2006). A new approach for combining content-based and collaborative filters. Journal of Intelligent Information Systems, 27(1), 79–91.

    Article  Google Scholar 

  14. Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 8(3), 281–200.

    Google Scholar 

  15. Kokkoras, F., Bassiliades, N., & Vlahavas, I. (2007). Cooperative CG-wrappers for web content extraction. In Proceedings of the 15th International Conference on Conceptual Structures, ICCS’07, Sheffield, UK.

  16. Laskov, P., Gehl, C., Kruger, S., & Muller, K.-R. (2006). Incremental support vector learning: Analysis, implementation and applications. Journal of Machine Learning Research, 7, 1909–1936.

    MathSciNet  Google Scholar 

  17. Lewis, D. D. (1992). An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. Copenhagen, Denmark: ACM.

  18. Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV.

  19. McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In Proceedings of AAAI-98 Workshop on Learning for Text Categorization.

  20. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Google Scholar 

  21. Scholz, M., & Klinkenberg, R. (2007). Boosting classifiers for drifting concepts. Intelligent Data Analysis, 11(1), 3–28.

    Google Scholar 

  22. Schutze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. In Proceedings of the SIGIR ‘95, 18th Annual International ACM SIGIR conference on research and development in information retrieval. Seattle, WA: ACM.

  23. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.

    Article  Google Scholar 

  24. Tsymbal, A. (2004). The problem of concept drift: Definitions and related work. Technical Report. Dublin, Ireland: Department of Computer Science, Trinity College.

    Google Scholar 

  25. Wenerstrom, B., & Giraud-Carrier, C. (2006). Temporal data mining in dynamic feature spaces. In Proceedings of the Sixth International Conference on Data Mining.

  26. Widmer, G., & Kubat, M. (1996). Learning in the presense of concept drift and hidden contexts. Machine Learning, 23(1), 69–101.

    Google Scholar 

  27. Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco, CA: Kaufmann.

    Google Scholar 

  28. Yang, Y. (1994a). An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12(3), 252–277.

    Article  Google Scholar 

  29. Yang, Y. (1994b). Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th Annual International ACM SIGIR conference on research and development in information retrieval. Dublin, Ireland: Springer.

  30. Yang, Y., & Pedersn, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning. San Francisco, CA: Kaufmann.

Download references

Acknowledgements

This work was partially supported by a PENED program (EPAN M.8.3.1, No. 03EΔ73), jointly funded by the European Union and the Greek Government (General Secretariat of Research and Technology/GSRT).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ioannis Katakis.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Katakis, I., Tsoumakas, G., Banos, E. et al. An adaptive personalized news dissemination system. J Intell Inf Syst 32, 191–212 (2009). https://doi.org/10.1007/s10844-008-0053-8

Download citation

Keywords

  • Personalization
  • Text classification
  • Concept drift
  • Ontology
  • News filtering
  • Dynamic feature space