With the explosive growth of the Word Wide Web, information overload became a crucial concern. In a data-rich information-poor environment like the Web, the discrimination of useful or desirable information out of tons of mostly worthless data became a tedious task. The role of Machine Learning in tackling this problem is thoroughly discussed in the literature, but few systems are available for public use. In this work, we bridge theory to practice, by implementing a web-based news reader enhanced with a specifically designed machine learning framework for dynamic content personalization. This way, we get the chance to examine applicability and implementation issues and discuss the effectiveness of machine learning methods for the classification of real-world text streams. The main features of our system named PersoNews are: (a) the aggregation of many different news sources that offer an RSS version of their content, (b) incremental filtering, offering dynamic personalization of the content not only per user but also per each feed a user is subscribed to, and (c) the ability for every user to watch a more abstracted topic of interest by filtering through a taxonomy of topics. PersoNews is freely available for public use on the WWW (http://news.csd.auth.gr).
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
The Apache SpamAssassin Project: http://spamassassin.apache.org/
SpamBayes: Bayesian Anti-Spam Classifier: http://spambayes.sourceforge.net/
Mozilla Thunderbird: http://wwwmozill.com/thunderbird/
Findory - http://www.findory.com
Spotback – http://www.spotback.com
Reddit – http://www.reddit.com
Google News – http://news.google.com
MyFeedz – http://www.myfeedz.com
Both datasets are available at http://mlkd.csd.auth.gr/datasets.html
Note that all IFS enhanced methods can be applied with no initial training set. Unfortunately the three baseline methods described in the section need a set of training documents in order to construct the feature space that they use.
The respective figures for the spam corpus are similar.
As positive, we consider the characterization of a message as uninteresting.
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of naive bayesian anti-spam filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain.
Banos, E., Katakis, I., Bassiliades, N., Tsoumakas, G., & Vlahavas, I. (2006). PersoNews: A personalized news reader enhanced by machine learning and semantic filtering. In Proceedings of the 5th International Conference on Ontologies, DataBases and Applications of Semantics (ODBASE 2006). Montpellier, France: Springer.
Bharat, K., Kamba, T., & Albers, M. (1998). Personalized, interactive news on the web. Multimedia Systems, 6(5), 349–358.
Billsus, D., & Pazzani, M. (1999). A hybrid user model for news story classification. In Proceedings of the Seventh International Conference on User Modeling. Banff, Canada: Springer.
Carreira, R., Crato, J. M., Goncalves, D., & Jorge, J. A. (2004). Evaluating adaptive user profiles for news classification. In Proceedings of the 9th International Conference on Intelligent user Interface. Funchal. Madeira, Portugal: ACM.
Chan, C.-H., Sun, A., & Lim, E.-P. (2001). Automated online news classification with personalization. In Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001), Bangalore, India.
Chin, J. P., Diehl, V. A., & Norman, K. L. (1988). Development of an instrument measuring user satisfaction of the human-computer interface. In Proceedings of SIGCHI Conference on Human factors in computing systems. Washington, DC: ACM.
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management. Bethesda, MD: ACM.
Fan, W. (2004). Systematic data selection to mine concept-drifting data streams. In Proceedings of the Tenth ACM SIGKDD international conference on knowledge discovery and data mining. Seattle, WA: ACM.
Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining. San Francisco, CA: ACM.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning. New York: Springer.
Katakis, I., Tsoumakas, G., & Vlahavas, I. (2006). Dynamic feature space and incremental feature selection for the classification of textual data streams. In Proceedings of ECML/PKDD-2006 International Workshop on knowledge discovery from data streams. Berlin, Germany: Springer.
Kim, B. M., Li, Q., Park, C. S., Kim, S. G., & Kim, J. Y. (2006). A new approach for combining content-based and collaborative filters. Journal of Intelligent Information Systems, 27(1), 79–91.
Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 8(3), 281–200.
Kokkoras, F., Bassiliades, N., & Vlahavas, I. (2007). Cooperative CG-wrappers for web content extraction. In Proceedings of the 15th International Conference on Conceptual Structures, ICCS’07, Sheffield, UK.
Laskov, P., Gehl, C., Kruger, S., & Muller, K.-R. (2006). Incremental support vector learning: Analysis, implementation and applications. Journal of Machine Learning Research, 7, 1909–1936.
Lewis, D. D. (1992). An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. Copenhagen, Denmark: ACM.
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV.
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In Proceedings of AAAI-98 Workshop on Learning for Text Categorization.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Scholz, M., & Klinkenberg, R. (2007). Boosting classifiers for drifting concepts. Intelligent Data Analysis, 11(1), 3–28.
Schutze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. In Proceedings of the SIGIR ‘95, 18th Annual International ACM SIGIR conference on research and development in information retrieval. Seattle, WA: ACM.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Tsymbal, A. (2004). The problem of concept drift: Definitions and related work. Technical Report. Dublin, Ireland: Department of Computer Science, Trinity College.
Wenerstrom, B., & Giraud-Carrier, C. (2006). Temporal data mining in dynamic feature spaces. In Proceedings of the Sixth International Conference on Data Mining.
Widmer, G., & Kubat, M. (1996). Learning in the presense of concept drift and hidden contexts. Machine Learning, 23(1), 69–101.
Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco, CA: Kaufmann.
Yang, Y. (1994a). An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12(3), 252–277.
Yang, Y. (1994b). Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th Annual International ACM SIGIR conference on research and development in information retrieval. Dublin, Ireland: Springer.
Yang, Y., & Pedersn, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning. San Francisco, CA: Kaufmann.
This work was partially supported by a PENED program (EPAN M.8.3.1, No. 03EΔ73), jointly funded by the European Union and the Greek Government (General Secretariat of Research and Technology/GSRT).
About this article
Cite this article
Katakis, I., Tsoumakas, G., Banos, E. et al. An adaptive personalized news dissemination system. J Intell Inf Syst 32, 191–212 (2009). https://doi.org/10.1007/s10844-008-0053-8
- Text classification
- Concept drift
- News filtering
- Dynamic feature space