Abstract
Despite the positive aspects of usage of the Internet and social networks within the concept of eSociety, huge data collections available for viewing and analysis to the user of the Internet can contain information which can be unwanted or malicious. The paper considers the problem of protection of users in the “electronic society” infrastructure against such information. The paper discusses the nature of the problem and possible approaches to its solution. To solve the problem it is proposed to use modular approach to construction of automated systems of protection against information, based on application of Data Mining methods. We consider the implementation of the system of protection against unwanted and harmful content, based on the classifier with three-level hierarchical architecture. Its experimental evaluation, which confirmed high efficiency of functioning of the system for most of the analyzed categories of web sites, are also discussed.
References
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 12 (2009). Article 12. doi:10.1145/1459352.1459357
Dumais, S., Platt, J., Heckermann, D., Sahami M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of 7th International Conference on Information and Knowledge Management, pp. 148–155. ACM, New York (1998). doi:10.1145/288627.288651
Resnick, P., Zeckhauser, R., Friedman, E., Kuwabara, K.: Reputation systems. Commun. ACM 43(12), 45–48 (2000). doi:10.1145/355112.355122
Kotenko, I., Chechulin, A., Shorov, A., Komashinsky, D.: Analysis and evaluation of web pages classification techniques for inappropriate content blocking. In: Advances in Data Mining: Applications and Theoretical Aspects. In: Proceedings of 14th Industrial Conference, ICDM 2014, St. Petersburg, Russia, 16–20 July 2014, pp. 39–54 (2014). doi:10.1007/978-3-319-08976-8
Kotenko, I., Chechulin, A., Komashinsky, D.: Evaluation of text classification techniques for inappropriate web content blocking. In: 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), pp. 412–417. IEEE Press, New York (2015). doi:10.1109/IDAACS.2015.7340769
Novozhilov, D., Kotenko, I., Chechulin, A.: Improving the categorization of web sites by analysis of HTML-tags statistics to block inappropriate content. In: Novais, P., Camacho, D., Analide, C., El Fallah Seghrouchni, A., Badica, C. (eds.) Intelligent Distributed Computing IX. SCI, vol. 616, pp. 257–263. Springer, Cham (2016). doi:10.1007/978-3-319-25017-5_24
Kotenko, I., Chechulin, A., Komashinsky, D.: Categorization of web pages for protection against inappropriate content in the internet. Int. J. Internet Protocol Technol. (JIPT) 10(1), 61–71 (2017). doi:10.1504/IJIPT.2017.10003851
Elsas, J., Efron, M.: HTML tag based metrics for use in web page type classification. In: American Society for Information Science and Technology Annual Meeting. Providence, Rhode Island, USA (2004)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: 10th European Conference on Machine Learning, pp. 137–142 (1998). doi:10.1007/BFb0026683
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: 18th Conference on Computational Linguistics, pp. 453–459 (2000). doi:10.3115/990820.990886
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: 15th International World Wide Web Conference (WWW), pp. 83–92 (2006)
Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word- and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21(3), 227–247 (2003). doi:10.1023/A:1025554732352
Attardi, G., Gulli, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119 (1999)
Khonji, M., Iraqi, Y., Jones, A.: Enhancing phishing e-mail classifiers: a lexical URL analysis approach. Int. J. Inf. Secur. Res. 6, 236–245 (2012)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 2009), pp. 1245–1254. ACM, New York (2009). doi:10.1145/1557019.1557153
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: Proceedings of 14th ACM International Conference on Information and Knowledge Management (CIKM 2005), pp. 325–326. ACM, New York (2005). doi:10.1145/1099554.1099649
Geide, M.: N-gram character sequence analysis of benign vs. malicious domains/URLs. http://analysis-manifold.com/ngram_whitepaper.pdf
Patil, A.S., Pawar, B.V.: Automated classification of web sites using naive Bayesian algorithm. In: Proceedings of International MultiConference of Engineers and Computer Scientists, pp. 466–470 (2012)
Riboni, D.: Feature selection for web page classification. In: Proceedings of Workshop on Web Content Mapping: A Challenge to ICT, pp. 121–128 (2002)
Meshkizadeh, S., Masoud-Rahmani, A.: Webpage classification based on compound of using HTML features & URL features and features of sibling pages. Int. J. Adv. Comput. Technol. 2(4), 36–46 (2010)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. Proceedings of CIKM 2006, 228–237 (2006)
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. Int. J. Very Large Data Bases 7(3), 163–178 (1998)
Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proceedings of CIKM 2003, New York, USA, pp. 394–401 (2003)
Liparas, D., HaCohen-Kerner, Y., Moumtzidou. A., Vrochidis, S., Kompatsiaris, I.: News articles classification using random forests and weighted multimodal features. In: Proceedings of 7th Information Retrieval Facility Conference (IRFC 2014), Copenhagen, Denmark, pp. 63–75 (2014)
Mangai, J.A., Wagle, S.M., Kumar, V.S.: A novel web page classification model using an improved k nearest neighbor algorithm. In: Proceedings of 3rd International Conference on Intelligent Computational Systems (ICICS 2013), Singapore, pp. 49–53 (2013)
Acknowledgements
This research is being supported by the grant of RSF#15-11-30029 in SPIIRAS.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kotenko, I., Saenko, I., Chechulin, A. (2017). Protection Against Information in eSociety: Using Data Mining Methods to Counteract Unwanted and Malicious Data. In: Alexandrov, D., Boukhanovsky, A., Chugunov, A., Kabanov, Y., Koltsova, O. (eds) Digital Transformation and Global Society. DTGS 2017. Communications in Computer and Information Science, vol 745. Springer, Cham. https://doi.org/10.1007/978-3-319-69784-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-69784-0_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69783-3
Online ISBN: 978-3-319-69784-0
eBook Packages: Computer ScienceComputer Science (R0)