Protection Against Information in eSociety: Using Data Mining Methods to Counteract Unwanted and Malicious Data

Kotenko, Igor; Saenko, Igor; Chechulin, Andrey

doi:10.1007/978-3-319-69784-0_15

Igor Kotenko^14,15,
Igor Saenko^14,15 &
Andrey Chechulin^14,15

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 745))

Included in the following conference series:

International Conference on Digital Transformation and Global Society

1209 Accesses
4 Citations

Abstract

Despite the positive aspects of usage of the Internet and social networks within the concept of eSociety, huge data collections available for viewing and analysis to the user of the Internet can contain information which can be unwanted or malicious. The paper considers the problem of protection of users in the “electronic society” infrastructure against such information. The paper discusses the nature of the problem and possible approaches to its solution. To solve the problem it is proposed to use modular approach to construction of automated systems of protection against information, based on application of Data Mining methods. We consider the implementation of the system of protection against unwanted and harmful content, based on the classifier with three-level hierarchical architecture. Its experimental evaluation, which confirmed high efficiency of functioning of the system for most of the analyzed categories of web sites, are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

References

Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 12 (2009). Article 12. doi:10.1145/1459352.1459357
Dumais, S., Platt, J., Heckermann, D., Sahami M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of 7th International Conference on Information and Knowledge Management, pp. 148–155. ACM, New York (1998). doi:10.1145/288627.288651
Resnick, P., Zeckhauser, R., Friedman, E., Kuwabara, K.: Reputation systems. Commun. ACM 43(12), 45–48 (2000). doi:10.1145/355112.355122
Article Google Scholar
Kotenko, I., Chechulin, A., Shorov, A., Komashinsky, D.: Analysis and evaluation of web pages classification techniques for inappropriate content blocking. In: Advances in Data Mining: Applications and Theoretical Aspects. In: Proceedings of 14th Industrial Conference, ICDM 2014, St. Petersburg, Russia, 16–20 July 2014, pp. 39–54 (2014). doi:10.1007/978-3-319-08976-8
Kotenko, I., Chechulin, A., Komashinsky, D.: Evaluation of text classification techniques for inappropriate web content blocking. In: 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), pp. 412–417. IEEE Press, New York (2015). doi:10.1109/IDAACS.2015.7340769
Novozhilov, D., Kotenko, I., Chechulin, A.: Improving the categorization of web sites by analysis of HTML-tags statistics to block inappropriate content. In: Novais, P., Camacho, D., Analide, C., El Fallah Seghrouchni, A., Badica, C. (eds.) Intelligent Distributed Computing IX. SCI, vol. 616, pp. 257–263. Springer, Cham (2016). doi:10.1007/978-3-319-25017-5_24
Chapter Google Scholar
Kotenko, I., Chechulin, A., Komashinsky, D.: Categorization of web pages for protection against inappropriate content in the internet. Int. J. Internet Protocol Technol. (JIPT) 10(1), 61–71 (2017). doi:10.1504/IJIPT.2017.10003851
Article Google Scholar
Elsas, J., Efron, M.: HTML tag based metrics for use in web page type classification. In: American Society for Information Science and Technology Annual Meeting. Providence, Rhode Island, USA (2004)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: 10th European Conference on Machine Learning, pp. 137–142 (1998). doi:10.1007/BFb0026683
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: 18th Conference on Computational Linguistics, pp. 453–459 (2000). doi:10.3115/990820.990886
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: 15th International World Wide Web Conference (WWW), pp. 83–92 (2006)
Google Scholar
Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word- and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21(3), 227–247 (2003). doi:10.1023/A:1025554732352
Article Google Scholar
Attardi, G., Gulli, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119 (1999)
Google Scholar
Khonji, M., Iraqi, Y., Jones, A.: Enhancing phishing e-mail classifiers: a lexical URL analysis approach. Int. J. Inf. Secur. Res. 6, 236–245 (2012)
Google Scholar
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 2009), pp. 1245–1254. ACM, New York (2009). doi:10.1145/1557019.1557153
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: Proceedings of 14th ACM International Conference on Information and Knowledge Management (CIKM 2005), pp. 325–326. ACM, New York (2005). doi:10.1145/1099554.1099649
Geide, M.: N-gram character sequence analysis of benign vs. malicious domains/URLs. http://analysis-manifold.com/ngram_whitepaper.pdf
Patil, A.S., Pawar, B.V.: Automated classification of web sites using naive Bayesian algorithm. In: Proceedings of International MultiConference of Engineers and Computer Scientists, pp. 466–470 (2012)
Google Scholar
Riboni, D.: Feature selection for web page classification. In: Proceedings of Workshop on Web Content Mapping: A Challenge to ICT, pp. 121–128 (2002)
Google Scholar
Meshkizadeh, S., Masoud-Rahmani, A.: Webpage classification based on compound of using HTML features & URL features and features of sibling pages. Int. J. Adv. Comput. Technol. 2(4), 36–46 (2010)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. Proceedings of CIKM 2006, 228–237 (2006)
Google Scholar
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. Int. J. Very Large Data Bases 7(3), 163–178 (1998)
Article Google Scholar
Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proceedings of CIKM 2003, New York, USA, pp. 394–401 (2003)
Google Scholar
Liparas, D., HaCohen-Kerner, Y., Moumtzidou. A., Vrochidis, S., Kompatsiaris, I.: News articles classification using random forests and weighted multimodal features. In: Proceedings of 7th Information Retrieval Facility Conference (IRFC 2014), Copenhagen, Denmark, pp. 63–75 (2014)
Google Scholar
Mangai, J.A., Wagle, S.M., Kumar, V.S.: A novel web page classification model using an improved k nearest neighbor algorithm. In: Proceedings of 3rd International Conference on Intelligent Computational Systems (ICICS 2013), Singapore, pp. 49–53 (2013)
Google Scholar

Download references

Acknowledgements

This research is being supported by the grant of RSF#15-11-30029 in SPIIRAS.

Author information

Authors and Affiliations

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, 39, 14 Liniya, Saint Petersburg, Russia
Igor Kotenko, Igor Saenko & Andrey Chechulin
St. Petersburg National Research University of Information Technologies, Mechanics and Optics, 49, Kronverkskiy prospekt, Saint-Petersburg, Russia
Igor Kotenko, Igor Saenko & Andrey Chechulin

Authors

Igor Kotenko
View author publications
You can also search for this author in PubMed Google Scholar
Igor Saenko
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Chechulin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Igor Kotenko .

Editor information

Editors and Affiliations

National Research University Higher School of Economics, St. Petersburg, Russia
Daniel A. Alexandrov
ITMO University, St. Petersburg, Russia
Alexander V. Boukhanovsky
ITMO University, St. Petersburg, Russia
Andrei V. Chugunov
National Research University Higher School of Economics, St. Petersburg, Russia
Yury Kabanov
National Research University Higher School of Economics, St. Petersburg, Russia
Olessia Koltsova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kotenko, I., Saenko, I., Chechulin, A. (2017). Protection Against Information in eSociety: Using Data Mining Methods to Counteract Unwanted and Malicious Data. In: Alexandrov, D., Boukhanovsky, A., Chugunov, A., Kabanov, Y., Koltsova, O. (eds) Digital Transformation and Global Society. DTGS 2017. Communications in Computer and Information Science, vol 745. Springer, Cham. https://doi.org/10.1007/978-3-319-69784-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-69784-0_15
Published: 09 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69783-3
Online ISBN: 978-3-319-69784-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics