Skip to main content

Protection Against Information in eSociety: Using Data Mining Methods to Counteract Unwanted and Malicious Data

  • Conference paper
  • First Online:
Digital Transformation and Global Society (DTGS 2017)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 745))

Included in the following conference series:

Abstract

Despite the positive aspects of usage of the Internet and social networks within the concept of eSociety, huge data collections available for viewing and analysis to the user of the Internet can contain information which can be unwanted or malicious. The paper considers the problem of protection of users in the “electronic society” infrastructure against such information. The paper discusses the nature of the problem and possible approaches to its solution. To solve the problem it is proposed to use modular approach to construction of automated systems of protection against information, based on application of Data Mining methods. We consider the implementation of the system of protection against unwanted and harmful content, based on the classifier with three-level hierarchical architecture. Its experimental evaluation, which confirmed high efficiency of functioning of the system for most of the analyzed categories of web sites, are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://urlblacklist.com.

  2. 2.

    http://www.shallalist.de.

  3. 3.

    https://www.dmoz.org.

  4. 4.

    https://rapidminer.com.

References

  1. Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 12 (2009). Article 12. doi:10.1145/1459352.1459357

  2. Dumais, S., Platt, J., Heckermann, D., Sahami M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of 7th International Conference on Information and Knowledge Management, pp. 148–155. ACM, New York (1998). doi:10.1145/288627.288651

  3. Resnick, P., Zeckhauser, R., Friedman, E., Kuwabara, K.: Reputation systems. Commun. ACM 43(12), 45–48 (2000). doi:10.1145/355112.355122

    Article  Google Scholar 

  4. Kotenko, I., Chechulin, A., Shorov, A., Komashinsky, D.: Analysis and evaluation of web pages classification techniques for inappropriate content blocking. In: Advances in Data Mining: Applications and Theoretical Aspects. In: Proceedings of 14th Industrial Conference, ICDM 2014, St. Petersburg, Russia, 16–20 July 2014, pp. 39–54 (2014). doi:10.1007/978-3-319-08976-8

  5. Kotenko, I., Chechulin, A., Komashinsky, D.: Evaluation of text classification techniques for inappropriate web content blocking. In: 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), pp. 412–417. IEEE Press, New York (2015). doi:10.1109/IDAACS.2015.7340769

  6. Novozhilov, D., Kotenko, I., Chechulin, A.: Improving the categorization of web sites by analysis of HTML-tags statistics to block inappropriate content. In: Novais, P., Camacho, D., Analide, C., El Fallah Seghrouchni, A., Badica, C. (eds.) Intelligent Distributed Computing IX. SCI, vol. 616, pp. 257–263. Springer, Cham (2016). doi:10.1007/978-3-319-25017-5_24

    Chapter  Google Scholar 

  7. Kotenko, I., Chechulin, A., Komashinsky, D.: Categorization of web pages for protection against inappropriate content in the internet. Int. J. Internet Protocol Technol. (JIPT) 10(1), 61–71 (2017). doi:10.1504/IJIPT.2017.10003851

    Article  Google Scholar 

  8. Elsas, J., Efron, M.: HTML tag based metrics for use in web page type classification. In: American Society for Information Science and Technology Annual Meeting. Providence, Rhode Island, USA (2004)

    Google Scholar 

  9. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: 10th European Conference on Machine Learning, pp. 137–142 (1998). doi:10.1007/BFb0026683

  10. Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: 18th Conference on Computational Linguistics, pp. 453–459 (2000). doi:10.3115/990820.990886

  11. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: 15th International World Wide Web Conference (WWW), pp. 83–92 (2006)

    Google Scholar 

  12. Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word- and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21(3), 227–247 (2003). doi:10.1023/A:1025554732352

    Article  Google Scholar 

  13. Attardi, G., Gulli, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119 (1999)

    Google Scholar 

  14. Khonji, M., Iraqi, Y., Jones, A.: Enhancing phishing e-mail classifiers: a lexical URL analysis approach. Int. J. Inf. Secur. Res. 6, 236–245 (2012)

    Google Scholar 

  15. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 2009), pp. 1245–1254. ACM, New York (2009). doi:10.1145/1557019.1557153

  16. Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: Proceedings of 14th ACM International Conference on Information and Knowledge Management (CIKM 2005), pp. 325–326. ACM, New York (2005). doi:10.1145/1099554.1099649

  17. Geide, M.: N-gram character sequence analysis of benign vs. malicious domains/URLs. http://analysis-manifold.com/ngram_whitepaper.pdf

  18. Patil, A.S., Pawar, B.V.: Automated classification of web sites using naive Bayesian algorithm. In: Proceedings of International MultiConference of Engineers and Computer Scientists, pp. 466–470 (2012)

    Google Scholar 

  19. Riboni, D.: Feature selection for web page classification. In: Proceedings of Workshop on Web Content Mapping: A Challenge to ICT, pp. 121–128 (2002)

    Google Scholar 

  20. Meshkizadeh, S., Masoud-Rahmani, A.: Webpage classification based on compound of using HTML features & URL features and features of sibling pages. Int. J. Adv. Comput. Technol. 2(4), 36–46 (2010)

    Google Scholar 

  21. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  22. Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. Proceedings of CIKM 2006, 228–237 (2006)

    Google Scholar 

  23. Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. Int. J. Very Large Data Bases 7(3), 163–178 (1998)

    Article  Google Scholar 

  24. Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proceedings of CIKM 2003, New York, USA, pp. 394–401 (2003)

    Google Scholar 

  25. Liparas, D., HaCohen-Kerner, Y., Moumtzidou. A., Vrochidis, S., Kompatsiaris, I.: News articles classification using random forests and weighted multimodal features. In: Proceedings of 7th Information Retrieval Facility Conference (IRFC 2014), Copenhagen, Denmark, pp. 63–75 (2014)

    Google Scholar 

  26. Mangai, J.A., Wagle, S.M., Kumar, V.S.: A novel web page classification model using an improved k nearest neighbor algorithm. In: Proceedings of 3rd International Conference on Intelligent Computational Systems (ICICS 2013), Singapore, pp. 49–53 (2013)

    Google Scholar 

Download references

Acknowledgements

This research is being supported by the grant of RSF#15-11-30029 in SPIIRAS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Igor Kotenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kotenko, I., Saenko, I., Chechulin, A. (2017). Protection Against Information in eSociety: Using Data Mining Methods to Counteract Unwanted and Malicious Data. In: Alexandrov, D., Boukhanovsky, A., Chugunov, A., Kabanov, Y., Koltsova, O. (eds) Digital Transformation and Global Society. DTGS 2017. Communications in Computer and Information Science, vol 745. Springer, Cham. https://doi.org/10.1007/978-3-319-69784-0_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69784-0_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69783-3

  • Online ISBN: 978-3-319-69784-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics