Data Cleaning Technique for Security Logs Based on Fellegi-Sunter Theory
Information security is one of the most important aspects an organization should consider. Due to this matter and the variety of existing vulnerabilities, there are specialized groups known as Computer Security Incident Response Team (CSIRT), that are responsible for event monitoring and for providing proactive and reactive support related to incidents. Using as a case study a CSIRT of a university with 10,000 users, and considering the high volume of events to be analyzed on a daily basis, it is proposed to implement a Big Data ecosystem. One of the most important activities for the information processing is the data cleaning phase, it will remove useless data and help to overcome storage limitations, since CSIRT is actually limited to a small time-frame, usually a few days and cannot analyze historical security events. Focusing on this cleaning phase, this article analyzes an intuitive technique and proposes a comparative technique based on the Fellegi-Sunter theory. The main conclusion of our research is that some data could be safely ignored helping to reduce storage size requirements. Moreover, increasing the data retention will enable to detect some events from historical data.
KeywordsData Cleaning Big Data Security Fellegi-Sunter
We thank to the National Polytechnic School CSIRT for their collaboration and facilities needed to test this data cleaning technique.
- 1.Qaiyum, S., Aziz, I.A., Jaafar, J.B.: Analysis of Big Data and quality-of-experience in high-density wireless network. In: 2016 3rd International Conference on Computer and Information Sciences (ICCOINS), pp. 287–292 (2016). doi: 10.1109/ICCOINS.2016.7783229
- 4.Martínez-Mosquera, D., Luján-Mora, S.: Data cleaning technique for security Big Data ecosystem. In: Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security, vol. 1, pp. 380–385 (2017). doi: 10.5220/0006360603800385
- 6.Luján Mora, S., Palomar Sanz, M.: Reducing inconsistency in integrating data from different sources. In: Proceedings 2001 International Database Engineering and Applications Symposium (IDEAS 2001), pp. 209–218 (2001). doi: 10.1109/IDEAS.2001.938087
- 7.Luján Mora, S., Palomar Sanz, M.: Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: Proceedings of the Second International Conference in Advances in Web-Age Information Management (WAIM 2001), pp. 191–202 (2001). doi: 10.1007/3-540-47714-4_18
- 8.Aye, T.T.: Web log cleaning for mining of web usage patterns. In: 2011 3rd International Conference Computer Research and Development (ICCRD), vol. 2, pp. 490–494 (2011). doi: 10.1109/ICCRD.2011.5764181
- 9.Maletic, J.I., Marcus, A.: Data cleansing: a prelude to knowledge discovery. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 19–32. Springer, USA (2009). doi: 10.1007/978-0-387-09823-4_2
- 10.Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Yin, S.: Bigdansing: a system for Big Data cleansing. In: ACM SIGMOD International Conference on Management of Data, pp. 1215–1230 (2015). doi: 10.1145/2723372.2747646
- 11.Krishnan, S., Haas, D., Franklin, M., Wu, E.: Towards reliable interactive data cleaning: a user survey and recommendations. In: ACM SIGMOD/PODS Conference Workshop on Human. In the Loop Data Analytics (2016), p. 9. doi: 10.1145/2939502.2939511
- 12.Winkler, W.E.: Using the EM algorithm for weight computation in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, vol. 667, p. 671 (1988)Google Scholar
- 13.Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)Google Scholar