Data Cleaning Technique for Security Logs Based on Fellegi-Sunter Theory

  • Diana Martinez-MosqueraEmail author
  • Sergio Luján-Mora
  • Gabriel López
  • Lauro Santos
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 300)


Information security is one of the most important aspects an organization should consider. Due to this matter and the variety of existing vulnerabilities, there are specialized groups known as Computer Security Incident Response Team (CSIRT), that are responsible for event monitoring and for providing proactive and reactive support related to incidents. Using as a case study a CSIRT of a university with 10,000 users, and considering the high volume of events to be analyzed on a daily basis, it is proposed to implement a Big Data ecosystem. One of the most important activities for the information processing is the data cleaning phase, it will remove useless data and help to overcome storage limitations, since CSIRT is actually limited to a small time-frame, usually a few days and cannot analyze historical security events. Focusing on this cleaning phase, this article analyzes an intuitive technique and proposes a comparative technique based on the Fellegi-Sunter theory. The main conclusion of our research is that some data could be safely ignored helping to reduce storage size requirements. Moreover, increasing the data retention will enable to detect some events from historical data.


Data Cleaning Big Data Security Fellegi-Sunter 



We thank to the National Polytechnic School CSIRT for their collaboration and facilities needed to test this data cleaning technique.


  1. 1.
    Qaiyum, S., Aziz, I.A., Jaafar, J.B.: Analysis of Big Data and quality-of-experience in high-density wireless network. In: 2016 3rd International Conference on Computer and Information Sciences (ICCOINS), pp. 287–292 (2016). doi: 10.1109/ICCOINS.2016.7783229
  2. 2.
    Arputhamary, B., Arockiam, L.: Data integration in Big Data environment. Bonfring Int. J. Data Mining 5(1), 1–5 (2015). doi: 10.9756/BIJDM.8001 CrossRefGoogle Scholar
  3. 3.
    Cárdenas, A., Manadhata, P., Rajan, S.: Big Data analytics for security. IEEE Secur. Priv. 11(6), 74–76 (2015). doi: 10.1109/MSP.2013.138 CrossRefGoogle Scholar
  4. 4.
    Martínez-Mosquera, D., Luján-Mora, S.: Data cleaning technique for security Big Data ecosystem. In: Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security, vol. 1, pp. 380–385 (2017). doi: 10.5220/0006360603800385
  5. 5.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969). doi: 10.1080/01621459.1969.10501049 CrossRefGoogle Scholar
  6. 6.
    Luján Mora, S., Palomar Sanz, M.: Reducing inconsistency in integrating data from different sources. In: Proceedings 2001 International Database Engineering and Applications Symposium (IDEAS 2001), pp. 209–218 (2001). doi: 10.1109/IDEAS.2001.938087
  7. 7.
    Luján Mora, S., Palomar Sanz, M.: Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: Proceedings of the Second International Conference in Advances in Web-Age Information Management (WAIM 2001), pp. 191–202 (2001). doi: 10.1007/3-540-47714-4_18
  8. 8.
    Aye, T.T.: Web log cleaning for mining of web usage patterns. In: 2011 3rd International Conference Computer Research and Development (ICCRD), vol. 2, pp. 490–494 (2011). doi: 10.1109/ICCRD.2011.5764181
  9. 9.
    Maletic, J.I., Marcus, A.: Data cleansing: a prelude to knowledge discovery. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 19–32. Springer, USA (2009). doi: 10.1007/978-0-387-09823-4_2
  10. 10.
    Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Yin, S.: Bigdansing: a system for Big Data cleansing. In: ACM SIGMOD International Conference on Management of Data, pp. 1215–1230 (2015). doi: 10.1145/2723372.2747646
  11. 11.
    Krishnan, S., Haas, D., Franklin, M., Wu, E.: Towards reliable interactive data cleaning: a user survey and recommendations. In: ACM SIGMOD/PODS Conference Workshop on Human. In the Loop Data Analytics (2016), p. 9. doi: 10.1145/2939502.2939511
  12. 12.
    Winkler, W.E.: Using the EM algorithm for weight computation in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, vol. 667, p. 671 (1988)Google Scholar
  13. 13.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Departamento de Ciencias de la IngenieríaUniversidad IsraelQuitoEcuador
  2. 2.Department of Software and Computing SystemsUniversity of AlicanteAlicanteSpain
  3. 3.Departamento de Electrónica, Telecomunicaciones y Redes de InformaciónEscuela Politécnica NacionalQuitoEcuador
  4. 4.Performance Testing and Continuous IntegrationNokia Solutions and NetworksAmadoraPortugal

Personalised recommendations