A Big Data Architecture for Log Data Storage and Analysis

  • Swapneel MehtaEmail author
  • Prasanth Kothuri
  • Daniel Lanza Garcia
Part of the Studies in Computational Intelligence book series (SCI, volume 771)


We propose an architecture for analysing database connection logs across different instances of databases within an intranet comprising over 10,000 users and associated devices. Our system uses Flume agents to send notifications to a Hadoop Distributed File System for long-term storage and ElasticSearch and Kibana for short-term visualisation, effectively creating a data lake for the extraction of log data. We adopt machine learning models with an ensemble of approaches to filter and process the indicators within the data and aim to predict anomalies or outliers using feature vectors built from this log data.


Big data analysis Log data storage System architecture Anomaly detection Unsupervised learning 



The authors would like to acknowledge the contributions of Mr. Eric Grancher, Mr. Luca Canali, Mr. Michael Davis, Dr. Jean-Roch Vlimant, Mr. Adrian Alan Pol, and other members of the CERN IT-DB Group. They are grateful to the staff and management of the CERN Openlab Team, including Mr. Alberto Di Meglio, for their support in undertaking this project.


  1. 1.
    Gorton, I., P. Greenfield, A. Szalay, and R. Williams. 2008. Data-intensive computing in the 21st century. Computer 41 (4): 30–32.CrossRefGoogle Scholar
  2. 2.
    Grancher, E., and M. Limper. 2013. Oracle at CERN.
  3. 3.
    Lanza, D. 2016. Collecting heterogeneous data into a central repository.
  4. 4.
    Baranowski, Z., M. Grzybek, L. Canali, D.L. Garcia, and K. Surdy. 2015. Scale out databases for CERN use cases. In Journal of physics: Conference series, vol. 664, no. 4, 042002. IOP Publishing.Google Scholar
  5. 5.
    Kothuri, P., D. Lanza Garcia, and J. Hermans. 2016. Developing and optimizing applications in hadoop. In 22nd international conference on computing in high energy and nuclear physics, CHEP.Google Scholar
  6. 6.
    Moore, R., C. Baru, R. Marciano, A. Rajasekar, and M. Wan. 1997. Data-intensive computing. In: Practical digital libraries: Books, bytes, and bucks, 105–129.Google Scholar
  7. 7.
    W. Johnston. 1997. Realtime widely distributed instrumentation systems. In: Practical digital libraries: Books, bytes, and bucks, 75–103.Google Scholar
  8. 8.
    Shoshani, A., L.M. Bernardo, H. Nordberg, D. Rotem, and A. Sim. 1998. Storage management for high energy physics applications. In Proceedings of computing in high energy physics 1998 (CHEP 98).
  9. 9.
    Foster, I., and C. Kesselman (eds.). 1999. The grid: Blueprint for a future computing infrastructure. Florida: Morgan Kaufmann Publishers.Google Scholar
  10. 10.
    Chervenak, A., I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. 2000. The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications 23 (3): 187–200.CrossRefGoogle Scholar
  11. 11.
    Ledain, J.E., J.A. Colgrove, and D. Koren. 1999.  Efficient virtualized mapping space for log device data storage system. Veritas Software Corp., U.S. Patent 5,996,054.Google Scholar
  12. 12.
  13. 13.
  14. 14.
    Chandola, V., A. Banerjee, V. Kumar. 2009. Outlier detection—A survey. Technical Report TR07–17, University of Minnesota.Google Scholar
  15. 15.
    Plase, D., L. Niedrite, and R. Taranovs. 2017. A comparison of HDFS compact data formats: Avro versus Parquet. Mokslas: Lietuvos Ateitis, 9 (3): 267.CrossRefGoogle Scholar
  16. 16.
    Plase, D., L. Niedrite, and R. Taranovs. November 2016. Accelerating data queries on Hadoop framework by using compact data formats. In 2016 IEEE 4th workshop on advances in information, electronic and electrical engineering (AIEEE), 1–7. IEEE.Google Scholar
  17. 17.
    Baranowski, Z., L. Canali, R. Toebbicke, J. Hrivnac, and D. Barberis. 2016. On behalf of the ATLAS collaboration, 2016. A study of data representation in hadoop to optimize the data storage and search performance for the ATLAS EventIndex. In 22nd international conference on computing in high energy and nuclear physics, CHEP.Google Scholar
  18. 18.
    Denning, D.E. 1987. An intrusion-detection model. IEEE Transactions on Software Engineering 2: 222–232.CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Swapneel Mehta
    • 1
    Email author
  • Prasanth Kothuri
    • 2
  • Daniel Lanza Garcia
    • 2
  1. 1.Dwarkadas J. Sanghvi College of EngineeringMumbaiIndia
  2. 2.European Organisation for Nuclear ResearchGenevaSwitzerland

Personalised recommendations