A Big Data Architecture for Log Data Storage and Analysis
We propose an architecture for analysing database connection logs across different instances of databases within an intranet comprising over 10,000 users and associated devices. Our system uses Flume agents to send notifications to a Hadoop Distributed File System for long-term storage and ElasticSearch and Kibana for short-term visualisation, effectively creating a data lake for the extraction of log data. We adopt machine learning models with an ensemble of approaches to filter and process the indicators within the data and aim to predict anomalies or outliers using feature vectors built from this log data.
KeywordsBig data analysis Log data storage System architecture Anomaly detection Unsupervised learning
The authors would like to acknowledge the contributions of Mr. Eric Grancher, Mr. Luca Canali, Mr. Michael Davis, Dr. Jean-Roch Vlimant, Mr. Adrian Alan Pol, and other members of the CERN IT-DB Group. They are grateful to the staff and management of the CERN Openlab Team, including Mr. Alberto Di Meglio, for their support in undertaking this project.
- 2.Grancher, E., and M. Limper. 2013. Oracle at CERN. https://indico.cern.ch/event/242874/.
- 3.Lanza, D. 2016. Collecting heterogeneous data into a central repository. https://indico.cern.ch/event/578615/.
- 4.Baranowski, Z., M. Grzybek, L. Canali, D.L. Garcia, and K. Surdy. 2015. Scale out databases for CERN use cases. In Journal of physics: Conference series, vol. 664, no. 4, 042002. IOP Publishing.Google Scholar
- 5.Kothuri, P., D. Lanza Garcia, and J. Hermans. 2016. Developing and optimizing applications in hadoop. In 22nd international conference on computing in high energy and nuclear physics, CHEP.Google Scholar
- 6.Moore, R., C. Baru, R. Marciano, A. Rajasekar, and M. Wan. 1997. Data-intensive computing. In: Practical digital libraries: Books, bytes, and bucks, 105–129.Google Scholar
- 7.W. Johnston. 1997. Realtime widely distributed instrumentation systems. In: Practical digital libraries: Books, bytes, and bucks, 75–103.Google Scholar
- 8.Shoshani, A., L.M. Bernardo, H. Nordberg, D. Rotem, and A. Sim. 1998. Storage management for high energy physics applications. In Proceedings of computing in high energy physics 1998 (CHEP 98). http://www.lbl.gov/arie/papers/proc-CHEP98.ps.
- 9.Foster, I., and C. Kesselman (eds.). 1999. The grid: Blueprint for a future computing infrastructure. Florida: Morgan Kaufmann Publishers.Google Scholar
- 11.Ledain, J.E., J.A. Colgrove, and D. Koren. 1999. Efficient virtualized mapping space for log device data storage system. Veritas Software Corp., U.S. Patent 5,996,054.Google Scholar
- 12.Apache Flume. https://flume.apache.org/.
- 13.Oracle real application clusters. http://www.oracle.com/technetwork/database/options/clustering/rac-wp-12c-1896129.pdf.
- 14.Chandola, V., A. Banerjee, V. Kumar. 2009. Outlier detection—A survey. Technical Report TR07–17, University of Minnesota.Google Scholar
- 16.Plase, D., L. Niedrite, and R. Taranovs. November 2016. Accelerating data queries on Hadoop framework by using compact data formats. In 2016 IEEE 4th workshop on advances in information, electronic and electrical engineering (AIEEE), 1–7. IEEE.Google Scholar
- 17.Baranowski, Z., L. Canali, R. Toebbicke, J. Hrivnac, and D. Barberis. 2016. On behalf of the ATLAS collaboration, 2016. A study of data representation in hadoop to optimize the data storage and search performance for the ATLAS EventIndex. In 22nd international conference on computing in high energy and nuclear physics, CHEP.Google Scholar