Abstract
We propose an architecture for analysing database connection logs across different instances of databases within an intranet comprising over 10,000 users and associated devices. Our system uses Flume agents to send notifications to a Hadoop Distributed File System for long-term storage and ElasticSearch and Kibana for short-term visualisation, effectively creating a data lake for the extraction of log data. We adopt machine learning models with an ensemble of approaches to filter and process the indicators within the data and aim to predict anomalies or outliers using feature vectors built from this log data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gorton, I., P. Greenfield, A. Szalay, and R. Williams. 2008. Data-intensive computing in the 21st century. Computer 41 (4): 30–32.
Grancher, E., and M. Limper. 2013. Oracle at CERN. https://indico.cern.ch/event/242874/.
Lanza, D. 2016. Collecting heterogeneous data into a central repository. https://indico.cern.ch/event/578615/.
Baranowski, Z., M. Grzybek, L. Canali, D.L. Garcia, and K. Surdy. 2015. Scale out databases for CERN use cases. In Journal of physics: Conference series, vol. 664, no. 4, 042002. IOP Publishing.
Kothuri, P., D. Lanza Garcia, and J. Hermans. 2016. Developing and optimizing applications in hadoop. In 22nd international conference on computing in high energy and nuclear physics, CHEP.
Moore, R., C. Baru, R. Marciano, A. Rajasekar, and M. Wan. 1997. Data-intensive computing. In: Practical digital libraries: Books, bytes, and bucks, 105–129.
W. Johnston. 1997. Realtime widely distributed instrumentation systems. In: Practical digital libraries: Books, bytes, and bucks, 75–103.
Shoshani, A., L.M. Bernardo, H. Nordberg, D. Rotem, and A. Sim. 1998. Storage management for high energy physics applications. In Proceedings of computing in high energy physics 1998 (CHEP 98). http://www.lbl.gov/arie/papers/proc-CHEP98.ps.
Foster, I., and C. Kesselman (eds.). 1999. The grid: Blueprint for a future computing infrastructure. Florida: Morgan Kaufmann Publishers.
Chervenak, A., I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. 2000. The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications 23 (3): 187–200.
Ledain, J.E., J.A. Colgrove, and D. Koren. 1999. Â Efficient virtualized mapping space for log device data storage system. Veritas Software Corp., U.S. Patent 5,996,054.
Apache Flume. https://flume.apache.org/.
Oracle real application clusters. http://www.oracle.com/technetwork/database/options/clustering/rac-wp-12c-1896129.pdf.
Chandola, V., A. Banerjee, V. Kumar. 2009. Outlier detection—A survey. Technical Report TR07–17, University of Minnesota.
Plase, D., L. Niedrite, and R. Taranovs. 2017. A comparison of HDFS compact data formats: Avro versus Parquet. Mokslas: Lietuvos Ateitis, 9 (3): 267.
Plase, D., L. Niedrite, and R. Taranovs. November 2016. Accelerating data queries on Hadoop framework by using compact data formats. In 2016 IEEE 4th workshop on advances in information, electronic and electrical engineering (AIEEE), 1–7. IEEE.
Baranowski, Z., L. Canali, R. Toebbicke, J. Hrivnac, and D. Barberis. 2016. On behalf of the ATLAS collaboration, 2016. A study of data representation in hadoop to optimize the data storage and search performance for the ATLAS EventIndex. In 22nd international conference on computing in high energy and nuclear physics, CHEP.
Denning, D.E. 1987. An intrusion-detection model. IEEE Transactions on Software Engineering 2: 222–232.
Acknowledgments
The authors would like to acknowledge the contributions of Mr. Eric Grancher, Mr. Luca Canali, Mr. Michael Davis, Dr. Jean-Roch Vlimant, Mr. Adrian Alan Pol, and other members of the CERN IT-DB Group. They are grateful to the staff and management of the CERN Openlab Team, including Mr. Alberto Di Meglio, for their support in undertaking this project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Mehta, S., Kothuri, P., Garcia, D.L. (2019). A Big Data Architecture for Log Data Storage and Analysis. In: Krishna, A., Srikantaiah, K., Naveena, C. (eds) Integrated Intelligent Computing, Communication and Security. Studies in Computational Intelligence, vol 771. Springer, Singapore. https://doi.org/10.1007/978-981-10-8797-4_22
Download citation
DOI: https://doi.org/10.1007/978-981-10-8797-4_22
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8796-7
Online ISBN: 978-981-10-8797-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)