Abstract
As a top cybersecurity vendor, Sangfor needs collects log streams from thousands of endpoint detection devices such as NTA, STA, EDR and identifies security threats in real-time way everyday. The discovery and disposal of network security incidents are highly real-time in nature with seconds or even milliseconds response time to prevent possible cyber attacks and data leaks. In order to extract more valuable information, the log streams are analyzed using stream processing with pattern matching like CEP (Complex Event Processing) in memory, and then stored in a persistent storage systems such as a data warehouse system or a search engine system for data scientists and network security engineers to do OLAP (Online Analytical Processing). Sangfor needs to build a low-latency big data platform to meet the challenges of massive logs.
More and more open source systems are proposed to solve the problem of data processing in a certain aspect. Many decisions must be made to balance the benefits when designing a real-time big data infrastructure. What’s more, how to architecture these systems and construct a one-stack unified big data platform have been the key obstacles for big data analytics. In this paper, we present the overall architecture of our low-latency big data infrastructure and identify four important design decisions i.e. message queue, stream processing, OLAP, and data lake. We analyze the advantages and disadvantages of existing open source system and clarify the reason behind our choices. We also describe the improvements and optimizations to make the open-source stacks fit in Sangfor’s environments, including designing a real-time development platform based on Flink and re-architecting Apache Kylin, Clickhouse and Presto as a HOLAP system. Then we highlight two important use cases to verify the rationality of our infrastructure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Apache hudi. https://hudi.apache.org/
Apache iceberg. https://iceberg.apache.org/
Apache impala. https://impala.apache.org/
Apache kylin. https://kylin.apache.org/
Apache pulsar. https://pulsar.apache.org/
Clickhouse. https://clickhouse.com/
Openmessaging benchmark framework. https://openmessaging.cloud/docs/benchmarks/
Rabbitmq. https://www.rabbitmq.com/
Rocketmq. https://rocketmq.apache.org/
Tpc-ds benchmark. https://www.tpc.org/tpcds/
Tpc-h benchmark. https://www.tpc.org/tpch/
Trino. https://trino.io/
Akidau, T., Balikov, A., Bekiroglu, K., Chernyak, S., Haberman, J., et al.: Millwheel: fault-tolerant stream processing at internet scale. In: Proceedings of the VLDB Endowment (VLDB 2013), pp. 1033–1044 (2013)
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. In: Proceedings of the VLDB Endowment (VLDB 2015), pp. 1792–1803 (2015)
Chaoqun, Z., Maomeng, S., Chuangxian, W., Xiaoqiang, P., et al.: AnalyticDB: real-time olap database system at Alibaba cloud. In: Proceedings of the VLDB Endowment (VLDB 2019), pp. 2059–2070 (2019)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: The 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2004), pp. 137–149 (2004)
Fangjin, Y., Eric, T., Xavier, L., Nelson, R., et al.: Druid: a real-time analytical data store. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 157–168 (2014)
Guoqiang Jerry, C., Janet L., W., Shridhar, L., Anshul, J., et al.: Realtime data processing at Facebook. In: Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD 2016), pp. 1087–1098 (2016)
Guozhang, W., Lei, C., Ayusman, D., Jason, G., Boyang, C., et al.: Consistency and completeness: rethinking distributed stream processing in apache Kafka. In: Proceedings of the 2021 International Conference on Management of Data (SIGMOD 2021), pp. 2602–2613 (2021)
Jagrati, A., Yanlei, D., Daniel, G., Neil, I.: Efficient pattern matching over event streams. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD 2008), pp. 147–160 (2008)
Kulkarni, S., Bhagat, N., Fu, M., et al.: Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD 2015) pp. 239–250 (2015)
Michael, A., Reynold, S.X., Cheng, L., Yin, H., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD 2015), pp. 1383–1394 (2015)
Michael, A., Tathagata, D., Liwen, S., Burak, Y., et al.: Delta lake: high-performance acid table storage over cloud object stores. In: Proceedings of the VLDB Endowment (VLDB 2020), pp. 3411–3424 (2020)
Paris, C., Marios, F., Vasiliki, K., Asterios, K.: Beyond analytics: the evolution of stream processing systems. In: Proceedings of the 2020 International Conference on Management of Data (SIGMOD 2020) pp. 2651–2658 (2020)
Paris, C., Stephan, E., Gyula, F., Seif, H., Stefan, R., Kostas, T.: State management in apache FlinK: consistent stateful distributed stream processing. In: Proceedings of the VLDB Endowment (VLDB 2017), pp. 1718–1729 (2017)
Pat, O.N., Betty, O.N., Xuedong, C.: Star schema benchmark (2009). https://www.cs.umb.edu/poneil/StarSchemaB.pdf
Raghav, S., Martin, T., Dain, S., David, P., et al.: Presto: SQL on everything. In: The 35th International Conference on Data Engineering (ICDE 3019), pp. 1802–1813 (2019)
Shadi A.N., Kartik, P., Yi, P., Navina, R., et al.: Samza: stateful scalable stream processing at LinkedIn. In: Proceedings of the VLDB Endowment (VLDB 2017), pp. 1634–1645 (2017)
Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm @twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 147–156 (2014)
Tyler, A., Edmon, B., Slava, C., Fabian, H., et al.: Watermarks in stream processing systems: semantics and comparative analysis of apache FlinK and google cloud dataflow. In: Proceedings of the VLDB Endowment (VLDB 2021), pp. 3135–3147 (2021)
Yupeng, F., Chinmay, S.: Real-time data infrastructure at uber. In: Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data (SIGMOD 2021), pp. 2503–2516 (2021)
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), pp. 423–438 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, F., Yan, Z., Gu, L. (2022). Towards Low-Latency Big Data Infrastructure at Sangfor. In: Chen, J., He, D., Lu, R. (eds) Emerging Information Security and Applications. EISA 2022. Communications in Computer and Information Science, vol 1641. Springer, Cham. https://doi.org/10.1007/978-3-031-23098-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-23098-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23097-4
Online ISBN: 978-3-031-23098-1
eBook Packages: Computer ScienceComputer Science (R0)