Towards Low-Latency Big Data Infrastructure at Sangfor

Chen, Fei; Yan, Zhengzheng; Gu, Liang

doi:10.1007/978-3-031-23098-1_3

Fei Chen^8,9,
Zhengzheng Yan⁸ &
Liang Gu⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1641))

Included in the following conference series:

International Symposium on Emerging Information Security and Applications

377 Accesses
2 Citations

Abstract

As a top cybersecurity vendor, Sangfor needs collects log streams from thousands of endpoint detection devices such as NTA, STA, EDR and identifies security threats in real-time way everyday. The discovery and disposal of network security incidents are highly real-time in nature with seconds or even milliseconds response time to prevent possible cyber attacks and data leaks. In order to extract more valuable information, the log streams are analyzed using stream processing with pattern matching like CEP (Complex Event Processing) in memory, and then stored in a persistent storage systems such as a data warehouse system or a search engine system for data scientists and network security engineers to do OLAP (Online Analytical Processing). Sangfor needs to build a low-latency big data platform to meet the challenges of massive logs.

More and more open source systems are proposed to solve the problem of data processing in a certain aspect. Many decisions must be made to balance the benefits when designing a real-time big data infrastructure. What’s more, how to architecture these systems and construct a one-stack unified big data platform have been the key obstacles for big data analytics. In this paper, we present the overall architecture of our low-latency big data infrastructure and identify four important design decisions i.e. message queue, stream processing, OLAP, and data lake. We analyze the advantages and disadvantages of existing open source system and clarify the reason behind our choices. We also describe the improvements and optimizations to make the open-source stacks fit in Sangfor’s environments, including designing a real-time development platform based on Flink and re-architecting Apache Kylin, Clickhouse and Presto as a HOLAP system. Then we highlight two important use cases to verify the rationality of our infrastructure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache hudi. https://hudi.apache.org/
Apache iceberg. https://iceberg.apache.org/
Apache impala. https://impala.apache.org/
Apache kylin. https://kylin.apache.org/
Apache pulsar. https://pulsar.apache.org/
Clickhouse. https://clickhouse.com/
Openmessaging benchmark framework. https://openmessaging.cloud/docs/benchmarks/
Rabbitmq. https://www.rabbitmq.com/
Rocketmq. https://rocketmq.apache.org/
Tpc-ds benchmark. https://www.tpc.org/tpcds/
Tpc-h benchmark. https://www.tpc.org/tpch/
Trino. https://trino.io/
Akidau, T., Balikov, A., Bekiroglu, K., Chernyak, S., Haberman, J., et al.: Millwheel: fault-tolerant stream processing at internet scale. In: Proceedings of the VLDB Endowment (VLDB 2013), pp. 1033–1044 (2013)
Google Scholar
Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. In: Proceedings of the VLDB Endowment (VLDB 2015), pp. 1792–1803 (2015)
Google Scholar
Chaoqun, Z., Maomeng, S., Chuangxian, W., Xiaoqiang, P., et al.: AnalyticDB: real-time olap database system at Alibaba cloud. In: Proceedings of the VLDB Endowment (VLDB 2019), pp. 2059–2070 (2019)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: The 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2004), pp. 137–149 (2004)
Google Scholar
Fangjin, Y., Eric, T., Xavier, L., Nelson, R., et al.: Druid: a real-time analytical data store. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 157–168 (2014)
Google Scholar
Guoqiang Jerry, C., Janet L., W., Shridhar, L., Anshul, J., et al.: Realtime data processing at Facebook. In: Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD 2016), pp. 1087–1098 (2016)
Google Scholar
Guozhang, W., Lei, C., Ayusman, D., Jason, G., Boyang, C., et al.: Consistency and completeness: rethinking distributed stream processing in apache Kafka. In: Proceedings of the 2021 International Conference on Management of Data (SIGMOD 2021), pp. 2602–2613 (2021)
Google Scholar
Jagrati, A., Yanlei, D., Daniel, G., Neil, I.: Efficient pattern matching over event streams. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD 2008), pp. 147–160 (2008)
Google Scholar
Kulkarni, S., Bhagat, N., Fu, M., et al.: Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD 2015) pp. 239–250 (2015)
Google Scholar
Michael, A., Reynold, S.X., Cheng, L., Yin, H., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD 2015), pp. 1383–1394 (2015)
Google Scholar
Michael, A., Tathagata, D., Liwen, S., Burak, Y., et al.: Delta lake: high-performance acid table storage over cloud object stores. In: Proceedings of the VLDB Endowment (VLDB 2020), pp. 3411–3424 (2020)
Google Scholar
Paris, C., Marios, F., Vasiliki, K., Asterios, K.: Beyond analytics: the evolution of stream processing systems. In: Proceedings of the 2020 International Conference on Management of Data (SIGMOD 2020) pp. 2651–2658 (2020)
Google Scholar
Paris, C., Stephan, E., Gyula, F., Seif, H., Stefan, R., Kostas, T.: State management in apache FlinK: consistent stateful distributed stream processing. In: Proceedings of the VLDB Endowment (VLDB 2017), pp. 1718–1729 (2017)
Google Scholar
Pat, O.N., Betty, O.N., Xuedong, C.: Star schema benchmark (2009). https://www.cs.umb.edu/poneil/StarSchemaB.pdf
Raghav, S., Martin, T., Dain, S., David, P., et al.: Presto: SQL on everything. In: The 35th International Conference on Data Engineering (ICDE 3019), pp. 1802–1813 (2019)
Google Scholar
Shadi A.N., Kartik, P., Yi, P., Navina, R., et al.: Samza: stateful scalable stream processing at LinkedIn. In: Proceedings of the VLDB Endowment (VLDB 2017), pp. 1634–1645 (2017)
Google Scholar
Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm @twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 147–156 (2014)
Google Scholar
Tyler, A., Edmon, B., Slava, C., Fabian, H., et al.: Watermarks in stream processing systems: semantics and comparative analysis of apache FlinK and google cloud dataflow. In: Proceedings of the VLDB Endowment (VLDB 2021), pp. 3135–3147 (2021)
Google Scholar
Yupeng, F., Chinmay, S.: Real-time data infrastructure at uber. In: Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data (SIGMOD 2021), pp. 2503–2516 (2021)
Google Scholar
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), pp. 423–438 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Fei Chen & Zhengzheng Yan
Sangfor Inc., Shenzhen, China
Fei Chen & Liang Gu

Authors

Fei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhengzheng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Liang Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Chen .

Editor information

Editors and Affiliations

Central China Normal University, Wuhan, China
Jiageng Chen
Wuhan University, Wuhan, China
Debiao He
University of New Brunswick, Fredericton, NB, Canada
Rongxing Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, F., Yan, Z., Gu, L. (2022). Towards Low-Latency Big Data Infrastructure at Sangfor. In: Chen, J., He, D., Lu, R. (eds) Emerging Information Security and Applications. EISA 2022. Communications in Computer and Information Science, vol 1641. Springer, Cham. https://doi.org/10.1007/978-3-031-23098-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-23098-1_3
Published: 04 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23097-4
Online ISBN: 978-3-031-23098-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Low-Latency Big Data Infrastructure at Sangfor