Skip to main content

Towards Low-Latency Big Data Infrastructure at Sangfor

  • Conference paper
  • First Online:
Emerging Information Security and Applications (EISA 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1641))

Abstract

As a top cybersecurity vendor, Sangfor needs collects log streams from thousands of endpoint detection devices such as NTA, STA, EDR and identifies security threats in real-time way everyday. The discovery and disposal of network security incidents are highly real-time in nature with seconds or even milliseconds response time to prevent possible cyber attacks and data leaks. In order to extract more valuable information, the log streams are analyzed using stream processing with pattern matching like CEP (Complex Event Processing) in memory, and then stored in a persistent storage systems such as a data warehouse system or a search engine system for data scientists and network security engineers to do OLAP (Online Analytical Processing). Sangfor needs to build a low-latency big data platform to meet the challenges of massive logs.

More and more open source systems are proposed to solve the problem of data processing in a certain aspect. Many decisions must be made to balance the benefits when designing a real-time big data infrastructure. What’s more, how to architecture these systems and construct a one-stack unified big data platform have been the key obstacles for big data analytics. In this paper, we present the overall architecture of our low-latency big data infrastructure and identify four important design decisions i.e. message queue, stream processing, OLAP, and data lake. We analyze the advantages and disadvantages of existing open source system and clarify the reason behind our choices. We also describe the improvements and optimizations to make the open-source stacks fit in Sangfor’s environments, including designing a real-time development platform based on Flink and re-architecting Apache Kylin, Clickhouse and Presto as a HOLAP system. Then we highlight two important use cases to verify the rationality of our infrastructure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Apache hudi. https://hudi.apache.org/

  2. Apache iceberg. https://iceberg.apache.org/

  3. Apache impala. https://impala.apache.org/

  4. Apache kylin. https://kylin.apache.org/

  5. Apache pulsar. https://pulsar.apache.org/

  6. Clickhouse. https://clickhouse.com/

  7. Openmessaging benchmark framework. https://openmessaging.cloud/docs/benchmarks/

  8. Rabbitmq. https://www.rabbitmq.com/

  9. Rocketmq. https://rocketmq.apache.org/

  10. Tpc-ds benchmark. https://www.tpc.org/tpcds/

  11. Tpc-h benchmark. https://www.tpc.org/tpch/

  12. Trino. https://trino.io/

  13. Akidau, T., Balikov, A., Bekiroglu, K., Chernyak, S., Haberman, J., et al.: Millwheel: fault-tolerant stream processing at internet scale. In: Proceedings of the VLDB Endowment (VLDB 2013), pp. 1033–1044 (2013)

    Google Scholar 

  14. Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. In: Proceedings of the VLDB Endowment (VLDB 2015), pp. 1792–1803 (2015)

    Google Scholar 

  15. Chaoqun, Z., Maomeng, S., Chuangxian, W., Xiaoqiang, P., et al.: AnalyticDB: real-time olap database system at Alibaba cloud. In: Proceedings of the VLDB Endowment (VLDB 2019), pp. 2059–2070 (2019)

    Google Scholar 

  16. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: The 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2004), pp. 137–149 (2004)

    Google Scholar 

  17. Fangjin, Y., Eric, T., Xavier, L., Nelson, R., et al.: Druid: a real-time analytical data store. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 157–168 (2014)

    Google Scholar 

  18. Guoqiang Jerry, C., Janet L., W., Shridhar, L., Anshul, J., et al.: Realtime data processing at Facebook. In: Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD 2016), pp. 1087–1098 (2016)

    Google Scholar 

  19. Guozhang, W., Lei, C., Ayusman, D., Jason, G., Boyang, C., et al.: Consistency and completeness: rethinking distributed stream processing in apache Kafka. In: Proceedings of the 2021 International Conference on Management of Data (SIGMOD 2021), pp. 2602–2613 (2021)

    Google Scholar 

  20. Jagrati, A., Yanlei, D., Daniel, G., Neil, I.: Efficient pattern matching over event streams. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD 2008), pp. 147–160 (2008)

    Google Scholar 

  21. Kulkarni, S., Bhagat, N., Fu, M., et al.: Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD 2015) pp. 239–250 (2015)

    Google Scholar 

  22. Michael, A., Reynold, S.X., Cheng, L., Yin, H., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD 2015), pp. 1383–1394 (2015)

    Google Scholar 

  23. Michael, A., Tathagata, D., Liwen, S., Burak, Y., et al.: Delta lake: high-performance acid table storage over cloud object stores. In: Proceedings of the VLDB Endowment (VLDB 2020), pp. 3411–3424 (2020)

    Google Scholar 

  24. Paris, C., Marios, F., Vasiliki, K., Asterios, K.: Beyond analytics: the evolution of stream processing systems. In: Proceedings of the 2020 International Conference on Management of Data (SIGMOD 2020) pp. 2651–2658 (2020)

    Google Scholar 

  25. Paris, C., Stephan, E., Gyula, F., Seif, H., Stefan, R., Kostas, T.: State management in apache FlinK: consistent stateful distributed stream processing. In: Proceedings of the VLDB Endowment (VLDB 2017), pp. 1718–1729 (2017)

    Google Scholar 

  26. Pat, O.N., Betty, O.N., Xuedong, C.: Star schema benchmark (2009). https://www.cs.umb.edu/poneil/StarSchemaB.pdf

  27. Raghav, S., Martin, T., Dain, S., David, P., et al.: Presto: SQL on everything. In: The 35th International Conference on Data Engineering (ICDE 3019), pp. 1802–1813 (2019)

    Google Scholar 

  28. Shadi A.N., Kartik, P., Yi, P., Navina, R., et al.: Samza: stateful scalable stream processing at LinkedIn. In: Proceedings of the VLDB Endowment (VLDB 2017), pp. 1634–1645 (2017)

    Google Scholar 

  29. Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm @twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 147–156 (2014)

    Google Scholar 

  30. Tyler, A., Edmon, B., Slava, C., Fabian, H., et al.: Watermarks in stream processing systems: semantics and comparative analysis of apache FlinK and google cloud dataflow. In: Proceedings of the VLDB Endowment (VLDB 2021), pp. 3135–3147 (2021)

    Google Scholar 

  31. Yupeng, F., Chinmay, S.: Real-time data infrastructure at uber. In: Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data (SIGMOD 2021), pp. 2503–2516 (2021)

    Google Scholar 

  32. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), pp. 423–438 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, F., Yan, Z., Gu, L. (2022). Towards Low-Latency Big Data Infrastructure at Sangfor. In: Chen, J., He, D., Lu, R. (eds) Emerging Information Security and Applications. EISA 2022. Communications in Computer and Information Science, vol 1641. Springer, Cham. https://doi.org/10.1007/978-3-031-23098-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23098-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23097-4

  • Online ISBN: 978-3-031-23098-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics