Advertisement

Towards Real-Time Analysis of ID-Associated Data

  • Guodong Jin
  • Yixuan Wang
  • Xiongpai Qin
  • Yueguo Chen
  • Xiaoyong Du
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11158)

Abstract

ID-associated data are sequences of entries, and each entry is semantically associated with a unique ID. Examples are user IDs in user behaviour logs of mobile applications and device IDs in sensor records of self-driving cars. Nowadays, many big data applications generate such types of ID-associated data at high speed, and most queries over them are ID-centric (on specific IDs and ranges of time). To generate valuable insights from such data timely, the system needs to ingest high volumes of them with low latency, and support real-time analysis over them efficiently. In this paper, we introduce a system prototype designed for this goal. The system designed a parallel ingestion pipeline and a lightweight indexing scheme for the fast ingestion and efficient analysis. Besides, a fiber partitioning method is utilized to achieve dynamic scalability. For better integration with Hadoop ecosystem, the prototype is implemented based on open source projects, including Kafka and Presto.

Keywords

Real-time analytics ID-associated data Real-time ingestion 

References

  1. 1.
    Apache Hive (2011). http://hive.apache.org
  2. 2.
    Spark SQL: relational data processing in spark (2015). http://spark.apache.org/sql/
  3. 3.
    Apache ORC (2018). https://orc.apache.org
  4. 4.
    Apache Parquet (2018). https://parquet.apache.org
  5. 5.
    Gormley, C., Tong, Z.: ElasticSearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine. O’Reilly Media Inc, Sebastopol (2015)Google Scholar
  6. 6.
    Karger, D., et al.: Web caching with consistent hashing. Comput. Netw. 31(11–16), 1203–1213 (1999)CrossRefGoogle Scholar
  7. 7.
    Kreps, J., Narkhede, N., Rao, J., et al.: Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, pp. 1–7 (2011)Google Scholar
  8. 8.
    Traverso, M.: Presto: interacting with petabytes of data at facebook (2013). Accessed 4 Feb 2014Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Guodong Jin
    • 1
    • 2
  • Yixuan Wang
    • 1
    • 2
  • Xiongpai Qin
    • 1
    • 2
  • Yueguo Chen
    • 1
    • 2
  • Xiaoyong Du
    • 1
    • 2
  1. 1.School of InformationRenmin University of ChinaBeijingChina
  2. 2.DEKE Key LaboratoryRenmin University of China, MOEBeijingChina

Personalised recommendations