Asia-Pacific Web Conference

Web Technologies and Applications pp 841-852 | Cite as

A Fast Data Ingestion and Indexing Scheme for Real-Time Log Analytics

  • Haoqiong Bian
  • Yueguo Chen
  • Xiongpai Qin
  • Xiaoyong Du
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9313)

Abstract

Structured log data is a kind of append-only time-series data which grows rapidly as new entries are continuously generated and captured. It has become very popular in application domains such as Internet, sensor networks and telecommunications. In recent years, many systems have been developed to support batch analysis of such structured log data. But they often fail to meet the high throughput requirements of real-time log data ingestion and analytics. An efficient index is very important to accelerate log data analytics, and at the meanwhile to support high throughput data loading. This paper focuses on designing a specialized indexing scheme for real-time log data analytics. The solution adopts a dynamic global hash index to partition the tuples into hash buckets. Then the tuples in the hash buckets are sorted and buffered in the sort buffer queue. When the amount of data in the queue reaches a threshold, the data is packed into segments before spilling to the disks. Moreover, an intra-segment index is maintained by meta database. With such an indexing scheme, the database system achieves high throughput and real-time data loading and query performance. As shown in the experiments, the data loading throughput reaches 5 million tuples per second per node. The delay of data loading does not exceed 10 seconds, and a sub-second query performance is achieved for the given queries.

Keywords

log data analytics index real-time high throughput 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. CACM 13(7), 422–426 (1970)CrossRefMATHGoogle Scholar
  13. 13.
    Boncz, P.A., Zukowski, M., Nes, N.: Monetdb/x100: Hyper-pipelining query execution. In: CIDR, vol. 5, pp. 225–237 (2005)Google Scholar
  14. 14.
    Chan, C.-Y., Ioannidis, Y.E.: Bitmap index design and evaluation. In: SIGMOD, vol. 27, pp. 355–366 (1998)Google Scholar
  15. 15.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. In: OSDI (2006)Google Scholar
  16. 16.
    Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984)Google Scholar
  17. 17.
    He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: ICDE (2011)Google Scholar
  18. 18.
    Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: STOC, pp. 654–663 (1997)Google Scholar
  19. 19.
    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS 44(2), 35–40 (2010)CrossRefGoogle Scholar
  20. 20.
    Lehman, P.L., et al.: Efficient locking for concurrent operations on b-trees. TODS 6(4), 650–670 (1981)CrossRefMATHGoogle Scholar
  21. 21.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD (2008)Google Scholar
  22. 22.
    Neil, P.O., Cheng, E., Gawlick, D., ONeil, E.: The log-structured merge-tree (lsm-tree). Acta Informatica 33(4), 351–385 (1996)CrossRefGoogle Scholar
  23. 23.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD (2009)Google Scholar
  24. 24.
    Ślźak, D., Wróblewski, J., Eastwood, V., Synak, P.: Brighthouse: an analytic data warehouse for ad-hoc queries. PVLDB 1(2), 1337–1345 (2008)Google Scholar
  25. 25.
    Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: Friends or foes? CACM, 53(1), January 2010Google Scholar
  26. 26.
    Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: A column-oriented DBMS. In: VLDB, pp. 553–564 (2005)Google Scholar
  27. 27.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: ICDE (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Haoqiong Bian
    • 1
  • Yueguo Chen
    • 1
  • Xiongpai Qin
    • 1
  • Xiaoyong Du
    • 1
  1. 1.Key Laboratory of Data Engineering and Knowledge Engineering (MOE)Renmin University of ChinaBeijingChina

Personalised recommendations