The VLDB Journal

, Volume 21, Issue 3, pp 287–307 | Cite as

Real-time creation of bitmap indexes on streaming network data

  • Francesco Fusco
  • Michail Vlachos
  • Marc Ph. Stoecklin
Regular Paper

Abstract

High-speed archival and indexing solutions of streaming traffic are growing in importance for applications such as monitoring, forensic analysis, and auditing. Many large institutions require fast solutions to support expedient analysis of historical network data, particularly in case of security breaches. However, “turning back the clock” is not a trivial task. The first major challenge is that such a technology needs to support data archiving under extremely high-speed insertion rates. Moreover, the archives created have to be stored in a compressed format that is still amenable to indexing and search. The above requirements make general-purpose databases unsuitable for this task and dedicated solutions are required. This work describes a solution for high-speed archival storage, indexing, and data querying on network flow information. We make the two following important contributions: (a) we propose a novel compressed bitmap index approach that significantly reduces both CPU load and disk consumption and, (b) we introduce an online stream reordering mechanism that further reduces space requirements and improves the time for data retrieval. The reordering methodology is based on the principles of locality-sensitive hashing (LSH) and also of interest for other bitmap creation techniques. Because of the synergy of these two components, our solution can sustain data insertion rates that reach 500,000–1 million records per second. To put these numbers into perspective, typical commercial network flow solutions can currently process 20,000–60,000 flows per second. In addition, our system offers interactive query response times that enable administrators to perform complex analysis tasks on the fly. Our technique is directly amenable to parallel execution, allowing its application in domains that are challenged by large volumes of historical measurement data, such as network auditing, traffic behavior analysis, and large-scale data visualization in service provider networks.

Keywords

Bitmap index Locality sensitive hashing Data stream Data archive 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abadi, D., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 671–682 (2006)Google Scholar
  2. 2.
    Abadi, D.J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A.S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The design of the borealis stream processing engine. In: Second Biennial Conference on Innovative Data Systems Research (CIDR) (2005)Google Scholar
  3. 3.
    Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores versus row-stores: how different are they really? In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 967–980 (2008)Google Scholar
  4. 4.
    Andrade, H., Gedik, B., Wu, K.-L., Yu, P.S.: Scale-up strategies for processing high-rate data streams in system S. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE), pp. 1375–1378 (2009)Google Scholar
  5. 5.
    Anh V.N., Moffat A.: Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1), 151–166 (2005)CrossRefGoogle Scholar
  6. 6.
    Antoshenkov G., Ziauddin M.: Query processing and optimization in Oracle Rdb. Very Large Data Bases J. 5(4), 229–237 (1996)CrossRefGoogle Scholar
  7. 7.
    Apaydin, T., Ferhatosmanoglu, H., Canahuate, G., Tosun, A.C.: Dynamic data organization for bitmap indices. In: Proceedings of International Conference on Scalable Information Systems (INFOSCALE), pp. 30:1–30:10 (2008)Google Scholar
  8. 8.
    Bethel, E.W., Campbell, S., Dart, E., Stockinger, K., Wu, K.: Accelerating network traffic analysis using query-driven visualization. In: Proceedings of IEEE Symposium on Visual Analytics Science and Technology (IEEE VAST), pp. 115–122 (2006)Google Scholar
  9. 9.
    Boncz P.A., Kersten M.L., Manegold S.: Breaking the memory wall in MonetDB. Commun. ACM 51(12), 77–85 (2008)CrossRefGoogle Scholar
  10. 10.
    Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S.R., Reiss, F., Shah, M.A.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 668–668 (2003)Google Scholar
  11. 11.
    Chang F., Dean J., Ghemawat S., Hsieh W.C., Wallach D.A., Burrows M., Chandra T., Fikes A., Gruber R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)MATHCrossRefGoogle Scholar
  12. 12.
    Cranor, C.D., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: a stream database for network applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 647–651 (2003)Google Scholar
  13. 13.
    Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedingds of the Symposium on Computational Geometry, pp. 253–262 (2004)Google Scholar
  14. 14.
    Deliége, F., Pedersen, T.B.: Position list word aligned hybrid: optimizing space and performance for compressed bitmaps. In: Proceedings of the International Conference on Extending Database Technology (EDBT), pp. 228–239 (2010)Google Scholar
  15. 15.
    Endace. Endace Measurement Systems, NinjaProbe Appliances. http://www.endace.com
  16. 16.
    Fang, W., He, B., Luo, Q.: Database compression on graphics processors. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 670–680 (2010)Google Scholar
  17. 17.
    FastBit. An Efficient Compressed Bitmap Index Technology. https://sdm.lbl.gov/fastbit/Google Scholar
  18. 18.
    Ferragina, P.: Data structures: time, I/Os, entropy, joules! In: Proceedings of 18th Annual European Conference on Algorithms: part II, pp. 1–16 (2010)Google Scholar
  19. 19.
    Fujioka, K., Uematsu, Y., Onizuka, M.: Application of bitmap index to information retrieval. In: Proceedings of the international World Wide Web conference (WWW), pp. 1109–1110 (2008)Google Scholar
  20. 20.
    Fusco, F., Stoecklin, M., Vlachos, M.: NET-FLi: On-the-fly Compression, Archiving and Indexing of Streaming Network Traffic. In: Proceedings of the International Conference on Very Large DataBases (VLDB), pp. 1382–1393 (2010)Google Scholar
  21. 21.
    Gailly, J.-L., Adler, M.: The ZLIB library. http://www.zlib.org/
  22. 22.
    Gates, C., Collins, M., Duggan, M., Kompanek, A., Thomas, M.: More netflow tools for performance and security. In: Proceedings of USENIX Conference on System Administration, pp. 121–132 (2004)Google Scholar
  23. 23.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 518–529 (1999)Google Scholar
  24. 24.
    Giura, P., Memon, N.: Netstore: an efficient storage infrastructure for network forensics and monitoring. In: Proceedings of the International Symposium on Recent Advances in Intrusion Detection (RAID), pp. 277–296 (2010)Google Scholar
  25. 25.
  26. 26.
    Harizopoulos, S., Liang, V., Abadi, D.J., Madden, S.: Performance tradeoffs in read-optimized databases. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 487–498 (2006)Google Scholar
  27. 27.
    Holloway A.L., DeWitt D.J.: Read-optimized databases, in depth. Proc. VLDB Endow. 1, 502–513 (2008)Google Scholar
  28. 28.
    IBM Corp., AURORA—Traffic Analysis and Visualization. http://www.zurich.ibm.com/aurora/
  29. 29.
    Intel. Intel. SSE4 Programming Reference (2007)Google Scholar
  30. 30.
    Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: multilevel traffic classification in the dark. In: Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), pp. 229–240 (2005)Google Scholar
  31. 31.
    Kaser, O., Lemire, D., Aouiche, K.: Histogram-aware sorting for enhanced word-aligned compression in bitmap indexes. In: Proceedings of International Workshop on Data Warehousing and OLAP (DOLAP), pp. 1–8 (2008)Google Scholar
  32. 32.
    Lemire D., Kaser O., Aouiche K.: Sorting improves word-aligned bitmap indexes. Data Knowl. Eng. 69(1), 3–28 (2010)CrossRefGoogle Scholar
  33. 33.
    Li, X., Bian, F., Zhang, H., Diot, C., Govindan, R., Hong, W., Iannaccone, G.: MIND: a distributed multi-dimensional indexing system for network diagnosis. In: Proceedings of the IEEE International Conference on Computer Communications (INFOCOM) (2006)Google Scholar
  34. 34.
    Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 950–961 (2007)Google Scholar
  35. 35.
    Morariu, C., Kramis, T., Stiller, B.: DIPStorage: Distributed storage of IP flow records. In: Proceedings of the 16th Workshop on Local and Metropolitan Area Networks (LANMAN) (2008)Google Scholar
  36. 36.
    Network Top. http://www.ntop.org/
  37. 37.
    Niksun. Niksun NetDetector. http://niksun.com
  38. 38.
    Oberhumer, M.F.: The Lempel-Ziv-Oberhumer Packer. http://www.lzop.org/
  39. 39.
    Oberhumer, M.F.: Lzo documentation. http://www.oberhumer.com/opensource/lzo/lzodoc.php
  40. 40.
    Pinar, A., Tao, T., Ferhatosmanoglu, H.: Compressing bitmap indices by data reorganization. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE), pp. 310–321 (2005)Google Scholar
  41. 41.
    Plagemann, T., Goebel, V., Bergamini, A., Tolu, G., Urvoy-Keller, G., Biersack, E.W.: Using data stream management systems for traffic analysis—a case study. In: Proceedings of the Passive and Active Measurement Conference (PAM), pp. 215–226 (2004)Google Scholar
  42. 42.
    Reiss, F., Stockinger, K., Wu, K., Shoshani, A., Hellerstein, J.M.: Enabling real-time querying of live and historical stream data. In:~Proceedings of International Conference on Scientific and Statistical Database Management (SSDBM), pp. 28 (2007)Google Scholar
  43. 43.
    Romig, S., Fullmer, M., Luman, R.: The OSU flow-tools package and CISCO NetFlow logs. In: Proceedings of USENIX Conference on System Administration, pp. 291–304 (2000)Google Scholar
  44. 44.
    Schatzmann, D., Mühlbauer, W., Spyropoulos, T., Dimitropoulos, X.: Digging into https: flow-based classification of webmail traffic. In: IMC ’10: Proceedings of the 10th Internet Measurement Conference. Melbourne, Australia, Nov (2010)Google Scholar
  45. 45.
    Stabno, M., Wrembel, R.: RLH: bitmap compression technique based on run-length and Huffman encoding. In: Proceedings of ACM International Workshop on Data Warehousing and OLAP (DOLAP), pp. 41–48 (2007)Google Scholar
  46. 46.
    Stonebraker, M., et~al.: C-Store: a column-oriented DBMS. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 553–564 (2005)Google Scholar
  47. 47.
    Sullivan, M., Heybey, A.: Tribeca: a system for managing large databases of network traffic. In: Proceedings of USENIX Annual Technical Conference, p. 2 (1998)Google Scholar
  48. 48.
    Wu, K., Otoo, E., Shoshani, A.: On the performance of bitmap indices for high cardinality attributes. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 24–35 (2004)Google Scholar
  49. 49.
    Wu K., Otoo E.J., Shoshani A.: Optimizing bitmap indices with efficient compression. ACM Trans. Database Syst. 31(1), 1–38 (2006)CrossRefGoogle Scholar
  50. 50.
    Wu, K., Otoo, E.J., Shoshani, A., Nordberg, H.: Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory, Berkeley, CA (USA)Google Scholar
  51. 51.
    Wu, K.-L., et al.: Challenges and experience in prototyping a multi-modal stream analytic and monitoring application on System S. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 1185–1196 (2007)Google Scholar
  52. 52.
    Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE), p. 59 (2006)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Francesco Fusco
    • 1
  • Michail Vlachos
    • 1
  • Marc Ph. Stoecklin
    • 2
  1. 1.IBM Research - ZurichRüschlikonSwitzerland
  2. 2.IBM Research - T. J. Watson Research CenterHawthorneUSA

Personalised recommendations