Optimized Data Placement for Column-Oriented Data Store in the Distributed Environment
Column-oriented data storage becomes a buzzword nowadays for its high efficiency in massive data access, high compression ratio on individual columns and etc. However, the initial observations turn out to not be trivially true. The seek time and bandwidth of current hard disk drivers (HDD) become the bottleneck for massive data processing day by day, when comparing to other component enhancements of computers during the past four decades. In this paper, we provide a novel data placement strategy for massive data analysis (i.e., read-optimized) based on Gray Code, which enhances the ratio of sequential access to a great extent for diverse query evaluations (e.g., range query, partial match range query, aggregation query and etc). A centralized/distributed structured index is employed in the popularly deployed distributed file systems (e.g., GFS), which achieves the convenient management, efficient accessibility, high extendibility and etc. Detailed theoretical analysis on index extendibility, sequential access improvement and storage capacity usage in terms of proposed data placement strategies are provided as well as specific algorithms. Our extensive experimental studies confirm the efficiency and effectiveness of our proposed data placement methods.
KeywordsRange Query Gray Code Data Placement Index Code Aggregation Query
Unable to display preview. Download preview PDF.
- 1.Boncz, P., Zukowski, M., Nes, N.: MonetDB/X100: Hyper-pipelining query execution. In: Proceeding of CIDR 2005 (2005)Google Scholar
- 2.Vertica, “Vertica” (2008), http://www.vertica.com
- 3.Olofson, A.W.: IDC Excerpt Worldwide Database Management System 2009-2013 Forecast and, Vendor Shares. Technical Report 219232E (October 2008)Google Scholar
- 4.Gray, J.: A Conversation with Jim Gray. ACM Queue 1(4) (2003)Google Scholar
- 5.Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. In: Proceedings of SIGOPS 2003, pp. 29–43 (2003)Google Scholar
- 6.Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of OSDI 2004 (2004)Google Scholar
- 7.Yahoo!, “Hadoop Distributed File System” (2008), http://hadoop.apache.org/hdfs/
- 8.Gray, F.: Pulse code communications. U.S. Patent 2632058 (1953)Google Scholar
- 9.Howard, J., et al.: An overview of the andrew file system. In: Proceedings of the USENIX 1988, pp. 23–26 (1988)Google Scholar
- 12.Copeland, G., Khoshafian, S.: A decomposition storage model. In: Proceedings SIGMOD 1985, pp. 268–279 (1985)Google Scholar
- 13.Stonebraker, M., Abadi, D., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., et al.: C-store: a column-oriented DBMS. In: Proceedings VLDB 2005, pp. 564–275 (2005)Google Scholar
- 15.Yahoo!, “HBase” (2008), http://hbase.apache.org/