Abstract
The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results. is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the data partitioning and work-allocation strategies of for processing top-k join queries to support data analysis services. In particular, we describe how adaptively samples data from “upstream” operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable to return top-k results with high confidence and low-overhead (up to ~9× faster than alternative schemes on 10 servers).
This work is partially funded by a HP Labs Innovation Research Program Grant “Data-Quality Aware Middleware for Scalable Data Analysis”.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Internet Movie Database, http://www.imdb.com/interfaces
Vertica, http://www.vertica.com
Yahoo! “Hadoop”, http://hadoop.apache.org
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Proceedings of the Very Large Data Bases Endowment 2(1), 922–933 (2009)
Arai, B., Das, G., Gunopulos, D., Koudas, N.: Anytime Measures for Top-k Algorithms. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 225–237 (2007)
Candan, K.S., Li, W.-S.: On Similarity Measures for Multimedia Database Applications. Knowledge and Information Systems 3(1), 30–51 (2001)
Candan, K.S., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: Scalable Multimedia Data Processing in Server Clusters. To appear in IEEE MultiMedia (2010)
Carey, M.J., Kossmann, D.: On Saying “Enough Already!” in SQL. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 219–230 (1997)
Carey, M.J., Kossmann, D.: Processing Top N and Bottom N Queries. IEEE Data Engineering Bulletin 20(3), 12–19 (1997)
Carey, M.J., Kossmann, D.: Reducing the Braking Distance of an SQL Query Engine. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 158–169 (1998)
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proceedings of the Very Large Data Bases Endowment 1(2), 1265–1276 (2008)
Chakrabarti, K., Ganti, V., Han, J., Xin, D.: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 371–382 (2006)
Chang, K., Hwang, S.-W.: Minimal Probing: Supporting Expensive Predicates for Top-k Queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 346–357 (2002)
Chaudhuri, S., Motwani, R., Narasayya, V.: Random Sampling for Histogram Construction: How much is enough? In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 436–447 (1998)
Chaudhuri, S., Gravano, L.: Evaluating Top-k Selection Queries. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 397–410 (1999)
Chaudhuri, S., Motwani, R., Narasayya, V.: On Random Sampling over Joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 263–274 (1999)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce Online. Technical Report, EECS Department, University of California, Berkeley (2009)
Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform. Proceedings of the Very Large Data Bases Endowment 1(2), 1277–1288 (2008)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation, pp. 137–150 (2004)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available Key-Value Store. Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles 41(6), 205–220 (2007)
Fagin, R.: Combining Fuzzy Information from Multiple Systems. In: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 216–226 (1996)
Fagin, R.: Fuzzy Queries in Multimedia Database Systems. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 1–10 (1998)
Fagin, R., Lotem, A., Naor, M.: Optimal Aggregation Algorithms for Middleware. Journal of Computer and System Sciences 66(4), 614–656 (2003)
Güntzer, U., Balke, W.-T., Kiessling, W.: Towards Efficient Multi-Feature Queries in Heterogeneous Environments. In: Proceedings of the International Conference on Information Technology: Coding and Computing, pp. 622–628 (2001)
Kim, J.W., Candan, K.S.: Skip-and-Prune: Cosine-based Top-k Query Processing for Efficient Context-Sensitive Document Retrieval. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 115–126 (2009)
Schnaitter, K., Spiegel, J., Polyzotis, N.: Depth Estimation for Ranking Query Optimization. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 902–913 (2007)
Li, C., Chang, K.C.-C., Ilyas, I.F., Song, S.: RankSQL: Query Algebra and Optimization for Relational Top-k Queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 131–142 (2005)
Lin, J.: Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162 (2009)
Marian, A., Bruno, N., Gravano, L.: Evaluating Top-k Queries over Web-Accessible Databases. ACM Transactions on Database Systems 29(2), 319–362 (2004)
Matias, Y., Vitter, J., Wang, M.: Wavelet-based Histograms for Selectivity Estimation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 448–459 (1998)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 165–178 (2009)
Poosala, V., Haas, P., Ioannidis, Y., Shekita, E.: Improved Histograms for Selectivity Estimation of Range Predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 294–305 (1996)
Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel Large Scale Feature Selection for Logistic Regression. In: Proceedings of the SIAM International Conference on Data Mining, pp. 1171–1182 (2009)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive A Petabyte Scale Data Warehouse Using Hadoop. In: Proceedings of the 26th International Conference on Data Engineering, pp. 996–1005 (2010)
Trivedi, K.S.: Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edn. John Wiley and Sons Ltd., Chichester (2002)
Vitter, J.S.: Random Sampling with a Reservoir. ACM Transactions on Mathematical Software 11(1), 37–57 (1985)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yu, R., Nagendra, M., Nagarkar, P., Candan, K.S., Kim, J.W. (2011). Data-Utility Sensitive Query Processing on Server Clusters to Support Scalable Data Analysis Services. In: Agrawal, D., Candan, K.S., Li, WS. (eds) New Frontiers in Information and Software as Services. Lecture Notes in Business Information Processing, vol 74. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19294-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-19294-4_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19293-7
Online ISBN: 978-3-642-19294-4
eBook Packages: Computer ScienceComputer Science (R0)