Data-Utility Sensitive Query Processing on Server Clusters to Support Scalable Data Analysis Services

Yu, Renwei; Nagendra, Mithila; Nagarkar, Parth; Candan, K. Selçuk; Kim, Jong Wook

doi:10.1007/978-3-642-19294-4_7

Renwei Yu⁹,
Mithila Nagendra⁹,
Parth Nagarkar⁹,
K. Selçuk Candan⁹ &
…
Jong Wook Kim⁹

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 74))

865 Accesses
1 Citations

Abstract

The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results. is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the data partitioning and work-allocation strategies of for processing top-k join queries to support data analysis services. In particular, we describe how adaptively samples data from “upstream” operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable to return top-k results with high confidence and low-overhead (up to ~9× faster than alternative schemes on 10 servers).

This work is partially funded by a HP Labs Innovation Research Program Grant “Data-Quality Aware Middleware for Scalable Data Analysis”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Internet Movie Database, http://www.imdb.com/interfaces
Vertica, http://www.vertica.com
Yahoo! “Hadoop”, http://hadoop.apache.org
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Proceedings of the Very Large Data Bases Endowment 2(1), 922–933 (2009)
Google Scholar
Arai, B., Das, G., Gunopulos, D., Koudas, N.: Anytime Measures for Top-k Algorithms. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 225–237 (2007)
Google Scholar
Candan, K.S., Li, W.-S.: On Similarity Measures for Multimedia Database Applications. Knowledge and Information Systems 3(1), 30–51 (2001)
Article MATH Google Scholar
Candan, K.S., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: Scalable Multimedia Data Processing in Server Clusters. To appear in IEEE MultiMedia (2010)
Google Scholar
Carey, M.J., Kossmann, D.: On Saying “Enough Already!” in SQL. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 219–230 (1997)
Google Scholar
Carey, M.J., Kossmann, D.: Processing Top N and Bottom N Queries. IEEE Data Engineering Bulletin 20(3), 12–19 (1997)
Google Scholar
Carey, M.J., Kossmann, D.: Reducing the Braking Distance of an SQL Query Engine. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 158–169 (1998)
Google Scholar
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proceedings of the Very Large Data Bases Endowment 1(2), 1265–1276 (2008)
Google Scholar
Chakrabarti, K., Ganti, V., Han, J., Xin, D.: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 371–382 (2006)
Google Scholar
Chang, K., Hwang, S.-W.: Minimal Probing: Supporting Expensive Predicates for Top-k Queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 346–357 (2002)
Google Scholar
Chaudhuri, S., Motwani, R., Narasayya, V.: Random Sampling for Histogram Construction: How much is enough? In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 436–447 (1998)
Google Scholar
Chaudhuri, S., Gravano, L.: Evaluating Top-k Selection Queries. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 397–410 (1999)
Google Scholar
Chaudhuri, S., Motwani, R., Narasayya, V.: On Random Sampling over Joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 263–274 (1999)
Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce Online. Technical Report, EECS Department, University of California, Berkeley (2009)
Google Scholar
Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform. Proceedings of the Very Large Data Bases Endowment 1(2), 1277–1288 (2008)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation, pp. 137–150 (2004)
Google Scholar
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available Key-Value Store. Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles 41(6), 205–220 (2007)
Article Google Scholar
Fagin, R.: Combining Fuzzy Information from Multiple Systems. In: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 216–226 (1996)
Google Scholar
Fagin, R.: Fuzzy Queries in Multimedia Database Systems. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 1–10 (1998)
Google Scholar
Fagin, R., Lotem, A., Naor, M.: Optimal Aggregation Algorithms for Middleware. Journal of Computer and System Sciences 66(4), 614–656 (2003)
Article MathSciNet MATH Google Scholar
Güntzer, U., Balke, W.-T., Kiessling, W.: Towards Efficient Multi-Feature Queries in Heterogeneous Environments. In: Proceedings of the International Conference on Information Technology: Coding and Computing, pp. 622–628 (2001)
Google Scholar
Kim, J.W., Candan, K.S.: Skip-and-Prune: Cosine-based Top-k Query Processing for Efficient Context-Sensitive Document Retrieval. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 115–126 (2009)
Google Scholar
Schnaitter, K., Spiegel, J., Polyzotis, N.: Depth Estimation for Ranking Query Optimization. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 902–913 (2007)
Google Scholar
Li, C., Chang, K.C.-C., Ilyas, I.F., Song, S.: RankSQL: Query Algebra and Optimization for Relational Top-k Queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 131–142 (2005)
Google Scholar
Lin, J.: Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162 (2009)
Google Scholar
Marian, A., Bruno, N., Gravano, L.: Evaluating Top-k Queries over Web-Accessible Databases. ACM Transactions on Database Systems 29(2), 319–362 (2004)
Article Google Scholar
Matias, Y., Vitter, J., Wang, M.: Wavelet-based Histograms for Selectivity Estimation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 448–459 (1998)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 165–178 (2009)
Google Scholar
Poosala, V., Haas, P., Ioannidis, Y., Shekita, E.: Improved Histograms for Selectivity Estimation of Range Predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 294–305 (1996)
Google Scholar
Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel Large Scale Feature Selection for Logistic Regression. In: Proceedings of the SIAM International Conference on Data Mining, pp. 1171–1182 (2009)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive A Petabyte Scale Data Warehouse Using Hadoop. In: Proceedings of the 26th International Conference on Data Engineering, pp. 996–1005 (2010)
Google Scholar
Trivedi, K.S.: Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edn. John Wiley and Sons Ltd., Chichester (2002)
MATH Google Scholar
Vitter, J.S.: Random Sampling with a Reservoir. ACM Transactions on Mathematical Software 11(1), 37–57 (1985)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

CIDSE, Arizona State University, Tempe, AZ, 85287, USA
Renwei Yu, Mithila Nagendra, Parth Nagarkar, K. Selçuk Candan & Jong Wook Kim

Authors

Renwei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Mithila Nagendra
View author publications
You can also search for this author in PubMed Google Scholar
Parth Nagarkar
View author publications
You can also search for this author in PubMed Google Scholar
K. Selçuk Candan
View author publications
You can also search for this author in PubMed Google Scholar
Jong Wook Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of California at Santa Barbara, 93106, Santa Barbara, CA, USA
Divyakant Agrawal
Computer Science and Engineering Department, Arizona State University, 85287-8809, Tempe, AZ, USA
K. Selçuk Candan
SAP China, 201203, Shanghai, China
Wen-Syan Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, R., Nagendra, M., Nagarkar, P., Candan, K.S., Kim, J.W. (2011). Data-Utility Sensitive Query Processing on Server Clusters to Support Scalable Data Analysis Services. In: Agrawal, D., Candan, K.S., Li, WS. (eds) New Frontiers in Information and Software as Services. Lecture Notes in Business Information Processing, vol 74. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19294-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-19294-4_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19293-7
Online ISBN: 978-3-642-19294-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics