Skip to main content

Data-Utility Sensitive Query Processing on Server Clusters to Support Scalable Data Analysis Services

  • Conference paper
New Frontiers in Information and Software as Services

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 74))

Abstract

The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks to support data analysis services. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch oriented data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain low-utility results.  is an efficient and scalable utility-aware parallel processing system for ranked query processing over large data sets. In this paper, we focus on the  data partitioning and work-allocation strategies of  for processing top-k join queries to support data analysis services. In particular, we describe how adaptively samples data from “upstream” operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable to return top-k results with high confidence and low-overhead (up to ~9× faster than alternative schemes on 10 servers).

This work is partially funded by a HP Labs Innovation Research Program Grant “Data-Quality Aware Middleware for Scalable Data Analysis”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Internet Movie Database, http://www.imdb.com/interfaces

  2. Vertica, http://www.vertica.com

  3. Yahoo! “Hadoop”, http://hadoop.apache.org

  4. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Proceedings of the Very Large Data Bases Endowment 2(1), 922–933 (2009)

    Google Scholar 

  5. Arai, B., Das, G., Gunopulos, D., Koudas, N.: Anytime Measures for Top-k Algorithms. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 225–237 (2007)

    Google Scholar 

  6. Candan, K.S., Li, W.-S.: On Similarity Measures for Multimedia Database Applications. Knowledge and Information Systems 3(1), 30–51 (2001)

    Article  MATH  Google Scholar 

  7. Candan, K.S., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: Scalable Multimedia Data Processing in Server Clusters. To appear in IEEE MultiMedia (2010)

    Google Scholar 

  8. Carey, M.J., Kossmann, D.: On Saying “Enough Already!” in SQL. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 219–230 (1997)

    Google Scholar 

  9. Carey, M.J., Kossmann, D.: Processing Top N and Bottom N Queries. IEEE Data Engineering Bulletin 20(3), 12–19 (1997)

    Google Scholar 

  10. Carey, M.J., Kossmann, D.: Reducing the Braking Distance of an SQL Query Engine. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 158–169 (1998)

    Google Scholar 

  11. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proceedings of the Very Large Data Bases Endowment 1(2), 1265–1276 (2008)

    Google Scholar 

  12. Chakrabarti, K., Ganti, V., Han, J., Xin, D.: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 371–382 (2006)

    Google Scholar 

  13. Chang, K., Hwang, S.-W.: Minimal Probing: Supporting Expensive Predicates for Top-k Queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 346–357 (2002)

    Google Scholar 

  14. Chaudhuri, S., Motwani, R., Narasayya, V.: Random Sampling for Histogram Construction: How much is enough? In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 436–447 (1998)

    Google Scholar 

  15. Chaudhuri, S., Gravano, L.: Evaluating Top-k Selection Queries. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 397–410 (1999)

    Google Scholar 

  16. Chaudhuri, S., Motwani, R., Narasayya, V.: On Random Sampling over Joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 263–274 (1999)

    Google Scholar 

  17. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce Online. Technical Report, EECS Department, University of California, Berkeley (2009)

    Google Scholar 

  18. Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform. Proceedings of the Very Large Data Bases Endowment 1(2), 1277–1288 (2008)

    Google Scholar 

  19. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation, pp. 137–150 (2004)

    Google Scholar 

  20. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available Key-Value Store. Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles 41(6), 205–220 (2007)

    Article  Google Scholar 

  21. Fagin, R.: Combining Fuzzy Information from Multiple Systems. In: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 216–226 (1996)

    Google Scholar 

  22. Fagin, R.: Fuzzy Queries in Multimedia Database Systems. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 1–10 (1998)

    Google Scholar 

  23. Fagin, R., Lotem, A., Naor, M.: Optimal Aggregation Algorithms for Middleware. Journal of Computer and System Sciences 66(4), 614–656 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  24. Güntzer, U., Balke, W.-T., Kiessling, W.: Towards Efficient Multi-Feature Queries in Heterogeneous Environments. In: Proceedings of the International Conference on Information Technology: Coding and Computing, pp. 622–628 (2001)

    Google Scholar 

  25. Kim, J.W., Candan, K.S.: Skip-and-Prune: Cosine-based Top-k Query Processing for Efficient Context-Sensitive Document Retrieval. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 115–126 (2009)

    Google Scholar 

  26. Schnaitter, K., Spiegel, J., Polyzotis, N.: Depth Estimation for Ranking Query Optimization. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 902–913 (2007)

    Google Scholar 

  27. Li, C., Chang, K.C.-C., Ilyas, I.F., Song, S.: RankSQL: Query Algebra and Optimization for Relational Top-k Queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 131–142 (2005)

    Google Scholar 

  28. Lin, J.: Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162 (2009)

    Google Scholar 

  29. Marian, A., Bruno, N., Gravano, L.: Evaluating Top-k Queries over Web-Accessible Databases. ACM Transactions on Database Systems 29(2), 319–362 (2004)

    Article  Google Scholar 

  30. Matias, Y., Vitter, J., Wang, M.: Wavelet-based Histograms for Selectivity Estimation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 448–459 (1998)

    Google Scholar 

  31. Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 165–178 (2009)

    Google Scholar 

  32. Poosala, V., Haas, P., Ioannidis, Y., Shekita, E.: Improved Histograms for Selectivity Estimation of Range Predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 294–305 (1996)

    Google Scholar 

  33. Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel Large Scale Feature Selection for Logistic Regression. In: Proceedings of the SIAM International Conference on Data Mining, pp. 1171–1182 (2009)

    Google Scholar 

  34. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive A Petabyte Scale Data Warehouse Using Hadoop. In: Proceedings of the 26th International Conference on Data Engineering, pp. 996–1005 (2010)

    Google Scholar 

  35. Trivedi, K.S.: Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edn. John Wiley and Sons Ltd., Chichester (2002)

    MATH  Google Scholar 

  36. Vitter, J.S.: Random Sampling with a Reservoir. ACM Transactions on Mathematical Software 11(1), 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yu, R., Nagendra, M., Nagarkar, P., Candan, K.S., Kim, J.W. (2011). Data-Utility Sensitive Query Processing on Server Clusters to Support Scalable Data Analysis Services. In: Agrawal, D., Candan, K.S., Li, WS. (eds) New Frontiers in Information and Software as Services. Lecture Notes in Business Information Processing, vol 74. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19294-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19294-4_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19293-7

  • Online ISBN: 978-3-642-19294-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics