Distributed and Parallel Databases

, Volume 29, Issue 3, pp 185–216 | Cite as

ASTERIX: towards a scalable, semistructured data platform for evolving-world models

  • Alexander Behm
  • Vinayak R. Borkar
  • Michael J. Carey
  • Raman Grover
  • Chen Li
  • Nicola Onose
  • Rares Vernica
  • Alin Deutsch
  • Yannis Papakonstantinou
  • Vassilis J. Tsotras
Article

Abstract

ASTERIX is a new data-intensive storage and computing platform project spanning UC Irvine, UC Riverside, and UC San Diego. In this paper we provide an overview of the ASTERIX project, starting with its main goal—the storage and analysis of data pertaining to evolving-world models. We describe the requirements and associated challenges, and explain how the project is addressing them. We provide a technical overview of ASTERIX, covering its architecture, its user model for data and queries, and its approach to scalable query processing and data management. ASTERIX utilizes a new scalable runtime computational platform called Hyracks that is also discussed at an overview level; we have recently made Hyracks available in open source for use by other interested parties. We also relate our work on ASTERIX to the current state of the art and describe the research challenges that we are currently tackling as well as those that lie ahead.

Keywords

Data-intensive computing Cloud computing Semistructured data ASTERIX Hyracks 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Mateo (1999) Google Scholar
  2. 2.
    Abiteboul, S., Fischer, P.C., Schek, H.-J.: Nested Relations and Complex Objects in Databases (LNCS). Springer, Berlin (1989) Google Scholar
  3. 3.
    Abiteboul, S., Manolescu, I., Polyzotis, N., Preda, N., Sun, C.: Xml processing in dht networks. In: ICDE ’08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pp. 606–615. IEEE Computer Society, Washington (2008) CrossRefGoogle Scholar
  4. 4.
    Agrawal, R., et al.: The Claremont report on database research. Commun. ACM 52(6), 56–65 (2009) CrossRefGoogle Scholar
  5. 5.
    Amer-Yahia, S., Botev, C., Buxton, S., Case, P., Doerre, J., Dyck, M., Holstege, M., Melton, J., Rys, M., Shanmugasundaram, J.: XQuery and XPath full text 1.0. W3C Candidate Recommendation, July 9 (2009) Google Scholar
  6. 6.
  7. 7.
    Apache Hadoop, http://hadoop.apache.org
  8. 8.
    Ballinger, C.: Born to be parallel. Why parallel origins give teradata. Database an enduring performance edge. http://www.teradata.com/library/pdf/eb3053.pdf
  9. 9.
    Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC ’10: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 119–130. ACM, New York (2010) CrossRefGoogle Scholar
  10. 10.
    Behm, A., Ji, S., Li, C., Lu, J.: Space-constrained gram-based indexing for efficient approximate string search. In: ICDE (2009) Google Scholar
  11. 11.
    Behm, A., Li, C., Carey, M.: Answering approximate string queries on large data sets using external memory. Technical report, Department of Computer Science, UC Irvine (under submission) (July 2010) Google Scholar
  12. 12.
    Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE (2011) Google Scholar
  13. 13.
    Botev, C., Amer-Yahia, S., Shanmugasundaram, J.: Expressiveness and performance of full-text search languages. In: EDBT, pp. 349–367 (2006) Google Scholar
  14. 14.
    Carey, M.J., Muhanna, W.A.: The performance of multiversion concurrency control algorithms. ACM Trans. Comput. Syst. 4(4), 338–378 (1986) CrossRefGoogle Scholar
  15. 15.
    Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008) Google Scholar
  16. 16.
    Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: Flumejava: easy, efficient data-parallel pipelines. In: PLDI, pp. 363–375 (2010) Google Scholar
  17. 17.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008) Google Scholar
  18. 18.
    Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: Pnuts: Yahoo!’s hosted data serving platform. PVLDB 1(2), 1277–1288 (2008) Google Scholar
  19. 19.
    Dayal, U., Blaustein, B., Buchmann, A., Chakravarthy, U., Hsu, M., Ledin, R., McCarthy, D., Rosenthal, A., Sarin, S., Carey, M.J., Livny, M., Jauhari, R.: The HiPAC project: combining active databases and timing constraints. SIGMOD Rec. 17(1), 51–70 (1988) CrossRefGoogle Scholar
  20. 20.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004) Google Scholar
  21. 21.
    Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010) CrossRefGoogle Scholar
  22. 22.
    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: SOSP, pp. 205–220 (2007) CrossRefGoogle Scholar
  23. 23.
    DeWitt, D.J., Ghandeharizadeh, S., Schneider, D.A., Bricker, A., Hsiao, H.-I., Rasmussen, R.: The Gamma database machine project. IEEE Trans. Knowl. Data Eng. 2(1), 44–62 (1990) CrossRefGoogle Scholar
  24. 24.
    DeWitt, D.J., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992) CrossRefGoogle Scholar
  25. 25.
    Diao, Y., Altinel, M., Franklin, M.J., Zhang, H., Fischer, P.M.: Path sharing and predicate evaluation for high-performance xml filtering. ACM Trans. Database Syst. 28(4), 467–516 (2003) CrossRefGoogle Scholar
  26. 26.
    Facebook press room—statistics. http://www.facebook.com/press/info.php?statistics
  27. 27.
  28. 28.
    Garofalakis, M.N., Ioannidis, Y.E.: Parallel query scheduling and optimization with time- and space-shared resources. In: VLDB, pp. 296–305 (1997) Google Scholar
  29. 29.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP, pp. 29–43 (2003) Google Scholar
  30. 30.
    Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB, pp. 436–445 (1997) Google Scholar
  31. 31.
  32. 32.
    Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–170 (1993) CrossRefGoogle Scholar
  33. 33.
    Hanson, E.N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J.B., Vernon, A.: Scalable trigger processing. In: ICDE, pp. 266–275 (1999) Google Scholar
  34. 34.
    Helland, P.: Life beyond distributed transactions: an apostate’s opinion. In: CIDR, pp. 132–141 (2007) Google Scholar
  35. 35.
    Hong, W., Stonebraker, M.: Optimization of parallel query execution plans in XPRS. In: PDIS, pp. 218–225 (1991) Google Scholar
  36. 36.
    Hyracks project on Google code. http://code.google.com/p/hyracks
  37. 37.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, pp. 59–72 (2007) Google Scholar
  38. 38.
  39. 39.
  40. 40.
  41. 41.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008) Google Scholar
  42. 42.
    MarketWatch, The Wall Street Journal. Will the news survive? http://www.marketwatch.com/story/will-the-news-survive-2009-12-08
  43. 43.
    Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010) Google Scholar
  44. 44.
    Moerkotte, G.: Building query compilers. Manuscript, 2009 Google Scholar
  45. 45.
    Object database management systems. http://www.odbms.org/odmg/
  46. 46.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110 (2008) Google Scholar
  47. 47.
    Pew Internet & American Life Project. Twitter and status updating, Fall 2009. http://www.pewinternet.org/Reports/2009/17-Twitter-and-Status-Updating-Fall-2009.aspx
  48. 48.
    Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with Sawzall. Sci. Program. 13(4), 277–298 (2005) Google Scholar
  49. 49.
    Quass, D., Widom, J., Goldman, R., Haas, K., Luo, Q., McHugh, J., Nestorov, S., Rajaraman, A., Rivero, H., Abiteboul, S., Ullman, J.D., Wiener, J.L.: Lore: a lightweight object repository for semistructured data. In: SIGMOD Conference, p. 549 (1996) Google Scholar
  50. 50.
    Ramakrishnan, R., Gehrke, J.: Database Management Systems. WCB/McGraw-Hill, Boston (2002) Google Scholar
  51. 51.
    Snodgrass, R.T., Ahn, I.: A taxonomy of time in databases. In: SIGMOD Conference, pp. 236–246 (1985) Google Scholar
  52. 52.
    Stonebraker, M., et al.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010) CrossRefGoogle Scholar
  53. 53.
  54. 54.
    Thusoo, A.: Hive—a petabyte scale data warehouse using Hadoop. http://www.facebook.com/note.php?note_id=89508453919
  55. 55.
    Twitter blog. Measuring tweets, Feb. 2010. http://blog.twitter.com/2010/02/measuring-tweets.html
  56. 56.
    U.S. Department of Commerce, Washington: Quarterly retail e-commerce sales, 4th quarter 2008. http://www2.census.gov/retail/releases/historical/ecomm/08Q4.html
  57. 57.
    Vernica, R., Carey, M., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference (2010) Google Scholar
  58. 58.
    Vernica, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: KEYS, pp. 9–14 (2009) CrossRefGoogle Scholar
  59. 59.
    Wong, E., Youssefi, K.: Decomposition—a strategy for query processing (abstract). In: Author, J.B.R. Jr. (ed.) Proceedings of the 1976 ACM SIGMOD International Conference on Management of Data, Washington, DC, June 2–4, 1976, p. 155. ACM, New York (1976) Google Scholar
  60. 60.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB (2008) Google Scholar
  61. 61.
    XQuery 1.0: An XML query language. http://www.w3.org/TR/xquery/
  62. 62.
    Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, pp. 1–14 (2008) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Alexander Behm
    • 1
  • Vinayak R. Borkar
    • 1
  • Michael J. Carey
    • 1
  • Raman Grover
    • 1
  • Chen Li
    • 1
  • Nicola Onose
    • 1
  • Rares Vernica
    • 1
  • Alin Deutsch
    • 2
  • Yannis Papakonstantinou
    • 2
  • Vassilis J. Tsotras
    • 3
  1. 1.University of CaliforniaIrvineUSA
  2. 2.University of CaliforniaSan DiegoUSA
  3. 3.University of CaliforniaRiversideUSA

Personalised recommendations