Skip to main content
Log in

ASTERIX: towards a scalable, semistructured data platform for evolving-world models

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

ASTERIX is a new data-intensive storage and computing platform project spanning UC Irvine, UC Riverside, and UC San Diego. In this paper we provide an overview of the ASTERIX project, starting with its main goal—the storage and analysis of data pertaining to evolving-world models. We describe the requirements and associated challenges, and explain how the project is addressing them. We provide a technical overview of ASTERIX, covering its architecture, its user model for data and queries, and its approach to scalable query processing and data management. ASTERIX utilizes a new scalable runtime computational platform called Hyracks that is also discussed at an overview level; we have recently made Hyracks available in open source for use by other interested parties. We also relate our work on ASTERIX to the current state of the art and describe the research challenges that we are currently tackling as well as those that lie ahead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Mateo (1999)

    Google Scholar 

  2. Abiteboul, S., Fischer, P.C., Schek, H.-J.: Nested Relations and Complex Objects in Databases (LNCS). Springer, Berlin (1989)

    Google Scholar 

  3. Abiteboul, S., Manolescu, I., Polyzotis, N., Preda, N., Sun, C.: Xml processing in dht networks. In: ICDE ’08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pp. 606–615. IEEE Computer Society, Washington (2008)

    Chapter  Google Scholar 

  4. Agrawal, R., et al.: The Claremont report on database research. Commun. ACM 52(6), 56–65 (2009)

    Article  Google Scholar 

  5. Amer-Yahia, S., Botev, C., Buxton, S., Case, P., Doerre, J., Dyck, M., Holstege, M., Melton, J., Rys, M., Shanmugasundaram, J.: XQuery and XPath full text 1.0. W3C Candidate Recommendation, July 9 (2009)

  6. Apache Avro, http://hadoop.apache.org/avro/

  7. Apache Hadoop, http://hadoop.apache.org

  8. Ballinger, C.: Born to be parallel. Why parallel origins give teradata. Database an enduring performance edge. http://www.teradata.com/library/pdf/eb3053.pdf

  9. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC ’10: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 119–130. ACM, New York (2010)

    Chapter  Google Scholar 

  10. Behm, A., Ji, S., Li, C., Lu, J.: Space-constrained gram-based indexing for efficient approximate string search. In: ICDE (2009)

    Google Scholar 

  11. Behm, A., Li, C., Carey, M.: Answering approximate string queries on large data sets using external memory. Technical report, Department of Computer Science, UC Irvine (under submission) (July 2010)

  12. Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE (2011)

    Google Scholar 

  13. Botev, C., Amer-Yahia, S., Shanmugasundaram, J.: Expressiveness and performance of full-text search languages. In: EDBT, pp. 349–367 (2006)

    Google Scholar 

  14. Carey, M.J., Muhanna, W.A.: The performance of multiversion concurrency control algorithms. ACM Trans. Comput. Syst. 4(4), 338–378 (1986)

    Article  Google Scholar 

  15. Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)

    Google Scholar 

  16. Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: Flumejava: easy, efficient data-parallel pipelines. In: PLDI, pp. 363–375 (2010)

    Google Scholar 

  17. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)

  18. Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: Pnuts: Yahoo!’s hosted data serving platform. PVLDB 1(2), 1277–1288 (2008)

    Google Scholar 

  19. Dayal, U., Blaustein, B., Buchmann, A., Chakravarthy, U., Hsu, M., Ledin, R., McCarthy, D., Rosenthal, A., Sarin, S., Carey, M.J., Livny, M., Jauhari, R.: The HiPAC project: combining active databases and timing constraints. SIGMOD Rec. 17(1), 51–70 (1988)

    Article  Google Scholar 

  20. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  21. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  22. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: SOSP, pp. 205–220 (2007)

    Chapter  Google Scholar 

  23. DeWitt, D.J., Ghandeharizadeh, S., Schneider, D.A., Bricker, A., Hsiao, H.-I., Rasmussen, R.: The Gamma database machine project. IEEE Trans. Knowl. Data Eng. 2(1), 44–62 (1990)

    Article  Google Scholar 

  24. DeWitt, D.J., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  25. Diao, Y., Altinel, M., Franklin, M.J., Zhang, H., Fischer, P.M.: Path sharing and predicate evaluation for high-performance xml filtering. ACM Trans. Database Syst. 28(4), 467–516 (2003)

    Article  Google Scholar 

  26. Facebook press room—statistics. http://www.facebook.com/press/info.php?statistics

  27. Facebook Thrift. http://incubator.apache.org/thrift

  28. Garofalakis, M.N., Ioannidis, Y.E.: Parallel query scheduling and optimization with time- and space-shared resources. In: VLDB, pp. 296–305 (1997)

    Google Scholar 

  29. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP, pp. 29–43 (2003)

    Google Scholar 

  30. Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB, pp. 436–445 (1997)

    Google Scholar 

  31. Google protocol buffers. http://code.google.com/apis/protocolbuffers/

  32. Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–170 (1993)

    Article  Google Scholar 

  33. Hanson, E.N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J.B., Vernon, A.: Scalable trigger processing. In: ICDE, pp. 266–275 (1999)

    Google Scholar 

  34. Helland, P.: Life beyond distributed transactions: an apostate’s opinion. In: CIDR, pp. 132–141 (2007)

    Google Scholar 

  35. Hong, W., Stonebraker, M.: Optimization of parallel query execution plans in XPRS. In: PDIS, pp. 218–225 (1991)

    Google Scholar 

  36. Hyracks project on Google code. http://code.google.com/p/hyracks

  37. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, pp. 59–72 (2007)

    Google Scholar 

  38. Jaql, http://www.jaql.org

  39. Jaql 0.1. http://www.jaql.org/release/0.1/jaql-overview.html

  40. JSON. http://www.json.org/

  41. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)

    Google Scholar 

  42. MarketWatch, The Wall Street Journal. Will the news survive? http://www.marketwatch.com/story/will-the-news-survive-2009-12-08

  43. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)

    Google Scholar 

  44. Moerkotte, G.: Building query compilers. Manuscript, 2009

  45. Object database management systems. http://www.odbms.org/odmg/

  46. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110 (2008)

    Google Scholar 

  47. Pew Internet & American Life Project. Twitter and status updating, Fall 2009. http://www.pewinternet.org/Reports/2009/17-Twitter-and-Status-Updating-Fall-2009.aspx

  48. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with Sawzall. Sci. Program. 13(4), 277–298 (2005)

    Google Scholar 

  49. Quass, D., Widom, J., Goldman, R., Haas, K., Luo, Q., McHugh, J., Nestorov, S., Rajaraman, A., Rivero, H., Abiteboul, S., Ullman, J.D., Wiener, J.L.: Lore: a lightweight object repository for semistructured data. In: SIGMOD Conference, p. 549 (1996)

    Google Scholar 

  50. Ramakrishnan, R., Gehrke, J.: Database Management Systems. WCB/McGraw-Hill, Boston (2002)

    Google Scholar 

  51. Snodgrass, R.T., Ahn, I.: A taxonomy of time in databases. In: SIGMOD Conference, pp. 236–246 (1985)

    Google Scholar 

  52. Stonebraker, M., et al.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  53. The Radicati Group Inc. Business user survey, 2009. http://www.radicati.com/wp/wp-content/uploads/2009/11/Business-User-Survey-2009-Executive-Summary1.pdf

  54. Thusoo, A.: Hive—a petabyte scale data warehouse using Hadoop. http://www.facebook.com/note.php?note_id=89508453919

  55. Twitter blog. Measuring tweets, Feb. 2010. http://blog.twitter.com/2010/02/measuring-tweets.html

  56. U.S. Department of Commerce, Washington: Quarterly retail e-commerce sales, 4th quarter 2008. http://www2.census.gov/retail/releases/historical/ecomm/08Q4.html

  57. Vernica, R., Carey, M., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference (2010)

    Google Scholar 

  58. Vernica, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: KEYS, pp. 9–14 (2009)

    Chapter  Google Scholar 

  59. Wong, E., Youssefi, K.: Decomposition—a strategy for query processing (abstract). In: Author, J.B.R. Jr. (ed.) Proceedings of the 1976 ACM SIGMOD International Conference on Management of Data, Washington, DC, June 2–4, 1976, p. 155. ACM, New York (1976)

    Google Scholar 

  60. Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB (2008)

    Google Scholar 

  61. XQuery 1.0: An XML query language. http://www.w3.org/TR/xquery/

  62. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, pp. 1–14 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael J. Carey.

Additional information

Communicated by: Brian Frank Cooper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Behm, A., Borkar, V.R., Carey, M.J. et al. ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib Parallel Databases 29, 185–216 (2011). https://doi.org/10.1007/s10619-011-7082-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-011-7082-y

Keywords

Navigation