ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Behm, Alexander; Borkar, Vinayak R.; Carey, Michael J.; Grover, Raman; Li, Chen; Onose, Nicola; Vernica, Rares; Deutsch, Alin; Papakonstantinou, Yannis; Tsotras, Vassilis J.

doi:10.1007/s10619-011-7082-y

ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Published: 31 March 2011

Volume 29, pages 185–216, (2011)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Alexander Behm¹,
Vinayak R. Borkar¹,
Michael J. Carey¹,
Raman Grover¹,
Chen Li¹,
Nicola Onose¹,
Rares Vernica¹,
Alin Deutsch²,
Yannis Papakonstantinou² &
…
Vassilis J. Tsotras³

561 Accesses
74 Citations
Explore all metrics

Abstract

ASTERIX is a new data-intensive storage and computing platform project spanning UC Irvine, UC Riverside, and UC San Diego. In this paper we provide an overview of the ASTERIX project, starting with its main goal—the storage and analysis of data pertaining to evolving-world models. We describe the requirements and associated challenges, and explain how the project is addressing them. We provide a technical overview of ASTERIX, covering its architecture, its user model for data and queries, and its approach to scalable query processing and data management. ASTERIX utilizes a new scalable runtime computational platform called Hyracks that is also discussed at an overview level; we have recently made Hyracks available in open source for use by other interested parties. We also relate our work on ASTERIX to the current state of the art and describe the research challenges that we are currently tackling as well as those that lie ahead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Scalable Querying of Large-Scale Models

A Comparison of Data Science Systems

Multi-model query languages: taming the variety of big data

Article Open access 31 May 2023

References

Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Mateo (1999)
Google Scholar
Abiteboul, S., Fischer, P.C., Schek, H.-J.: Nested Relations and Complex Objects in Databases (LNCS). Springer, Berlin (1989)
Google Scholar
Abiteboul, S., Manolescu, I., Polyzotis, N., Preda, N., Sun, C.: Xml processing in dht networks. In: ICDE ’08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pp. 606–615. IEEE Computer Society, Washington (2008)
Chapter Google Scholar
Agrawal, R., et al.: The Claremont report on database research. Commun. ACM 52(6), 56–65 (2009)
Article Google Scholar
Amer-Yahia, S., Botev, C., Buxton, S., Case, P., Doerre, J., Dyck, M., Holstege, M., Melton, J., Rys, M., Shanmugasundaram, J.: XQuery and XPath full text 1.0. W3C Candidate Recommendation, July 9 (2009)
Apache Avro, http://hadoop.apache.org/avro/
Apache Hadoop, http://hadoop.apache.org
Ballinger, C.: Born to be parallel. Why parallel origins give teradata. Database an enduring performance edge. http://www.teradata.com/library/pdf/eb3053.pdf
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC ’10: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 119–130. ACM, New York (2010)
Chapter Google Scholar
Behm, A., Ji, S., Li, C., Lu, J.: Space-constrained gram-based indexing for efficient approximate string search. In: ICDE (2009)
Google Scholar
Behm, A., Li, C., Carey, M.: Answering approximate string queries on large data sets using external memory. Technical report, Department of Computer Science, UC Irvine (under submission) (July 2010)
Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE (2011)
Google Scholar
Botev, C., Amer-Yahia, S., Shanmugasundaram, J.: Expressiveness and performance of full-text search languages. In: EDBT, pp. 349–367 (2006)
Google Scholar
Carey, M.J., Muhanna, W.A.: The performance of multiversion concurrency control algorithms. ACM Trans. Comput. Syst. 4(4), 338–378 (1986)
Article Google Scholar
Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)
Google Scholar
Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: Flumejava: easy, efficient data-parallel pipelines. In: PLDI, pp. 363–375 (2010)
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)
Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: Pnuts: Yahoo!’s hosted data serving platform. PVLDB 1(2), 1277–1288 (2008)
Google Scholar
Dayal, U., Blaustein, B., Buchmann, A., Chakravarthy, U., Hsu, M., Ledin, R., McCarthy, D., Rosenthal, A., Sarin, S., Carey, M.J., Livny, M., Jauhari, R.: The HiPAC project: combining active databases and timing constraints. SIGMOD Rec. 17(1), 51–70 (1988)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: SOSP, pp. 205–220 (2007)
Chapter Google Scholar
DeWitt, D.J., Ghandeharizadeh, S., Schneider, D.A., Bricker, A., Hsiao, H.-I., Rasmussen, R.: The Gamma database machine project. IEEE Trans. Knowl. Data Eng. 2(1), 44–62 (1990)
Article Google Scholar
DeWitt, D.J., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
Article Google Scholar
Diao, Y., Altinel, M., Franklin, M.J., Zhang, H., Fischer, P.M.: Path sharing and predicate evaluation for high-performance xml filtering. ACM Trans. Database Syst. 28(4), 467–516 (2003)
Article Google Scholar
Facebook press room—statistics. http://www.facebook.com/press/info.php?statistics
Facebook Thrift. http://incubator.apache.org/thrift
Garofalakis, M.N., Ioannidis, Y.E.: Parallel query scheduling and optimization with time- and space-shared resources. In: VLDB, pp. 296–305 (1997)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP, pp. 29–43 (2003)
Google Scholar
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB, pp. 436–445 (1997)
Google Scholar
Google protocol buffers. http://code.google.com/apis/protocolbuffers/
Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–170 (1993)
Article Google Scholar
Hanson, E.N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J.B., Vernon, A.: Scalable trigger processing. In: ICDE, pp. 266–275 (1999)
Google Scholar
Helland, P.: Life beyond distributed transactions: an apostate’s opinion. In: CIDR, pp. 132–141 (2007)
Google Scholar
Hong, W., Stonebraker, M.: Optimization of parallel query execution plans in XPRS. In: PDIS, pp. 218–225 (1991)
Google Scholar
Hyracks project on Google code. http://code.google.com/p/hyracks
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, pp. 59–72 (2007)
Google Scholar
Jaql, http://www.jaql.org
Jaql 0.1. http://www.jaql.org/release/0.1/jaql-overview.html
JSON. http://www.json.org/
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)
Google Scholar
MarketWatch, The Wall Street Journal. Will the news survive? http://www.marketwatch.com/story/will-the-news-survive-2009-12-08
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)
Google Scholar
Moerkotte, G.: Building query compilers. Manuscript, 2009
Object database management systems. http://www.odbms.org/odmg/
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110 (2008)
Google Scholar
Pew Internet & American Life Project. Twitter and status updating, Fall 2009. http://www.pewinternet.org/Reports/2009/17-Twitter-and-Status-Updating-Fall-2009.aspx
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with Sawzall. Sci. Program. 13(4), 277–298 (2005)
Google Scholar
Quass, D., Widom, J., Goldman, R., Haas, K., Luo, Q., McHugh, J., Nestorov, S., Rajaraman, A., Rivero, H., Abiteboul, S., Ullman, J.D., Wiener, J.L.: Lore: a lightweight object repository for semistructured data. In: SIGMOD Conference, p. 549 (1996)
Google Scholar
Ramakrishnan, R., Gehrke, J.: Database Management Systems. WCB/McGraw-Hill, Boston (2002)
Google Scholar
Snodgrass, R.T., Ahn, I.: A taxonomy of time in databases. In: SIGMOD Conference, pp. 236–246 (1985)
Google Scholar
Stonebraker, M., et al.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Article Google Scholar
The Radicati Group Inc. Business user survey, 2009. http://www.radicati.com/wp/wp-content/uploads/2009/11/Business-User-Survey-2009-Executive-Summary1.pdf
Thusoo, A.: Hive—a petabyte scale data warehouse using Hadoop. http://www.facebook.com/note.php?note_id=89508453919
Twitter blog. Measuring tweets, Feb. 2010. http://blog.twitter.com/2010/02/measuring-tweets.html
U.S. Department of Commerce, Washington: Quarterly retail e-commerce sales, 4th quarter 2008. http://www2.census.gov/retail/releases/historical/ecomm/08Q4.html
Vernica, R., Carey, M., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference (2010)
Google Scholar
Vernica, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: KEYS, pp. 9–14 (2009)
Chapter Google Scholar
Wong, E., Youssefi, K.: Decomposition—a strategy for query processing (abstract). In: Author, J.B.R. Jr. (ed.) Proceedings of the 1976 ACM SIGMOD International Conference on Management of Data, Washington, DC, June 2–4, 1976, p. 155. ACM, New York (1976)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB (2008)
Google Scholar
XQuery 1.0: An XML query language. http://www.w3.org/TR/xquery/
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, pp. 1–14 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Irvine, USA
Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Raman Grover, Chen Li, Nicola Onose & Rares Vernica
University of California, San Diego, USA
Alin Deutsch & Yannis Papakonstantinou
University of California, Riverside, USA
Vassilis J. Tsotras

Authors

Alexander Behm
View author publications
You can also search for this author in PubMed Google Scholar
Vinayak R. Borkar
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Carey
View author publications
You can also search for this author in PubMed Google Scholar
Raman Grover
View author publications
You can also search for this author in PubMed Google Scholar
Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Onose
View author publications
You can also search for this author in PubMed Google Scholar
Rares Vernica
View author publications
You can also search for this author in PubMed Google Scholar
Alin Deutsch
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Papakonstantinou
View author publications
You can also search for this author in PubMed Google Scholar
Vassilis J. Tsotras
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael J. Carey.

Additional information

Communicated by: Brian Frank Cooper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Behm, A., Borkar, V.R., Carey, M.J. et al. ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib Parallel Databases 29, 185–216 (2011). https://doi.org/10.1007/s10619-011-7082-y

Download citation

Published: 31 March 2011
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10619-011-7082-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Abstract

Access this article

Similar content being viewed by others

Towards Scalable Querying of Large-Scale Models

A Comparison of Data Science Systems

Multi-model query languages: taming the variety of big data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Abstract

Access this article

Similar content being viewed by others

Towards Scalable Querying of Large-Scale Models

A Comparison of Data Science Systems

Multi-model query languages: taming the variety of big data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation