Decorating the cloud: enabling annotation management in MapReduce
- 427 Downloads
Abstract
Data curation and annotation are indispensable mechanisms to a wide range of applications for capturing various types of metadata information. This metadata not only increases the data’s credibility and merit, and allows end users and applications to make more informed decisions, but also enables advanced processing over the data that is not feasible otherwise. That is why annotation management has been extensively studied in the context of scientific repositories, web documents, and relational database systems. In this paper, we make the case that cloud-based applications that rely on the emerging Hadoop infrastructure are also in need for data curation and annotation and that the presence of such mechanisms in Hadoop would bring value-added capabilities to these applications. We propose the “CloudNotes” system, a full-fledged MapReduce-based annotation management engine. CloudNotes addresses several new challenges to annotation management including: (1) scalable and distributed processing of annotations over large clusters, (2) propagation of annotations under the MapReduce’s blackbox execution model, and (3) annotation-driven optimizations ranging from proactive prefetching and colocation of annotations, annotation-aware task scheduling, novel shared execution strategies among the annotation jobs, and concurrency control mechanisms for annotation management. These challenges have not been addressed or explored before by the state-of-art technologies. CloudNotes is built on top of the open-source Hadoop/HDFS infrastructure and experimentally evaluated to demonstrate the practicality and scalability of its features, and the effectiveness of its optimizations under large workloads.
Keywords
Distributed annotation management MapReduce Cloud-based annotationsReferences
- 1.Abouzeid, A., et al.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp. 922–933 (2009)Google Scholar
- 2.Agrawal, R., Imran, A., Seay, C., Walker, J.: A layer based architecture for provenance in big data. In: Big Data, pp. 1–7 (2014)Google Scholar
- 3.Akoush, S., Carata, L., Sohan, R., Hopper, A.: MrLazy: Lazy runtime label propagation for MapReduce. In: HotCloud (2014)Google Scholar
- 4.Akoush, S., Sohan, R., Hopper, A.: HadoopProv: towards provenance as a first class citizen in MapReduce. In: USENIX Workshop on the Theory and Practice of Provenance (2013)Google Scholar
- 5.Amazon Elastic MapReduce. Developer Guide, API Version 2009-03-31 (2009)Google Scholar
- 6.Amsterdamer, Y., Davidson, S.B., Deutch, D., et al.: Putting lipstick on pig: enabling database-style workflow provenance. In: PVLDB, pp. 346–357 (2011)Google Scholar
- 7.Bhagwat, D., Chiticariu, L., Tan, W.: An annotation management system for relational databases. In: VLDB, pp. 900–911 (2004)Google Scholar
- 8.Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)CrossRefGoogle Scholar
- 9.Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD, pp. 539–550 (2006)Google Scholar
- 10.Buneman, P., Cheney, J., Tan, W.-C., Vansummeren, S.: Curated databases. In: Proceedings of the 27th ACM Symposium on Principles of Database Systems (PODS), pp. 1–12 (2008)Google Scholar
- 11.Buneman, P., et al.: On propagation of deletions and annotations through views. In: PODS, pp. 150–158 (2002)Google Scholar
- 12.Buneman, P., Khanna, S., Tan, W.: Why and where: a characterization of data provenance. Lecture Notes in Computer Science, vol. 1973, pp. 316–333 (2001)Google Scholar
- 13.Buneman, P., Kostylev, E.V., Vansummeren, S.: Annotations are relative. In: Proceedings of the 16th International Conference on Database Theory, ICDT ’13, pp. 177–188 (2013)Google Scholar
- 14.Buyya, R.: Market-oriented cloud computing: vision, hype, and reality of delivering computing as the 5th utility. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 1–15 (2009)Google Scholar
- 15.Crawl, D., Wang, J., Altintas, I.: Provenance for MapReduce-based data-intensive workflows. In: WORKS Workshop, pp. 21–30 (2011)Google Scholar
- 16.Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)Google Scholar
- 17.Dias, J., Ogasawara, et al. E.: Algebraic dataflows for big data analysis. In: International Conference on Big Data, pp. 150–155 (2013)Google Scholar
- 18.Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)Google Scholar
- 19.Eltabakh, M.Y., Aref, W.G., Elmagarmid, A.K., Ouzzani, M., Silva, Y.N.: Supporting annotations on relations. In: EDBT, pp. 379–390 (2009)Google Scholar
- 20.Eltabakh, M.Y., Özcan, F., Sismanis, Y., Haas, P., Pirahesh, H., Vondrak, J.: Eagle-eyed elephant: split-oriented indexing in hadoop. In: EDBT, pp. 89–100 (2013)Google Scholar
- 21.Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)Google Scholar
- 22.Gatterbauer, W., Balazinska, M., Khoussainova, N., Suciu, D.: Believe it or not: adding belief annotations to databases. Proc. VLDB Endow. 2(1), 1–12 (2009)CrossRefGoogle Scholar
- 23.Geerts, F., et al.: Mondrian: annotating and querying databases through colors and blocks. In: ICDE, p. 82 (2006)Google Scholar
- 24.Geerts, F., Van Den Bussche, J.: Relational completeness of query languages for annotated databases. In: DBPL, pp. 127–137 (2007)Google Scholar
- 25.Gunarathne, T., Wu, T.-L., Qiu, J., Fox, G.: MapReduce in the clouds for science. In: CloudCom Conference, pp. 565–572 (2010)Google Scholar
- 26.
- 27.IBM InfoSphere BigInsights. http://www-01.ibm.com/software/data/infosphere/biginsights
- 28.Karvounarakis, G., Green, T.J.: Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)CrossRefGoogle Scholar
- 29.Lang, W., Nehme, R.V., Robinson, E., Naughton, J.F.: Partial results in database systems. In: SIGMOD, pp. 1275–1286 (2014)Google Scholar
- 30.Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: SIGMOD, pp. 985–996 (2011)Google Scholar
- 31.Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. PVLDB 5(11), 1196–1207 (2012)Google Scholar
- 32.Logothetis, D., De, S., Yocum, K.: Scalable lineage capture for debugging DISC analytics. In: SOCC, pp. 17:1–17:15 (2013)Google Scholar
- 33.Logothetis, D., Trezzo, C., Webb, K.C.: In-situ mapreduce for log processing. In: USENIXATC, pp. 9–9 (2011)Google Scholar
- 34.Milieu: Lightweight and Configurable Big Data Provenance for Science, Santa Clara, CA. IEEE (2013)Google Scholar
- 35.Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3, 494–505 (2010)CrossRefMATHGoogle Scholar
- 36.Olston, C., et al.: Nova: continuous pig/hadoop workflows. In: SIGMOD Conference, pp. 1081–1090 (2011)Google Scholar
- 37.Park, H., Ikeda, R., Widom, J.: Ramp: a system for capturing and tracing provenance in mapreduce workflows. In: VLDB. Stanford InfoLab (August, 2011)Google Scholar
- 38.
- 39.Russell, J.: Couldera-Impala. O’Reilly Media (2013)Google Scholar
- 40.Spark: Lightning-fast cluster computing. https://spark.apache.org
- 41.Tan, W.-C.: Containment of relational queries with annotation propagation. In: DBPL (2003)Google Scholar
- 42.The Apache Foundation. Hadoop. http://hadoop.apache.org
- 43.The Apache Foundation. Hbase. http://hbase.apache.org/
- 44.The Apache Software Foundation. HDFS architecture guide. http://hadoop.apache.org/hdfs/docs/current/hdfs-design.html
- 45.Thusoo, A., Murthy, R., Sarma, J.S., Shao, Z., Jain, N., Chakka, P., Anthony, S., Liu, H., Zhang, N.: Hive—a petabyte scale data warehousing using hadoop. In: ICDE (2010)Google Scholar
- 46.Traverso, M.: Presto: Interacting with petabytes of data at Facebook (2013)Google Scholar
- 47.Ullman, J.: Principles of database and knowledge-base systems, vol. 1 (1988)Google Scholar
- 48.White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly Media Inc, Sebastopol (2012)Google Scholar
- 49.Wu, E., Madden, S., Stonebraker, M.: SubZero: a fine-grained lineage system for scientific databases. In: ICDE, pp. 865–876 (2013)Google Scholar
- 50.Xiao, D., Eltabakh, M.Y.: InsightNotes: summary-based annotation management in relational databases. In: SIGMOD Conference, pp. 661–672 (2014)Google Scholar
- 51.Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010 (2010)Google Scholar
- 52.Zhang, C., Naughton, J., Dewitt, D., Luo, Q.: On supporting containment queries in relational database management systems. In: In SIGMOD, pp. 425–436 (2001)Google Scholar
- 53.Zhang, C., Sterck, H.D., Aboulnaga, A., Djambazian, H., Sladek, R.: Case Study of Scientific Data Processing on a Cloud Using Hadoop. In: HPCS, pp. 400–415 (2009)Google Scholar