Abstract
Data curation and annotation are indispensable mechanisms to a wide range of applications for capturing various types of metadata information. This metadata not only increases the data’s credibility and merit, and allows end users and applications to make more informed decisions, but also enables advanced processing over the data that is not feasible otherwise. That is why annotation management has been extensively studied in the context of scientific repositories, web documents, and relational database systems. In this paper, we make the case that cloud-based applications that rely on the emerging Hadoop infrastructure are also in need for data curation and annotation and that the presence of such mechanisms in Hadoop would bring value-added capabilities to these applications. We propose the “CloudNotes” system, a full-fledged MapReduce-based annotation management engine. CloudNotes addresses several new challenges to annotation management including: (1) scalable and distributed processing of annotations over large clusters, (2) propagation of annotations under the MapReduce’s blackbox execution model, and (3) annotation-driven optimizations ranging from proactive prefetching and colocation of annotations, annotation-aware task scheduling, novel shared execution strategies among the annotation jobs, and concurrency control mechanisms for annotation management. These challenges have not been addressed or explored before by the state-of-art technologies. CloudNotes is built on top of the open-source Hadoop/HDFS infrastructure and experimentally evaluated to demonstrate the practicality and scalability of its features, and the effectiveness of its optimizations under large workloads.
Similar content being viewed by others
Notes
The assumption of a single InputFormat for a given dataset can be easily relaxed in CloudNotes by extending the OID object to maintain both the start and end offsets of the formed records. As such, annotations will be attached to “byte segments” within the file. Annotations can then propagate along with these byte segments independent from the used InputFormat.
The OIds passed to a reduce task are implemented using the same Iterator mechanism currently used for passing the values. Therefore, if the inputs’ size is too large to fit in memory, they are streamed from disk as needed in the same standard way.
References
Abouzeid, A., et al.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp. 922–933 (2009)
Agrawal, R., Imran, A., Seay, C., Walker, J.: A layer based architecture for provenance in big data. In: Big Data, pp. 1–7 (2014)
Akoush, S., Carata, L., Sohan, R., Hopper, A.: MrLazy: Lazy runtime label propagation for MapReduce. In: HotCloud (2014)
Akoush, S., Sohan, R., Hopper, A.: HadoopProv: towards provenance as a first class citizen in MapReduce. In: USENIX Workshop on the Theory and Practice of Provenance (2013)
Amazon Elastic MapReduce. Developer Guide, API Version 2009-03-31 (2009)
Amsterdamer, Y., Davidson, S.B., Deutch, D., et al.: Putting lipstick on pig: enabling database-style workflow provenance. In: PVLDB, pp. 346–357 (2011)
Bhagwat, D., Chiticariu, L., Tan, W.: An annotation management system for relational databases. In: VLDB, pp. 900–911 (2004)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD, pp. 539–550 (2006)
Buneman, P., Cheney, J., Tan, W.-C., Vansummeren, S.: Curated databases. In: Proceedings of the 27th ACM Symposium on Principles of Database Systems (PODS), pp. 1–12 (2008)
Buneman, P., et al.: On propagation of deletions and annotations through views. In: PODS, pp. 150–158 (2002)
Buneman, P., Khanna, S., Tan, W.: Why and where: a characterization of data provenance. Lecture Notes in Computer Science, vol. 1973, pp. 316–333 (2001)
Buneman, P., Kostylev, E.V., Vansummeren, S.: Annotations are relative. In: Proceedings of the 16th International Conference on Database Theory, ICDT ’13, pp. 177–188 (2013)
Buyya, R.: Market-oriented cloud computing: vision, hype, and reality of delivering computing as the 5th utility. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 1–15 (2009)
Crawl, D., Wang, J., Altintas, I.: Provenance for MapReduce-based data-intensive workflows. In: WORKS Workshop, pp. 21–30 (2011)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Dias, J., Ogasawara, et al. E.: Algebraic dataflows for big data analysis. In: International Conference on Big Data, pp. 150–155 (2013)
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)
Eltabakh, M.Y., Aref, W.G., Elmagarmid, A.K., Ouzzani, M., Silva, Y.N.: Supporting annotations on relations. In: EDBT, pp. 379–390 (2009)
Eltabakh, M.Y., Özcan, F., Sismanis, Y., Haas, P., Pirahesh, H., Vondrak, J.: Eagle-eyed elephant: split-oriented indexing in hadoop. In: EDBT, pp. 89–100 (2013)
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Gatterbauer, W., Balazinska, M., Khoussainova, N., Suciu, D.: Believe it or not: adding belief annotations to databases. Proc. VLDB Endow. 2(1), 1–12 (2009)
Geerts, F., et al.: Mondrian: annotating and querying databases through colors and blocks. In: ICDE, p. 82 (2006)
Geerts, F., Van Den Bussche, J.: Relational completeness of query languages for annotated databases. In: DBPL, pp. 127–137 (2007)
Gunarathne, T., Wu, T.-L., Qiu, J., Fox, G.: MapReduce in the clouds for science. In: CloudCom Conference, pp. 565–572 (2010)
IBM InfoSphere BigInsights. http://www-01.ibm.com/software/data/infosphere/biginsights
Karvounarakis, G., Green, T.J.: Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)
Lang, W., Nehme, R.V., Robinson, E., Naughton, J.F.: Partial results in database systems. In: SIGMOD, pp. 1275–1286 (2014)
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: SIGMOD, pp. 985–996 (2011)
Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. PVLDB 5(11), 1196–1207 (2012)
Logothetis, D., De, S., Yocum, K.: Scalable lineage capture for debugging DISC analytics. In: SOCC, pp. 17:1–17:15 (2013)
Logothetis, D., Trezzo, C., Webb, K.C.: In-situ mapreduce for log processing. In: USENIXATC, pp. 9–9 (2011)
Milieu: Lightweight and Configurable Big Data Provenance for Science, Santa Clara, CA. IEEE (2013)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3, 494–505 (2010)
Olston, C., et al.: Nova: continuous pig/hadoop workflows. In: SIGMOD Conference, pp. 1081–1090 (2011)
Park, H., Ikeda, R., Widom, J.: Ramp: a system for capturing and tracing provenance in mapreduce workflows. In: VLDB. Stanford InfoLab (August, 2011)
Russell, J.: Couldera-Impala. O’Reilly Media (2013)
Spark: Lightning-fast cluster computing. https://spark.apache.org
Tan, W.-C.: Containment of relational queries with annotation propagation. In: DBPL (2003)
The Apache Foundation. Hadoop. http://hadoop.apache.org
The Apache Foundation. Hbase. http://hbase.apache.org/
The Apache Software Foundation. HDFS architecture guide. http://hadoop.apache.org/hdfs/docs/current/hdfs-design.html
Thusoo, A., Murthy, R., Sarma, J.S., Shao, Z., Jain, N., Chakka, P., Anthony, S., Liu, H., Zhang, N.: Hive—a petabyte scale data warehousing using hadoop. In: ICDE (2010)
Traverso, M.: Presto: Interacting with petabytes of data at Facebook (2013)
Ullman, J.: Principles of database and knowledge-base systems, vol. 1 (1988)
White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly Media Inc, Sebastopol (2012)
Wu, E., Madden, S., Stonebraker, M.: SubZero: a fine-grained lineage system for scientific databases. In: ICDE, pp. 865–876 (2013)
Xiao, D., Eltabakh, M.Y.: InsightNotes: summary-based annotation management in relational databases. In: SIGMOD Conference, pp. 661–672 (2014)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010 (2010)
Zhang, C., Naughton, J., Dewitt, D., Luo, Q.: On supporting containment queries in relational database management systems. In: In SIGMOD, pp. 425–436 (2001)
Zhang, C., Sterck, H.D., Aboulnaga, A., Djambazian, H., Sladek, R.: Case Study of Scientific Data Processing on a Cloud Using Hadoop. In: HPCS, pp. 400–415 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lu, Y., Li, Y. & Eltabakh, M.Y. Decorating the cloud: enabling annotation management in MapReduce. The VLDB Journal 25, 399–424 (2016). https://doi.org/10.1007/s00778-016-0422-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-016-0422-9