Decorating the cloud: enabling annotation management in MapReduce

Lu, Yue; Li, Yuguan; Eltabakh, Mohamed Y.

doi:10.1007/s00778-016-0422-9

Decorating the cloud: enabling annotation management in MapReduce

Regular Paper
Published: 30 January 2016

Volume 25, pages 399–424, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yue Lu¹,
Yuguan Li¹ &
Mohamed Y. Eltabakh¹

1679 Accesses
Explore all metrics

Abstract

Data curation and annotation are indispensable mechanisms to a wide range of applications for capturing various types of metadata information. This metadata not only increases the data’s credibility and merit, and allows end users and applications to make more informed decisions, but also enables advanced processing over the data that is not feasible otherwise. That is why annotation management has been extensively studied in the context of scientific repositories, web documents, and relational database systems. In this paper, we make the case that cloud-based applications that rely on the emerging Hadoop infrastructure are also in need for data curation and annotation and that the presence of such mechanisms in Hadoop would bring value-added capabilities to these applications. We propose the “CloudNotes” system, a full-fledged MapReduce-based annotation management engine. CloudNotes addresses several new challenges to annotation management including: (1) scalable and distributed processing of annotations over large clusters, (2) propagation of annotations under the MapReduce’s blackbox execution model, and (3) annotation-driven optimizations ranging from proactive prefetching and colocation of annotations, annotation-aware task scheduling, novel shared execution strategies among the annotation jobs, and concurrency control mechanisms for annotation management. These challenges have not been addressed or explored before by the state-of-art technologies. CloudNotes is built on top of the open-source Hadoop/HDFS infrastructure and experimentally evaluated to demonstrate the practicality and scalability of its features, and the effectiveness of its optimizations under large workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The assumption of a single InputFormat for a given dataset can be easily relaxed in CloudNotes by extending the OID object to maintain both the start and end offsets of the formed records. As such, annotations will be attached to “byte segments” within the file. Annotations can then propagate along with these byte segments independent from the used InputFormat.
The OIds passed to a reduce task are implemented using the same Iterator mechanism currently used for passing the values. Therefore, if the inputs’ size is too large to fit in memory, they are streamed from disk as needed in the same standard way.

References

Abouzeid, A., et al.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp. 922–933 (2009)
Agrawal, R., Imran, A., Seay, C., Walker, J.: A layer based architecture for provenance in big data. In: Big Data, pp. 1–7 (2014)
Akoush, S., Carata, L., Sohan, R., Hopper, A.: MrLazy: Lazy runtime label propagation for MapReduce. In: HotCloud (2014)
Akoush, S., Sohan, R., Hopper, A.: HadoopProv: towards provenance as a first class citizen in MapReduce. In: USENIX Workshop on the Theory and Practice of Provenance (2013)
Amazon Elastic MapReduce. Developer Guide, API Version 2009-03-31 (2009)
Amsterdamer, Y., Davidson, S.B., Deutch, D., et al.: Putting lipstick on pig: enabling database-style workflow provenance. In: PVLDB, pp. 346–357 (2011)
Bhagwat, D., Chiticariu, L., Tan, W.: An annotation management system for relational databases. In: VLDB, pp. 900–911 (2004)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
Article Google Scholar
Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD, pp. 539–550 (2006)
Buneman, P., Cheney, J., Tan, W.-C., Vansummeren, S.: Curated databases. In: Proceedings of the 27th ACM Symposium on Principles of Database Systems (PODS), pp. 1–12 (2008)
Buneman, P., et al.: On propagation of deletions and annotations through views. In: PODS, pp. 150–158 (2002)
Buneman, P., Khanna, S., Tan, W.: Why and where: a characterization of data provenance. Lecture Notes in Computer Science, vol. 1973, pp. 316–333 (2001)
Buneman, P., Kostylev, E.V., Vansummeren, S.: Annotations are relative. In: Proceedings of the 16th International Conference on Database Theory, ICDT ’13, pp. 177–188 (2013)
Buyya, R.: Market-oriented cloud computing: vision, hype, and reality of delivering computing as the 5th utility. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 1–15 (2009)
Crawl, D., Wang, J., Altintas, I.: Provenance for MapReduce-based data-intensive workflows. In: WORKS Workshop, pp. 21–30 (2011)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Dias, J., Ogasawara, et al. E.: Algebraic dataflows for big data analysis. In: International Conference on Big Data, pp. 150–155 (2013)
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)
Google Scholar
Eltabakh, M.Y., Aref, W.G., Elmagarmid, A.K., Ouzzani, M., Silva, Y.N.: Supporting annotations on relations. In: EDBT, pp. 379–390 (2009)
Eltabakh, M.Y., Özcan, F., Sismanis, Y., Haas, P., Pirahesh, H., Vondrak, J.: Eagle-eyed elephant: split-oriented indexing in hadoop. In: EDBT, pp. 89–100 (2013)
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Google Scholar
Gatterbauer, W., Balazinska, M., Khoussainova, N., Suciu, D.: Believe it or not: adding belief annotations to databases. Proc. VLDB Endow. 2(1), 1–12 (2009)
Article Google Scholar
Geerts, F., et al.: Mondrian: annotating and querying databases through colors and blocks. In: ICDE, p. 82 (2006)
Geerts, F., Van Den Bussche, J.: Relational completeness of query languages for annotated databases. In: DBPL, pp. 127–137 (2007)
Gunarathne, T., Wu, T.-L., Qiu, J., Fox, G.: MapReduce in the clouds for science. In: CloudCom Conference, pp. 565–572 (2010)
Hive. http://hadoop.apache.org/hive
IBM InfoSphere BigInsights. http://www-01.ibm.com/software/data/infosphere/biginsights
Karvounarakis, G., Green, T.J.: Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)
Article Google Scholar
Lang, W., Nehme, R.V., Robinson, E., Naughton, J.F.: Partial results in database systems. In: SIGMOD, pp. 1275–1286 (2014)
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: SIGMOD, pp. 985–996 (2011)
Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. PVLDB 5(11), 1196–1207 (2012)
Google Scholar
Logothetis, D., De, S., Yocum, K.: Scalable lineage capture for debugging DISC analytics. In: SOCC, pp. 17:1–17:15 (2013)
Logothetis, D., Trezzo, C., Webb, K.C.: In-situ mapreduce for log processing. In: USENIXATC, pp. 9–9 (2011)
Milieu: Lightweight and Configurable Big Data Provenance for Science, Santa Clara, CA. IEEE (2013)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3, 494–505 (2010)
Article MATH Google Scholar
Olston, C., et al.: Nova: continuous pig/hadoop workflows. In: SIGMOD Conference, pp. 1081–1090 (2011)
Park, H., Ikeda, R., Widom, J.: Ramp: a system for capturing and tracing provenance in mapreduce workflows. In: VLDB. Stanford InfoLab (August, 2011)
Pig. http://hadoop.apache.org/pig
Russell, J.: Couldera-Impala. O’Reilly Media (2013)
Spark: Lightning-fast cluster computing. https://spark.apache.org
Tan, W.-C.: Containment of relational queries with annotation propagation. In: DBPL (2003)
The Apache Foundation. Hadoop. http://hadoop.apache.org
The Apache Foundation. Hbase. http://hbase.apache.org/
The Apache Software Foundation. HDFS architecture guide. http://hadoop.apache.org/hdfs/docs/current/hdfs-design.html
Thusoo, A., Murthy, R., Sarma, J.S., Shao, Z., Jain, N., Chakka, P., Anthony, S., Liu, H., Zhang, N.: Hive—a petabyte scale data warehousing using hadoop. In: ICDE (2010)
Traverso, M.: Presto: Interacting with petabytes of data at Facebook (2013)
Ullman, J.: Principles of database and knowledge-base systems, vol. 1 (1988)
White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly Media Inc, Sebastopol (2012)
Google Scholar
Wu, E., Madden, S., Stonebraker, M.: SubZero: a fine-grained lineage system for scientific databases. In: ICDE, pp. 865–876 (2013)
Xiao, D., Eltabakh, M.Y.: InsightNotes: summary-based annotation management in relational databases. In: SIGMOD Conference, pp. 661–672 (2014)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010 (2010)
Zhang, C., Naughton, J., Dewitt, D., Luo, Q.: On supporting containment queries in relational database management systems. In: In SIGMOD, pp. 425–436 (2001)
Zhang, C., Sterck, H.D., Aboulnaga, A., Djambazian, H., Sladek, R.: Case Study of Scientific Data Processing on a Cloud Using Hadoop. In: HPCS, pp. 400–415 (2009)

Download references

Author information

Authors and Affiliations

Computer Science Department, Worcester Polytechnic Institute (WPI), 100 Institute Rd., Worcester, MA, 01609, USA
Yue Lu, Yuguan Li & Mohamed Y. Eltabakh

Authors

Yue Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yuguan Li
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Y. Eltabakh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Y. Eltabakh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, Y., Li, Y. & Eltabakh, M.Y. Decorating the cloud: enabling annotation management in MapReduce. The VLDB Journal 25, 399–424 (2016). https://doi.org/10.1007/s00778-016-0422-9

Download citation

Received: 28 January 2015
Revised: 15 December 2015
Accepted: 12 January 2016
Published: 30 January 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s00778-016-0422-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Decorating the cloud: enabling annotation management in MapReduce

Abstract

Access this article

Similar content being viewed by others

Data Organization and Curation in Big Data

smartAPI: Towards a More Intelligent Network of Web APIs

WDFed: Exploiting Cloud Databases Using Metadata and RESTful APIs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Decorating the cloud: enabling annotation management in MapReduce

Abstract

Access this article

Similar content being viewed by others

Data Organization and Curation in Big Data

smartAPI: Towards a More Intelligent Network of Web APIs

WDFed: Exploiting Cloud Databases Using Metadata and RESTful APIs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation