Decorating the cloud: enabling annotation management in MapReduce

Abstract

Data curation and annotation are indispensable mechanisms to a wide range of applications for capturing various types of metadata information. This metadata not only increases the data’s credibility and merit, and allows end users and applications to make more informed decisions, but also enables advanced processing over the data that is not feasible otherwise. That is why annotation management has been extensively studied in the context of scientific repositories, web documents, and relational database systems. In this paper, we make the case that cloud-based applications that rely on the emerging Hadoop infrastructure are also in need for data curation and annotation and that the presence of such mechanisms in Hadoop would bring value-added capabilities to these applications. We propose the “CloudNotes” system, a full-fledged MapReduce-based annotation management engine. CloudNotes addresses several new challenges to annotation management including: (1) scalable and distributed processing of annotations over large clusters, (2) propagation of annotations under the MapReduce’s blackbox execution model, and (3) annotation-driven optimizations ranging from proactive prefetching and colocation of annotations, annotation-aware task scheduling, novel shared execution strategies among the annotation jobs, and concurrency control mechanisms for annotation management. These challenges have not been addressed or explored before by the state-of-art technologies. CloudNotes is built on top of the open-source Hadoop/HDFS infrastructure and experimentally evaluated to demonstrate the practicality and scalability of its features, and the effectiveness of its optimizations under large workloads.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Notes

  1. 1.

    The assumption of a single InputFormat for a given dataset can be easily relaxed in CloudNotes by extending the OID object to maintain both the start and end offsets of the formed records. As such, annotations will be attached to “byte segments” within the file. Annotations can then propagate along with these byte segments independent from the used InputFormat.

  2. 2.

    The OIds passed to a reduce task are implemented using the same Iterator mechanism currently used for passing the values. Therefore, if the inputs’ size is too large to fit in memory, they are streamed from disk as needed in the same standard way.

References

  1. 1.

    Abouzeid, A., et al.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp. 922–933 (2009)

  2. 2.

    Agrawal, R., Imran, A., Seay, C., Walker, J.: A layer based architecture for provenance in big data. In: Big Data, pp. 1–7 (2014)

  3. 3.

    Akoush, S., Carata, L., Sohan, R., Hopper, A.: MrLazy: Lazy runtime label propagation for MapReduce. In: HotCloud (2014)

  4. 4.

    Akoush, S., Sohan, R., Hopper, A.: HadoopProv: towards provenance as a first class citizen in MapReduce. In: USENIX Workshop on the Theory and Practice of Provenance (2013)

  5. 5.

    Amazon Elastic MapReduce. Developer Guide, API Version 2009-03-31 (2009)

  6. 6.

    Amsterdamer, Y., Davidson, S.B., Deutch, D., et al.: Putting lipstick on pig: enabling database-style workflow provenance. In: PVLDB, pp. 346–357 (2011)

  7. 7.

    Bhagwat, D., Chiticariu, L., Tan, W.: An annotation management system for relational databases. In: VLDB, pp. 900–911 (2004)

  8. 8.

    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)

    Article  Google Scholar 

  9. 9.

    Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD, pp. 539–550 (2006)

  10. 10.

    Buneman, P., Cheney, J., Tan, W.-C., Vansummeren, S.: Curated databases. In: Proceedings of the 27th ACM Symposium on Principles of Database Systems (PODS), pp. 1–12 (2008)

  11. 11.

    Buneman, P., et al.: On propagation of deletions and annotations through views. In: PODS, pp. 150–158 (2002)

  12. 12.

    Buneman, P., Khanna, S., Tan, W.: Why and where: a characterization of data provenance. Lecture Notes in Computer Science, vol. 1973, pp. 316–333 (2001)

  13. 13.

    Buneman, P., Kostylev, E.V., Vansummeren, S.: Annotations are relative. In: Proceedings of the 16th International Conference on Database Theory, ICDT ’13, pp. 177–188 (2013)

  14. 14.

    Buyya, R.: Market-oriented cloud computing: vision, hype, and reality of delivering computing as the 5th utility. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 1–15 (2009)

  15. 15.

    Crawl, D., Wang, J., Altintas, I.: Provenance for MapReduce-based data-intensive workflows. In: WORKS Workshop, pp. 21–30 (2011)

  16. 16.

    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

  17. 17.

    Dias, J., Ogasawara, et al. E.: Algebraic dataflows for big data analysis. In: International Conference on Big Data, pp. 150–155 (2013)

  18. 18.

    Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)

    Google Scholar 

  19. 19.

    Eltabakh, M.Y., Aref, W.G., Elmagarmid, A.K., Ouzzani, M., Silva, Y.N.: Supporting annotations on relations. In: EDBT, pp. 379–390 (2009)

  20. 20.

    Eltabakh, M.Y., Özcan, F., Sismanis, Y., Haas, P., Pirahesh, H., Vondrak, J.: Eagle-eyed elephant: split-oriented indexing in hadoop. In: EDBT, pp. 89–100 (2013)

  21. 21.

    Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)

    Google Scholar 

  22. 22.

    Gatterbauer, W., Balazinska, M., Khoussainova, N., Suciu, D.: Believe it or not: adding belief annotations to databases. Proc. VLDB Endow. 2(1), 1–12 (2009)

    Article  Google Scholar 

  23. 23.

    Geerts, F., et al.: Mondrian: annotating and querying databases through colors and blocks. In: ICDE, p. 82 (2006)

  24. 24.

    Geerts, F., Van Den Bussche, J.: Relational completeness of query languages for annotated databases. In: DBPL, pp. 127–137 (2007)

  25. 25.

    Gunarathne, T., Wu, T.-L., Qiu, J., Fox, G.: MapReduce in the clouds for science. In: CloudCom Conference, pp. 565–572 (2010)

  26. 26.

    Hive. http://hadoop.apache.org/hive

  27. 27.

    IBM InfoSphere BigInsights. http://www-01.ibm.com/software/data/infosphere/biginsights

  28. 28.

    Karvounarakis, G., Green, T.J.: Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)

    Article  Google Scholar 

  29. 29.

    Lang, W., Nehme, R.V., Robinson, E., Naughton, J.F.: Partial results in database systems. In: SIGMOD, pp. 1275–1286 (2014)

  30. 30.

    Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: SIGMOD, pp. 985–996 (2011)

  31. 31.

    Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. PVLDB 5(11), 1196–1207 (2012)

    Google Scholar 

  32. 32.

    Logothetis, D., De, S., Yocum, K.: Scalable lineage capture for debugging DISC analytics. In: SOCC, pp. 17:1–17:15 (2013)

  33. 33.

    Logothetis, D., Trezzo, C., Webb, K.C.: In-situ mapreduce for log processing. In: USENIXATC, pp. 9–9 (2011)

  34. 34.

    Milieu: Lightweight and Configurable Big Data Provenance for Science, Santa Clara, CA. IEEE (2013)

  35. 35.

    Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3, 494–505 (2010)

    Article  MATH  Google Scholar 

  36. 36.

    Olston, C., et al.: Nova: continuous pig/hadoop workflows. In: SIGMOD Conference, pp. 1081–1090 (2011)

  37. 37.

    Park, H., Ikeda, R., Widom, J.: Ramp: a system for capturing and tracing provenance in mapreduce workflows. In: VLDB. Stanford InfoLab (August, 2011)

  38. 38.

    Pig. http://hadoop.apache.org/pig

  39. 39.

    Russell, J.: Couldera-Impala. O’Reilly Media (2013)

  40. 40.

    Spark: Lightning-fast cluster computing. https://spark.apache.org

  41. 41.

    Tan, W.-C.: Containment of relational queries with annotation propagation. In: DBPL (2003)

  42. 42.

    The Apache Foundation. Hadoop. http://hadoop.apache.org

  43. 43.

    The Apache Foundation. Hbase. http://hbase.apache.org/

  44. 44.

    The Apache Software Foundation. HDFS architecture guide. http://hadoop.apache.org/hdfs/docs/current/hdfs-design.html

  45. 45.

    Thusoo, A., Murthy, R., Sarma, J.S., Shao, Z., Jain, N., Chakka, P., Anthony, S., Liu, H., Zhang, N.: Hive—a petabyte scale data warehousing using hadoop. In: ICDE (2010)

  46. 46.

    Traverso, M.: Presto: Interacting with petabytes of data at Facebook (2013)

  47. 47.

    Ullman, J.: Principles of database and knowledge-base systems, vol. 1 (1988)

  48. 48.

    White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly Media Inc, Sebastopol (2012)

    Google Scholar 

  49. 49.

    Wu, E., Madden, S., Stonebraker, M.: SubZero: a fine-grained lineage system for scientific databases. In: ICDE, pp. 865–876 (2013)

  50. 50.

    Xiao, D., Eltabakh, M.Y.: InsightNotes: summary-based annotation management in relational databases. In: SIGMOD Conference, pp. 661–672 (2014)

  51. 51.

    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010 (2010)

  52. 52.

    Zhang, C., Naughton, J., Dewitt, D., Luo, Q.: On supporting containment queries in relational database management systems. In: In SIGMOD, pp. 425–436 (2001)

  53. 53.

    Zhang, C., Sterck, H.D., Aboulnaga, A., Djambazian, H., Sladek, R.: Case Study of Scientific Data Processing on a Cloud Using Hadoop. In: HPCS, pp. 400–415 (2009)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mohamed Y. Eltabakh.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lu, Y., Li, Y. & Eltabakh, M.Y. Decorating the cloud: enabling annotation management in MapReduce. The VLDB Journal 25, 399–424 (2016). https://doi.org/10.1007/s00778-016-0422-9

Download citation

Keywords

  • Distributed annotation management
  • MapReduce
  • Cloud-based annotations