The VLDB Journal

, Volume 25, Issue 3, pp 399–424 | Cite as

Decorating the cloud: enabling annotation management in MapReduce

Regular Paper


Data curation and annotation are indispensable mechanisms to a wide range of applications for capturing various types of metadata information. This metadata not only increases the data’s credibility and merit, and allows end users and applications to make more informed decisions, but also enables advanced processing over the data that is not feasible otherwise. That is why annotation management has been extensively studied in the context of scientific repositories, web documents, and relational database systems. In this paper, we make the case that cloud-based applications that rely on the emerging Hadoop infrastructure are also in need for data curation and annotation and that the presence of such mechanisms in Hadoop would bring value-added capabilities to these applications. We propose the “CloudNotes” system, a full-fledged MapReduce-based annotation management engine. CloudNotes addresses several new challenges to annotation management including: (1) scalable and distributed processing of annotations over large clusters, (2) propagation of annotations under the MapReduce’s blackbox execution model, and (3) annotation-driven optimizations ranging from proactive prefetching and colocation of annotations, annotation-aware task scheduling, novel shared execution strategies among the annotation jobs, and concurrency control mechanisms for annotation management. These challenges have not been addressed or explored before by the state-of-art technologies. CloudNotes is built on top of the open-source Hadoop/HDFS infrastructure and experimentally evaluated to demonstrate the practicality and scalability of its features, and the effectiveness of its optimizations under large workloads.


Distributed annotation management MapReduce Cloud-based annotations 


  1. 1.
    Abouzeid, A., et al.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp. 922–933 (2009)Google Scholar
  2. 2.
    Agrawal, R., Imran, A., Seay, C., Walker, J.: A layer based architecture for provenance in big data. In: Big Data, pp. 1–7 (2014)Google Scholar
  3. 3.
    Akoush, S., Carata, L., Sohan, R., Hopper, A.: MrLazy: Lazy runtime label propagation for MapReduce. In: HotCloud (2014)Google Scholar
  4. 4.
    Akoush, S., Sohan, R., Hopper, A.: HadoopProv: towards provenance as a first class citizen in MapReduce. In: USENIX Workshop on the Theory and Practice of Provenance (2013)Google Scholar
  5. 5.
    Amazon Elastic MapReduce. Developer Guide, API Version 2009-03-31 (2009)Google Scholar
  6. 6.
    Amsterdamer, Y., Davidson, S.B., Deutch, D., et al.: Putting lipstick on pig: enabling database-style workflow provenance. In: PVLDB, pp. 346–357 (2011)Google Scholar
  7. 7.
    Bhagwat, D., Chiticariu, L., Tan, W.: An annotation management system for relational databases. In: VLDB, pp. 900–911 (2004)Google Scholar
  8. 8.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)CrossRefGoogle Scholar
  9. 9.
    Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD, pp. 539–550 (2006)Google Scholar
  10. 10.
    Buneman, P., Cheney, J., Tan, W.-C., Vansummeren, S.: Curated databases. In: Proceedings of the 27th ACM Symposium on Principles of Database Systems (PODS), pp. 1–12 (2008)Google Scholar
  11. 11.
    Buneman, P., et al.: On propagation of deletions and annotations through views. In: PODS, pp. 150–158 (2002)Google Scholar
  12. 12.
    Buneman, P., Khanna, S., Tan, W.: Why and where: a characterization of data provenance. Lecture Notes in Computer Science, vol. 1973, pp. 316–333 (2001)Google Scholar
  13. 13.
    Buneman, P., Kostylev, E.V., Vansummeren, S.: Annotations are relative. In: Proceedings of the 16th International Conference on Database Theory, ICDT ’13, pp. 177–188 (2013)Google Scholar
  14. 14.
    Buyya, R.: Market-oriented cloud computing: vision, hype, and reality of delivering computing as the 5th utility. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 1–15 (2009)Google Scholar
  15. 15.
    Crawl, D., Wang, J., Altintas, I.: Provenance for MapReduce-based data-intensive workflows. In: WORKS Workshop, pp. 21–30 (2011)Google Scholar
  16. 16.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)Google Scholar
  17. 17.
    Dias, J., Ogasawara, et al. E.: Algebraic dataflows for big data analysis. In: International Conference on Big Data, pp. 150–155 (2013)Google Scholar
  18. 18.
    Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)Google Scholar
  19. 19.
    Eltabakh, M.Y., Aref, W.G., Elmagarmid, A.K., Ouzzani, M., Silva, Y.N.: Supporting annotations on relations. In: EDBT, pp. 379–390 (2009)Google Scholar
  20. 20.
    Eltabakh, M.Y., Özcan, F., Sismanis, Y., Haas, P., Pirahesh, H., Vondrak, J.: Eagle-eyed elephant: split-oriented indexing in hadoop. In: EDBT, pp. 89–100 (2013)Google Scholar
  21. 21.
    Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)Google Scholar
  22. 22.
    Gatterbauer, W., Balazinska, M., Khoussainova, N., Suciu, D.: Believe it or not: adding belief annotations to databases. Proc. VLDB Endow. 2(1), 1–12 (2009)CrossRefGoogle Scholar
  23. 23.
    Geerts, F., et al.: Mondrian: annotating and querying databases through colors and blocks. In: ICDE, p. 82 (2006)Google Scholar
  24. 24.
    Geerts, F., Van Den Bussche, J.: Relational completeness of query languages for annotated databases. In: DBPL, pp. 127–137 (2007)Google Scholar
  25. 25.
    Gunarathne, T., Wu, T.-L., Qiu, J., Fox, G.: MapReduce in the clouds for science. In: CloudCom Conference, pp. 565–572 (2010)Google Scholar
  26. 26.
  27. 27.
  28. 28.
    Karvounarakis, G., Green, T.J.: Semiring-annotated data: queries and provenance. SIGMOD Rec. 41(3), 5–14 (2012)CrossRefGoogle Scholar
  29. 29.
    Lang, W., Nehme, R.V., Robinson, E., Naughton, J.F.: Partial results in database systems. In: SIGMOD, pp. 1275–1286 (2014)Google Scholar
  30. 30.
    Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: SIGMOD, pp. 985–996 (2011)Google Scholar
  31. 31.
    Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. PVLDB 5(11), 1196–1207 (2012)Google Scholar
  32. 32.
    Logothetis, D., De, S., Yocum, K.: Scalable lineage capture for debugging DISC analytics. In: SOCC, pp. 17:1–17:15 (2013)Google Scholar
  33. 33.
    Logothetis, D., Trezzo, C., Webb, K.C.: In-situ mapreduce for log processing. In: USENIXATC, pp. 9–9 (2011)Google Scholar
  34. 34.
    Milieu: Lightweight and Configurable Big Data Provenance for Science, Santa Clara, CA. IEEE (2013)Google Scholar
  35. 35.
    Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3, 494–505 (2010)CrossRefMATHGoogle Scholar
  36. 36.
    Olston, C., et al.: Nova: continuous pig/hadoop workflows. In: SIGMOD Conference, pp. 1081–1090 (2011)Google Scholar
  37. 37.
    Park, H., Ikeda, R., Widom, J.: Ramp: a system for capturing and tracing provenance in mapreduce workflows. In: VLDB. Stanford InfoLab (August, 2011)Google Scholar
  38. 38.
  39. 39.
    Russell, J.: Couldera-Impala. O’Reilly Media (2013)Google Scholar
  40. 40.
    Spark: Lightning-fast cluster computing.
  41. 41.
    Tan, W.-C.: Containment of relational queries with annotation propagation. In: DBPL (2003)Google Scholar
  42. 42.
    The Apache Foundation. Hadoop.
  43. 43.
    The Apache Foundation. Hbase.
  44. 44.
    The Apache Software Foundation. HDFS architecture guide.
  45. 45.
    Thusoo, A., Murthy, R., Sarma, J.S., Shao, Z., Jain, N., Chakka, P., Anthony, S., Liu, H., Zhang, N.: Hive—a petabyte scale data warehousing using hadoop. In: ICDE (2010)Google Scholar
  46. 46.
    Traverso, M.: Presto: Interacting with petabytes of data at Facebook (2013)Google Scholar
  47. 47.
    Ullman, J.: Principles of database and knowledge-base systems, vol. 1 (1988)Google Scholar
  48. 48.
    White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly Media Inc, Sebastopol (2012)Google Scholar
  49. 49.
    Wu, E., Madden, S., Stonebraker, M.: SubZero: a fine-grained lineage system for scientific databases. In: ICDE, pp. 865–876 (2013)Google Scholar
  50. 50.
    Xiao, D., Eltabakh, M.Y.: InsightNotes: summary-based annotation management in relational databases. In: SIGMOD Conference, pp. 661–672 (2014)Google Scholar
  51. 51.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010 (2010)Google Scholar
  52. 52.
    Zhang, C., Naughton, J., Dewitt, D., Luo, Q.: On supporting containment queries in relational database management systems. In: In SIGMOD, pp. 425–436 (2001)Google Scholar
  53. 53.
    Zhang, C., Sterck, H.D., Aboulnaga, A., Djambazian, H., Sladek, R.: Case Study of Scientific Data Processing on a Cloud Using Hadoop. In: HPCS, pp. 400–415 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Computer Science DepartmentWorcester Polytechnic Institute (WPI)WorcesterUSA

Personalised recommendations