Advertisement

Dataspaces: Where Structure and Schema Meet

  • Maurizio Atzori
  • Nicoletta Dessì
Part of the Studies in Computational Intelligence book series (SCI, volume 375)

Abstract

In this chapter we investigate the crucial problem that poses the bases to the concept of dataspaces: the need for human interaction/intervention in the process of organizing (getting the structure of) unstructured data. We survey the existing techniques behind dataspaces to overcome that need, exploring the structure of a dataspace along three dimensions: dataspace profiling, querying and searching and application domain. We will further explore existing projects focusing on dataspaces, induction of data structure from documents, and data models where data schema and documents structure overlaps will be reviewed, such as Apache Hadoop, Cassandra on Amazon Dynamo, Google BigTable model and other DHT-based flexible data structures, Google Fusion Tables, iMeMex, U-DID, WebTables and Yahoo! SearchMonkey.

Keywords

Data Integration Distribute Hash Table Conjunctive Query Query Model Unstructured Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Gounbark, L., Benhlima, L., Chiadmi, D.: Data integration system: toward a prototype. In: ACS/IEEE International Conference on Computer Systems and Applications, pp. 33–36 (2009)Google Scholar
  2. 2.
    Gatterbauer, W., Suciu, D.: Managing structured collections of community data. In: CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Online Proceedings, Asilomar (January 2011)Google Scholar
  3. 3.
    Dittrich, J.-P., Salles, M.A.V.: idm: A unified and versatile data model for personal dataspace management. In: Dayal, et al [52], pp. 367–378Google Scholar
  4. 4.
    Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Record 34(4), 27–33 (2005)CrossRefGoogle Scholar
  5. 5.
    Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: Vansummeren, S. (ed.) PODS, pp. 1–9. ACM, New York (2006)Google Scholar
  6. 6.
    Dong, X., Halevy, A.Y.: Indexing dataspaces. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) SIGMOD Conference, pp. 43–54. ACM, New York (2007)Google Scholar
  7. 7.
    Howe, B., Maier, D., Rayner, N., Rucker, J.: Quarrying dataspaces: Schemaless profiling of unfamiliar information sources. In: ICDEW 2008: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering Workshop, pp. 270–277. IEEE Computer Society Press, Washington, DC, USA (2008)CrossRefGoogle Scholar
  8. 8.
    Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 847–860. ACM, New York (2008)CrossRefGoogle Scholar
  9. 9.
    Hedeler, C., et al.: Pay-as-you-go mapping selection in dataspaces. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011. ACM Press, New York (to appear 2011)Google Scholar
  10. 10.
    Madhavan, J., Halevy, A.Y., Cohen, S., Dong, X.L., Jeffery, S.R., Ko, D., Yu, C.: Structured data meets the web: A few observations. IEEE Data Eng. Bull. 29(4), 19–26 (2006)Google Scholar
  11. 11.
    Marshall, B.: Data quality and data profiling - a glossary (2007), http://www.w3.org/DesignIssues/LinkedData.html
  12. 12.
    Hedeler, C., Belhajjame, K., Fernandes, A.A.A., Embury, S.M., Paton, N.W.: Dimensions of dataspaces. In: Sexton, A.P. (ed.) BNCOD 26. LNCS, vol. 5588, pp. 55–66. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  13. 13.
    Lee, B.: Linked data - design issues (2006), http://www.w3.org/DesignIssues/LinkedData.html
  14. 14.
    Liu, J., Dong, X., Halevy, A.Y.: Answering structured queries on unstructured data. In: WebDB (2006)Google Scholar
  15. 15.
    Halevy, A.Y., Rajaraman, A., Ordille, J.J.: Data integration: The teenage years. In: Dayal, et al [52], pp. 9–16Google Scholar
  16. 16.
    White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol (2009)Google Scholar
  17. 17.
    Apache Foundation Software. Apache hbase, subproject of hadoop (2006), http://hbase.apache.org/#Overview
  18. 18.
    Lakshman, A., Malik, P.: Cassandra: a structured storage system on a p2p network. In: auf der Heide, F.M., Bender, M.A. (eds.) SPAA, p. 47. ACM, New York (2009)CrossRefGoogle Scholar
  19. 19.
    Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)CrossRefGoogle Scholar
  20. 20.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)Google Scholar
  21. 21.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data (best paper award). In: OSDI [53], pp. 205–218Google Scholar
  22. 22.
    Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidon, J.: Google fusion tables: web-centered data management and collaboration. In: Elmagarmid, Agrawal [54], pp. 1061–1066Google Scholar
  23. 23.
    Salles, M.A.V., Dittrich, J.-P., Karakashian, S.K., Girard, O.R., Blunschi, L.: itrails: Pay-as-you-go information integration in dataspaces. In: Koch, C., Gehrke, J., Garofalakis, M.N., Srivastava, D., Aberer, K., Deshpande, A., Florescu, D., Chan, C.Y., Ganti, V., Kanne, C.-C., Klas, W., Neuhold, E.J. (eds.) VLDB, pp. 663–674. ACM, New York (2007)Google Scholar
  24. 24.
    Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)Google Scholar
  25. 25.
    Uren, V.S., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. J. Web Sem. 4(1), 14–28 (2006)CrossRefGoogle Scholar
  26. 26.
    Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)Google Scholar
  27. 27.
    King, P.J.H., Poulovassilis, A.: Enhancing database technology to better manage and exploit partially structured data. Technical report bbkcs-00-14, Birkbeck University of London (2000), http://www.dcs.bbk.ac.uk/research/techreps/2000/bbkcs-00-14.pdf
  28. 28.
    Bairoch, A., Boeckmann, B., Ferro, S., Gasteiger, E.: Swiss-prot: Juggling between evolution and stability. Briefings in Bioinformatics 5(1), 39–58 (2004)CrossRefGoogle Scholar
  29. 29.
    Doan, A., Halevy, A.Y.: Semantic-integration research in the database community. AI Mag. 26, 83–94 (2005)Google Scholar
  30. 30.
    Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. Knowl. Eng. Rev. 18, 1–31 (2003)CrossRefGoogle Scholar
  31. 31.
    Choi, N., Song, I.-Y., Han, H.: A survey on ontology mapping. SIGMOD Rec. 35, 34–41 (2006)CrossRefGoogle Scholar
  32. 32.
    Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009)CrossRefGoogle Scholar
  33. 33.
    Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: Elmagarmid, Agrawal [54], pp. 387–398Google Scholar
  34. 34.
    Do, H.H., Rahm, E.: Matching large schemas: Approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)CrossRefGoogle Scholar
  35. 35.
    Talukdar, P.P., Reisinger, J., Pasca, M., Ravichandran, D., Bhagat, R., Pereira, F.: Weakly-supervised acquisition of labeled class instances using graph random walks. In: EMNLP, pp. 582–590. ACL (2008)Google Scholar
  36. 36.
    Dessì, N., Pes, B.: Towards scientific dataspaces. In: Web Intelligence, IAT Workshops, pp. 575–578. IEEE, Los Alamitos (2009)Google Scholar
  37. 37.
    Hamilton, J.: Perspectives: One size does not fit all (2009), http://perspectives.mvdirona.com/CommentViewguidafe46691-a293-4f9a-8900-5688a597726a.aspx
  38. 38.
    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. In: Bressoud, T.C., Frans Kaashoek, M. (eds.) SOSP, pp. 205–220. ACM, New York (2007)CrossRefGoogle Scholar
  39. 39.
    Burrows, M.: The chubby lock service for loosely-coupled distributed systems. In: OSDI [53], pp. 335–350Google Scholar
  40. 40.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Scott, M.L., Peterson, L.L. (eds.) SOSP, pp. 29–43. ACM, New York (2003)Google Scholar
  41. 41.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004, pp. 137–150 (2004)Google Scholar
  42. 42.
    Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)CrossRefGoogle Scholar
  43. 43.
    Dean, J.: Experiences with mapreduce, an abstraction for large-scale computation. In: PACT 2006: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, p. 1. ACM Press, New York (2006)CrossRefGoogle Scholar
  44. 44.
  45. 45.
    Apache Foundation Software. Apache hive, data warehouse infrastructure built on top of apache hadoop (2010), http://hive.apache.org/
  46. 46.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)Google Scholar
  47. 47.
    Apache Foundation Software. The couchdb project (2008), http://couchdb.apache.org/
  48. 48.
    Cloudant.com. Cloudant bigcouch (2008), https://cloudant.com/
  49. 49.
    Evans, N.S., GauthierDickey, C., Grothoff, C.: Routing in the dark: Pitch black. In: ACSAC, pp. 305–314. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  50. 50.
    Balakrishnan, H., Frans Kaashoek, M., Karger, D., Morris, R., Stoica, I.: Looking up data in p2p systems. Commun. ACM 46, 43–48 (2003)CrossRefGoogle Scholar
  51. 51.
    Yahoo! Searchmonkey (2011), http://developer.yahoo.com/searchmonkey/
  52. 52.
    Dayal, U., Whang, K.-Y., Lomet, D.B., Alonso, G., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y.-K. (eds.): Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15. ACM, New York (2006)Google Scholar
  53. 53.
    Symposium on Operating Systems Design and Implementation (OSDI 2006), November 6-8. USENIX Association, Seattle (2006)Google Scholar
  54. 54.
    Elmagarmid, A.K., Agrawal, D. (eds.): Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, June 6-10. ACM, USA (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Maurizio Atzori
    • 1
  • Nicoletta Dessì
    • 1
  1. 1.University of CagliariItaly

Personalised recommendations