Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Indexing dataspaces with partitions

Abstract

Dataspaces are recently proposed to manage heterogeneous data, with features like partially unstructured, high dimension and extremely sparse. The inverted index has been previously extended to retrieve dataspaces. In order to achieve more efficient access to dataspaces, in this paper, we first introduce our survey of data features in the real dataspaces. Based on the features observed in our study, several partitioning based index approaches are proposed to accelerate the query processing in dataspaces. Specifically, the vertical partitioning index utilizes the partitions on tokens to merge and compress data. We can both reduce the number of I/O reads and avoid aggregation of data inside a compressed list. The horizontal partitioning index supports pruning partitions of tuples in the top-k query. Thus, we can reduce the computation overhead of irrelevant candidate tuples to the query. Finally, we also propose a hybrid index with both vertical and horizontal partitioning. The extensive experiment results in real data sets demonstrate that our approaches outperform the previous techniques and scale well with the large data size.

This is a preview of subscription content, log in to check access.

References

  1. 1.

    Agrawal, R., Somani, A., Xu, Y.: Storage and querying of e-commerce data. In: VLDB, pp. 149–158 (2001)

  2. 2.

    Abadi, D., Madden, S., Hachem, N.: Column-stores vs. row-stores: how different are they really? In: SIGMOD Conference (2008)

  3. 3.

    Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J.: Scalable semantic web data management using vertical partitioning. In: VLDB, pp. 411–422 (2007)

  4. 4.

    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994)

  5. 5.

    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)

  6. 6.

    Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: Path summaries and path partitioning in modern xml databases. World Wide Web 11(1), 117–151 (2008)

  7. 7.

    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)

  8. 8.

    Beckmann, J.L., Halverson, A., Krishnamurthy, R., Naughton, J.F.: Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In: ICDE, p. 58 (2006)

  9. 9.

    Bruno, E., Faessel, N., Glotin, H., Maitre, J.L., Scholl, M.: Indexing and querying segmented web pages: the blockweb model. World Wide Web 14(5–6), 623–649 (2011)

  10. 10.

    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)

  11. 11.

    Chu, E., Baid, A., Chen, T., Doan, A., Naughton, J.F.: A relational approach to incrementally extracting and querying structure in unstructured data. In: VLDB, pp. 1045–1056 (2007)

  12. 12.

    Chu, E., Beckmann, J.L., Naughton, J.F.: The case for a wide-table approach to manage sparse relational data sets. In: SIGMOD Conference, pp. 821–832 (2007)

  13. 13.

    de Vries, A.P., Mamoulis, N., Nes, N., Kersten, M.L.: Efficient k-NN search on vertically decomposed data. In: SIGMOD Conference, pp. 322–333 (2002)

  14. 14.

    Dong, X., Halevy, A.Y.: Indexing dataspaces. In: SIGMOD Conference, pp. 43–54 (2007)

  15. 15.

    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS (2001)

  16. 16.

    Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Record 34(4), 27–33 (2005)

  17. 17.

    Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)

  18. 18.

    Franklin, M.J., Halevy, A.Y., Maier, D.: A first tutorial on dataspaces. PVLDB 1(2), 1516–1517 (2008)

  19. 19.

    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)

  20. 20.

    Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)

  21. 21.

    Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD Conference, pp. 847–860 (2008)

  22. 22.

    Lester, N., Moffat, A., Zobel, J.: Fast on-line index construction by geometric partitioning. In: CIKM, pp. 776–783 (2005)

  23. 23.

    Li, Q., Chen, J., Wu, Y.: Algorithm for extracting loosely structured data records through digging strict patterns. World Wide Web 12(3), 263–284 (2009)

  24. 24.

    Lu, W., Chen, J., Du, X., Wang, J., Pan, W.: Efficient top-k approximate searches against a relation with multiple attributes. World Wide Web 14(5–6), 573–597 (2011)

  25. 25.

    Mamoulis, N.: Efficient processing of joins on set-valued attributes. In: SIGMOD Conference, pp. 157–168 (2003)

  26. 26.

    Ng, W., Lau, H.L., Zhou, A.: Divide, compress and conquer: Querying xml via partitioned path-based compressed data blocks. World Wide Web 11(2), 169–197 (2008)

  27. 27.

    Salles, M.A.V., Dittrich, J.-P., Karakashian, S.K., Girard, O.R., Blunschi, L.: Itrails: pay-as-you-go information integration in dataspaces. In: VLDB, pp. 663–674 (2007)

  28. 28.

    Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley (1989)

  29. 29.

    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)

  30. 30.

    Sarma, A.D., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD Conference, pp. 861–874 (2008)

  31. 31.

    Song, S., Chen, L., Yu, P.S.: On data dependencies in dataspaces. In: ICDE, pp. 470–481 (2011)

  32. 32.

    Song, S., Chen, S., Yuan, M.: Materialization and decomposition of dataspaces for efficient search. IEEE Trans. Knowl. Data Eng. 23(12), 1872–1887 (2011)

  33. 33.

    Tomasic, A., Garcia-Molina, H., Shoens, K.A.: Incremental updates of inverted lists for text document retrieval. In: SIGMOD Conference, pp. 289–300 (1994)

  34. 34.

    Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)

  35. 35.

    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann (1999)

  36. 36.

    Wu, X., Theodoratos, D., Souldatos, S., Dalamagas, T., Sellis, T.K.: Evaluation techniques for generalized path pattern queries on xml data. World Wide Web 13(4), 441–474 (2010)

  37. 37.

    Zipf, G.K.: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley (1949)

  38. 38.

    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), Article 6 (2006). doi:10.1145/1132956.1132959

  39. 39.

    Zobel, J., Moffat, A., Sacks-Davis, R.: An efficient indexing technique for full text databases. In: VLDB, pp. 352–362 (1992)

Download references

Author information

Correspondence to Shaoxu Song.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Song, S., Chen, L. Indexing dataspaces with partitions. World Wide Web 16, 141–170 (2013). https://doi.org/10.1007/s11280-012-0163-7

Download citation

Keywords

  • partitioning index
  • dataspaces