Get a Sample for a Discount

Sampling-Based XML Data Pricing
  • Ruiming Tang
  • Antoine Amarilli
  • Pierre Senellart
  • Stéphane Bressan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8644)


While price and data quality should define the major trade-off for consumers in data markets, prices are usually prescribed by vendors and data quality is not negotiable. In this paper we study a model where data quality can be traded for a discount. We focus on the case of XML documents and consider completeness as the quality dimension. In our setting, the data provider offers an XML document, and sets both the price of the document and a weight to each node of the document, depending on its potential worth. The data consumer proposes a price. If the proposed price is lower than that of the entire document, then the data consumer receives a sample, i.e., a random rooted subtree of the document whose selection depends on the discounted price and the weight of nodes. By requesting several samples, the data consumer can iteratively explore the data in the document. We show that the uniform random sampling of a rooted subtree with prescribed weight is unfortunately intractable. However, we are able to identify several practical cases that are tractable. The first case is uniform random sampling of a rooted subtree with prescribed size; the second case restricts to binary weights. For both these practical cases we present polynomial-time algorithms and explain how they can be integrated into an iterative exploratory sampling approach.


Binary Tree Data Market Sampling Problem Discount Price Data Provider 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cohen, S., Kimelfeld, B., Sagiv, Y.: Running tree automata on probabilistic XML. In: PODS, pp. 227–236 (2009)Google Scholar
  2. 2.
    Henzinger, M.R., Heydon, A., Mitzenmacher, M., Najork, M.: On near-uniform URL sampling. Computer Networks 33(1-6) (2000)Google Scholar
  3. 3.
    Hübler, C., Kriegel, H.-P., Borgwardt, K., Ghahramani, Z.: Metropolis algorithms for representative subgraph sampling. In: ICDM (2008)Google Scholar
  4. 4.
    Koutris, P., Upadhyaya, P., Balazinska, M., Howe, B., Suciu, D.: Query-based data pricing. In: PODS (2012)Google Scholar
  5. 5.
    Koutris, P., Upadhyaya, P., Balazinska, M., Howe, B., Suciu, D.: QueryMarket demonstration: Pricing for online data markets. PVLDB 5(12) (2012)Google Scholar
  6. 6.
    Koutris, P., Upadhyaya, P., Balazinska, M., Howe, B., Suciu, D.: Toward practical query pricing with QueryMarket. In: SIGMOD (2013)Google Scholar
  7. 7.
    Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: SIGKDD (2006)Google Scholar
  8. 8.
    Li, C., Li, D.Y., Miklau, G., Suciu, D.: A theory of pricing private data. In: ICDT (2013)Google Scholar
  9. 9.
    Li, C., Miklau, G.: Pricing aggregate queries in a data marketplace. In: WebDB (2012)Google Scholar
  10. 10.
    Lin, B.-R., Kifer, D.: On arbitrage-free pricing for general data queries. PVLDB 7(9), 757–768 (2014)Google Scholar
  11. 11.
    Lu, X., Bressan, S.: Sampling connected induced subgraphs uniformly at random. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 195–212. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Luo, C., Jiang, Z., Hou, W.-C., Yu, F., Zhu, Q.: A sampling approach for XML query selectivity estimation. In: EDBT (2009)Google Scholar
  13. 13.
    Maiya, A.S., Berger-Wolf, T.Y.: Sampling community structure. In: WWW (2010)Google Scholar
  14. 14.
    Muschalle, A., Stahl, F., Löser, A., Vossen, G.: Pricing approaches for data markets. In: Castellanos, M., Dayal, U., Rundensteiner, E.A. (eds.) BIRTE 2012. LNBIP, vol. 154, pp. 129–144. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  15. 15.
    Pipino, L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4) (2002)Google Scholar
  16. 16.
    Ribeiro, B.F., Towsley, D.F.: Estimating and sampling graphs with multidimensional random walks. In: Internet Measurement Conference (2010)Google Scholar
  17. 17.
    Tang, R., Wu, H., Bao, Z., Bressan, S., Valduriez, P.: The price is right. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds.) DEXA 2013, Part II. LNCS, vol. 8056, pp. 380–394. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  18. 18.
    Wang, R.Y., Strong, D.M.: Beyond accuracy: What data quality means to data consumers. J. of Management Information Systems 12(4) (1996)Google Scholar
  19. 19.
    Wang, W., Jiang, H., Lu, H., Yu, J.X.: Containment join size estimation: Models and methods. In: SIGMOD (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Ruiming Tang
    • 1
  • Antoine Amarilli
    • 2
  • Pierre Senellart
    • 2
  • Stéphane Bressan
    • 1
  1. 1.National University of SingaporeSingapore
  2. 2.Institut Mines–Télécom; Télécom ParisTech; CNRS LTCI. ParisFrance

Personalised recommendations