Dirty Data Management in Cloud Database

  • Hongzhi WangEmail author
  • Jianzhong Li
  • Jinbao Wang
  • Hong Gao


Data quality problem is caused by dirty data. Massive data sets contain dirty data in higher probability. As an important platform for massive data management, it is necessary to manage dirty data in cloud databases. Since traditional data-cleaning-based methods cannot clean dirty data entirely and are costly for massive datasets, a massive dirty data management method is presented in this chapter to obtain query result with quality assurance. To achieve this goal, a dirty database storage structure for cloud databases as well as a multi-level index structure for query processing is presented. Exploiting this index for a query on dirty data, candidates nodes in the cloud are selected to run and process the query efficiently. This chapter discusses the index structure and index-based query processing techniques. Experimental results show the efficiency and effectiveness of the presented techniques.


Data Index Query Processing Query Result Master Node Slave Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research is partially supported by National Science Foundation of China (No. 61003046), the NSFC-RGC of China (No. 60831160525), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Postdoctoral Foundation of China (No. 20090450126, No. 201003447), Doctoral Fund of Ministry of Education of China (No. 20102302120054), Postdoctoral Foundation of Heilongjiang Province (No. LBH-Z09109), and Development Program for Outstanding Young Teachers in Harbin Institute of Technology (No. HITQNJS.2009.052).


  1. 1.
    Eckerson, W.W.: Xml for analysis specification. Technical Report, The Data Warehousing Institute. = 6064, 2002Google Scholar
  2. 2.
    Raman, A., DeHoratius, N., Ton, Z.: Execution: The missing link in retail operations. Calif. Manag. Rev. 43(3), 136–152 (2001)Google Scholar
  3. 3.
    Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  4. 4.
    Fuxman, A.,  Miller, R.J.: First-order query rewriting for inconsistent databases. In: ICDT, pp. 337–351 (2005)Google Scholar
  5. 5.
    Fuxman, A., Fazli, E., Miller, R.J.: Conquer: Efficient management of inconsistent databases. In: SIGMOD Conference, pp. 155–166 (2005)Google Scholar
  6. 6.
    Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: A probabilistic approach. In: ICDE, p. 30 (2006)Google Scholar
  7. 7.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database system implementation. Prentice-Hall, NJ (2000)Google Scholar
  8. 8.
    Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)Google Scholar
  9. 9.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern information retrieval. ACM, NY (1999)Google Scholar
  10. 10.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms, 2nd edn. MIT, MA (2001)zbMATHGoogle Scholar
  11. 11.
    Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)CrossRefGoogle Scholar
  12. 12.
    Schaeffer, S.E.: Graph clustering. Comp. Sci. Rev. 1(1), 27–64 (2007)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Sarawagi, S. , Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)Google Scholar
  14. 14.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)Google Scholar
  15. 15.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP 2003, pp. 29–43Google Scholar
  16. 16.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)Google Scholar
  17. 17.
    Apache Hadoop Scholar
  18. 18.
    Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.A.: PNUTS: Yahoo!’s hosted data serving platform. PVLDB 1(2), 1277–1288 (2008)Google Scholar
  19. 19.
    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels,W.: Dynamo: Amazon’s highly available key-value store. In: SIGOPS, pp. 205–220 (2007)Google Scholar
  20. 20.
    Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E.: Ceph: a scalable, high-performance distributed file system. In: SODI, pp. 307–320 (2006)Google Scholar
  21. 21.
    Aguilera, M.K., Merchant, A., Shah, M., Veitch, A., Karamanolis, C.: Sinfonia: A new paradigm for building scalable distributed systems. In: SOSP 2007Google Scholar
  22. 22.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004Google Scholar
  23. 23.
    Yang, H.-C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: SIGMOD, pp. 1029–1040 (2007)Google Scholar
  24. 24.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Hongzhi Wang
    • 1
    Email author
  • Jianzhong Li
    • 1
  • Jinbao Wang
    • 1
  • Hong Gao
    • 1
  1. 1.Harbin Institute of TechnologyHarbinChina

Personalised recommendations