Advertisement

Measuring and Modelling Data Quality for Quality-Awareness in Data Mining

  • Laure Berti-Équille
Part of the Studies in Computational Intelligence book series (SCI, volume 43)

Keywords

Data Quality Association Rule Data Warehouse Record Linkage Data Mining Process 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Avenali A, Batini C, Bertolazzi P, and Missier P. A formulation of the data quality optimization problem. In Proc. of the Intl. CAiSE Workhop on Data and Information Quality (DIQ), pages 49-63, Riga, Latvia, 2004.Google Scholar
  2. 2.
    Karakasidis A, Vassiliadis P, and Pitoura E. Etl queues for active data warehousing. In Proc. of the 2nd ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 28-39, Baltimore, MD, USA, 2005.Google Scholar
  3. 3.
    McCallum A, Nigam K, and Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the 6th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 169-178, Boston, MA, USA, 2000.Google Scholar
  4. 4.
    Monge A. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull., 23(4):14-20, 2000.Google Scholar
  5. 5.
    Sheth A, Wood C, and Kashyap V. Q-data: Using deductive database technology to improve data quality. In Proc. of Intl. Workshop on Programming with Logic Databases (ILPS), pages 23-56, 1993.Google Scholar
  6. 6.
    Simitsis A, Vassiliadis P, and Sellis TK. Optimizing etl processes in data warehouses. In Proc. of the 11th Intl. Conf. on Data Engineering (ICDE), pages 564-575, Tokyo, Japan, 2005.Google Scholar
  7. 7.
    Dempster AP, Laird NM, and Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1-38, 1977.zbMATHMathSciNetGoogle Scholar
  8. 8.
    Kahn B, Strong D, and Wang R. Information quality benchmark: Product and service performance. Com. of the ACM, 45(4):184-192, 2002.CrossRefGoogle Scholar
  9. 9.
    Batini C, Catarci T, and Scannapiceco M. A survey of data quality issues in cooperative information systems. In Tutorial presented at the 23rd Intl. Conf. on Conceptual Modeling (ER), Shanghai, China, 2004.Google Scholar
  10. 10.
    Djeraba C. Association and content-based retrieval. IEEE Transactions on Knowledge and Data Engineering (TDKE), 15(1):118-135, 2003.CrossRefGoogle Scholar
  11. 11.
    Fox C, Levitin A, and Redman T. The notion of data and its quality dimensions. Information Processing and Management, 30(1), 1994.Google Scholar
  12. 12.
    Ordonez C and Omiecinski E. Discovering association rules based on image content. In Proc. of IEEE Advances in Digital Libraries Conf. (ADL’99), pages 38-49, 1999.Google Scholar
  13. 13.
    Carlson D. Data stewardship in action. DM Review, 2002.Google Scholar
  14. 14.
    Loshin D. Enterprise Knowledge Management: The Data Quality Approach. .Morgan Kaufmann, 2001.Google Scholar
  15. 15.
    Pyle D. Data Preparation for Data Mining. Morgan Kaufmann, 1999.Google Scholar
  16. 16.
    Quass D and Starkey P. Record linkage for genealogical databases. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 40-42, Washington, DC, USA, 2003.Google Scholar
  17. 17.
    Theodoratos D and Bouzeghoub M. Data currency quality satisfaction in the design of a data warehouse. Special Issue on Design and Management of Data Warehouses, Intl. Journal of Cooperative Inf. Syst., 10(3):299-326, 2001.CrossRefGoogle Scholar
  18. 18.
    Paradice DB and Fuerst WL. A mis data quality management strategy based on an optimal methodology. Journal of Information Systems, 5(1):48-66, 1991.Google Scholar
  19. 19.
    Ballou DP and Pazer H. Designing information systems to optimize the accuracy-timeliness trade-off. Information Systems Research, 6(1), 1995.Google Scholar
  20. 20.
    Ballou DP and Pazer H. Modeling completeness versus consistency trade-offs in information decision contexts. IEEE Transactions on Knowledge and Data Engineering (TDKE), 15(1):240-243, 2002.Google Scholar
  21. 21.
    Guérin E, Marquet G, Burgun A, Loral O, Berti- Équille L, Leser U, and Moussouni F. Integrating and warehousing liver gene expression data and related biomedical resources in gedaw. In Proc. of the 2nd Intl. Workshop on Data Integration in the Life Science (DILS), San Diego, CA, USA, 2005.Google Scholar
  22. 22.
    Knorr E and Ng R. Algorithms for mining distance-based outliers in large datasets. In Proc. of the 24th Intl. Conf. on Very Large Data Bases (VLDB), pages 392-403, New York City, USA, 1998.Google Scholar
  23. 23.
    Rahm E and Do H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3-13, 2000.Google Scholar
  24. 24.
    Caruso F, Cochinwala M, Ganapathy U, Lalk G, and Missier P. Telcordia’s database reconciliation and data quality analysis tool. In Proc. of the 26th Intl. Conf. on Very Large Data Bases (VLDB), pages 615-618, Cairo, Egypt, September 10-14 2000.Google Scholar
  25. 25.
    Naumann F. Quality-Driven Query Answering for Integrated Information Systems, volume 2261 of LNCS. Springer, 2002.Google Scholar
  26. 26.
    Naumann F, Leser U, and Freytag JC. Quality-driven integration of hetero-geneous information systems. In Proc. of the 25th Intl. Conf. on Very Large Data Bases (VLDB), pages 447-458, Edinburgh, Scotland, 1999.Google Scholar
  27. 27.
    De Giacomo G, Lembo D, Lenzerini M, and Rosati R. Tackling inconsistencies in data integration through source preferences. In Proc. of the 1rst ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 27-34, Paris, France, 2004.Google Scholar
  28. 28.
    Delen G and Rijsenbrij D. The specification, engineering and measurement of information systems quality. Journal of Software Systems, 17:205-217, 1992.CrossRefGoogle Scholar
  29. 29.
    Liepins G and Uppuluri V. Data Quality Control: Theory and Pragmatics. M. Dekker, 1990.Google Scholar
  30. 30.
    Navarro G. A guided tour to approximate string matching. ACM Computer Surveys, 33(1):31-88, 2001.CrossRefGoogle Scholar
  31. 31.
    Shankaranarayan G, Wang RY, and Ziad M. Modeling the manufacture of an information product with ip-map. In Proc. of the 6th Intl. Conf. on Information Quality, Boston, MA, USA, 2000.Google Scholar
  32. 32.
    Mihaila GA, Raschid L, and Vidal M. Using quality of data metadata for source selection and ranking. In Proc. of the 3rd Intl. WebDB Workshop, pages 93-98, Dallas, TX, USA, 2000.Google Scholar
  33. 33.
    Tayi GK and Ballou DP. Examining data quality. Com. of the ACM, 41(2):54-57,1998.CrossRefGoogle Scholar
  34. 34.
    Galhardas H, Florescu D, Shasha D, Simon E, and Saita C. Declarative data cleaning: Language, model and algorithms. In Proc. of the 9th Intl. Conf. on Very Large Data Bases (VLDB), pages 371-380, Roma, Italy, 2001.Google Scholar
  35. 35.
    Müller H, Leser U, and Freytag JC. Mining for patterns in contradictory data. In Proc. of the 1rst ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 51-58, Paris, France, 2004.Google Scholar
  36. 36.
    Pasula H, Marthi B, Milch B, Russell S, and Shpitser I. Identity uncertainty and citation matching. In Proc. of the Intl. Conf. Advances in Neural Information Processing Systems (NIPS), pages 1401-1408, Vancouver, British Colombia, 2003.Google Scholar
  37. 37.
    Newcombe HB, Kennedy JM, Axford SJ, and James AP. Automatic linkage of vital records. Science, 130:954-959, 1959.CrossRefGoogle Scholar
  38. 38.
    Fellegi IP and Sunter AB. A theory for record linkage. Journal of the American Statistical Association, 64:1183-1210, 1969.CrossRefGoogle Scholar
  39. 39.
    Celko J and McDonald J. Don’t warehouse dirty data. Datamation, 41(18), 1995.Google Scholar
  40. 40.
    Rothenberg J. Metadata to support data quality and longevity. In Proc. Of the 1st IEEE Metadata Conf., 1996.Google Scholar
  41. 41.
    Schlimmer J. Learning determinations and checking databases. In Proc. Of AAAI Workshop on Knowledge Discovery in Databases, 1991.Google Scholar
  42. 42.
    Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall, 1997.Google Scholar
  43. 43.
    Ullmann JR. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal, 20(2):141-147, 1997.CrossRefGoogle Scholar
  44. 44.
    Fan K, Lu H, Madnick S, and Cheung D. Discovering and reconciling value conflicts for numerical data integration. Information Systems, 26(8):235-656, 2001.CrossRefGoogle Scholar
  45. 45.
    Huang K, Lee Y, and Wang R. Quality Information and Knowledge Management. Prentice Hall, New Jersey, 1999.Google Scholar
  46. 46.
    Berti- Équille L. Data quality awareness: a case study for cost-optimal association rule mining. Knowl. Inf. Syst., 2006.Google Scholar
  47. 47.
    English L. Improving Data Warehouse and Business Information Quality. Wiley, New York, 1998.Google Scholar
  48. 48.
    Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Pietarinen L, and Srivastava D. Using Q-grams in a DBMS for Approximate String Processing. IEEE Data Eng. Bull., 24(4), December 2001.Google Scholar
  49. 49.
    Gravano L, Ipeirotis PG, Koudas N, and Srivastava D. Text joins in an rdbms for web data integration. In Proc. of the 12th Intl. World Wide Web Conf. (WWW), pages 90-101, Budapest, Hungary, 2003.Google Scholar
  50. 50.
    Lim L, Srivastava J, Prabhakar S, and Richardson J. Entity identification in database integration. In Proc. of the 9th Intl. Conf. on Data Engineering (ICDE), pages 294-301, Vienna, Austria, 1993.Google Scholar
  51. 51.
    Liu L and Chi L. Evolutionary data quality. In Proc. of the 7th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, USA, 2002.Google Scholar
  52. 52.
    Santis LD, Scannapieco M, and Catarci T. Trusting data quality in cooperative information systems. In Proc. of the Intl. Conf. on Cooperative Information Systems (CoopIS), pages 354-369, Catania, Sicily, Italy, 2003.Google Scholar
  53. 53.
    Bilenko M and Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In Proc. of the 9th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 39-48, Washington, DC, USA, 2003.Google Scholar
  54. 54.
    Bouzeghoub M and Peralta V. A framework for analysis of data freshness. In Proc. of the 1st ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 59-67, Paris, France, 2004.Google Scholar
  55. 55.
    Breunig M, Kriegel H, Ng R, and Sander J. Lof: Identifying density-based local outliers. In Proc. of 2000 ACM SIGMOD Conf., pages 93-104, Dallas, TX, USA, May 16-18 2000.Google Scholar
  56. 56.
    Buechi M, Borthwick A, Winkel A, and Goldberg A. Cluemaker: a language for approximate record matching. In Proc. of the 8th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, USA, 2003.Google Scholar
  57. 57.
    Goodchild M and Jeansoulin R. Data Quality in Geographic Information: From Error to Uncertainty. Hermès, 1998.Google Scholar
  58. 58.
    Hernandez M and Stolfo S. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9-37, 1998.CrossRefGoogle Scholar
  59. 59.
    Jarke M, Jeusfeld MA, Quix C, and Vassiliadis P. Architecture and quality in data warehouses. In Proc. of the 10th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 93-113, Pisa, Italy, 1998.Google Scholar
  60. 60.
    Piattini M, Calero C, and Genero M, editors. Information and Database Quality, volume 25. Kluwer International Series on Advances in Database Systems, 2002.Google Scholar
  61. 61.
    Piattini M, Genero M, Calero C, Polo C, and Ruiz F. Chapter 14: Advanced Database Technology and Design, chapter Database Quality, pages 485-509. Artech House, 2000.Google Scholar
  62. 62.
    Scannapieco M, Pernici B, and Pierce E. Advances in Management Information Systems - Information Quality Monograph (AMIS-IQ), chapter IP-UML: A Methodology for Quality Improvement Based on IP-MAP and UML. Sharpe, 2004.Google Scholar
  63. 63.
    Weis M and Naumann F. Detecting duplicate objects in xml documents. In Proc. of the 1st Intl. ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 10-19, Paris, France, 2004.Google Scholar
  64. 64.
    Jeusfeld MA, Quix C, and Jarke M. Design and analysis of quality information for data warehouses. In Proc. of 17th Intl. Conf. Conceptual Modelling (ER), pages 349-362, Singapore, 1998.Google Scholar
  65. 65.
    Elfeky MG, Verykios VS, and Elmagarmid AK. Tailor: A record linkage toolbox. In Proc. of the 19th Intl. Conf. on Data Engineering (ICDE), pages 1-28, San Jose, CA, USA, 2002.Google Scholar
  66. 66.
    Brodie ML. Data quality in information systems. Information and Management, 3:245-258, 1980.CrossRefGoogle Scholar
  67. 67.
    Lavrač N, Flach PA, and Zupan B. Rule evaluation measures: A unifying view. In Proc. of the Intl. Workshop on Inductive Logic Programming (ILP), pages 174-185, Bled, Slovenia, 1999.Google Scholar
  68. 68.
    Benjelloun O, Garcia-Molina H, Su Q, and Widom J. Swoosh: A generic approach to entity resolution. Technical report, Stanford Database Group., 2005.Google Scholar
  69. 69.
    ıane O, Han J, and Zhu H. Mining recurrent items in multimedia with progressive resolution refinement. In Proc. of the 16th Intl. Conf. on Data Engineering (ICDE), p.461-476, San Diego, CA, USA, 2000.Google Scholar
  70. 70.
    Christen P, Churches T, and Hegland M. Febrl - a parallel open source data linkage system. In Proc. of the 8th Pacific Asia Conf. on Advances in Knowledege Discovery and Data Mining (PAKDD), pages 638-647, Sydney, Australia, May 26-28 2004.Google Scholar
  71. 71.
    Missier P and Batini C. A multidimensional model for information quality in cis. In Proc. of the 8th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, MA, USA, 2003.Google Scholar
  72. 72.
    Perner P. Data Mining on Multimedia, volume LNCS 2558. Springer, 2002.Google Scholar
  73. 73.
    Vassiliadis P. Data Warehouse Modeling and Quality Issues. PhD thesis, Technical University of Athens, Greece, 2000.Google Scholar
  74. 74.
    Vassiliadis P, Simitsis A, Georgantas P, and Terrovitis M. A framework for the design of etl scenarios. In Proc. of the 15th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 520-535, Klagenfurt, Austria, 2003.Google Scholar
  75. 75.
    Vassiliadis P, Bouzeghoub M, and Quix C. Towards quality-oriented data warehouse usage and evolution. In Proc. of the 11th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 164-179, Heidelberg, Germany, 1999.Google Scholar
  76. 76.
    Vassiliadis P, Vagena Z, Skiadopoulos S, and Karayannidis N. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. IEEE Data Eng. Bull., 23(4):42-47, 2000.Google Scholar
  77. 77.
    Tan PN, Kumar V, and Srivastava J. Selecting the right interestingness measure for association patterns. In Proc. of the 8th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 32-41, Edmonton, Canada, 2002.Google Scholar
  78. 78.
    Agrawal R, Imielinski T, and Swami AN. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD Conf., pages 207-216, Washington, DC,USA, 1993.Google Scholar
  79. 79.
    Ananthakrishna R, Chaudhuri S, and Ganti V. Eliminating fuzzy duplicates in datawarehouses. In Proc. of the 28th Intl. Conf. on Very Large Data Bases (VLDB), pages 586-597, Hong-Kong, China, 2002.Google Scholar
  80. 80.
    Baxter R, Christen P, and Churches T. A comparison of fast blocking methods for record linkage. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 27-29, Washington, DC, USA, 2003.Google Scholar
  81. 81.
    Wang R. A product perspective on total data quality management. Com. Of the ACM, 41(2):58-65, 1998.CrossRefGoogle Scholar
  82. 82.
    Wang R. Advances in Database Systems, volume 23, chapter Journey to Data Quality. Kluwer Academic Press, Boston, MA, USA, 2002.Google Scholar
  83. 83.
    Wang R, Storey V, and Firth C. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering (TDKE), 7(4):670-677, 1995.Google Scholar
  84. 84.
    Little RJ and Rubin DB. Statistical Analysis with Missing Data. Wiley, New-York, 1987.zbMATHGoogle Scholar
  85. 85.
    Pearson RK. Data mining in face of contaminated and incomplete records. In Proc. of SIAM Intl. Conf. Data Mining, 2002.Google Scholar
  86. 86.
    Hamming RW. Error-detecting and error-correcting codes. Bell System Technical Journal, 29(2):147-160, 1950.MathSciNetGoogle Scholar
  87. 87.
    Chaudhuri S, Ganjam K, Ganti V, and Motwani R. Robust and efficient fuzzy match for online data cleaning. In Proc. of the 2003 ACM SIGMOD Intl. Conf. on Management of Data, pages 313-324, San Diego, CA, USA, 2003.Google Scholar
  88. 88.
    Tejada S, Knoblock CA, and Minton S. Learning object identification rules for information integration. Information Systems, 26(8), 2001.Google Scholar
  89. 89.
    Ahmed T, Asgari AH, Mehaoua A, Borcoci E, Berti- Équille L, and Kormentzas G. End-to-end quality of service provisioning through an integrated management system for multimedia content delivery. Special Issue of Computer Communications on Emerging Middleware for Next Generation Networks, 2005.Google Scholar
  90. 90.
    Dasu T and Johnson T. Exploratory Data Mining and Data Cleaning. Wiley, New York, 2003.zbMATHCrossRefGoogle Scholar
  91. 91.
    Dasu T, Johnson T, Muthukrishnan S, and Shkapenyuk V. Mining database structure or how to build a data quality browser. In Proc. of the 2002 ACM SIGMOD Intl. Conf., pages 240-251, Madison, WI, USA, 2002.Google Scholar
  92. 92.
    Johnson T and Dasu T. Comparing massive high-dimensional data sets. In Proc. of the 4th Intl. Conf. KDD, pages 229-233, New York City, New York, USA, 1998.Google Scholar
  93. 93.
    Redman T. Data Quality: The Field Guide. Digital Press, Elsevier, 2001.Google Scholar
  94. 94.
    Raman V and Hellerstein JM. Potter’s wheel: an interactive data cleaning system. In Proc. of the 26th Intl. Conf. on Very Large Data Bases (VLDB), pages 381-390, Roma, Italy, 2001.Google Scholar
  95. 95.
    DuMouchel W, Volinsky C, Johnson T, Cortez C, and Pregibon D. Squashing flat files flatter. In Proc. of the 5th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 6-16, San Diego, CA, USA, 1999.Google Scholar
  96. 96.
    Madnick SE Wang R, Kon HB. Data quality requirements analysis and modeling. In Proc. of the 9th Intl. Conf. on Data Engineering (ICDE), pages 670-677, Vienna, Austria, 1993.Google Scholar
  97. 97.
    Hou WC and Zhang Z. Enhancing database correctness: A statistical approach. In Proc. of the 1995 ACM SIGMOD Intl. Conf. on Management of Data, San Jose, CA, USA, 1995.Google Scholar
  98. 98.
    Winkler WE. Methods for evaluating and creating data quality. Information Systems, 29(7), 2004.Google Scholar
  99. 99.
    Winkler WE and Thibaudeau Y. An application of the fellegi-sunter model of record linkage to the 1990 u.s. decennial census. Technical Report Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, DC, USA, 1991.Google Scholar
  100. 100.
    Low WL, Lee ML, and Ling TW. A knowledge-based approach for duplicate elimination in data cleaning. Information System, 26(8), 2001.Google Scholar
  101. 101.
    Cui Y and Widom J. Lineage tracing for general data warehouse transformation. In Proc. of the 27th Intl. Conf. on Very Large Data Bases (VLDB), pages 471-480, Roma, Italy, September 11-14 2001.Google Scholar
  102. 102.
    Zhu Y and Shasha D. Statstream: Statistical monitoring of thousands of data streams in real time. In Proc. of the 10th Intl. Conf. on Very Large Data Bases (VLDB), pages 358-369, Hong-Kong, China, 2002.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Laure Berti-Équille
    • 1
  1. 1.IRISAUniversity of Rennes IFrance

Personalised recommendations