Advertisement

Abstract

Data quality problems arise in different application contexts and require appropriate handling so that information becomes reliable. Examples of data anomalies are: missing values, the existence of duplicates, misspellings, data inconsistencies and wrong data formats. Current technologies handle data quality problems through: (i) software programs written in a programming language (e.g., C or Java) or an RDBMS programming language, (ii) the integrity constraints mechanisms offered by relational database management systems; or (iii) using a commercial data quality tool. None of these approaches is appropriate when handling non-conventional data applications dealing with large amounts of information. In fact, the existing technology is not able to support the design of a data flow graph that effectively and efficiently produce clean data.

AJAX is a data cleaning and transformation tool that overcomes these aspects. In this paper, we present an overview of the entire set of functionalities supported by the AJAX system. First, we explain the logical and physical levels of the AJAX framework, and the advantages brought in terms of specification and optimization of data cleaning programs. Second, the set of logical data cleaning and transformation operators is described and exemplified, using the declarative language proposed. Third, we illustrate the purpose of the debugging facility and how it is supported by the exception mechanism offered by logical operators. Finally, the architecture of the AJAX system is presented and experimental validation of the prototype is briefly referred.

Keywords

Logical Operator Data Transformation Data Cleaning Physical Level Match Operator 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank Spektrum 14, 15–21 (2005) (invited paper)Google Scholar
  2. 2.
    Buneman, P., Khanna, S., Tan, W.-C.: Why and Where: A Characterization of Data Provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  3. 3.
    Carreira, P., Galhardas, H., Pereira, J., Lopes, A.: Data mapper: An operator for expressing one-to-many data transformations. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 136–145. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  4. 4.
    Cui, Y., Widom, J.: Practical Lineage Tracing in Data Warehouses. In: Proc. of the International Conference on Data Engineering (ICDE) (2000)Google Scholar
  5. 5.
    Cui, Y., Widom, J.: Lineage Tracing for General Data Warehouse Transformations. In: Proc. of the International Conference on Very Large Databases (VLDB) (2001)Google Scholar
  6. 6.
    Fabret, F.: Optimisation du Calcul Incrémentiel dans les Langages de Règles pour Bases de Données. PhD thesis, Université de Versailles Saint-Quentin (1994)Google Scholar
  7. 7.
    Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., Equit, W.: Efficient and effective querying by image content. JIIS 3(3/4) (1994)Google Scholar
  8. 8.
    Galhardas, H., Barateiro, J.: InfoLegada2gB: an application for migrating dam safety information (unpublished)Google Scholar
  9. 9.
    Galhardas, H.: Nettoyage de Données: Modèle, Langage Déclaratif, et Algorithmes. PhD thesis, Université de Versailles Saint-Quentin (2001)Google Scholar
  10. 10.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E.: AJAX: An Extensible Data Cleaning Tool. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) Proc. of the ACM SIGMOD International Conference on Management of Data, vol. 2. ACM, New York (2000) (demonstration paper)Google Scholar
  11. 11.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Proc. of the International Conference on Very Large Databases (VLDB), Rome, Italy (September 2001)Google Scholar
  12. 12.
    Lee, M.L., Ling, T.W., Low, W.L.: A Knowledge-Based Framework for Intelligent Data Cleaning. Information Systems Journal - Special Issue on Data Extraction and Cleaning (2001)Google Scholar
  13. 13.
    Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)CrossRefGoogle Scholar
  14. 14.
    Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: Proc. of the International Conference on Very Large Databases (VLDB), Rome (2001)Google Scholar
  15. 15.
    Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL processes in data warehouses. In: Proc. of the International Conference on Data Engineering (ICDE) (2005)Google Scholar
  16. 16.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Theory 147, 195–197 (1981)Google Scholar
  17. 17.
    Microsoft Research (Sponsored by) NSF, NASA. CiteSeer.IST, http://citeseer.ist.psu.edu/
  18. 18.
    Woodruff, A., Stonebraker, M.: Supporting Fine-Grained Data Lineage in a Database Visualization Environment. In: Proc. of the International Conference on Data Engineering (ICDE) (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Helena Galhardas
    • 1
  1. 1.INESC-ID and Instituto Superior TécnicoPorto SalvoPortugal

Personalised recommendations