Data profiling refers to the activity of creating small but informative summaries of a database . These summaries range from simple statistics such as the number of records in a table and the number of distinct values of a field, to more complex statistics such as the distribution of n-grams in the field text, to structural properties such as keys and functional dependencies. Database profiles are useful for database exploration, detection of data quality problems , and for schema matching in data integration . Database exploration helps a user identify important database properties, whether it is data of interest or data quality problems. Schema matching addresses the critical question, “do two fields or sets of fields or tables represent the same information?” Answers to these questions are very useful for designing data integration scripts.
Databases which support a complex organization tend to be quite complex...
- 1.Broder A. On the resemblance and containment of documents. In Proc. IEEE Conf. on Compression and Comparison of Sequences, 1997, pp. 21–29.Google Scholar
- 3.Dasu T., Johnson T., and Marathe A. Database exploration using database dynamics. IEEE Data Eng. Bull. 29(2):43–59, 2006.Google Scholar
- 4.Dasu T., Johnson T., Muthukrishnan S., and Shkapenyuk V. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD Int. Conf. on Management of data, 2002, pp. 240–251.Google Scholar
- 5.Evoke Software. Data Profiling and Mapping, The Essential First Step in Data Migration and Integration Projects. Available at: http://www.evokesoftware.com/pdf/wtpprDPM.pdf 2000.
- 6.Gravano L., Ipeirotis P.G., Jagadish H.V., Koudas N., Muthukrishnan S., and Srivastava D. Approximate String Joins in a Database (Almost) for Free. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 491–500.Google Scholar
- 8.IBM Websphere Information Integration. Available at: http://ibm.ascential.com
- 9.Informatica Data Explorer. Available at: http://www.informatica.com/products_services/data_explorer
- 10.Kang J. and Naughton J.F. On schema matching with opaque column names and data values. In Proc. ACM SIGMOD Int. Conf. on Management of Data, San Diego, CA, 2003, pp. 205–216.Google Scholar
- 11.Shen W., DeRose P., Vu L., Doan A.H., and Ramakrishnan R. Source-aware entity matching: a compositional approach. In Proc. 23rd Int. Conf. on Data Engineering, pp. 196–205.Google Scholar