Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Data Profiling

  • Ziawasch AbedjanEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_8

Synonyms

Definitions

According to Naumann (2013), Abedjan et al. (2015), data profiling is the set of activities and processes to determine the metadata about a given dataset. Metadata can range from simple statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values, to complex inter-value and intercolumn dependencies. Metadata that are more difficult to compute involve multiple columns, such as inclusion dependencies or functional dependencies. Collectively, a set of results of these tasks is called the data profile or database profile. In the Encyclopedia of Database Systems, Johnson considers data profiling as the activity of creating small but informative summaries of a database (Johnson 2009). In contrast to data mining, the focus of data profiling is typically the structure of a dataset. Thus data profiling generates knowledge about columns of a dataset,...

This is a preview of subscription content, log in to check access.

References

  1. Abedjan Z, Grütze T, Jentzsch A, Naumann F (2014a) Profiling and mining RDF data with ProLOD++. In: Proceedings of the international conference on data engineering (ICDE), pp 1198–1201, DemoGoogle Scholar
  2. Abedjan Z, Schulze P, Naumann F (2014b) DFD: efficient functional dependency discovery. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 949–958Google Scholar
  3. Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581CrossRefGoogle Scholar
  4. Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, Gehrke J, Haas L, Halevy A, Han J, Jagadish HV, Labrinidis A, Madden S, Papakonstantinou Y, Patel JM, Ramakrishnan R, Ross K, Shahabi C, Suciu D, Vaithyanathan S, Widom J (2012) Challenges and opportunities with Big Data. Technical report, Computing Community Consortium. http://cra.org/ccc/docs/init/bigdatawhitepaper.pdfGoogle Scholar
  5. Benford F (1938) The law of anomalous numbers. Proc Am Philos Soc 78(4):551–572zbMATHGoogle Scholar
  6. Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning. In: Proceedings of the international conference on data engineering (ICDE), pp 733–744Google Scholar
  7. Caruccio L, Deufemia V, Polese G (2016) Relaxed functional dependencies – a survey of approaches. IEEE Trans Knowl Data Eng (TKDE) 28(1):147–165.  https://doi.org/10.1109/TKDE.2015.2472010CrossRefGoogle Scholar
  8. Dasu T, Loh JM (2012) Statistical distortion: consequences of data cleaning. Proc VLDB Endow (PVLDB) 5(11):1674–1683CrossRefGoogle Scholar
  9. Dasu T, Johnson T, Marathe A (2006) Database exploration using database dynamics. IEEE Data Eng Bull 29(2):43–59Google Scholar
  10. Dasu T, Loh JM, Srivastava D (2014) Empirical glitch explanations. In: Proceedings of the international conference on knowledge discovery and data mining (SIGKDD), pp 572–581. ISBN: 978-1-4503-2956-9Google Scholar
  11. Deshpande A, Garofalakis M, Rastogi R (2001) Independence is good: dependency-based histogram synopses for high-dimensional data. In: Proceedings of the international conference on management of data (SIGMOD), pp 199–210. ISBN:1-58113-332-4Google Scholar
  12. Euzenat J, Shvaiko P (2013) Ontology matching, 2nd edn. Springer, Berlin/Heidelberg/New YorkzbMATHCrossRefGoogle Scholar
  13. Garofalakis M, Keren D, Samoladas V (2013) Sketch-based geometric monitoring of distributed stream queries. Proc VLDB Endow (PVLDB) 6(10):937–948CrossRefGoogle Scholar
  14. Hainaut J-L, Henrard J, Englebert V, Roland D, Hick J-M (2009) Database reverse engineering. In: Encyclopedia of database systems. Springer, Heidelberg, pp 723–728Google Scholar
  15. Heise A, Quiané-Ruiz J-A, Abedjan Z, Jentzsch A, Naumann F (2013) Scalable discovery of unique column combinations. Proc VLDB Endow (PVLDB) 7(4): 301–312CrossRefGoogle Scholar
  16. Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining – a general survey and comparison. SIGKDD Explor 2(1):58–64. ISSN: 1931-0145CrossRefGoogle Scholar
  17. Johnson T (2009) Data profiling. In: Encyclopedia of database systems. Springer, Heidelberg, pp 604–608Google Scholar
  18. Kache H, Han W-S, Markl V, Raman V, Ewen S (2006) POP/FED: progressive query optimization for federated queries in DB2. In: Proceedings of the international conference on very large databases (VLDB), pp 1175–1178Google Scholar
  19. Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J (2012) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Proceedings of advanced visual interfaces (AVI), pp 547–554Google Scholar
  20. Kang J, Naughton JF (2003) On schema matching with opaque column names and data values. In: Proceedings of the international conference on management of data (SIGMOD), pp 205–216Google Scholar
  21. Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with Cupid. In: Proceedings of the international conference on very large databases (VLDB), pp 49–58Google Scholar
  22. Mannino MV, Chu P, Sager T (1988) Statistical profile estimation in database systems. ACM Comput Surv 20(3):191–221zbMATHCrossRefGoogle Scholar
  23. Markowitz VM, Makowsky JA (1990) Identifying extended entity-relationship object structures in relational schemas. IEEE Trans Softw Eng 16(8):777–790CrossRefGoogle Scholar
  24. Naumann F (2013) Data profiling revisited. SIGMOD Rec 42(4):40–49CrossRefGoogle Scholar
  25. Naumann F, Ho C-T, Tian X, Haas L, Megiddo N (2002) Attribute classification using feature analysis. In: Proceedings of the international conference on data engineering (ICDE), p 271Google Scholar
  26. Ntarmos N, Triantafillou P, Weikum G (2009) Distributed hash sketches: scalable, efficient, and accurate cardinality estimation for distributed multisets. ACM Trans Comput Syst (TOCS) 27(1):1–53. ISSN:0734-2071CrossRefGoogle Scholar
  27. Papenbrock T, Naumann F (2016) A hybrid approach to functional dependency discovery. In: Proceedings of the international conference on management of data (SIGMOD), pp 821–833. https://doi.org/10.1145/2882903.2915203, http://doi.acm.org/10.1145/2882903.2915203
  28. Papenbrock T, Bergmann T, Finke M, Zwiener J, Naumann F (2015a) Data profiling with metanome. Proc VLDB Endow (PVLDB) 8:1860–1863CrossRefGoogle Scholar
  29. Papenbrock T, Kruse S, Quiané-Ruiz J-A, Naumann F (2015b) Divide & conquer-based inclusion dependency discovery. Proc VLDB Endow (PVLDB) 8(7):774–785CrossRefGoogle Scholar
  30. Petit J-M, Kouloumdjian J, Boulicaut J-F, Toumani F (1994) Using queries to improve database reverse engineering. In: Proceedings of the international conference on conceptual modeling (ER), pp 369–386CrossRefGoogle Scholar
  31. Pipino L, Lee Y, Wang R (2002) Data quality assessment. Commun ACM 4:211–218CrossRefGoogle Scholar
  32. Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: Proceedings of the international conference on very large databases (VLDB), pp 486–495Google Scholar
  33. Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates. In: Proceedings of the international conference on management of data (SIGMOD), pp 294–305Google Scholar
  34. Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San FranciscoGoogle Scholar
  35. Raman V, Hellerstein JM (2001) Potters wheel: an interactive data cleaning system. In: Proceedings of the international conference on very large databases (VLDB), pp 381–390Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Teschnische Universität BerlinBerlinGermany