Data Profiling

Abedjan, Ziawasch

doi:10.1007/978-3-319-77525-8_8

Data Profiling

Ziawasch Abedjan³

Reference work entry
First Online: 01 January 2019

158 Accesses
2 Citations

Synonyms

Metadata extraction; Metadata generation

Definitions

According to Naumann (2013), Abedjan et al. (2015), data profiling is the set of activities and processes to determine the metadata about a given dataset. Metadata can range from simple statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values, to complex inter-value and intercolumn dependencies. Metadata that are more difficult to compute involve multiple columns, such as inclusion dependencies or functional dependencies. Collectively, a set of results of these tasks is called the data profile or database profile. In the Encyclopedia of Database Systems, Johnson considers data profiling as the activity of creating small but informative summaries of a database (Johnson 2009). In contrast to data mining, the focus of data profiling is typically the structure of a dataset. Thus data profiling generates knowledge about columns of a dataset,...

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Abedjan Z, Grütze T, Jentzsch A, Naumann F (2014a) Profiling and mining RDF data with ProLOD++. In: Proceedings of the international conference on data engineering (ICDE), pp 1198–1201, Demo
Google Scholar
Abedjan Z, Schulze P, Naumann F (2014b) DFD: efficient functional dependency discovery. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 949–958
Google Scholar
Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581
Article Google Scholar
Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, Gehrke J, Haas L, Halevy A, Han J, Jagadish HV, Labrinidis A, Madden S, Papakonstantinou Y, Patel JM, Ramakrishnan R, Ross K, Shahabi C, Suciu D, Vaithyanathan S, Widom J (2012) Challenges and opportunities with Big Data. Technical report, Computing Community Consortium. http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf
Google Scholar
Benford F (1938) The law of anomalous numbers. Proc Am Philos Soc 78(4):551–572
MATH Google Scholar
Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning. In: Proceedings of the international conference on data engineering (ICDE), pp 733–744
Google Scholar
Caruccio L, Deufemia V, Polese G (2016) Relaxed functional dependencies – a survey of approaches. IEEE Trans Knowl Data Eng (TKDE) 28(1):147–165. https://doi.org/10.1109/TKDE.2015.2472010
Article Google Scholar
Dasu T, Loh JM (2012) Statistical distortion: consequences of data cleaning. Proc VLDB Endow (PVLDB) 5(11):1674–1683
Article Google Scholar
Dasu T, Johnson T, Marathe A (2006) Database exploration using database dynamics. IEEE Data Eng Bull 29(2):43–59
Google Scholar
Dasu T, Loh JM, Srivastava D (2014) Empirical glitch explanations. In: Proceedings of the international conference on knowledge discovery and data mining (SIGKDD), pp 572–581. ISBN: 978-1-4503-2956-9
Google Scholar
Deshpande A, Garofalakis M, Rastogi R (2001) Independence is good: dependency-based histogram synopses for high-dimensional data. In: Proceedings of the international conference on management of data (SIGMOD), pp 199–210. ISBN:1-58113-332-4
Google Scholar
Euzenat J, Shvaiko P (2013) Ontology matching, 2nd edn. Springer, Berlin/Heidelberg/New York
Book MATH Google Scholar
Garofalakis M, Keren D, Samoladas V (2013) Sketch-based geometric monitoring of distributed stream queries. Proc VLDB Endow (PVLDB) 6(10):937–948
Article Google Scholar
Hainaut J-L, Henrard J, Englebert V, Roland D, Hick J-M (2009) Database reverse engineering. In: Encyclopedia of database systems. Springer, Heidelberg, pp 723–728
Google Scholar
Heise A, Quiané-Ruiz J-A, Abedjan Z, Jentzsch A, Naumann F (2013) Scalable discovery of unique column combinations. Proc VLDB Endow (PVLDB) 7(4): 301–312
Article Google Scholar
Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining – a general survey and comparison. SIGKDD Explor 2(1):58–64. ISSN: 1931-0145
Article Google Scholar
Johnson T (2009) Data profiling. In: Encyclopedia of database systems. Springer, Heidelberg, pp 604–608
Google Scholar
Kache H, Han W-S, Markl V, Raman V, Ewen S (2006) POP/FED: progressive query optimization for federated queries in DB2. In: Proceedings of the international conference on very large databases (VLDB), pp 1175–1178
Google Scholar
Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J (2012) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Proceedings of advanced visual interfaces (AVI), pp 547–554
Google Scholar
Kang J, Naughton JF (2003) On schema matching with opaque column names and data values. In: Proceedings of the international conference on management of data (SIGMOD), pp 205–216
Google Scholar
Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with Cupid. In: Proceedings of the international conference on very large databases (VLDB), pp 49–58
Google Scholar
Mannino MV, Chu P, Sager T (1988) Statistical profile estimation in database systems. ACM Comput Surv 20(3):191–221
Article MATH Google Scholar
Markowitz VM, Makowsky JA (1990) Identifying extended entity-relationship object structures in relational schemas. IEEE Trans Softw Eng 16(8):777–790
Article Google Scholar
Naumann F (2013) Data profiling revisited. SIGMOD Rec 42(4):40–49
Article Google Scholar
Naumann F, Ho C-T, Tian X, Haas L, Megiddo N (2002) Attribute classification using feature analysis. In: Proceedings of the international conference on data engineering (ICDE), p 271
Google Scholar
Ntarmos N, Triantafillou P, Weikum G (2009) Distributed hash sketches: scalable, efficient, and accurate cardinality estimation for distributed multisets. ACM Trans Comput Syst (TOCS) 27(1):1–53. ISSN:0734-2071
Article Google Scholar
Papenbrock T, Naumann F (2016) A hybrid approach to functional dependency discovery. In: Proceedings of the international conference on management of data (SIGMOD), pp 821–833. https://doi.org/10.1145/2882903.2915203, http://doi.acm.org/10.1145/2882903.2915203
Papenbrock T, Bergmann T, Finke M, Zwiener J, Naumann F (2015a) Data profiling with metanome. Proc VLDB Endow (PVLDB) 8:1860–1863
Article Google Scholar
Papenbrock T, Kruse S, Quiané-Ruiz J-A, Naumann F (2015b) Divide & conquer-based inclusion dependency discovery. Proc VLDB Endow (PVLDB) 8(7):774–785
Article Google Scholar
Petit J-M, Kouloumdjian J, Boulicaut J-F, Toumani F (1994) Using queries to improve database reverse engineering. In: Proceedings of the international conference on conceptual modeling (ER), pp 369–386
Chapter Google Scholar
Pipino L, Lee Y, Wang R (2002) Data quality assessment. Commun ACM 4:211–218
Article Google Scholar
Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: Proceedings of the international conference on very large databases (VLDB), pp 486–495
Google Scholar
Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates. In: Proceedings of the international conference on management of data (SIGMOD), pp 294–305
Google Scholar
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Francisco
Google Scholar
Raman V, Hellerstein JM (2001) Potters wheel: an interactive data cleaning system. In: Proceedings of the international conference on very large databases (VLDB), pp 381–390
Google Scholar

Download references

Author information

Authors and Affiliations

Teschnische Universität Berlin, Berlin, Germany
Ziawasch Abedjan

Authors

Ziawasch Abedjan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziawasch Abedjan .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
School of Information Technologies, Sydney University, Sydney, Australia
Albert Y. Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Abedjan, Z. (2019). Data Profiling. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-77525-8_8
Published: 20 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics