Synonyms
Definitions
According to Naumann (2013), Abedjan et al. (2015), data profiling is the set of activities and processes to determine the metadata about a given dataset. Metadata can range from simple statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values, to complex inter-value and intercolumn dependencies. Metadata that are more difficult to compute involve multiple columns, such as inclusion dependencies or functional dependencies. Collectively, a set of results of these tasks is called the data profile or database profile. In the Encyclopedia of Database Systems, Johnson considers data profiling as the activity of creating small but informative summaries of a database (Johnson 2009). In contrast to data mining, the focus of data profiling is typically the structure of a dataset. Thus data profiling generates knowledge about columns of a dataset,...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abedjan Z, Grütze T, Jentzsch A, Naumann F (2014a) Profiling and mining RDF data with ProLOD++. In: Proceedings of the international conference on data engineering (ICDE), pp 1198–1201, Demo
Abedjan Z, Schulze P, Naumann F (2014b) DFD: efficient functional dependency discovery. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 949–958
Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581
Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, Gehrke J, Haas L, Halevy A, Han J, Jagadish HV, Labrinidis A, Madden S, Papakonstantinou Y, Patel JM, Ramakrishnan R, Ross K, Shahabi C, Suciu D, Vaithyanathan S, Widom J (2012) Challenges and opportunities with Big Data. Technical report, Computing Community Consortium. http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf
Benford F (1938) The law of anomalous numbers. Proc Am Philos Soc 78(4):551–572
Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning. In: Proceedings of the international conference on data engineering (ICDE), pp 733–744
Caruccio L, Deufemia V, Polese G (2016) Relaxed functional dependencies – a survey of approaches. IEEE Trans Knowl Data Eng (TKDE) 28(1):147–165. https://doi.org/10.1109/TKDE.2015.2472010
Dasu T, Loh JM (2012) Statistical distortion: consequences of data cleaning. Proc VLDB Endow (PVLDB) 5(11):1674–1683
Dasu T, Johnson T, Marathe A (2006) Database exploration using database dynamics. IEEE Data Eng Bull 29(2):43–59
Dasu T, Loh JM, Srivastava D (2014) Empirical glitch explanations. In: Proceedings of the international conference on knowledge discovery and data mining (SIGKDD), pp 572–581. ISBN: 978-1-4503-2956-9
Deshpande A, Garofalakis M, Rastogi R (2001) Independence is good: dependency-based histogram synopses for high-dimensional data. In: Proceedings of the international conference on management of data (SIGMOD), pp 199–210. ISBN:1-58113-332-4
Euzenat J, Shvaiko P (2013) Ontology matching, 2nd edn. Springer, Berlin/Heidelberg/New York
Garofalakis M, Keren D, Samoladas V (2013) Sketch-based geometric monitoring of distributed stream queries. Proc VLDB Endow (PVLDB) 6(10):937–948
Hainaut J-L, Henrard J, Englebert V, Roland D, Hick J-M (2009) Database reverse engineering. In: Encyclopedia of database systems. Springer, Heidelberg, pp 723–728
Heise A, Quiané-Ruiz J-A, Abedjan Z, Jentzsch A, Naumann F (2013) Scalable discovery of unique column combinations. Proc VLDB Endow (PVLDB) 7(4): 301–312
Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining – a general survey and comparison. SIGKDD Explor 2(1):58–64. ISSN: 1931-0145
Johnson T (2009) Data profiling. In: Encyclopedia of database systems. Springer, Heidelberg, pp 604–608
Kache H, Han W-S, Markl V, Raman V, Ewen S (2006) POP/FED: progressive query optimization for federated queries in DB2. In: Proceedings of the international conference on very large databases (VLDB), pp 1175–1178
Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J (2012) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Proceedings of advanced visual interfaces (AVI), pp 547–554
Kang J, Naughton JF (2003) On schema matching with opaque column names and data values. In: Proceedings of the international conference on management of data (SIGMOD), pp 205–216
Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with Cupid. In: Proceedings of the international conference on very large databases (VLDB), pp 49–58
Mannino MV, Chu P, Sager T (1988) Statistical profile estimation in database systems. ACM Comput Surv 20(3):191–221
Markowitz VM, Makowsky JA (1990) Identifying extended entity-relationship object structures in relational schemas. IEEE Trans Softw Eng 16(8):777–790
Naumann F (2013) Data profiling revisited. SIGMOD Rec 42(4):40–49
Naumann F, Ho C-T, Tian X, Haas L, Megiddo N (2002) Attribute classification using feature analysis. In: Proceedings of the international conference on data engineering (ICDE), p 271
Ntarmos N, Triantafillou P, Weikum G (2009) Distributed hash sketches: scalable, efficient, and accurate cardinality estimation for distributed multisets. ACM Trans Comput Syst (TOCS) 27(1):1–53. ISSN:0734-2071
Papenbrock T, Naumann F (2016) A hybrid approach to functional dependency discovery. In: Proceedings of the international conference on management of data (SIGMOD), pp 821–833. https://doi.org/10.1145/2882903.2915203, http://doi.acm.org/10.1145/2882903.2915203
Papenbrock T, Bergmann T, Finke M, Zwiener J, Naumann F (2015a) Data profiling with metanome. Proc VLDB Endow (PVLDB) 8:1860–1863
Papenbrock T, Kruse S, Quiané-Ruiz J-A, Naumann F (2015b) Divide & conquer-based inclusion dependency discovery. Proc VLDB Endow (PVLDB) 8(7):774–785
Petit J-M, Kouloumdjian J, Boulicaut J-F, Toumani F (1994) Using queries to improve database reverse engineering. In: Proceedings of the international conference on conceptual modeling (ER), pp 369–386
Pipino L, Lee Y, Wang R (2002) Data quality assessment. Commun ACM 4:211–218
Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: Proceedings of the international conference on very large databases (VLDB), pp 486–495
Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates. In: Proceedings of the international conference on management of data (SIGMOD), pp 294–305
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Francisco
Raman V, Hellerstein JM (2001) Potters wheel: an interactive data cleaning system. In: Proceedings of the international conference on very large databases (VLDB), pp 381–390
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this entry
Cite this entry
Abedjan, Z. (2019). Data Profiling. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-77525-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering