Skip to main content

Data Profiling

  • Reference work entry
  • First Online:

Synonyms

Metadata extraction; Metadata generation

Definitions

According to Naumann (2013), Abedjan et al. (2015), data profiling is the set of activities and processes to determine the metadata about a given dataset. Metadata can range from simple statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values, to complex inter-value and intercolumn dependencies. Metadata that are more difficult to compute involve multiple columns, such as inclusion dependencies or functional dependencies. Collectively, a set of results of these tasks is called the data profile or database profile. In the Encyclopedia of Database Systems, Johnson considers data profiling as the activity of creating small but informative summaries of a database (Johnson 2009). In contrast to data mining, the focus of data profiling is typically the structure of a dataset. Thus data profiling generates knowledge about columns of a dataset,...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   849.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Abedjan Z, Grütze T, Jentzsch A, Naumann F (2014a) Profiling and mining RDF data with ProLOD++. In: Proceedings of the international conference on data engineering (ICDE), pp 1198–1201, Demo

    Google Scholar 

  • Abedjan Z, Schulze P, Naumann F (2014b) DFD: efficient functional dependency discovery. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 949–958

    Google Scholar 

  • Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581

    Article  Google Scholar 

  • Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, Gehrke J, Haas L, Halevy A, Han J, Jagadish HV, Labrinidis A, Madden S, Papakonstantinou Y, Patel JM, Ramakrishnan R, Ross K, Shahabi C, Suciu D, Vaithyanathan S, Widom J (2012) Challenges and opportunities with Big Data. Technical report, Computing Community Consortium. http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf

    Google Scholar 

  • Benford F (1938) The law of anomalous numbers. Proc Am Philos Soc 78(4):551–572

    MATH  Google Scholar 

  • Berti-Equille L, Dasu T, Srivastava D (2011) Discovery of complex glitch patterns: a novel approach to quantitative data cleaning. In: Proceedings of the international conference on data engineering (ICDE), pp 733–744

    Google Scholar 

  • Caruccio L, Deufemia V, Polese G (2016) Relaxed functional dependencies – a survey of approaches. IEEE Trans Knowl Data Eng (TKDE) 28(1):147–165. https://doi.org/10.1109/TKDE.2015.2472010

    Article  Google Scholar 

  • Dasu T, Loh JM (2012) Statistical distortion: consequences of data cleaning. Proc VLDB Endow (PVLDB) 5(11):1674–1683

    Article  Google Scholar 

  • Dasu T, Johnson T, Marathe A (2006) Database exploration using database dynamics. IEEE Data Eng Bull 29(2):43–59

    Google Scholar 

  • Dasu T, Loh JM, Srivastava D (2014) Empirical glitch explanations. In: Proceedings of the international conference on knowledge discovery and data mining (SIGKDD), pp 572–581. ISBN: 978-1-4503-2956-9

    Google Scholar 

  • Deshpande A, Garofalakis M, Rastogi R (2001) Independence is good: dependency-based histogram synopses for high-dimensional data. In: Proceedings of the international conference on management of data (SIGMOD), pp 199–210. ISBN:1-58113-332-4

    Google Scholar 

  • Euzenat J, Shvaiko P (2013) Ontology matching, 2nd edn. Springer, Berlin/Heidelberg/New York

    Book  MATH  Google Scholar 

  • Garofalakis M, Keren D, Samoladas V (2013) Sketch-based geometric monitoring of distributed stream queries. Proc VLDB Endow (PVLDB) 6(10):937–948

    Article  Google Scholar 

  • Hainaut J-L, Henrard J, Englebert V, Roland D, Hick J-M (2009) Database reverse engineering. In: Encyclopedia of database systems. Springer, Heidelberg, pp 723–728

    Google Scholar 

  • Heise A, Quiané-Ruiz J-A, Abedjan Z, Jentzsch A, Naumann F (2013) Scalable discovery of unique column combinations. Proc VLDB Endow (PVLDB) 7(4): 301–312

    Article  Google Scholar 

  • Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining – a general survey and comparison. SIGKDD Explor 2(1):58–64. ISSN: 1931-0145

    Article  Google Scholar 

  • Johnson T (2009) Data profiling. In: Encyclopedia of database systems. Springer, Heidelberg, pp 604–608

    Google Scholar 

  • Kache H, Han W-S, Markl V, Raman V, Ewen S (2006) POP/FED: progressive query optimization for federated queries in DB2. In: Proceedings of the international conference on very large databases (VLDB), pp 1175–1178

    Google Scholar 

  • Kandel S, Parikh R, Paepcke A, Hellerstein J, Heer J (2012) Profiler: integrated statistical analysis and visualization for data quality assessment. In: Proceedings of advanced visual interfaces (AVI), pp 547–554

    Google Scholar 

  • Kang J, Naughton JF (2003) On schema matching with opaque column names and data values. In: Proceedings of the international conference on management of data (SIGMOD), pp 205–216

    Google Scholar 

  • Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with Cupid. In: Proceedings of the international conference on very large databases (VLDB), pp 49–58

    Google Scholar 

  • Mannino MV, Chu P, Sager T (1988) Statistical profile estimation in database systems. ACM Comput Surv 20(3):191–221

    Article  MATH  Google Scholar 

  • Markowitz VM, Makowsky JA (1990) Identifying extended entity-relationship object structures in relational schemas. IEEE Trans Softw Eng 16(8):777–790

    Article  Google Scholar 

  • Naumann F (2013) Data profiling revisited. SIGMOD Rec 42(4):40–49

    Article  Google Scholar 

  • Naumann F, Ho C-T, Tian X, Haas L, Megiddo N (2002) Attribute classification using feature analysis. In: Proceedings of the international conference on data engineering (ICDE), p 271

    Google Scholar 

  • Ntarmos N, Triantafillou P, Weikum G (2009) Distributed hash sketches: scalable, efficient, and accurate cardinality estimation for distributed multisets. ACM Trans Comput Syst (TOCS) 27(1):1–53. ISSN:0734-2071

    Article  Google Scholar 

  • Papenbrock T, Naumann F (2016) A hybrid approach to functional dependency discovery. In: Proceedings of the international conference on management of data (SIGMOD), pp 821–833. https://doi.org/10.1145/2882903.2915203, http://doi.acm.org/10.1145/2882903.2915203

  • Papenbrock T, Bergmann T, Finke M, Zwiener J, Naumann F (2015a) Data profiling with metanome. Proc VLDB Endow (PVLDB) 8:1860–1863

    Article  Google Scholar 

  • Papenbrock T, Kruse S, Quiané-Ruiz J-A, Naumann F (2015b) Divide & conquer-based inclusion dependency discovery. Proc VLDB Endow (PVLDB) 8(7):774–785

    Article  Google Scholar 

  • Petit J-M, Kouloumdjian J, Boulicaut J-F, Toumani F (1994) Using queries to improve database reverse engineering. In: Proceedings of the international conference on conceptual modeling (ER), pp 369–386

    Chapter  Google Scholar 

  • Pipino L, Lee Y, Wang R (2002) Data quality assessment. Commun ACM 4:211–218

    Article  Google Scholar 

  • Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: Proceedings of the international conference on very large databases (VLDB), pp 486–495

    Google Scholar 

  • Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates. In: Proceedings of the international conference on management of data (SIGMOD), pp 294–305

    Google Scholar 

  • Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Raman V, Hellerstein JM (2001) Potters wheel: an interactive data cleaning system. In: Proceedings of the international conference on very large databases (VLDB), pp 381–390

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziawasch Abedjan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Abedjan, Z. (2019). Data Profiling. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_8

Download citation

Publish with us

Policies and ethics