Skip to main content

Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking

  • Conference paper
  • First Online:
Book cover Information Technology: New Generations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 448))

Abstract

Data profiling technology is very valuable for data governance and data quality control because people need it to verify and review the quality of structured, semi-structured, and unstructured data. In this paper, we first review relevant works and discuss their definitions of data profiling. Second, we offer a new definition and propose new classifications for data profiling tasks. Third, the paper presents several free and commercial profiling tools. Fourth, authors offer a new data quality metrics and data quality score calculation. Finally, authors discuss a data profiling tool framework for big data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zikopoulos, P., Eaton, C.: Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media (2011)

    Google Scholar 

  2. Buneman, P.: Semistructured data. In: Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 117–121. ACM (1997)

    Google Scholar 

  3. Buneman, P., Davidson, S., Fernandez, M., Suciu, D.: Adding structure to unstructured data. In: Database Theory, ICDT 1997, pp. 336–350. Springer, Heidelberg (1997)

    Google Scholar 

  4. Khatri, V.: Brown, C.V: Designing data governance. Communications of the ACM 53(1), 148–152 (2010)

    Article  Google Scholar 

  5. Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Communications of the ACM 45(4), 211–218 (2002)

    Article  Google Scholar 

  6. Wang, R.Y.: A product perspective on total data quality management. Communications of the ACM 41(2), 58–65 (1998)

    Article  Google Scholar 

  7. Kumar, R., Yadav, A.: Aggregate Profiler – Data Quality. http://sourceforge.net/projects/dataquality/

  8. Talend Company. Talend Open Studio for Data Quality. http://www.talend.com/products/data-quality

  9. DataCleaner Company. DataCleaner Manual. http://datacleaner.org/resources/docs/4.0.10/pdf/datacleaner-reference.pdf

  10. IBM Company. InfoSphere Information Server: Information Center. http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_9.1.0/

  11. Informatica Company. Data Profiling Solutions. https://www.informatica.com/data-profiling.html

  12. Oracle Company. Oracle Enterprise Data Quality. http://www.oracle.com/us/products/middleware/data-integration/enterprise-data-quality/overview/index.html

  13. SAP Company. SAP Information Steward. http://scn.sap.com/docs/DOC-8751

  14. SAS Company. SAS Products: DataFlux Data Management Studio. http://support.sas.com/software/products/dfdmstudioserver/

  15. A Data Governance Solution Tailored for Your Role. Collibra Solution Comments. https://www.collibra.com/solution/

  16. Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Communications of the ACM 45(4), 211–218 (2002)

    Article  Google Scholar 

  17. Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull 23(4), 3–13 (2000)

    Google Scholar 

  18. Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to data quality. The MIT Press (2009)

    Google Scholar 

  19. Moody, D.L.: Metrics for evaluating the quality of entity relationship models. In: Conceptual Modeling, ER 1998, pp. 211–225. Springer, Heidelberg (1998)

    Google Scholar 

  20. Ballou, D.P., Tayi, G.K.: Enhancing data quality in data warehouse environments. Communications of the ACM 42(1), 73–78 (1999)

    Article  Google Scholar 

  21. Calero, C., Piattini, M., Pascual, C., Serrano, M.A.: Towards Data Warehouse Quality Metrics. In: DMDW, p. 2 (2001)

    Google Scholar 

  22. Loshin, D.: Monitoring Data Quality Performance Using Data Quality Metrics: A White Paper. Informatica, November 2006

    Google Scholar 

  23. The Six Primary Dimensions for Data Quality Assessment. The Six Primary Dimensions for Data Quality Assessment. http://www.enterprisemanagement360.com/white_paper/six-primary-dimensions-data-quality-assessment/

  24. The Ultimate Guide to Data Governance Metrics: Healthcare Edition:40 Ways for Payers and Providers to Measure Information Quality Success. The Ultimate Guide to Data Governance Metrics: Healthcare Edition (2012). http://www.ajilitee.com/wp-content/uploads/2013/09/Ultimate-Guide-to-Data-Governance-Metrics-for-Healthcare-Ajilitee-June-2012.pdf

  25. Zikopoulos, P., Eaton, C.: Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media (2011)

    Google Scholar 

  26. LaValle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big data, analytics and the path from insights to value. MIT Sloan Management Review 21 (2013)

    Google Scholar 

  27. CUDA GPUs. NVIDIA Developer. June 4, 2012. https://developer.nvidia.com/cuda-gpus

  28. Apache Spark™ - Lightning-Fast Cluster Computing. Apache Spark™ - Lightning-Fast Cluster Computing. http://spark.apache.org/

  29. Apache Storm. http://storm.apache.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Dai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J. (2016). Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking. In: Latifi, S. (eds) Information Technology: New Generations. Advances in Intelligent Systems and Computing, vol 448. Springer, Cham. https://doi.org/10.1007/978-3-319-32467-8_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32467-8_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32466-1

  • Online ISBN: 978-3-319-32467-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics