Advertisement

Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking

  • Wei Dai
  • Isaac Wardlaw
  • Yu Cui
  • Kashif Mehdi
  • Yanyan Li
  • Jun Long
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 448)

Abstract

Data profiling technology is very valuable for data governance and data quality control because people need it to verify and review the quality of structured, semi-structured, and unstructured data. In this paper, we first review relevant works and discuss their definitions of data profiling. Second, we offer a new definition and propose new classifications for data profiling tasks. Third, the paper presents several free and commercial profiling tools. Fourth, authors offer a new data quality metrics and data quality score calculation. Finally, authors discuss a data profiling tool framework for big data.

Keywords

Data profiling tools Data governance Big data Data quality control Data management 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Zikopoulos, P., Eaton, C.: Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media (2011)Google Scholar
  2. 2.
    Buneman, P.: Semistructured data. In: Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 117–121. ACM (1997)Google Scholar
  3. 3.
    Buneman, P., Davidson, S., Fernandez, M., Suciu, D.: Adding structure to unstructured data. In: Database Theory, ICDT 1997, pp. 336–350. Springer, Heidelberg (1997)Google Scholar
  4. 4.
    Khatri, V.: Brown, C.V: Designing data governance. Communications of the ACM 53(1), 148–152 (2010)CrossRefGoogle Scholar
  5. 5.
    Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Communications of the ACM 45(4), 211–218 (2002)CrossRefGoogle Scholar
  6. 6.
    Wang, R.Y.: A product perspective on total data quality management. Communications of the ACM 41(2), 58–65 (1998)CrossRefGoogle Scholar
  7. 7.
    Kumar, R., Yadav, A.: Aggregate Profiler – Data Quality. http://sourceforge.net/projects/dataquality/
  8. 8.
    Talend Company. Talend Open Studio for Data Quality. http://www.talend.com/products/data-quality
  9. 9.
  10. 10.
    IBM Company. InfoSphere Information Server: Information Center. http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_9.1.0/
  11. 11.
    Informatica Company. Data Profiling Solutions. https://www.informatica.com/data-profiling.html
  12. 12.
  13. 13.
    SAP Company. SAP Information Steward. http://scn.sap.com/docs/DOC-8751
  14. 14.
    SAS Company. SAS Products: DataFlux Data Management Studio. http://support.sas.com/software/products/dfdmstudioserver/
  15. 15.
    A Data Governance Solution Tailored for Your Role. Collibra Solution Comments. https://www.collibra.com/solution/
  16. 16.
    Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Communications of the ACM 45(4), 211–218 (2002)CrossRefGoogle Scholar
  17. 17.
    Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull 23(4), 3–13 (2000)Google Scholar
  18. 18.
    Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to data quality. The MIT Press (2009)Google Scholar
  19. 19.
    Moody, D.L.: Metrics for evaluating the quality of entity relationship models. In: Conceptual Modeling, ER 1998, pp. 211–225. Springer, Heidelberg (1998)Google Scholar
  20. 20.
    Ballou, D.P., Tayi, G.K.: Enhancing data quality in data warehouse environments. Communications of the ACM 42(1), 73–78 (1999)CrossRefGoogle Scholar
  21. 21.
    Calero, C., Piattini, M., Pascual, C., Serrano, M.A.: Towards Data Warehouse Quality Metrics. In: DMDW, p. 2 (2001)Google Scholar
  22. 22.
    Loshin, D.: Monitoring Data Quality Performance Using Data Quality Metrics: A White Paper. Informatica, November 2006Google Scholar
  23. 23.
    The Six Primary Dimensions for Data Quality Assessment. The Six Primary Dimensions for Data Quality Assessment. http://www.enterprisemanagement360.com/white_paper/six-primary-dimensions-data-quality-assessment/
  24. 24.
    The Ultimate Guide to Data Governance Metrics: Healthcare Edition:40 Ways for Payers and Providers to Measure Information Quality Success. The Ultimate Guide to Data Governance Metrics: Healthcare Edition (2012). http://www.ajilitee.com/wp-content/uploads/2013/09/Ultimate-Guide-to-Data-Governance-Metrics-for-Healthcare-Ajilitee-June-2012.pdf
  25. 25.
    Zikopoulos, P., Eaton, C.: Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media (2011)Google Scholar
  26. 26.
    LaValle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big data, analytics and the path from insights to value. MIT Sloan Management Review 21 (2013)Google Scholar
  27. 27.
    CUDA GPUs. NVIDIA Developer. June 4, 2012. https://developer.nvidia.com/cuda-gpus
  28. 28.
    Apache Spark™ - Lightning-Fast Cluster Computing. Apache Spark™ - Lightning-Fast Cluster Computing. http://spark.apache.org/
  29. 29.

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Wei Dai
    • 1
  • Isaac Wardlaw
    • 2
  • Yu Cui
    • 3
  • Kashif Mehdi
    • 4
  • Yanyan Li
    • 2
  • Jun Long
    • 5
  1. 1.Information ScienceUniversity of Arkansas at Little RockLittle RockUSA
  2. 2.Computer ScienceUniversity of Arkansas at Little RockLittle RockUSA
  3. 3.College of Information EngineeringGuangdong Mechanical and Electrical PolytechnicGuangzhouChina
  4. 4.Software Development GroupCollibra Inc.New YorkUSA
  5. 5.Information Science and EngineeringCentral South UniversityChangshaChina

Personalised recommendations