Skip to main content

Improving Data Quality Through Deep Learning and Statistical Models

  • Conference paper
  • First Online:
Information Technology - New Generations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 558))

Abstract

Traditional data quality control methods are based on users’ experience or previously established business rules, and this limits performance in addition to being a very time consuming process with lower than desirable accuracy. Utilizing deep learning, we can leverage computing resources and advanced techniques to overcome these challenges and provide greater value to users.

In this paper, we, the authors, first review relevant works and discuss machine learning techniques, tools, and statistical quality models. Second, we offer a creative data quality framework based on deep learning and statistical model algorithm for identifying data quality. Third, we use data involving salary levels from an open dataset published by the state of Arkansas to demonstrate how to identify outlier data and how to improve data quality via deep learning. Finally, we discuss future work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103–110.

    Article  Google Scholar 

  2. Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (2013). Machine learning: An artificial intelligence approach. Berlin: Springer Science & Business Media.

    MATH  Google Scholar 

  3. Alpaydin, E. (2014). Introduction to machine learning. Cambridge, MA/London: MIT Press.

    MATH  Google Scholar 

  4. Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  5. Natarajan, B. K. (2014). Machine learning: A theoretical approach. San Mateo: Morgan Kaufmann.

    Google Scholar 

  6. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.

    Article  Google Scholar 

  7. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

    Article  Google Scholar 

  8. Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013 (pp. 8599–8603). IEEE.

    Google Scholar 

  9. Hawkins, S., He, H., Williams, G., & Baxter, R. (2002). Outlier detection using replicator neural networks. In Data warehousing and knowledge discovery (pp. 170–180). Berlin Heidelberg: Springer.

    Chapter  Google Scholar 

  10. Aggarwal, C. C. (2015). Outlier analysis. In Data mining (pp. 237–263). Springer International Publishing.

    Google Scholar 

  11. Montgomery, D. C. (2009). Statistical quality control (Vol. 7). New York: Wiley.

    MATH  Google Scholar 

  12. Leavenworth, R. S., & Grant, E. L. (2000). Statistical quality control. New York: Tata McGraw-Hill Education.

    MATH  Google Scholar 

  13. DeVor, R. E., Chang, T.-h., & Sutherland, J. W. (2007). Statistical quality design and control: Contemporary concepts and methods. Upper Saddle River: Prentice Hall.

    Google Scholar 

  14. Bluman, A. G. (2009). Elementary statistics: A step by step approach. New York: McGraw-Hill Higher Education.

    Google Scholar 

  15. Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R., Kötter, T., Meinl, T., Ohl, P., Sieb, C., Thiel, K., & Wiswedel, B. (2008). KNIME: The Konstanz information miner. Berlin Heidelberg: Springer.

    Google Scholar 

  16. O’hagan, S., & Kell, D. B. (2015). Software review: the KNIME workflow environment and its applications in genetic programming and machine learning. Genetic Programming and Evolvable Machines, 16(3), 387–391.

    Article  MathSciNet  Google Scholar 

  17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.

    Article  Google Scholar 

  18. Mark, H., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.

    Article  Google Scholar 

  19. Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C.-W., & Tseng, V. S. (2014). SPMF: A java open-source pattern mining library. The Journal of Machine Learning Research, 15(1), 3389–3393.

    MATH  Google Scholar 

  20. Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C. W., & Tseng, V. S. (2014). SPMF: A java open-source pattern mining library. The Journal of Machine Learning Research, 15(1), 3389–3393.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Dai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Dai, W., Yoshigoe, K., Parsley, W. (2018). Improving Data Quality Through Deep Learning and Statistical Models. In: Latifi, S. (eds) Information Technology - New Generations. Advances in Intelligent Systems and Computing, vol 558. Springer, Cham. https://doi.org/10.1007/978-3-319-54978-1_66

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54978-1_66

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54977-4

  • Online ISBN: 978-3-319-54978-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics