Enhancing Textual Data Quality in Data Mining: Case Study and Experiences

  • Yi Feng
  • Chunhua Ju
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7867)

Abstract

Dirty data is recognized as a top challenge for data mining. Textual data is one type of data that should be explored more on the topic of data quality, to ensure the discovered knowledge is of quality. In this paper, we focus on the topic of textual data quality (TDQ) in data mining. Based on our data mining experiences for years, three typical TDQ dimensions and related problems are highlighted, including representation granularity, representation consistency, and completeness. Then, to provide a real-world example on how to enhance TDQ in data mining, a case study is demonstrated in detail in this paper, under the background of data mining in traditional Chinese medicine and covers three typical TDQ problems and corresponding solutions. The case study provided in this paper is expected to help data analysts and miners to attach more importance to TDQ issue, and enhance TDQ for more reliable data mining.

Keywords

Data Mining Data Quality Textual Data Quality Traditional Chinese Medicine 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
  3. 3.
    Rexer, K.: 4th Annual Data Miner Survey - 2010 Survey Summary Report (2011), http://www.rexeranalytics.com
  4. 4.
    Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP 1.0 Process and User Guide (2000), http://www.crisp-dm.org
  5. 5.
    Ballou, D.P., Pazer, H.L.: Cost/Quality Tradeoffs for Control Procedures in Information Systems. OMEGA: Int’l J. Management Science 15(6), 509–521 (1987)CrossRefGoogle Scholar
  6. 6.
    Wang, R.Y., Reddy, M.P., Kon, H.B.: Toward Quality Data: an Attribute-based Approach. Decision Support Systems 13(3-4), 349–372 (1995)CrossRefGoogle Scholar
  7. 7.
    Wang, R.Y., Strong, D.M.: Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 4, 5–34 (1996)Google Scholar
  8. 8.
    Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.W.: Overview and Framework for Data and Information Quality Research. ACM Journal of Data and Information Quality 1(1), 1–22 (2009)Google Scholar
  9. 9.
    O’Donnell, M., Knott, A.: Oberlander Jon., Mellish C.: Optimising Text Quality in Generation from Relational Databases. In: Proc. of 1st International Conference on Natural Language Generation, pp. 133–140 (2000)Google Scholar
  10. 10.
    Sonntag, D.: Assessing the Quality of Natural Language Text Data. Proc. of GI Jahrestagung 1, 259–263 (2004)Google Scholar
  11. 11.
    Feng, Y., Wu, Z.H., Chen, H.: j., Yu, T., Mao, Y.X., Jiang, X.H.: Data Quality in Traditional Chinese Medicine. In: Proc. of BMEI 2008, pp. 255–259 (2008)Google Scholar
  12. 12.
    Zhou, X.Z., Wu, Z.H., Yin, A.N., Wu, L.C., Fan, W.Y., Zhang, R.E.: Ontology Development for Unified Traditional Chinese Medical Language System. Artificial Intelligence in Medicine 32(1), 15–27 (2004)CrossRefGoogle Scholar
  13. 13.
    Feng, Y., Wu, Z., Zhou, Z.: Combining an Order-Semisensitive Text Similarity and Closest Fit Approach to Textual Missing Values in Knowledge Discovery. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3682, pp. 943–949. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Schmid, J.: The Main Steps to Data Quality. In: Proc. of 4th Industrial Conf. on Data Mining, pp. 69–77 (2004)Google Scholar
  15. 15.
    Feng, Y., Wu, Z.H., Zhou, Z.M., Fan, W.Y.: Knowledge Discovery in Traditional Chinese Medicine: State of the Art and Perspectives. Artificial Intelligence in Medicine 38(3), 219–236 (2006)CrossRefGoogle Scholar
  16. 16.
    Pipino, L., Kopcso, D.: Data Mining, Dirty Data, and Costs. In: Proc. of ICIQ 2004, pp. 164–169 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Yi Feng
    • 1
  • Chunhua Ju
    • 1
    • 2
  1. 1.School of Computer Science & Information EngineeringZhejiang Gongshang UniversityHangzhouP.R. China
  2. 2.Contemporary Business and Trade Research CenterZhejiang Gongshang UniversityHangzhouP.R. China

Personalised recommendations