Abstract
As mentioned in Chap. 1, supervised statistical learning is currently the mainstream method for constructing practical systems for application, and large-scale annotation data are the foundation and premise of this method. In the era of Internet big data, massive data such as text, images, and videos can be easily obtained. However, the data directly obtained from the Internet or the raw data from other sources, such as medical records written by doctors, maintenance logbooks and job cards for airplanes, and chat records in WeChat or Twitter, often contain noise, and there are many cases of ill-formed language that create obstacles to fulfilling model tasks, so these data must be preprocessed.
This chapter briefly introduces the basic methods of data acquisition and preprocessing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(4), 359–394.
Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (2nd ed.). Upper Saddle River: Prentice Hall.
Lovins, J. B. (1968). Development of a stemming algorithm. Translation and Computational Linguistics, 11(1), 22–31.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
Paice, C. D. (1990). Another stemmer. ACM SIGIR Forum, 24(3), 56–61.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Wang, K., Zong, C., & Su, K.-Y. (2012). Integrating generative and discriminative character-based models for chinese word segmentation. ACM Transactions on Asian Language Information Processing, 11(2), 1–41.
Zong, C. (2013). Statistical natural language processing (2nd ed.). Beijing: Tsinghua University Press (in Chinese).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 Tsinghua University Press
About this chapter
Cite this chapter
Zong, C., Xia, R., Zhang, J. (2021). Data Annotation and Preprocessing. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_2
Download citation
DOI: https://doi.org/10.1007/978-981-16-0100-2_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0099-9
Online ISBN: 978-981-16-0100-2
eBook Packages: Computer ScienceComputer Science (R0)