Skip to main content

Data Annotation and Preprocessing

  • Chapter
  • First Online:
Text Data Mining
  • 3358 Accesses

Abstract

As mentioned in Chap. 1, supervised statistical learning is currently the mainstream method for constructing practical systems for application, and large-scale annotation data are the foundation and premise of this method. In the era of Internet big data, massive data such as text, images, and videos can be easily obtained. However, the data directly obtained from the Internet or the raw data from other sources, such as medical records written by doctors, maintenance logbooks and job cards for airplanes, and chat records in WeChat or Twitter, often contain noise, and there are many cases of ill-formed language that create obstacles to fulfilling model tasks, so these data must be preprocessed.

This chapter briefly introduces the basic methods of data acquisition and preprocessing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.imdb.com/.

  2. 2.

    https://www.imdb.com/title/tt4912910/

  3. 3.

    https://www.imdb.com/robots.txt

  4. 4.

    4 https://www.w3school.com.cn/html/html_entities.asp

  5. 5.

    https://opencc.byvoid.com/.

  6. 6.

    https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/.

  7. 7.

    https://www.nltk.org/api/nltk.tokenize.html.

  8. 8.

    https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.

References

  • Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(4), 359–394.

    Article  Google Scholar 

  • Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (2nd ed.). Upper Saddle River: Prentice Hall.

    Google Scholar 

  • Lovins, J. B. (1968). Development of a stemming algorithm. Translation and Computational Linguistics, 11(1), 22–31.

    Google Scholar 

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Paice, C. D. (1990). Another stemmer. ACM SIGIR Forum, 24(3), 56–61.

    Article  Google Scholar 

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Article  Google Scholar 

  • Wang, K., Zong, C., & Su, K.-Y. (2012). Integrating generative and discriminative character-based models for chinese word segmentation. ACM Transactions on Asian Language Information Processing, 11(2), 1–41.

    Google Scholar 

  • Zong, C. (2013). Statistical natural language processing (2nd ed.). Beijing: Tsinghua University Press (in Chinese).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Tsinghua University Press

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zong, C., Xia, R., Zhang, J. (2021). Data Annotation and Preprocessing. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-0100-2_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-0099-9

  • Online ISBN: 978-981-16-0100-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics