Skip to main content

Processing Written Language

  • Chapter
  • First Online:
More than a Chatbot
  • 237 Accesses

Abstract

In this chapter, we will learn more about how text is processed automatically. Natural language processing refers to the automated processing (including generation) of speech and text. We will look at some common natural language processing applications and explain common methods from the field of natural language processing. We will deepen our machine learning knowledge and introduce and understand the advantages and disadvantages of deep learning and neural networks. Finally, we will understand how words from human language can be represented as mathematical vectors and why this is beneficial for machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 24.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://spacy.io

  2. 2.

    We will look at some more technical concepts but, however, will in the scope of this book not go into all the details. If you want to know it all and are willing to dive deep into the technical parts, I can recommend you the textbooks (Lane et al. 2019; Hagiwara 2021) as follow-up to this book, providing an applied step-by-step description of different NLP methods.

  3. 3.

    Choosing the top 3 words is to keep the example simple. In a real case, we would want to choose a higher number of words.

  4. 4.

    In computer science, we often start counting with 0 and not 1 in such situations. But it takes some time to get used to this, so let’s start counting with 1 here.

  5. 5.

    In the example above, the numbers inside the vector were represented one above the other, and here they are represented on the same line. This is just to improve the readability and has no specific meaning.

  6. 6.

    There are also tools and libraries (existing software components) that can support the data engineer in automating some of these steps.

  7. 7.

    If you are interested in a more mathematical introduction, I can refer you to Rashid (2017), by which some of the examples in this section are inspired by.

  8. 8.

    A matrix is a table of numbers. For example, a matrix can be multiple vectors being aggregated together. In such a matrix, each column or row of the table would then be a vector.

  9. 9.

    We could also have document embeddings, or sentence embeddings, but let’s stick for now to the fact that 1 word = 1 vector.

  10. 10.

    This is similar to the one-hot vectors we have seen previously. However, one advantage with such word embeddings as described here is the lower number of dimensions.

  11. 11.

    Note that the exact values of Vector(«Queen») computed in the example might not exist in the dictionary, and therefore the closest vector to the computed result will most probably be the best solution to the puzzle.

  12. 12.

    Also, we might likely be dealing with floating point numbers such as 1.2 rather than integers such as 1 or 2.

  13. 13.

    Typical dimensions are 100 to 500 dimensions, which depends on the corpus (text samples) the word embeddings were trained on (Lane et al. 2019).

  14. 14.

    Therefore, this kind of method is also referred to as self-supervised learning.

  15. 15.

    If you are interested in more details, see Lane et al. (2019, p. 191).

  16. 16.

    For a more detailed explanation about this, refer to Lane et al. (2019, p. 191).

References

  • Bolukbasi T, Chang KW, Zou JY, Saligrama V, Kalai AT (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.

    Google Scholar 

  • Firth JR (1962) A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis, Oxford.

    Google Scholar 

  • Grave É, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning Word Vectors for 157 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).

    Google Scholar 

  • Hagiwara M (2021) Real-world natural language processing. Manning Publishing.

    Google Scholar 

  • Hancox P (1996) A brief history of natural language processing. Available at https://www.cs.bham.ac.uk/~pjh/sem1a5/pt1/pt1_history.html, last accessed 27.05.2023.

  • Jurafsky D, Martin JH (2023) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft 3rd Edition.

    Google Scholar 

  • Lane H, Howard C, Hapke HM (2019) Natural language processing in action. Understanding, analyzing und generating text with Python. Manning Publishing.

    Google Scholar 

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean, J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.

    Google Scholar 

  • Mikolov T, Grave É, Bojanowski P, Puhrsch C, Joulin A. (2018) Advances in Pre-Training Distributed Word Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).

    Google Scholar 

  • Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

    Google Scholar 

  • Plecháč P (2021) Relative contributions of Shakespeare and Fletcher in Henry VIII: An analysis based on most frequent words and most frequent rhythmic patterns. Digital Scholarship in the Humanities, 36(2), 430-438.

    Article  Google Scholar 

  • Rashid T (2017) Neuronale Netze selbst programmieren: ein verständlicher Einstieg mit Python. O'Reilly. Originally in English: Rashid, T. (2016). Make your own neural network. CreateSpace Independent Publishing Platform.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kurpicz-Briki, M. (2023). Processing Written Language. In: More than a Chatbot. Springer, Cham. https://doi.org/10.1007/978-3-031-37690-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-37690-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-37689-4

  • Online ISBN: 978-3-031-37690-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics