Abstract
In this chapter, we will learn more about how text is processed automatically. Natural language processing refers to the automated processing (including generation) of speech and text. We will look at some common natural language processing applications and explain common methods from the field of natural language processing. We will deepen our machine learning knowledge and introduce and understand the advantages and disadvantages of deep learning and neural networks. Finally, we will understand how words from human language can be represented as mathematical vectors and why this is beneficial for machine learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
We will look at some more technical concepts but, however, will in the scope of this book not go into all the details. If you want to know it all and are willing to dive deep into the technical parts, I can recommend you the textbooks (Lane et al. 2019; Hagiwara 2021) as follow-up to this book, providing an applied step-by-step description of different NLP methods.
- 3.
Choosing the top 3 words is to keep the example simple. In a real case, we would want to choose a higher number of words.
- 4.
In computer science, we often start counting with 0 and not 1 in such situations. But it takes some time to get used to this, so let’s start counting with 1 here.
- 5.
In the example above, the numbers inside the vector were represented one above the other, and here they are represented on the same line. This is just to improve the readability and has no specific meaning.
- 6.
There are also tools and libraries (existing software components) that can support the data engineer in automating some of these steps.
- 7.
If you are interested in a more mathematical introduction, I can refer you to Rashid (2017), by which some of the examples in this section are inspired by.
- 8.
A matrix is a table of numbers. For example, a matrix can be multiple vectors being aggregated together. In such a matrix, each column or row of the table would then be a vector.
- 9.
We could also have document embeddings, or sentence embeddings, but let’s stick for now to the fact that 1 word = 1 vector.
- 10.
This is similar to the one-hot vectors we have seen previously. However, one advantage with such word embeddings as described here is the lower number of dimensions.
- 11.
Note that the exact values of Vector(«Queen») computed in the example might not exist in the dictionary, and therefore the closest vector to the computed result will most probably be the best solution to the puzzle.
- 12.
Also, we might likely be dealing with floating point numbers such as 1.2 rather than integers such as 1 or 2.
- 13.
Typical dimensions are 100 to 500 dimensions, which depends on the corpus (text samples) the word embeddings were trained on (Lane et al. 2019).
- 14.
Therefore, this kind of method is also referred to as self-supervised learning.
- 15.
If you are interested in more details, see Lane et al. (2019, p. 191).
- 16.
For a more detailed explanation about this, refer to Lane et al. (2019, p. 191).
References
Bolukbasi T, Chang KW, Zou JY, Saligrama V, Kalai AT (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
Firth JR (1962) A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis, Oxford.
Grave É, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning Word Vectors for 157 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Hagiwara M (2021) Real-world natural language processing. Manning Publishing.
Hancox P (1996) A brief history of natural language processing. Available at https://www.cs.bham.ac.uk/~pjh/sem1a5/pt1/pt1_history.html, last accessed 27.05.2023.
Jurafsky D, Martin JH (2023) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft 3rd Edition.
Lane H, Howard C, Hapke HM (2019) Natural language processing in action. Understanding, analyzing und generating text with Python. Manning Publishing.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean, J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Mikolov T, Grave É, Bojanowski P, Puhrsch C, Joulin A. (2018) Advances in Pre-Training Distributed Word Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
Plecháč P (2021) Relative contributions of Shakespeare and Fletcher in Henry VIII: An analysis based on most frequent words and most frequent rhythmic patterns. Digital Scholarship in the Humanities, 36(2), 430-438.
Rashid T (2017) Neuronale Netze selbst programmieren: ein verständlicher Einstieg mit Python. O'Reilly. Originally in English: Rashid, T. (2016). Make your own neural network. CreateSpace Independent Publishing Platform.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kurpicz-Briki, M. (2023). Processing Written Language. In: More than a Chatbot. Springer, Cham. https://doi.org/10.1007/978-3-031-37690-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-37690-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37689-4
Online ISBN: 978-3-031-37690-0
eBook Packages: Computer ScienceComputer Science (R0)