Processing Written Language

Kurpicz-Briki, Mascha

doi:10.1007/978-3-031-37690-0_3

Mascha Kurpicz-Briki²

237 Accesses

Abstract

In this chapter, we will learn more about how text is processed automatically. Natural language processing refers to the automated processing (including generation) of speech and text. We will look at some common natural language processing applications and explain common methods from the field of natural language processing. We will deepen our machine learning knowledge and introduce and understand the advantages and disadvantages of deep learning and neural networks. Finally, we will understand how words from human language can be represented as mathematical vectors and why this is beneficial for machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 24.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://spacy.io
2.
We will look at some more technical concepts but, however, will in the scope of this book not go into all the details. If you want to know it all and are willing to dive deep into the technical parts, I can recommend you the textbooks (Lane et al. 2019; Hagiwara 2021) as follow-up to this book, providing an applied step-by-step description of different NLP methods.
3.
Choosing the top 3 words is to keep the example simple. In a real case, we would want to choose a higher number of words.
4.
In computer science, we often start counting with 0 and not 1 in such situations. But it takes some time to get used to this, so let’s start counting with 1 here.
5.
In the example above, the numbers inside the vector were represented one above the other, and here they are represented on the same line. This is just to improve the readability and has no specific meaning.
6.
There are also tools and libraries (existing software components) that can support the data engineer in automating some of these steps.
7.
If you are interested in a more mathematical introduction, I can refer you to Rashid (2017), by which some of the examples in this section are inspired by.
8.
A matrix is a table of numbers. For example, a matrix can be multiple vectors being aggregated together. In such a matrix, each column or row of the table would then be a vector.
9.
We could also have document embeddings, or sentence embeddings, but let’s stick for now to the fact that 1 word = 1 vector.
10.
This is similar to the one-hot vectors we have seen previously. However, one advantage with such word embeddings as described here is the lower number of dimensions.
11.
Note that the exact values of Vector(«Queen») computed in the example might not exist in the dictionary, and therefore the closest vector to the computed result will most probably be the best solution to the puzzle.
12.
Also, we might likely be dealing with floating point numbers such as 1.2 rather than integers such as 1 or 2.
13.
Typical dimensions are 100 to 500 dimensions, which depends on the corpus (text samples) the word embeddings were trained on (Lane et al. 2019).
14.
Therefore, this kind of method is also referred to as self-supervised learning.
15.
If you are interested in more details, see Lane et al. (2019, p. 191).
16.
For a more detailed explanation about this, refer to Lane et al. (2019, p. 191).

References

Bolukbasi T, Chang KW, Zou JY, Saligrama V, Kalai AT (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
Google Scholar
Firth JR (1962) A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis, Oxford.
Google Scholar
Grave É, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning Word Vectors for 157 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Google Scholar
Hagiwara M (2021) Real-world natural language processing. Manning Publishing.
Google Scholar
Hancox P (1996) A brief history of natural language processing. Available at https://www.cs.bham.ac.uk/~pjh/sem1a5/pt1/pt1_history.html, last accessed 27.05.2023.
Jurafsky D, Martin JH (2023) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft 3rd Edition.
Google Scholar
Lane H, Howard C, Hapke HM (2019) Natural language processing in action. Understanding, analyzing und generating text with Python. Manning Publishing.
Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean, J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Google Scholar
Mikolov T, Grave É, Bojanowski P, Puhrsch C, Joulin A. (2018) Advances in Pre-Training Distributed Word Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
Google Scholar
Plecháč P (2021) Relative contributions of Shakespeare and Fletcher in Henry VIII: An analysis based on most frequent words and most frequent rhythmic patterns. Digital Scholarship in the Humanities, 36(2), 430-438.
Article Google Scholar
Rashid T (2017) Neuronale Netze selbst programmieren: ein verständlicher Einstieg mit Python. O'Reilly. Originally in English: Rashid, T. (2016). Make your own neural network. CreateSpace Independent Publishing Platform.
Google Scholar

Download references

Author information

Authors and Affiliations

Applied Machine Intelligence, Bern University of Applied Sciences, Biel/Bienne, Switzerland
Mascha Kurpicz-Briki

Authors

Mascha Kurpicz-Briki
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kurpicz-Briki, M. (2023). Processing Written Language. In: More than a Chatbot. Springer, Cham. https://doi.org/10.1007/978-3-031-37690-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-37690-0_3
Published: 07 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37689-4
Online ISBN: 978-3-031-37690-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics