Interperforming in AI: question of ‘natural’ in machine learning and recurrent neural networks
- 487 Downloads
This article offers a critical inquiry of contemporary neural network models as an instance of machine learning, from an interdisciplinary perspective of AI studies and performativity. It shows the limits on the architecture of these network systems due to the misemployment of ‘natural’ performance, and it offers ‘context’ as a variable from a performative approach, instead of a constant. The article begins with a brief review of machine learning-based natural language processing systems and continues with a concentration on the relevant model of recurrent neural networks, which is applied in most commercial research such as Facebook AI Research. It demonstrates that the logic of performativity is not brought into account in all recurrent nets, which is an integral part of human performance and languaging, and it argues that recurrent network models, in particular, fail to grasp human performativity. This logic works similarly to the theory of performativity articulated by Jacques Derrida in his critique of John L. Austin’s concept of the performative. Applying Jacques Derrida’s work on performativity, and linguistic traces as spatially organized entities that allow for this notion of performance, the article argues that recurrent nets fall into the trap of taking ‘context’ as a constant, of treating human performance as a ‘natural’ fix to be encoded, instead of performative. Lastly, the article applies its proposal more concretely to the case of Facebook AI Research’s Alice and Bob.
KeywordsPerformativity Machine learning Natural language processing Recurrent neural networks Derrida Facebook
A never-ending goal of the research on human–machine interaction has been to achieve a state where humans and computers may have natural conversations. In recent decades, there have been overwhelming advances in the competency of computers to recognize, parse and understand language and images, and generate responses in the context of conversations, especially by the help of the new practical deep neural network models developed in relatively independent laboratories of MIT and DFKI. In particular, when it comes to the use of these state-of-the-art models in daily life, it is imperative to take into account the developments in the laboratories of major commercial technology firms, such as Facebook and Google. In these laboratories, there are quite a few models that understand natural language in constrained domains, goal-oriented dialog contexts, no matter how stilted digitized voices of these models might sound. Yet, even with the current advanced models developed in these laboratories, we still see artificial agents that generate speech in a limited number of situations for limited number of purposes, such as reserving a place in a restaurant in the case of Google’s Duplex, or playing chess in the case of IBM’s Watson (Danaher 2018). In particular, machine learning-based natural language processing systems are still struggling to model simple human-like communication, such as conversations. They do not go beyond constant contextual constraints to engage in a flowing conversation that can force the artificial agent (bot or AI assistant) to adjust to human agents’ natural linguistic systems instead of the natural language systems adapting to the artificial agent.
This paper unpacks the most commonly and commercially used NLP model, recurrent nets or recurrent neural networks (RNN), to question the treatment of “natural” in the process of designing and developing artificial agents in commercial laboratories, such as Facebook AI Research. Recurrent nets are designed for pattern recognition in data, such as text, numerical data, or images. The algebraic functions of these nets are enhanced by the repetitive insertion of similar data inputs in a way as to get close to the way human memory operates to use language in the context of conversation. In other words, RNNs take the input as something they have already recognized. For instance, a present case of an input would be someone speaking to Amazon’s Alexa to play a particular song on Spotify (“Alexa, can you play Billie Eilish’s ‘Bad Guy’”), and Alexa completes this task by parsing the words, recognizing the traces of each word and the entire sentence structure by means of its previously recorded audio dataset, as well as grammar structures. In this sense, these models have two sorts of inputs at work: the one from the past (encoded data) and the one at the present (input data), which determine how recurrent nets are supposed to respond to new inputs. This procedure is accepted as the same as humans do in real life and is termed ‘natural language processing’. This paper questions the treatment of the “natural” in RNN models in particular, and in machine learning-based NLP models in general. As we will see, “recurrent” or repetitive dimension of language learning requires an examination on the performative aspects of language traces (as opposed to the formal abstraction of models that appeal to the “natural”).
The article begins with a description of NLP architecture deployed in the research on machine learning, with an overview of the particular neural networks that have been found applicable for developing algorithmic agents toward commercial purposes. The next section outlines the processes through which language gets recorded and indexed into big datasets, and employed by recurrent neural networks based on machine learning. Then, in the third section, the article takes a theoretical detour through Jacques Derrida’s reworking of performativity based on his critique of John L. Austin’s work on the constative and performative utterances to underline the trouble with the natural, and to propose the performative as a better concept to grasp the functions of recurrent nets. The last section then applies this theoretical argument more concretely to the case of Alice and Bob, two prototypical AI bots (known as “Turkers”) developed by Facebook AI Research, and underlines some of the key problems with RNNs in particular.
2 The question concerning the “natural”: human likeness
Human-like performance is a common denominator in broad definitions of AI. One of the most cited books in the field, Russell and Norvig’s Artificial Intelligence: A Modern Approach, elaborate a number of leading definitions and find out that ‘doing like a human’ and ‘human rationality’ are common characteristics (Russell and Norvig 2016). This is no surprise, since these definitions follow Alan Turing’s footsteps, and that Turing’s ideal computer was based on imitating human-like speech (Turing 1950). Various versions of Turing’s famous test illustrate why certain schools approach AI as an embodiment with human-like traits, and thus treat these projected embodiments of human-like traits as natural fixes. Contemporary research on the development of AI relies on a similar understanding of the human mind and sign systems, which, in most current cases, show a synthesis of language and articulation, imitation and data processing.
Natural language processing is a comprehensive practice of predominantly statistical and broadly computational techniques used in the process of analyzing, parsing and representing language (Allen 2006). NLP research generally started in the 1950s thanks to Turing’s work, which provided the research with the basic criteria, the Turing Test, toward successful intelligence models (Chowdhury 2003). NLP techniques have evolved from earlier models that processed sentences in minutes in the 1950s to the search engines that process large texts in seconds in the 2000s (Cambria 2014). NLP techniques are an integral part of digital tools from computers to smartphones that provide them with various tasks at multiple levels such as parsing language, sentence breaking, part-of-speech tagging (POS), named-entity recognition (NER), optical character recognition (OCR) and machine translation.
At first glance, this interactive approach to research brings forward not only the importance of machine learning, but also the significance of conducting human-like interactions (Bennett et al. 2003). However, ‘interaction’ is a form/trace mediated exchange that is separated from action, how humans feel or how humans use multimodal activity. A closer look at the infrastructure of NLP models thus opens up the question of ‘natural’ as in the human world. Once asked, it will be argued that misassociations with the ‘natural’ place unnecessary limits on the architecture of NLP systems. Furthermore, in mistreating the human-like as a fixed norm, as in the unchanging laws of physics, NLP systems also incorporate the ‘context’ component into machinic models as a constant as opposed to a variable.
There are multiple machine learning methods that do the work of processing various forms of data through dimensions that are composed of representations. Recent methods and models that have been employed to master the “natural” come forward with regard to the design of commercial AI products. The model of recurrent neural networks (RNNs) has been put into operation in the designs by Facebook’s and Google’s AI teams. Facebook’s Mechanical Turker Descent researchers Yang et al. and Google Duplex engineers propose the interactive process as a better method to produce natural language since humans learn a language in a natural environment (Yang et al. 2017; Leviathan and Matias 2018). It is therefore imperative to scrutinize the architecture of recurrent nets, a fundamental feature of major machine learning research on language processing. We will unpack the relations between recurrent nets and the “natural” to clarify the trouble with the ‘context’ component of these nets.
2.1 Recurrent neural networks
Whether recorded data are distributed similar to the ‘human world’ or not is a prominent issue in NLP systems. Given that the data are not significantly different from those found in the human world, it is hypothesized to be reliable. The distributional hypothesis is related to the context component of any NLP system. Distributions represent words with similar meanings by vectors, and the hypothesis makes the strong assumption that these words will only be uttered this way in similar contexts. The distributional vectors function to grasp the neighbors of a particular word within some borders known as windows. This procedure called “word embedding” computes similarities between vector representations through various formulations such as cosine. Usually, these embeddings are obtained through initial pieces of training in large datasets, by subsidiary goals such as forecasting a word used in a particular context (Mikolov et al. 2010). Context is pictured as a psychological force that ‘works’ to get semantic information from the data. Because windowing reduces the number of dimensions, word embeddings are useful in deciding on the context for NLP tasks. Machine learning-based NLP models in general, and RNNs in particular, represent not only words, but also sentences employing these distributed representations (word embeddings).
Because RNNs process linguistic information sequentially, their algorithms can recognize sequentiality as embedded in language. RNNs parse language into units of a sequence, and these units represent characters, words or sentences. In essence, RNN uses a lexical semantic approach to such units. Depending on the meaning of use, each new unit might change its semantic meaning based on the previous unit (like compound words). RNNs can capture various meanings through the mediation of its re-adjustable and sequential function to model texts of any length from words to longer documents (Tang et al. 2015).
2.2 Mastering the “natural”
Various versions of the Turing Test have shown why research on AI strives to compose models with natural human characteristics such as language. In the representative version of the test, there are three participators: an interrogator, a human and a machine. If the interrogator cannot tell who is human and who is the machine, or if the machine can trick the interrogator, it passes the test as “intelligent.” Turing proposed his digital computer model as a self-learning machine using an audio dataset recorded to tape. The earlier models of imitative AI relied on the mediation of data between the machine and the human, or the mediation between “intelligences” on the basis of lingual signs (see Shannon and Weaver 1949).
Turing’s proposal for intelligence did not explicitly refer to an AI’s understanding of natural language in a dialog but, rather, how language could be learned and imitated to pass the test. In this vein, the Turing design concentrates on mere imitation without recurrence. RNN, however, is designed not only to imitate the natural, but also approximate and fixate the natural through repetition. Their design incorporates learning the architecture of language. Performing beyond imitation requires repetition of similar patterns in different contexts. It thus demands their embedded ability to learn structures in language. Repetition of the learned patterns and then putting together small pieces to make a meaningful whole, using this larger structure in another step and then reiterating this process help RNN algorithms build exhaustive structures of words, sentences and dialogs from a limited dataset. The infrastructure of this particular NLP model highlights these themes of linguistic performance and contextualization, as well as an orientation toward an ultimate goal (for any NLP framework, this process of objectifying a task with a goal is known as incentivization, in other words, agents are intentionalized). The whole is associated with differing treatments of gathering, processing and generating data.
RNN architecture focuses on language as “natural”. Despite innovative and powerful techniques, we argue that the concept of “natural” needs to be problematized by taking a performative approach to language. Jacques Derrida’s articulation of performativity would offer an analytical tool for such a goal. Linguistic performativity provides us with a variable of context. In this way, rather than offer a fixated meaning of “conversation as context”, one can see the differing meanings of each word, sentences and texts not only within a given dataset, but also in their interactively generated equivalents. As such, we will now turn to a survey of relevant aspects of Jacques Derrida’s view of the performative. The results show a problem immanent to the RNN architecture in particular and, it is suggested, NLP systems in general.
3 Performativity and contextual possibilities
Every sign, linguistic or nonlinguistic, spoken or written… in a small or large unit, can be cited, put between quotation marks; in so doing it can break with every given context, engendering an infinity of new contexts in a manner which is absolutely illimitable. This does not imply that the mark is valid outside of a context, but on the contrary that there are only contexts without any center or absolute anchoring [ancrage]. This citationality, this duplication or duplicity, this iterability of the mark is neither an accident nor an anomaly, it is that (normal/abnormal) without which a mark could not even have a function called “normal.” What would a mark be that could not be cited? Or one whose origins would not get lost along the way? (Derrida 1988, 12)
A central concept of the theory of language as imitated and reiterated performance is performativity, highlighted by Jacques Derrida in ‘Signature Event Context’ (1988). Derrida critiques John L. Austin’s concept of performative utterances in How To Do Things With Words, where Austin has put forward that performativity in languaging is based on action that includes ‘context’ and ‘consciousness’ and human use of ‘illocutionary’ force. Austin identifies two forms of interrelated utterances: constative utterances that define something and performative utterances that refer to the act of performing what is said. In his view, statements such as “This computer is old,” or “I bought a new computer,” are either descriptive or reported, rather than performative. Austin’s notion of ‘performative utterances’ relies on indulging in an act such as a marriage that is confirmed by uttering words, “taking to be one’s wife”, “To name the ship is to say (in the appropriate circumstances) the words ‘I name &c.’ When I say, before the registrar or altar, etc., ‘I do’, I am not reporting on a marriage, I am indulging in it” (Austin 1962, 6).
Derrida concentrates on not action, but signs and speech traces. In this way, he turns to the issue of “the conscious presence of the intention of the speaking subject in the totality of his speech act,” which, he claims, is a highly significant ingredient of performativity (Derrida 1988, 14). He uses the term ‘author’ for the origin and the norm of the language (which he tends to associate with writing) and, further, illustrates how the author’s intention can be de-contextualized in different uses of language by the speaking subject’s conscious intention and memory traces. All elements, Derrida stresses, are replete with ‘reiteration’ and ‘recitation.’ For there to be a successful or an unsuccessful speech, respectively, language should be cited and reiterated. Repetition of constatives, in Derrida’s view, slightly alter the meaning of the speech acts in ways that are context sensitive. This is Derrida’s perspective of performativity.
Unlike Derrida, Austin denies that ‘the mediation of thinking’ is central to language. His aim was to trace language to the performative. Accordingly, he addresses, not speech traces, but actual people who build and launch ships, get married, make avowals, make philosophy and other similar ‘circumstances’. He subordinated the ‘semiotic’ to ‘speech acts’ whose illocutionary and perlocutionary force depend on circumstances. Eventually, he probably came to see that when one starts with speech acts, the performative/constative distinction is untenable. The constative cannot oppose the performative; in other words, constative acts are also performative. Austin’s concern was with human activity in which wordings play a part. He focused on languaging as a multimodal connector of people and things. As an activity, Austin’s view of languaging does not reduce to language processing.
Could a performative utterance succeed if its formulation did not repeat a “coded” or iterable utterance, or in other words, if the formula I pronounce in order to open a meeting, launch a ship or a marriage was not identifiable as conforming with an iterable model, if it were not then identifiable in some way as a “citation”? (Derrida 1988, 18)
For Derrida, reiterability, re-citeability of any semiotic sign/trace is an inherent characteristic of language. The origin and intention become insignificant as this reiteration proceeds, allowing the sign to be used for new possibilities. Iterability does not simply mean that any performative utterance is a repetition of a norm. Different contexts yield different results, and a reiteration cannot be pure. This general iterability assumes that there is residual for the return of the same in the process of variation between each item re-uttered. Each re-uttered item will vary in the process of constating. The differential residual will make it impossible for a re-presenting or an absolute return of the original in which the spectral author, or the linguistic sign’s past, is displaced since it becomes insignificant.
Our inquiry thus concerns, building on Derrida, how speech traces can be used performatively. Derrida’s view of the performative concerns ‘written’ marks that exist independently of performance. In his light, whereas traces cannot perform, living systems do so. Derrida’s concern with traces overlaps with our concern with machine-derived and manipulated traces. Language processing-based computational models deal with traces, not with Austin’s concerns about what language does things to living people. Accordingly, the question as to how computational means can be used to classify and manipulate traces and elements is significant. We argue that they can be treated as performative and tagged to elements and variables to pick out ‘speech act types’ and types of contexts.
We conclude that as long as context as a variable is not included in NLP models, there is no possibility of the new. Austin’s theory applies to the limited performance of the AI agents within given norms and goals; Derrida’s critical approach, however, opens up a contextual or even creative approach to language. The only invention would be the re-invention of speech act, followed by a supervised narrowing that is sensitive to context. In the case of Facebook’s and Google’s AI researches, the proposal is that of simple contexts run by enormous data, and this is because the researchers conceive unusual events to be usual in big data (Yang et al. 2017; Michaely et al. 2017; Oord et al. 2017). FAIR’s categorization is still flexible due to the supervised learning process through repetition. The data structure of RNNs is inevitably encoded at the level of algorithm, grammar, subject–object relation and the editorial code, and thus the possibility of new contexts is always limited by these operations. Hence, RNNs function through contextualization, as evidenced by the linguistic nature. This restrains their ability to investigate the critical domain upon which contextual variations parasitize the original data and produce the new. What this means for us is that even those intentions and contexts that are not included in the master algorithm of RNNs can be included through the artificial agent’s entrance into the game of linguistic performativity that constitutes the agent’s ability to do what it is not encoded to do.
4 The case of Alice and Bob
Catch only what you’ve thrown yourself, all is mere skill and little gain; but when you’re suddenly the catcher of a ball thrown by an eternal partner with accurate and measured swing toward you, to your center, in an arch from the great bridge-building of God: why catching then becomes a power - not yours, a world’s.
Rainer Maria Rilke
In the case of Mechanical Turker Descent, Facebook researchers understand natural language as a large dataset of interactive speech traces that can all be structured into speech-to-text or text-to-speech utterances. Although these double structures of speech are obviously more flexible than the convolutional neural networks that Yann LeCun (Conneau et al. 2016) was analyzing at the time of the FAIR’s launch, they still come down to basic concepts that can be analyzed by specifying a relation between iterability and intention. This means that those things that are included in the given datasets can be successfully repeated by an intentional (goal-oriented) AI agent, Turker. Yet, this success is limited by a single repetition of the patterns in the fixed dataset. Interactive learning, in augmenting the patterns in the dataset, necessitates recursion of different contextual uses of, say, the same element. However, this variability of contextual uses of inputs is not included as a key component of RNNs. Context is only taken as a basic conversation.
Performativity of a natural language is a threshold through which new contexts, differential acts, and presences can take place. Although the apparent recursivity is the aim of NLP programming, this process has overlooked linguistic performativity. Linguistic performativity lies at the center of machine learning based NLP and clarifies a train of thought that is endowed with underlining the constraints of such attempts to project the natural, that is, performative, into the artificial. What is evident in recurrent nets is the scarcity of the relevant components and underrepresentation of the contextual dimensions taking place in the linguistic performativity. We will now clarify this problem by a simple insertion of linguistic performativity and will provide a basic articulation of the missing component of context between the recursive and the performative through the case of FAIR’s Alice and Bob.
4.1 The trouble with Mastering the Dungeon
The objective of Facebook’s project is to see whether it was possible for AI agents “to engage in start-to-finish negotiations with other bots or people while arriving at common decisions or outcomes” (Lewis et al. 2017).1 Their goals were to negotiate for a number of objects like hats, books, and balls. AI agents are supposed to learn how to negotiate with humans without letting them know that they are talking to a machine, or without exposing them to a mechanic performance of language. For this, as mentioned above, the constraint to be expanded at the heart of the neural network model is the conversational flow within the contextual constraints of the dungeon.
The conundrum in this particular case, and arguably in the RNN models in general, is the way the context is treated. From the perspective of linguistic performativity, context, which is fixated as dialog in neural networks, becomes a variable itself. The word and sentence embeddings of the network depend on the varying context. In each recurrence, the context moves along with the network, as each repeating word is added to the sentences. Each time a word is added as a new component to the conversation with the incentive/intention to get close to a particular goal, any success is deferred. This contextual deferring, however, naturally produces a form of linguistic performance that is only relevant to the task in itself. As the interaction takes place within the deferring context, it endows the word embeddings with deferred meanings to arrive at the balls or the books.
What Facebook’s researchers consider any given new to be is an entity in a certain language system made up of text-to-speech function, and this performative operation restricts the possibility of enabling a deferring context that would otherwise be unconstrained. Some aspects of this problem can conceivably be corrected should the objective of these researches be something other than designing entities for practical tasks designed for human–machine interaction.
In casting off the possibility of the interactive performance, FAIR’s research underrepresents the contextual spirals that are needed for the master game plan. The problem with NLP is that it treats the world as constants in the “natural” used in physics. This paper has argued that humans do not live in such a world; we live, perform, reiterate and recite traces, and language is no exception. The ways we inhabit the world allows what Alan Turing calls imitation, and what Jacques Derrida calls repeating with differences. Each time we repeat a linguistic trace in the present, we do it differently from what we did in the past. The context, or space, revolves around the words and sentences in relation to the previous uses. NLP’s language systems fail to grasp the contextual spirals, an integral part of human performance and languaging, since speech traces, as a consequence of actions/activity, have to be performative.
Recurrent nets, a leading protocol in NLP research, might generate a certain ability and performativity that was not available before, but the statistical network models are problematic in relation to the context component, and this is proven by the interperformative consequence in the case of Alice and Bob. The suggestion to design an alternative model into machine learning is beyond the scope of this article, since it is not an internal problem in RNN models themselves but rather the treatment of ‘natural’ in any recognition models in general. With regard to the constitutive limitations of these models, it would therefore be expected to include context as a variable and linguistic performativity as a method to test the success of the artificial agents. Otherwise, as the case of Alice and Bob shows, these artificial agents would quite rapidly evolve into incomprehensible aliens.
FAIR published their early MTD research findings on the Facebook Code Blog—available at https://code.fb.com/ml-applications/deal-or-no-deal-training-ai-bots-to-negotiate/.
I would like to express my gratitude to my beloved one, who both visibly and invisibly interperformed with me in numerous spaces before, during and after the development of this manuscript. I should also thank my professor Denise Albanese (George Mason University) for her invaluable help in the process of initial revisions of the manuscript.
This research did not receive any specific grant from funding agencies in the public, commercial or non-profit sectors.
- Allen JF (2006) Natural language processing. Encyclopedia of cognitive scienceGoogle Scholar
- Austin JL (1962) How to do things with words. The William James lectures delivered at Harvard University in 1955. Clarendon Press, OxfordGoogle Scholar
- Bennett IM, Babu BR, Morkhandikar K, Gururaj P (2003) US Patent no. 6,665,640. US Patent and Trademark Office, Washington, DCGoogle Scholar
- Conneau A, Schwenk H, Barrault L, Lecun Y (2016) Very deep convolutional networks for natural language processing. arXiv preprintGoogle Scholar
- Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint. arXiv:1705.02364
- Derrida J (1988) Signature event context. Limited Inc. Northwestern University Press, EvanstonGoogle Scholar
- IBM (2018) The new AI innovation equation. IBM Blog. https://ibm.com/watson/advantage-reports/future-of-artificial-intelligence/ai-innovation-equation.html
- Kelly K, IBM (2018) What’s next for AI? Q&A with the co-founder of Wired Kevin Kelly. IBM Blog. https://ibm.com/watson/advantage-reports/future-of-artificial-intelligence/kevin-kelly.html
- Leviathan Y, Matias Y (2018) Google duplex: An ai system for accomplishing real-world tasks over the phone. Google AI Blog. https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html
- Lewis M, Yarats D, Dauphin YN, Parikh D, Batra D (2017) Deal or no deal? Training AI bots to negotiate. Facebook Code. https://code.fb.com/ml-applications/deal-or-no-deal-training-ai-bots-to-negotiate/
- Michaely AH, Zhang X, Simko G, Parada C, Aleksic P (2017) Keyword spotting for Google assistant using contextual speech recognition. In: Automatic speech recognition and understanding workshop (ASRU), 2017 IEEE. IEEE, pp 272–278Google Scholar
- Mikolov T, Karafiát LM, Burget JC, Khudanpur S (2010) Recurrent neural network based language model. In: Proceedings of interspeech, vol 2, p 3Google Scholar
- Oord AVD, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, Casagrande N (2017) Parallel WaveNet: fast high-fidelity speech synthesis. arXiv preprint. arXiv:1711.10433
- Shannon CE, Weaver W (1949) The mathematical theory of communication. Urbana, ILGoogle Scholar
- Tang D, Qin B, Liu T (2015) Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of conference of empirical methods natural language processing, pp 1422–1432Google Scholar
- Yang Z, Zhang S, Urbanek J, Feng W, Miller AH, Szlam A, Weston J (2017) Mastering the Dungeon: grounded language learning by mechanical Turker Descent. arXiv preprint. arXiv:1711.07950
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.