Chatbot Interaction with Artificial Intelligence: human data augmentation with T5 and language transformer ensemble for text classification

In this work we present the Chatbot Interaction with Artificial Intelligence (CI-AI) framework as an approach to the training of a transformer based chatbot-like architecture for task classification with a focus on natural human interaction with a machine as opposed to interfaces, code, or formal commands. The intelligent system augments human-sourced data via artificial paraphrasing in order to generate a large set of training data for further classical, attention, and language transformation-based learning approaches for Natural Language Processing (NLP). Human beings are asked to paraphrase commands and questions for task identification for further execution of algorithms as skills. The commands and questions are split into training and validation sets. A total of 483 responses were recorded. Secondly, the training set is paraphrased by the T5 model in order to augment it with further data. Seven state-of-the-art transformer-based text classification algorithms (BERT, DistilBERT, RoBERTa, DistilRoBERTa, XLM, XLM-RoBERTa, and XLNet) are benchmarked for both sets after fine-tuning on the training data for two epochs. We find that all models are improved when training data is augmented by the T5 model, with an average increase of classification accuracy by 4.01%. The best result was the RoBERTa model trained on T5 augmented data which achieved 98.96% classification accuracy. Finally, we found that an ensemble of the five best-performing transformer models via Logistic Regression of output label predictions led to an accuracy of 99.59% on the dataset of human responses. A highly-performing model allows the intelligent system to interpret human commands at the social-interaction level through a chatbot-like interface (e.g. “Robot, can we have a conversation?”) and allows for better accessibility to AI by non-technical users.


Introduction
Attention-based and transformer language models are a rapidly growing field of study within machine learning and artificial intelligence and for applications beyond.The field of Natural Language Processing has especially been advanced through transformers due to their approach to reading being more akin to human behaviour than classical sequential techniques.With many industries turning to Artificially Intelligent solutions by the day, models have a growing requirement for robustness, explainability, and accessibility since AI solutions are becoming more and more popular for those without specific technical backgrounds in the field.Another interesting field that is similarly being seen more often is that of Data Augmentation; that is, creating data from a set that in itself increases the quality of that set of data.The alternative to data augmentation, which is unfortunately the case with many modern NLP systems, is to gather more data.As an alternative to unwanted privacy concerns, data scientists may instead find ways to augment the data as a friendlier alternative.
In this study, we bring together all of these aforementioned concepts and fields of study to form a system that we call Chatbot Interaction with Artificial Intelligence (CI-AI).A general overview of the approach can be observed in Figure 1.As an alternative to writing code and managing data, complex machine learning tasks such as conversational AI, sentiment analysis, scene recognition, brainwave classification and sign language recognition among others are given accessibility through an interface of natural, social interaction via both verbal and non-verbal communication.That is, for example, a spoken command of "can we have a conversation?"or a sign language command of "can-we-talk" would command the system to launch a conversational AI program.For such a system to be possible, it needs to be robust, since an interactive system that makes one mistake for many successes would be considered a broken system.The system needs to be accessible to a great number of people with differing backgrounds, and thus must have the ability to generalise by being exposed to a large amount of training data.Last, but by no means least, the system needs to be explainable; as given in a later example, if a human were to utter the phrase, "Feeling sad today.Can you cheer me up with a joke?", which features within that phrase lead to a correct classification and command to the chatbot to tell a joke?Where does the model focus within the given text in order to correctly predict and fulfil the human's request?Thus, to achieve these goals, the scientific contributions of this work are as follows: 1.The collection of a 7-class command-to-task dataset from multiple human beings from around the world, giving a total of 483 data objects.
2. Augmentation of the human data with a transformer-based paraphrasing model which results in a final training dataset of 13,090 labelled data objects.
3. Benchmarking of 7 State-of-the-Art transformer-based classification approaches for text-to-task commands.Each model is trained on the real training data and validation data, and is then trained on the real training data plus the paraphrased augmented data and validation data.We find that all 7 models are improved significantly when exposed to augmented data.

4.
A deep exploration of the best model.Firstly in order to discern the small amount of errors (1.04% errors) and how they were caused by seeing the largest errors in terms of loss and the class probability distributions.Secondly, the chatbot is given commands that were not present during training or validation, and top features (words) are observed -interestingly, given their technical nature, the models focus keenly on varying parts of the sentence similar to a human reading.
The rest of this article is structured as follows.Initially, the background and related studies are explored in Section 2. The method of the experiments are described in Section 3, and the results from the experiments are then presented in Section 5.With the best-performing model in mind, Section 6 then explores the model in terms of the small number of errors made, and how the model interprets new and unseen data (ie.should the model be in deployment).Finally, conclusions are drawn and future work is suggested in Section 7.
Figure 2: An eye-tracking study of natural reading from [19].The reader's gaze naturally follows a left-to-right reading pattern with a fluctuation back to the main area of interest, where the main reading time is greater than that of the rest of the sentence.

Background and Related Works
The Transformer is a new concept in the field of deep learning [1].Transformers currently have a primary focus on NLP, but state-of-the-art image processing using similar networks have recently been explored [2].With the idea of paying attention in mind, the theory behind the exploration of Transformers in NLP is their more natural approach to sentences; rather than focusing on one token at a time in the order that they appear and suffering from the vanishing gradient problem [3], Transformer-based models instead pay attention to tokens in a learned order and as such enable more parallelisation while improving upon many NLP problems through which many benchmarks have been broken [1,4].For these reasons, such approaches are rapidly forming State-of-the Art scores for many NLP problems [5].For text data in particular these include generation [6,7], question answering [8,9], sentiment analysis [10,11], translation [12][13][14], paraphrasing [15,16], and classification [17,18].
According to [1], Transformers are based on calculation of scaled dot-product attention units.These weights are calculated for each word within the input vector of words (document or sentence).The output of the attention unit are embeddings for a combination of relevant tokens within the input sequence.This is shown later on in Section 6 where both correctly and incorrectly classified input sequences are highlighted with top features that lead to such a prediction.Weights for the query W q , key W k , and value W v are calculated as follows: The query is an object within the sequence, the keys are vector representations of said input sequence, and the values are produced given the query against keys.Unsupervised models receive Q, K and V from the same source and thus pay self-attention.For tasks such as classification and translation, K and V are derived from the source and Q is derived from the target.For example, Q could be a class for the text to belong to ie. for sentiment analysis "positive" and "neutral" and thus the prediction of the classification model.Secondly, for translation, values K and V could be derived from the English sentence "Hello, how are you?" and Q the sequence "¿Hola, como estas?" for supervised English-Spanish machine translation.All of the State-of-the-Art models benchmarked in these experiments follow the concept of Multi-headed Attention.This is simply a concatenation of multiple i attention heads h i to form a larger network of interconnected attention units: It is important to note that human beings also do not read in a token-sequential nature as is with classical models such as the Long Short Term Memory (LSTM) network [20].Figure 2 from a 2019 study on reading comprehension [19] shows human behaviour while reading.It can be observed from this example and other related studies [21][22][23], that rather than simply reading left-to-right (or right-to-left [23,24]), instead attention is paid to areas of interest within the document.Of course, a human being does not follow the equations previously described, but it can be noted that attention-based models are more similar to human reading comprehension than that of sequential models such as the LSTM.Later, in Section 6, during the exploration of top features within correct classifications, it can be observed that RoBERTa also focuses upon select areas of interest within a text for prediction.
The Text-to-Text Transfer Transformer T5 model is a unified approach to text transformers from Google AI [25].T5 aims to unify NLP tasks by restricting output to text which is then interpreted to score the learning task; for example, it is natural to have a text output for a translation task (as per the previous example on English-Spanish translation), but for classification tasks on the other hand, a sparse vector for each prediction is often expected -T5 instead would output a textual representation of the class(es).This feature allows T5 to be extended to many NLP tasks outside of those suggested and benchmarked in the original work.To give a specific example to this study, an English-English translation of example "what time is it right now?" to "could you tell me the time, please?" provides a paraprhasing activity.That is, to express the same meaning of a text written in a different way.
Chatbots are a method of human-machine interaction that have transcended novelty to become a useful technology of the modern world.A biological signal study from 2019 (Muscular activity, respiration, heart rate, and electrical behaviours of the skin) found that textual chatbots provide a more comfortable platform of interaction than with more human-like animated avatars, which caused participants to grow uncomfortable within the uncanny valley [26].Many chatbots exist as entertainment and as forms of art, such as in 2018 [27] when natural interaction was enabled via state-of-art of the art methods for character generation from text [28].This allowed for 10,000 visitors to converse with 19th century characters from Machado de Assis' "Dom Casmurro".It has been strongly suggested through multiple experiments that natural interaction with chatbots will provide a useful educational tool in the future for students of varying ages [29,30,30,31].The main open issue in the field of conversational agents is data scarcity which in turn can lead to unrealistic and unnatural interaction, overcoming which are requirements for the Loebner Prize based on the Turing test [32].Solutions have been offered such as data selection of input [33], input simplification and generalisation [34], and more recently parapahrasing of data [35].These recent advances in data augmentation by paraphrasing in particular have shown promise in improving conversational systems by increasing understanding of naturally spoken language [36,37].

Proposed Approach
In this section, the proposed approach followed by the experiments are described, from data collection to modes of learning and classification.
The main aim of this work is to enable accessibility to previous studies, and in particular the machine learning models derived throughout them.Accessibility is presented in the form of social interaction, where a user requests to use a system in particular via natural language and the task is derived and performed.The seven commands are: • Scene Recognition [38] -The participant requests a scene recognition algorithm to be instantiated, a camera and microphone are activated for multi-modality classification.
• EEG Classification -The participant requests an EEG classification algorithm to be instantiated and begins streaming data from a MUSE EEG headband, there are two algorithms: -EEG Mental State Classification [39] -Classification of whether the participant is concentrating, relaxed, or neutral.-EEG Emotional State Classification [40] -Classification of emotional valence, positive, negative, or neutral.
• Sentiment Analysis of Text [41] -The participant requests the instantiation of a sentiment analysis classification algorithm for a given text.
• Sign Language Recognition [42] -The participant requests to converse via sign language, a camera and Leap Motion and Leap Motion are activated for multi-modality classification.Sign language is now accepted as input to the task-classification layer of the chatbot.
• Conversational AI [34] -The participant requests to have a conversation, a chatbot program is executed.
• Joke Generator [43,44] -The participant requests to hear a joke, a joke-generator algorithm is executed and output is printed.
Each of the given commands are requested in the form of natural social interaction (either by keyboard input, speech converted to text, or sign language converted to text), and through accurate recognition, the correct algorithm is executed based on classification of the human input.Tasks such as sentiment analysis of text and emotional recognition of EEG brainwaves, and mental state recognition compared to emotional state recognition, are requested in similar ways and as such constitutes a difficult classification problem.For these problems, minute lingual details must be recognised in order to overcome ambiguity within informal communication.
Figure 3 shows the overall view of the system.Keyboard input text, or speech and sign language converted to text provide an input of natural social interaction.The chatbot, trained on the tasks, classifies which task has been requested and executes said task for the human participant.Sign language, due to its need for an active camera and hand-tracking, is requested and activated via keyboard input or speech and itself constitutes a task.In order to derive the bold 'Chatbot'

Human agent provides input
Figure 3: Overall view of the Chatbot Interaction with Artificial Intelligence (CI-AI) system as a looped process guided by human input, through natural social interaction due to the language transformer approach.The chatbot itself is trained via the process in Figure 4.

Human responses (Small dataset)
Questionnaires    4 shows the training processes followed.Human data is gathered via questionnaires which gives a relatively small dataset (even though many responses were gathered, the nature of NLP tends to require a large amount of mined data), split into training and testing instances.The first experiment is built upon this data, and State-of-the-Art transformer classification models are benchmarked.In the second set of more complex experiments, the T5 paraphrasing model augments the training data and generates a large dataset, which are then also benchmarked with the same models and validation data in order to provide a direct comparison of the effects of augmentation.
A questionnaire was published online for users to provide human data in the form of examples of commands that would lead to a given task classification.Five examples were given for each, and Table 1 shows some examples that were presented.The questionnaire instructions were introduced with "For each of these questions, please write how you would state the text differently to how the example is given.That is, paraphrase it.Please give only one answer for each.You can be as creative as you want!".Two examples were given that were not part of any gathered classes, "If the question was: 'How are you getting to the cinema?'You could answer: 'Are we driving to the cinema or are we getting the bus?' and "If the question was: 'What time is it?'You could answer: 'Oh no, I slept in too late... Is it the morning or afternoon?What's the time?'".These examples were designed to show the users that creativity and diversion from the given example was not just acceptable but also encouraged, so long as the general meaning and instruction of and within the message was retained (the instructions ended with "The example you give must still make sense, leading to the same outcome.").Extra instructions were given as and when requested, and participants did not submit any example phrases nor were any duplicates submitted.A total of 483 individual responses were recorded.The answers were split 70/30 on a per-class basis to provide two class-balanced datasets, firstly for training (and augmentation), and secondly for validation.That is, regardless of augmentation, the model is tested based on this validation set and are all thus directly comparable in terms of their learning abilities.
The T5 paraphrasing model which was trained on the Quora question pairs dataset [45] is executed a maximum of 50 times for each statement within the training set, where the model will stop generating paraphrases if the limit of possibilities or 50 total are reached.Once each statement had been paraphrased, a random subsample of the dataset on a per-class basis was taken set at the number of data objects within the least common class (sign language).Concatenated then with the real training data, a dataset of 13,090 examples were formed (1870 per class).This dataset thus constitutes the second training set for the second experiment, in order to compare the effects of data augmentation for the problem presented.
Table 2 shows the models that are trained and benchmarked on the two training sets (Human, Human+T5), and validated on the same validation dataset.It can be observed that the models are complex, and training requires a relatively high

Single Transformer Results
Testing Data (Human Data) Transformers (weak models omitted) amount of computational resources.Due to this, the pre-trained weights for each model are fine-tuned for two epochs on each of the training datasets.

Statistical Ensemble of Transformer Classifiers
Following the results detailed later in Section 5, two main findings were made; 1) that all models were improved by T5 augmentation and 2) XLM and XLNet were weak solutions to the problem and scored relatively low classification scores.Following these findings, an extension to the study through an ensemble method is devised which combines the five strong models when trained on paraphrased data, which can be observed in Figure 5.The training and test datasets are firstly distilled into a numerical vector of five predictions made by the five transformer models.Following this, statistical machine learning models are trained on the training set and validated by the test set in order to discern whether combining the models together ultimately improves the ability of the model.The reasoning behind a statistical ensemble is that it enables possible improvements to a decision system's robustness and accuracy [53].Given that nuanced differences between the transformers may lead to 'personal' improvements in some situations and negative impacts in others, for example when certain phrases appear within commands, a more democratic approach may allow the pros of some models outweigh the cons of others.Employing a statistical model to learn these patterns by classifying the class based on the outputs of the previous models would thus allow said ML model to learn these nuanced differences between the transformers.

Experimental Hardware and Software
The experiments were executed on an NVidia Tesla K80 GPU which has 4992 CUDA cores and 24 GB of GDDR5 memory via the Google Colab platform.The Transformers were implemented via the KTrain library [54], which is a back-end for TensorFlow [55] Keras [56].The pretrained weights for the Transformers prior to fine-tuning were from the HuggingFace NLP Library [49].
The statistical models for the ensemble results were implemented in Python via the Scikit-learn toolkit [59] and executed on an Intel Core i7 Processor (3.7GHz).

Results
Table 3 shows the overall results for all of the experiments.Every single model, even the weakest XLNet for this particular problem, was improved when training on the human data alongside the augmented data which can be seen for the increases in metrics in Table 4.This required a longer training time due to the more computationally intense nature of training on a larger dataset.T5 paraphrasing for data augmentation led to an average accuracy increase of 4.01 points, and the precision, recall, and F1 scores were also improved at an average of 0.05, 0.05, and 0.07, respectively.
The best performing model was RoBERTa when training on the human training set as well as the augmented data.This model achieved 98.96% accuracy with 0.99 precision, recall and F1 score.The alternative to training only on the human data achieved 97.93% accuracy with stable precision, recall and F1 scores of 0.98.The second best performing models were both the distilled version of RoBERTa and BERT, which achieved 98.55% and likewise 0.98 for the other three metrics.Interestingly, some models saw a drastic increase in classification ability when data augmentation was implemented; the BERT model rose from 90.25% classification accuracy with 0.93 precision, 0.9 recall and 0.9 F1 score with a +8.3% increase and then more stable metrics of 0.99 each as described previously.In the remainder of this section, the 98.96% performing RoBERTa model when trained upon human and T5 data is explored further.This includes, exploration of errors made overall and per specific examples, as well as an exploration of top features within successful predictions made.
Figure 6 shows a comparison between the model performance and number of trainable parameters.Note that the most complex model scored the least in terms of classification ability.The best performing model was the third most complex model of all.The least complex model, DistilBERT, achieved a relatively high accuracy of 98.34%.

Exploration of the best transformer model
In this section, we explore the best model.The best model, as previously discussed, was the RoBERTa model when training on both the collected training data and the paraphrased data generated by the T5 model.5 shows the classification metrics for each individual class by the RoBERTa model.The error matrix for the validation data can be seen in Figure 7.The tasks of EEG mental state classification, scene recognition, and sentiment analysis were classified perfectly.Of the imperfect classes, the task of conversational AI ('CHAT') was sometimes misclassified as a request for a joke, which is likely due to the social nature of the two activities.EEG emotional state classification was rarely mistakenly classified as the mental state recognition and sentiment analysis tasks, firstly due to the closely related EEG tasks and secondly as sentiment analysis since data often involved terms synonymous with valence or emotion.Similarly, the joke class was also rarely misclassified as sentiment analysis, for example, "tell me something funny" and "can you read this email and tell me if they are being funny with me?" ('funny' in the second context being a British slang term for sarcasm).The final class with misclassified instances was sentiment analysis, as emotional state recognition, for the same reason previously described when the error occurred vice-versa.

Mistakes and probabilities
In this section, we explore the biggest errors made when classifying the validation set by considering their losses.
Table 6 shows the most confusing data objects within the training set and Figure 8 explores which parts of the phrase the model focused on to derive these erroneous classifications.Overall, only five misclassified sentences had a loss above 1; the worst losses were in the range of 1.05 to 6.24.The first phrase, "what is your favourite one liner?",may likely have caused confusion due to the term "one liner" which was not present within the training set.Likewise, the term "valence" in "What is the valence of my brainwaves?" was also not present within the training set, and the term "brainwaves" was most common when referring to mental state recognition rather than emotional state recognition.
An interesting error occurred from the command "Run emotion classification", where the classification was incorrectly given as EEG emotional state recognition rather than Sentiment Analysis.The command collected from a human subject was ambiguous, and as such the two most likely classes were the incorrect EEG Emotions at a probability of 0.672 and the correct Sentiment Analysis at a probability of 0.32.This raises an issue to be explored in future works, given the nature of natural social interaction, it is likely that ambiguity will be present during conversation.Within this erroneous classification, two classes were far more likely than all other classes present, and thus a choice between the two in the form of a question akin to human deduction of ambiguous language would likely solve such problems and increase accuracy.Additionally, this would rarely incur the requirement of further effort from the user.

Top features within unseen data
Following the training of the model, this section explores features within data when an unseen phrase or command is uttered.That is, the examples given in this section were not data within the training or validation datasets, and thus are more accurate simulations of the model within a real-world scenario given new data to process based on the rules learnt during training.
In this regard, Figure 9 shows an example of a correct prediction of an unseen data's class, for each class.Interestingly, the model shows behaviour reminiscent of human reading [60,61] due to transformers not being limited to considering a temporal sequence in chronological order of appearance.
In the first example the most useful features were 'time to speak' followed by 'got', 'to' and 'me'.The least useful features were 'right now', which alone would be classified as 'SCENE-CLASSIFICATION' with a probability of 0.781 due to many provided training examples for such class containing questions such as 'where are you right now? Can you run scene recognition and tell me?'.The second example also had a strong negative impact from the word 'read' which alone would be classified as 'SENTIMENT-ANALYSIS' with a probability of 0.991 due to the existence of phrases such as 'please read this message and tell me if they are angry with me' being popular within the gathered human responses and as such the augmented data.This example found correct classification due to the terms 'emotions' and 'mind' primarily, followed by 'feeling'.Following these two first examples, the remaining five examples were strongly classified.In the mental state recognition task, even though the term 'mental state' was specifically uttered, the term 'concentrating' was the strongest feature within the statement given the goal of the algorithm to classify concentrating and relaxed states of mind.As could be expected, the 'JOKE' task was best classified by the term 'joke' itself being present, but, interestingly, the confidence of classification was increased with the phrases 'Feeling sad today.' and 'cheer me up'.The scene classification task was confidently predicted with a probability of 1 mainly due to the terms 'look around' and 'where you are'.The red highlight for the word 'if' alone would be classified as 'SENTIMENT-ANALYSIS' with a probability of 0.518 given the popularity of phrases along the lines of 'if they are emotion or emotion'.
The sentiment analysis task was then, again, confidently classified correctly with a probability of 1.This was due to the terms 'received this email', 'if', and 'sarcastic' being present.Finally, the sign language task was also classified with a probability of 1 most due to the features 'voice' and 'sign'.The red features highlighted, 'speaking with please' would alone be classified as 'CHAT' with a probability of 0.956, since they are strongly reminiscent to commands such as, 'can we speak about something please?'.
An interesting behaviour to note from these examples is the previously described nature of reading.Transformer models are advancing the field of NLP in part thanks due to their lack of temporal restriction, ergo the limitations existent within models such as Recurrent or Long Short Term Memory Neural Networks.This allows for behaviours more similar to a human being, such as when someone may focus on certain key words first before glancing backwards for more context.Such behaviours are not possible with sequence-based text classification techniques.

Transformer Ensemble Results
Following the previous findings, the five strongest models which were BERT (98.55%),DistilBERT (98.34%),RoBERTa (98.96%),Distil-RoBERTa (98.55%), and XLM-RoBERTa (98.76%) are combined into a preliminary ensemble strategy as previously described.XLM (14.81%) and XLNet (35.68%) are omitted due to their low classification abilities.As noted, it was observed previously that the best score by a single model was RoBERTa which scored 98.96% classification accuracy, and thus the main goal of the statistical ensemble classifier is to learn patterns that could possibly account for making up some of the 1.04% of errors and correct for them.Initially, Table 7 shows the information gain rankings of each predictor by 10 fold cross validation on the training set alone, interestingly BERT is ranked the highest with an information gain of 2.717 (± 0.002).Following this, the results in Table 8 show the results for multiple statistical methods of ensembling the predictions of the five Transformer models; all of the models with the exception of Gaussian Naïve Bayes could outperform the best single Transformer model by an accuracy increase of at least +0.42 points.The two best models which achieved the same score were Logistic Regression and Random Forests, which when ensembling the predictions of the five transformers, could increase the accuracy by +0.63 points over RoBERTa and achieve an accuracy of 99.59%.
Finally, figure 10 shows the confusion matrix for both the Logistic Regression and Random Forest methods of ensembling Transformer predictions since the errors made by both models were identical.Many of the errors have been mitigated through ensembling the transformer models, with minor confusion occuring between the 'CHAT' and 'JOKE' classes and the 'SENTIMENT ANALYSIS' and 'EEG-EMOTIONS' classes.

Conclusion and Future Work
The studies performed in this work have shown primarily that data augmentation through transformer-based paraphrasing via the T5 model have positively useful effects on many state-of-the-art language transformer-based classification models.BERT and DistilBERT, RoBERTa and DisilRoBERTa, XLM, XLM-RoBERTa, and XLNet all showed increases in learning performance when learning with augmented data from the training set when compared to learning only on the original data pre-augmentation.The best single model found was RoBERTa, which could classify human commands to an artificially intelligent system at a rate of 98.96% accuracy, where errors were often due to ambiguity within human language.A statistical ensemble of the five best transformer models then led to an increase accuracy of 99.59% when using either Logistic Regression or a Random Forest to process the output predictions of each transformer, utilising small differences between the models when trained on the dataset.Although XLM did not perform well, the promising performance of XLM-RoBERTa showed that models trained on a task do not necessarily underperform on another different task given the general ability of lingual understanding.With this in mind, and given that the models are too complex to train simultaneously, it may be useful in future to consider the predictions of all trained models and form an ensemble through meta classifiers through statistical, deep learning, or further transformer approaches.A small vector input of predictions would allow for deeper decision making given the singular outputs of each transformer.Alternatively, a vector of inputs in addition to the original text may allow for deeper understanding behind why errors are made and allow for learned exceptions to overcome them.A preliminary ensemble of the five models that did not have weak scores showed that classification accuracy could be further increased by treating the outputs of each transformer model as attributes in themselves, for rules to be learnt from.The experiment was limited in that attribute selection was based solely on removing the two underperforming models; in future, exploration could be performed into attribute selection to fine-tune the number of models used as input.Additionally, only a predicted labels in the form of nominal attributes were used as input, whereas additional attributes such as probabilities of each output class could be utilised in order to provide more information for the statistical ensemble classifier.

Figure 1 :
Figure 1: A general overview of the proposed approach.

Figure 4 :
Figure 4: Data collection and model training process.In this example, the T5 paraphrasing model is used to augment and enhance the training dataset.Models are compared when they are augmented and when they are not on the same validation set, in order to discern what affect augmentation has.

Figure 5 :
Figure 5: An ensemble strategy where statistical machine learning models trained on the predictions of the transformers then classify the text based on the test data predictions of the transformer classification models.

Figure 6 :
Figure 6: Comparison of each model's classification ability and number of million trainable parameters within them.

Figure 7 :
Figure 7: Normalised confusion matrix for the best command classification model, which was RoBERTa when trained on human data and augmented T5 paraphrased data.

Figure 10 :
Figure 10: Normalised confusion matrix for the best ensemble methods of Logistic Regression and Random Forest (errors made by the two were identical).

Table 1 :
A selection of example statements presented to the users for paraphrasing.One example is given for each for readability purposes, but a total of five examples were presented to the participants.

Table 2 :
An overview of models benchmarked and their topologies

Table 3 :
Classification results of each model on the same validation set, both with and without augmented paraphrased data within the training dataset.Bold shows best model per run, underline shows the best model overall.

Table 4 :
Observed increases in training metrics for each model due to data augmentation via paraphrasing the training dataset.

Table 5 :
Per-class precision, recall, and F1 score metrics for the best model.

Table 6 :
The most confusing sentences according to the model (all of those with a loss >1) and the probabilities as to which class they were predicted to belong to.Key -C1: CHAT, C2: EEG-EMOTIONS, C3: EEG-MENTAL-STATE, C4: JOKE, C5: SCENE-RECOGNITION, C6: SENTIMENT-ANALYSIS, C7: SIGN-LANGUAGE

Table 7 :
Information Gain ranking of each predictor model by 10 fold cross validation on the training set

Table 8 :
Results for the ensemble learning of Transformer predictions compared to the best single model (RoBERTa)