Sentiment analysis in learning resources

In this work, we aim to analyze the sentiments of learning resources from their textual contents. This work proposes a method for automatic assignment of emotional state to learning resources, based on their feature similarity with previously labeled learning resources. Then, various feature extraction strategies, which describe the relevant information in the texts, are compared for the task of sentiments analysis, considering the two main dimensions of emotions: arousal and valence. The results are very promising, showing a very high value in the performance metrics, like the R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document} score.


Introduction
Emotions affect the way we think, act and behave. In the psychological literature, there exists empirical evidence of the influence of the affective state in cognitive processes, including memorizing and decision-making (Leony et al., 2012;Cordero et al., 2020;Perozo et al., 2012). The emotional state influences the way humans 1 3 learn, as claimed by Santos et al. (2014): "There is an agreement in the literature that affect influences learning". Since the emotional state is an influential factor in learning, it must be considered for enhancing the learning process. Authors of Shen et al. (2009); Aguilar et al. (2015) found evidence of the superiority of emotionaware systems against non-emotion aware systems, which suggests that emotional data can improve the performance of e-learning systems. Several other works evidencing the advantages of emotion-aware systems in education can be found in the literature (Fatahi, 2019;Pekrun, 1992;Faria et al., 2017;Le et al., 2018;Cordero et al., 2019). Some of the main benefits in the education domain are the improvement of the teaching-learning process and students' performance, as well as the reduction in attrition (Mite-Baidal et al., 2018).
One important aspect that can be extracted from textual contents is their emotions. This emotion can be deduced from the reviews, comments, and opinions that other persons write of the text (Garcia & Schweitzer, 2011;Ullah et al., 2016;Mohammad, 2016;Sánchez et al., 2019). The other way is to carry out a sentiment analysis in the text to identify and extract its affective states. In this work, we focus on the utilization of Machine Learning (ML) techniques for this task.
This work is interested in the emotion evoked by the text in the learning resources (sentiment analysis) because it affects the learning process of students, as has been determined in previous works (Cordero et al., 2019;Lajoie et al., 2020). Thus, this paper exploits the textual information in the digital learning resources, but also, analyzes their titles and descriptions contained in their metadata (they describe each learning resource), among other things. This information can be useful for modifying and adapting the learning process, especially in virtual learning environments (Imani & Montazer, 2019;Yadegaridehkordi et al., 2019;Alatrash et al., 2021;Sharma et al., 2021;Aguilar, 2001;Pekrun et al., 2002;Chamba-Eras & Aguilar, 2017;Aguilar, 1998).
Specifically, our aim is to predict the general or "average" perceived emotion from various readers, focusing on the common information. We do this with the objective of resembling the emotion induced by the text, assuming that the average emotion perceived by the audience is quite similar to the evoked emotion by the text. We rely on this assumption given that one way to calculate the emotion induced by content is by recognizing the emotion that it generates in several people. Then, we can obtain the similarities between them in search of a "general" emotional state that the content provokes in people. In the case of teachers, a tool based on this work is very important because, as emotions influence the learning process, they could use it as a service to determine when and in what learning activities they could rely on learning resources according to their emotional charge (Leony et al., 2012;Shen et al., 2009;Sánchez et al., 2016). For example, a learning resource that provokes negative feelings could be used by the teacher in discussion activities, with a lot of interaction, to minimize its negative impact. On the other hand, highly positively charged learning resources could be used by the teacher in individual activities, particularly when the student has negative feelings and requires motivation.
In the literature, two groups of models have been commonly used to represent emotions, categorical models and continuous models (Ekman, 1992;Cambria et al., 2015Cambria et al., , 2018Colombetti, 2009;Plutchik, 1980) Categorical models try to represent emotions with discrete values, such as happiness, surprise, sadness, and disgust. Categorical models start from the principle that humans have an innate set of recognizable and distinguishable basic emotions. In general, six basic, universally accepted emotions have been proposed (Ekman, 1992;Colombetti, 2009;Plutchik, 1980): happiness, anger, sadness, surprise, disgust and fear. Continuous models propose dimensions that constitute a continuous space, in which a specific emotional state is represented by a point in said space (Cambria et al., , 2018Plutchik, 1980). Thus, continuous models of emotions attempt to conceptualize human emotions in one or several dimensions. Each dimension corresponds to a variable that describes an emotion, such as valence and/or arousal or intensity. In these models, depending on the value of these variables, one or another emotion appears. Several dimensional models of emotion have been developed, some of them are the circumflex model, the vector model, and Senticnet. In this work, we use a continuous model for representing emotions: the arousal valence dimensional model. Each one of the dimensions has two opposite poles, aroused-calmed for arousal, and pleasure-disgust for valence.
The main objective of this work is to develop a method for recognizing the emotions in learning resources, which can be later used in other applications like educative recommendation systems. The main contributions of this work are: (1) propose a new method for automatic sentiment analysis in textual contents, in terms of arousal and valence and; (2) compare various feature extraction strategies for text, and ML techniques, in the task of sentiment analysis in learning resources.
The rest of the work is structured as follows: Sect. 2 explores the related works; the proposed approach for the recognition of emotions from textual contents in learning resources is detailed in Sect. 3; the next section describes the experiments. Then, Sect. 5 presents the analysis of the results, and finally, Sect. 6 presents the conclusions of the work.

Related works
In the literature, there exist many works focusing on extracting emotions from texts, and not only the extraction of emotions has been investigated, but other affective information like cognitive appraisals and sentiments have been also widely studied (Garcia & Schweitzer, 2011;Ullah et al., 2016;Mohammad, 2016). However, the sentiment analysis from formal texts has received less attention from the experts in computational intelligence. The area of sentiment analysis has focused mainly on informal textual contents like email, chat, SMS, and online user reviews.
In this section, we will analyze recent works related to the sentiment analysis in multimedia content, to strategies of identification and/or recognition of emotions, emphasizing the works dedicated to such analysis in formal texts, intensive learning environments in technology, and/or multimedia learning material.
Perceived and induced (evoked) emotions from contents have been mainly investigated for multimedia content: videos, images, presentations, music, etc. Authors of Chen and Picard (2016) propose using 3D convolutional neural networks for extracting spatio-temporal features from animated GIFs, in order to predict crowd-sourced intensity scores of 17 emotions. In Koelsch (2015), various principles underlying the evocation of emotion with music are described, including their relevance for music therapy. The investigation presented in Tian et al. (2017) explores the relationship between perceived and induced emotions from affective movie content. The main difference between those works and our work is that they recognize emotions from multimedia contents (GIFs, music and movies), and we predict emotions from the textual content, specifically, from the textual contents of learning resources.
In the context of emotion recognition, Acheampong et al. (2020) review the area of emotion detection from texts and presents the main approaches used for the design of text-based emotion detection systems. Also, emotion-labeled text data sources are presented. Batbaatar et al. (2019) propose a neural network architecture, called SENN (Semantic-Emotion Neural Network) that utilizes both semantic/syntactic and emotional information for emotion detection and recognition from text. SENN model uses two sub-networks, a first sub-network based on the bidirectional Long-Short Term Memory (BiLSTM) technique to capture contextual information and semantic relationships, and the second sub-network based on the convolutional neural network (CNN) to extract emotional features and emotional relationships between words from the text. The authors of Paltoglou et al. (2013) propose predictive models of the emotional responses for informal texts in social media using a real-valued scale on the two affective dimensions of valence and arousal. The results show that the prediction of emotional states is possible. In Rodriguez et al. (2012), Rodriguez et al. describe different approaches for automatic extraction of emotions from texts, which are essays written by students. Also, they describe its utilization in a context-based adaptive e-learning system for the dynamic recommendation of activities according to the student's emotions. The paper  studies the semantic representation of educational digital resources. They propose an extraction mechanism of features/characteristics from the digital resources using Best Matching 25, Latent Semantic Analysis, Doc2Vec, and Latent Dirichlet allocation techniques. For this analysis, they test two types of educational digital resources: scientific publications and learning objects.
In the paper, Perikos and Hatzilygeroudis (2016), the authors describe a sentiment analysis system for automatic recognition of emotions in text, using an ensemble of classifiers. The ensemble schema is based on three classifiers; one is a knowledgebased tool performing deep analysis of the natural language sentences, and the other two are statistical (a Naïve Bayes and a Maximum Entropy learner). Also, Chatterjee et al. (2019) define a Deep Learning-based approach to detect emotions in textual dialogues. They combine both semantic and sentiment-based representations for more accurate emotion detection. On the other hand, Su et al. (2018) propose a long-short term memory (LSTM)-based approach to text emotion recognition based on semantic word vector and emotional word vector of the input text. The works in the previous two paragraphs either do not do so for learning resources, or do not do so from an emotional feature extraction perspective. We combine both aspects in this work.
In the context of formal texts, in the paper Mensink (2021) is analyzed the seductive details in scientific texts, which can potentially evoke a strong emotional response. They carried out experiments to analyse the effects when participants read scientific texts related, with and without embedded seductive details. They conclude that the presence of seductive details that produce negative emotions reduced recall performance. On the other hand, Tan and Zhang (2008) carry out a sentiment categorization on Chinese documents using different feature selection methods and machine learning methods on 1021 documents. The experimental results indicate that the quality of the classifiers is severely dependent on domains or topics. In Ahmad et al. (2020), an emotional state classification system for poetry text is proposed using Deep Learning. For this purpose, they use the C-BiLSTM model on the poetry corpus.
There is a recent special issue where the contributors address the importance of understanding and measuring emotions in technology-intensive learning environments (Lajoie et al., 2020). These works consider that emotions play an important role in fostering learning, for what analyze different domains, including mathematics, medicine, history, among others. Finally, the closer to our proposition is proposed by Moreno-Marcos et al. (2018), where they compared several ML techniques for sentiment analysis to determine the learners' emotions when they use a MOOC (Massive Open Online Course). For that, they considered the forum messages in MOOCs as the source of information, which were studied using a sentiment analysis approach to detect complex emotions, such as frustration or boredom. The works of the two previous paragraphs show the feasibility of doing sentiment analysis in formal documents, and also, the relevance of the sentiment analysis of learning resources for virtual educational media.
Finally, several works start from the premise that the emotional design of multimedia learning material can evoke positive emotions in learners, which in turn facilitate the learning process. For example, Heidig et al. (2015) explore the potential of an emotional design of multimedia learning resources. They define an approach to deduce emotionally relevant design features, adopting concepts from web design. They consider nine conditions, classified according to two design factors (classical vs. expressive aesthetics) and a usability factor (high vs. low usability). They conclude that the perceived aesthetics and usability positively affected the emotional states of the learners, including the motivation to continue working with the resource. Newman and Joyner (2018) used a sentiment analysis tool, called VADER (Valence Aware Dictionary and sEntiment Reasoner), to analyze the positivity/negativity of comments of the students about a course using three sources: a forum with comments of the course, an unofficial "reviews" site of the students and the official evaluations. Too, Plass and Kaplan (2016) propose a model integrating cognitive and affective aspects of learning with digital media for making designs of multimedia learning. Their model combines elements of the affective computing paradigm and the cognitive-affective theory, among other elements. On the other hand, Peng et al. (2021) investigate how the visual aesthetics of interface design influenced learners' cognitive processes and emotional valences. As well, Rodrigues and Silva (2022) review the articles about emotional design in multimedia learning, considering several aspects like emotional design features most frequently found, effects of emotional design on the learning process, among others. Finally, the paper Stark et al. (2018) proposes a method based on eye-tracking for the emotional design in multimedia learning for textual parts of multimedia resources. In their experiments, the positive emotional text design led to better learning outcomes. Further, learners in the group with the negative emotional text design showed a worse emotional state after learning.
As we can see previously, there is an important number of works dedicated to the analysis of emotions in texts, particularly in informal texts. Additionally, various studies have shown the affective impact of multimedia learning. Particularly, the multimedia didactic material can evoke positive emotions in the learners, which improves their learning processes. Despite the investigations for multimedia contents, or on the analysis of sentiment in the educational context, studies for extracting emotions in learning resources were not found. Thus, there are no studies that have been dedicated exclusively to the analysis of the emotions evoked by the textual content in learning resources. Only the work of Moreno-Marcos et al. (2018) is close to our proposal, but they do not propose a methodological scheme, or a strategy for the extraction of characteristics/variables, to be able to determine the emotions in the learning resources. Our research is innovative because it points to a goal that has not been explored in the literature: the emotions extraction in learning resources using a feature engineering approach Pacheco et al. (2014). This is the main contribution of our work, which could later be used by recommender systems based on emotions Monsalve-Pulido et al. (2020); Salazar et al. (2021). With this work, we aim to open the door to the investigation of this topic, showing the evidence that the extraction of this kind of affective information can be performed with high accuracy.

Our approach
Our approach has two main steps: In the first step are automatically labeled the learning resources, and in the second are defined the sentiment analysis models for learning resources. These steps have sub-steps, as can be observed in Figure 1, where the first step is in blue and the second in yellow.

First step: automatic annotation of learning resources
Since the objective is to use supervised ML techniques, annotated data must be collected. The objective data are learning resources annotated with a specific emotional state perceived by readers (students) in the arousal-valence dimensional model. As mentioned in the related works section, there are no previous works for extracting emotions from learning resources, and to the best of our knowledge, there is no dataset of learning resources with associated emotional states.
For that reason, an automatic method for the assignment of perceived emotional states to learning resources was developed, following the idea presented in Zad and Finlayson (2020). Thus, based on Zad and Finlayson (2020), our approach defines a dataset of textual data annotated with the perceived emotional state of readers. The dataset defined in Zad and Finlayson (2020) is used for the automatic annotation of the learning resources (emotional state). The idea is to build an automatic assigner general enough for assigning annotations to any textual content. The proposed technique for the automatic assignment has five steps (1) Obtain a dataset of texts annotated with emotional states in the arousal-valence dimensional model, which have been defined by the readers. (2) Preprocess texts in the annotated dataset and in the learning resources dataset. (3) Extract features from the preprocessed texts and the learning resources. (4) Calculate the similarity of features extracted from each learning resource with each of the annotated texts in the first dataset. (5) Assign to the learning resource the emotional state of the most similar text, according to the similarity metric.

Data collection
In this step, textual data with assigned emotional states using the arousal-valence dimensional model and textual contents of learning resources, are collected. The first dataset is required for the automatic calculation ((annotation)) of emotional states in the learning resources dataset. The second dataset is used for extracting features and calculating annotations, in order to be used to train the sentiment analysis models of learning resources.

Preprocessing
Textual data contains noise that must be cleaned before executing analytic tasks. Preprocessing helps clean the data and prepares it for the analytic tasks that will Fig. 1 Proposed approach be performed. The preprocessing is based on natural language processing (NLP) techniques, such as: transforming to lower case, stop words removal, part of speech (POS) tagging and lemmatization.
Converting text to lower case allows treating words like: "Dog" and "dog" as the same token. Stopwords are words that are highly frequent in a corpus (the collection of texts) such as articles; these words do not provide much information and increase the dimensionality of the data, augmenting the computational resources needed and making more difficult the analytic task, due to that, they must be removed from the texts. POS tagging consists of tagging each word into the part of speech the word belongs to. Part of speech can be verbs, nouns, adjectives, among others, and derivatives from these. Lemmatization converts words to their lemma, a lemma is the root of the word (the form that the word is normally found in dictionaries). This last technique transforms words like "jumped" and "jumping" to the root "jump".
The preprocessing consists of 6 steps: (1) transform the text to lower case, (2) remove punctuation marks, (3) tokenize the text, which is very similar to split into words, (4) tag each token (POS), (5) remove stopwords and tokens with length lower than 2 characters and (6) lemmatize tokens.
Each text is passed through this preprocessing procedure before extracting features (except for the BERT features set, as explained in the following section).

Feature extraction
From the texts, relevant features are extracted and converted to numerical values because the ML techniques for the construction of the recognition models generally do not accept textual data as input, but numerical data. Several feature extraction strategies are compared according to their performance in recognition of emotional state. In total, 4 feature sets were selected for experimenting with them: a Vector Space Model (VSM), a feature set extracted from Senticnet (Cambria et al., 2018), a feature set extracted from Senticnet + AffectiveSpace  embedding, and BERT (Devlin et al., 2018) embedding. These feature sets were selected because they belong to different approaches for text feature extraction: word frequency-based features (VSM), knowledge base based feature extraction (Senticnet and Senticnet + AffectiveSpace) and deep learning-based feature extraction (BERT).
• VSM: Vector space model features consist of representing each document (text) by a vector in which each position is associated with a word or token in the vocabulary of the corpus (collection of documents). Suppose that a corpus has N documents and M different tokens over those N documents, then each document is represented by an M-dimensional vector in which each position in the array represents frequency-based information of that word on the document. This frequency-based information can be the presence or not of the token in the document, the term frequency of the token, the TF-IDF score of the token in the document, among others. As a result, a matrix of NxM is built from the documents, this matrix is very big and sparse. For that reason, this method is usually followed by a dimensionality reduction technique, such as Principal Components Analysis (PCA) or Singular Value Decomposition (SVD).
• Senticnet: Senticnet is a knowledge base that provides affective information from 100.000 concepts. The authors of this resource used recurrent neural networks to infer primitives by lexical substitution for building the knowledge base. Senticnet is publicly available for research (available at https://sentic.net/, accessed 13/10/2020). • Senticnet + AffectiveSpace: This is an extension of the Senticnet features. The resource AffectiveSpace was developed by the same authors of Senticnet, and the resource is also publicly available for research at the same page as senticnet.
The resource was built using a special Vector Space Model for modelling affective information of terms. It provides 100-dimensional vectors (embedding) for 100.000 concepts, not the same concepts than Senticnet, but the majority of them are present in both knowledge bases. The features of AffectiveSpace are concatenated with the features of Senticnet for building more robust features. The idea of expanding Senticnet was extracted from Poria et al. (2015), where the authors used EmoSenticSpace as the resource for expanding the Senticnet features, but as we have not access to this resource, we replaced it with AffectiveSpace that also provides 100-dimensional embedding of terms. • BERT: BERT is a technique that uses bidirectional recurrent neural networks for calculating the embedding of tokens. It was originally developed by Google, then it was released. One of the main advantages of using Bert embedding is that it considers the context of the term for calculating the embedding, thus, the same token in different sentences could have different embedding depending on the meaning of the token in that context. This particularity is especially relevant for tokens that have different meanings depending on the context, such as "book" that have two meanings: as a verb and as a noun, and those two are very different meanings.
Three of them (VSM, Senticnet and Senticnet + AffectiveSpace) use the preprocessing step. BERT does not require it because it considers all tokens to understand the context and calculate the embedding.

Similarity
A similarity score over the features extracted is used for calculating the likeliness of annotated texts and learning resources. The similarity score selected is the cosine similarity score, as proposed in Zad and Finlayson (2020). The cosine similarity is defined as follows: Where A and B are the two vectors (of features) and is the angle formed between the two vectors. This similarity score is based on the cosine function, and it is bounded between −1 and 1, being 1 the perfect similarity and −1 the lowest similarity. As it only uses the cosine function, it only considers the angle between the two vectors but does not consider their magnitudes.

Emotional state annotations
For assigning the emotional state of a learning resource, the similarity function is applied between its feature vector and the feature vectors of each annotated text in the other dataset. The maximum similarity is determined, and the emotional state of the text with the maximum similarity is assigned to the learning resource.
This step is based on Zad and Finlayson (2020), where they propose a method for automatic categorical labelling. We extended it to continuous values (the arousalvalence dimensional model). Our method is scalable to large datasets since the calculation of the cosine similarity is a lightweight calculation, it can be done in a matrix way to make it much more efficient, and it can also be parallelized and computed in a distributed way.

Second step: sentiment analysis models in learning resources
Once the automatic annotation of learning resources is performed, a recognition model can be fitted with those annotations and the features extracted from learning resources, in order to carry out the sentiment analysis in learning resources. This step consists of two substeps: training the model and evaluating the model.

Training
In this process, relevant features and emotional states assigned to each learning resource are used. They are fed to an ML technique to create a model that learns to recognize the emotional states. Two classical ML techniques are compared, specifically: random forest (RF) and partial least squares (PLS) because they have shown good results for multi-output regression problems, as is our case.
RF consists of an ensemble of decision trees that "vote" for the final prediction. At the end of the process, the prediction with a higher "scoring vote" is taken as the final prediction. The ensemble of various decision trees brings up the weakness of overfitting, but it is solved by the bootstrapping technique that RF uses: each decision tree has a different random sample (with replacement) of the dataset, and thus contains a certain subset of the original features. As every decision tree counts with different information, different points of view are evaluated, solving in that way the overfitting problem (for more information about this technique see Breiman (2001)). The main reason for selecting this technique is that it has obtained very good results across many analytics tasks, not only in emotion recognition, but in multiple fields with very competitive results Fernández-Delgado et al. (2014).
PLS regression is one of the classical methods for multi-output regression, it provides the advantage of maximising the covariance between the input and output, solving the disadvantages of multiple linear regression (MLR a.k.a Ordinary Least Squares) and Principal Components Regression (PCR). The method requires the addition of weights to maintain orthogonal scores and the factors are calculated sequentially by projecting output through input (Wise, 2004) (for more information on this technique see Wegelin (2000)). The main advantage of using this technique in this task is that it does not model each output independently, but it models every output together with the cross variance between outputs and inputs, considering the relations between the different outputs.

Evaluation
To compare the ML techniques and the feature sets, three metrics are calculated from the predictions/recognition's made of the emotional states. These metrics provide information on how similar the predictions/recognition's are to the real values, which are used for making comparisons in the results section. The three metrics used are r-squared score ( R 2 ), root mean squared error (RMSE) and standard deviation relative error (SRE). The first metric is bounded between negative infinity and 1, the higher the better, being 1 the perfect score. This metric represents the quantity of variance the model can explain from the total variance of the data. Its equation is: Where ŷ is the vector of predictions, ȳ is the mean of y, and y the vector of true values. In this task, there is a bias to the center of the arousal-valence space by the nature of the contents. This metric is very useful for this task because it compares the performance of the trained model against a model that always predicts the mean value, giving low scores to models that are biased to the center and promoting models that are not influenced by this bias.
The second metric is a measure of the average prediction error. This measure is in the same units as the original units, that is, if the RMSE is 0.5 it means that on average the model is predicting wrong by 0.5 original units. Since the arousal-valence dimensional model is bounded between −1 and 1 in both dimensions, the maximum expected error would be √ 8 in Euclidean norm. The range for RMSE in this task is from 0 to √ 8 . As with any other error, the lower the better, the perfect prediction RMSE score is 0. The RMSE formula is: Where ŷ is the vector of predictions, y is the vector of true values, and n is the number of samples. The metric is relevant for the problem because it gives an idea of the error that the predictor is incurring in making the predictions. This metric may be influenced by the bias to the center of the space, but this weakness is compensated by the other two metrics. One advantage of this metric is that the score is in the same units as the arousal-valence dimensions, so calculations can be performed in the original units.
The third metric is used for avoiding the model to predict only central values that give good results in terms of R 2 and RMSE when labels are biased to the center, as is the case with this dataset. This metric compares the standard deviation of predictions with the true values. As with any other error, the lower the better, being 0 the perfect score. Its formula is: Where SRE is the relative error of standard deviation, aro is the standard deviation of the arousal labels, val is the standard deviation of the valence labels, ̂a ro is the standard deviation of the arousal predictions, ̂v al is the standard deviation of the valence predictions. Having the previously mentioned center bias, this metric is useful for selecting models that are not influenced by the bias. It helps promote models with variance similar to the one of the annotations, and penalizes models that only predict central values.

Experiments
This section provides details of the experiments performed. It is divided in the two steps presented in the previous section.

Data collection
For this investigation, the annotated text dataset used was EmoBank Hahn (2017, 2017b). This dataset consists of 9620 sentences tagged with the perceived emotional state by readers in terms of arousal-valence dimensions. Each sentence was annotated by various readers, the average annotation is taken as the gold standard. This dataset was selected because was the only one found containing text annotated with perceived emotional states in terms of arousal and valence. The nonannotated dataset of learning resources used consists of 30694 learning resources extracted from the public layer of MERLOT (University, 2020). The data extracted from Merlot is the title and description of the learning resources, which is contained in its metadata that uses the IEEE LOM (Learning Object Metadata) standard (https:// ieee-sa. imeet centr al. com/ ltsc/) to describe learning resources. Merlot was selected as a source of learning resources because it is open and free for research.

Preprocessing
The preprocessing consists of 6 steps (see the previous section): Firstly, the texts are transformed to lower case. Second, punctuation marks are removed, the punctuation marks considered to be removed is a predefined list available in NLTK (Loper & Bird, 2002), which contains the most frequent punctuation marks. Third, the text is split into tokens using the work tokenizer that NLTK provides. The process is very similar to split into words, but is a little more sophisticated because the word_ tokenizer uses regular expressions to tokenize the text, as in Penn treebank (Marcus al., 1994). Fourth, the tokens are tagged with their part of speech (POS). This step uses the pos_tagger that NLTK provides, this POS tagger uses a pre-trained model for the English language in order to predict the tags for tokens. Fifth, the stopwords are removed using the predefined list provided by NLTK, which contains the most common stopwords in the English language. Sixth, the tokens are converted to their lemma using WordNet (Fellbaum, 2012) lemmatizer, which is available in the NLTK library. WordNet is a lexical database for the English language that provides lemmatization capabilities using the structured semantic relationships between words.

Feature extraction
Several feature sets can be extracted from texts, and depending on the objective, some are more relevant and adequate. For this investigation, we compared the performance of four feature sets with the target of extracting the perceived emotional state of readers from learning resources. The first feature set is the VSM making use of TF-IDF scores of words. This set represents each document as a collection of TF-IDF scores of each word in the vocabulary of the corpus. The second feature set consists of seven affective features extracted making use of Sentictnet 5 (Cambria et al., 2018) knowledge base. In concrete, the features extracted are: pleasantness (continuous), attention (continuous), sensitivity (continuous), aptitude (continuous), primary mood (categorical), secondary mood (categorical), polarity (continuous). Senticnet provides these features for each word in the text. For summarizing the features of words for each text, the percentiles 0, 25, 50, 75 and 100 are used for the continuous variables, and the count of each category from the categorical features is used. The two categorical features (primary and secondary mood) have 8 categories: joy, sadness, interest, surprise, anger, fear, admiration, disgust (the 8 emotions proposed by Plutchik (1980)). Finally, 41 features are extracted from each text (5 percentiles * 5 continuous features = 25; 8 categories * 2 categorical features = 16; 25 + 16 = 41).
The third feature set is an extension of the second feature set, adding the embedding provided by AffectiveSpace Cambria et al. (2015). This resource gives 100-dimensional vectors (word embedding) for each word. The same strategy for summarizing the words across the text is used: percentiles 0,25,50,75 and 100 because the 100 dimensions of the vector are continuous and treated as independent variables, resulting in 541 features per text (5 percentiles * 105 continuous features = 525; 8 categories * 2 categorical features = 16; 525 + 16 = 541). The fourth feature set used is composed of the context-aware embedding extracted using BERT Devlin et al. (2018). The uncased base pretrained model was used for generating the embedding, this pretrained model provides 768-dimensional vectors for each word in a sentence. Summarizing of embedding of words over sentences and sentences over texts is performed averaging the 768-dimensional vectors. Thus, 768 features are extracted for each text.
In addition, experiments concatenating feature sets by pairs were performed. The concatenation of feature sets can provide more information that an ML model can exploit to generate better predictions/recognition, but also can bring weaknesses when features present problems when considered together, an example of these problems is multi-collinearity. This problem is present when a feature can be explained to a high degree as a linear combination of other features in the feature set.

Dimensional reduction
For every feature set, a dimension reduction technique is applied for reducing computational costs and keeping only the most relevant information. Specifically, Principal Component Analysis (PCA) was used for keeping only the components that represent 80% of the variance of data. After applying dimensional reduction, feature sets consist of: VSM: 1412, Senticnet: 7, Senticnet + AffectiveSpace: 66 and BERT: 60 final features.

Training
Having the annotated dataset of learning resources, traditional ML techniques for multi-output regression were used, to build sentiment analysis models in terms of arousal and valence dimensions. Two ML techniques were compared: PLS and RF. Two experiments are implemented with the objective of testing the generalization capability of the annotation process and the predictive capability of the recognition models. Various configurations of parameters were tested for both techniques, for the different feature sets and experiments. The description of the two experiments is given following: First experiment: The first experiment consists in fitting a model with the original annotated text dataset, and testing the fitted model with the learning resources dataset and its automatically assigned emotional states. As this model is trained and tested with data from different domains, this method allows measuring the capability of the generalization of the annotation process to different textual domains.
Since the annotations come from the similarity of features extracted, it is a valid approach to test its generalization power from the two textual domains (original annotations = sentences, generated annotations = learning resources), because the model trains with the features that are supposed to be similar due to the automatic assignment of annotation process proposed.
Second experiment: This experiment utilizes data from the same domain for training and testing. In concrete, it uses the automatically annotated learning resources dataset and splits it into training and testing sets, then, a model is fitted with the training data and tested with the testing data, just like a classic supervised machine learning problem.

Evaluation
For the evaluation of the first experiment, the original annotated dataset was taken as training data, and the learning resources dataset with its generated annotations was taken as testing data. In the case of the second experiment, only the learning resources dataset was used. As the number of available samples is considerably high (30964), it was randomly split with 70% for training and 30% for testing using a cross-validation approach.

Result analysis First step: automatic annotation of learning resources
We present statistics of the distributions of the annotations for verifying that they come from a similar distribution to that of the original data. Table 1 presents the distributions of the automatically assigned emotional states to the learning resources dataset, which are compared to the original feature set. The "original" feature set is the distribution of the originally labelled textual dataset (EmoBank). The Euclidean norm dimension is the Euclidean norm of arousal and valence  Table, it is observed that statistics of every feature set are very similar to those of the original distribution, evidencing the similarity of the distributions of the original annotations and the automatically generated annotations for learning resources, which is a good indicator of the validity of the approach. VSM features present the higher deviation for every dimension, even, for two of the dimensions (Arousal and Euclidean norm), the deviation is higher than the deviation of original data. Since statistics are similar, the generated annotations and original annotations have related properties, including biases in the data. A bias to the center of the arousal-valence space is evidenced in the statistics presented in the table.
In our experiments, we also calculated the cosine similarity between the automatically assigned annotations and the original annotations. For that objective, we divided the annotated texts dataset (EMOBANK) into two sets, one for the base annotations and one for calculating the automatic annotations that are extracted from the basis set. Then, the cosine similarity in the second set is calculated between the generated annotations and the originals. The results by feature set are shown in Table 2.
According to the cosine similarity measured, the automatically assigned annotations and the original annotations are almost identical in every feature set, which shows the validity of the approach for the automatic assignation of emotional states, at least for textual data from the same domain.

Second step: sentiment analysis models in learning resources
The results are presented for each experiment, each feature set, and each ML technique. For our problem, a high R 2 is a very good indicator because due to the bias to the center of the space present in data, a model that always predicts the mean values (that are very close to the center of space, as shown in Table 1), would get very good results.
In this task, obtaining low values of RMSE is not difficult since data is biased to the center of the space. For that reason, models are tempted to predict very central values, obtaining good results in terms of RMSE. Even knowing that RMSE scores will be low, even if the models do not perform very well, the metric is used because it gives an idea of the average error in which the model incurs, which is very useful information for the application of the model in real cases. The last metric, SRE, helps select models that are not very affected by the bias to the center present in the data. The lower this score, the better. Since the metric compares the standard deviation of true annotations with predicted annotations, it provides information about models that present very high variance or very low variance, in comparison with true variance. As mentioned earlier, because of the bias, models are tempted to predict very central values with low variance, this metric is of special importance in those cases. A high SRE means that the trained model presents a very high variance, an indicator of low precision in prediction.

First experiment
The objective of this experiment is to empirically test the generalization power of the automatic annotation process. The results of this experiment can be observed in Table 3. Best results were observed using BERT embedding and PLS regression.
R 2 score is very low for every experiment, even in some cases the score is negative. That is an indication that it was not possible to fit good models that can generalize between the two textual domains, and that a model that predicts always the average value of each dimension gets very similar results to those of the models fitted using RF and PLS. That is because annotations are biased to the center of the space, producing these metrics to be generally low. If the annotations were well distributed over the space, metrics would be worst. SRE scores are larger (higher than 0.5) in every case, demonstrating that the variance of predictions is very dissimilar to the one of the actual annotations. This is another indicator of very poor results and that the distribution of predictions does not follow the distribution of true annotations. RF using Senticnet features obtained a higher R 2 score than PLS with BERT, despite that BERT obtained better RMSE and SRE. Results mixing feature sets are not very different from those of single feature sets in terms of R 2 score, but for RMSE and SRE, the values got worse, especially using the PLS technique. PLS regression has more extrapolation power than RF, thus, values out of the space can be predicted by this technique. That is the reason for the extremely poor metrics, the trained model predicts values out of the space obtaining very big errors.
The results of this experiment are not good. The maximum R 2 score observed is 0.0617 which is very low. These results indicate the lack of generalization of the models when are trained and tested with different textual domain datasets.

Second experiment
The objective of this experiment is to test the performance of the proposed sentiment analysis method, training and testing with datasets from the same domain (learning resources). The results of the second experiment are presented in Table 4. The best results were found making use of VSM features and RF as the ML technique.
R 2 generally increased greatly in the majority of cases in comparison with the first experiment. R 2 results are good using the RF technique, however, PLS shows lose results for every case, they are even comparable with the results in the first experiment. The values for RMSE decreased, in general, for this experiment. The SRE metric improved considerably for this experiment, observing values lower than 0.5. The models trained in this experiment obtained acceptable variance, similar to that of the true annotations, showing that the distribution of the predictions is similar, at least in variance, to that of the true values.
The fusion of feature sets did not improve the performance of ML techniques, and in some cases, the results were worse. RMSE metric increased notably for the models that used fused feature sets. The results for fused feature sets indicate that the different feature sets present problems when considered together. A feature selection process can be performed for validating this hypothesis and selecting only the relevant variables. Our intention was for the dimensionality reduction process to replace the feature selection process, but it did not obtain the expected results, as evidenced by the performance of the fused feature sets.

General results
The results of the second experiment are better than the first one. This can be explained by the difference of domains the first experiment deals with. The results for the second experiment are very good and promising. It is curious to note that the best feature set also gets the worst results with the other ML technique. Using VSM features with PLS regression, the qualities of the evaluation metrics are below an acceptable threshold, but with RF, the results are excellent and very competitive, claiming that the sentiment analysis from textual data in learning resources. from textual data in learning resources can be done with high performance.
The lack of generalization of the models trained to predict from data from textual domains other than the textual domain with which the models were trained is evident in the results of the first experiment. The bad quality of the evaluation metrics in this first experiment evidences this. On the other hand, good results can be obtained using the same textual domain for training and testing, as evidenced in the results of the second experiment.
Theoretically, one may think that BERT embedding would be the best feature set because they are context-aware while the others are "naive" in respect to that. However, results have shown that there is no better feature set because the results are also highly dependent on the technique used, as is the case with VSM in the second experiment: PLS with this feature set obtained no significant results, but Finally, we take two learning objects from the set collected from the Merlot repository in section 4.1.1, to study the behavior of our learning resource sentiment analysis approach. Merlot's two examples of learning resources are the textbook "Web 2.0 Tools in Education: A Quick Guide" Embi (202o) and the video "Creation of electronic books for distance education -Case study" McIntyre (2021). Our system takes the title and description of each learning resource as input, to perform the sentiment analysis of each of them. For the first case, the sentiment analysis model concludes that the document evokes happiness by having high valence and arousal, while in the second learning resource the analysis identifies it as calm by having a high valence and low arousal.

Conclusions
This work presents a method for automatic sentiment analysis in learning resources. Additionally, it compares various feature extraction strategies and ML techniques in the task of sentiment analysis from textual data in learning resources.
The automatic annotation of emotional states presents distributions similar to the original annotations, demonstrating good consistency and excellent similarity to the true annotations. Thus, the automatic annotation of emotional states presents the same nature as the original dataset distribution, which is biased to the center of the arousal-valence space. With higher quality data, a more adequate predictor can be built. Also, the proposal for the sentiment analysis in learning resources is promising, the results are good, reflecting the viability of the approach. They demonstrate that is possible to extract the emotions in the textual information in learning resources, with very good performance.
The proposed process for extracting the emotions using these ML techniques is very domain-sensitive. When applying the learned models to other domains, metrics go to the ground, reflecting the great attachment of these models to the domain of the data with which they were trained, showing low generalization towards other domains. This is a common problem with ML techniques, various of them suffer from low generalization when testing with data of different nature . Despite that, it was found that the RF technique overperformed PLS for this task. Also, there was not a dominant feature set between the four analyzed, the results showed great dependence on the ML technique used for training.
A tool of this type is very useful in a smart classroom, in the management of a learning process by a teacher, since it allows improving the management of learning resources, being aware of emotions (Shen et al., 2009;Pekrun, 1992;Lajoie et al., 2020). For example, a learning object recommendation system can adapt to the emotional context (Leony et al., 2012;Santos et al., 2014). We have considered the different types of learning resources contained in the Merlot repository (videos, textbooks, videogames, etc.) but we have not considered scientific works, patents, etc., that may be useful in learning processes Heidig et al., 2015). Also, we have considered the title and description of the learning resources established in the metadata of the IEEE-LOM standard, but it should be analyzed that other metadata may be useful to improve our approach Heidig et al., 2015). Finally, two machine learning techniques were tested as sentiment analysis methods.
Among the most relevant limitations of this work are that the emotional annotation of learning resources is required, a very arduous task due to the large number of resources of this type that exist. Another limitation is that it used a specific continuous emotional model; its behavior must be analyzed for other models, even discrete ones, to see if they improve its performance. Also, the extraction of meta-characteristics from the learning resources in Merlot involved developing a specific script, which should be done for each learning resource repository to be used. Finally, a case study in a real learning environment is required using a tool such as the one proposed in this work, to analyze its effect on the emotional behavior of students and their learning processes.
For future works, more complex ML models, such as deep neural networks, can be explored for obtaining adequate generalization between textual domains. Another future work is the utilization of other feature sets, and the definition of a feature selection process for solving the problem of fusing the feature sets, which showed worse results than considering the feature sets single.