1 Introduction

Recommender systems (RS) currently represent an appropriate solution for the problem associated with efficient information access in today’s digital world [4, 32, 36, 84]. Specifically, collaborative filtering (CF) has been highlighted as a very appropriate paradigm to implement these systems because they are able to produce accurate recommendations requiring a minimum amount of information [24].

User-user collaborative filtering, as a pioneer recommendation approach, was initially introduced by Resnick et al. [65] as a way to predict the user’s preferences by using the information associated with similar peers associated with a community of people. Later, Sarwar et al. [71] extended the idea behind user similarity and defined item-item CF algorithms that are based on item-based similarities and outperform user-user methods in terms of accuracy and scalability [22]. Koren et al. [40] popularized a new paradigm for CF, proposing dimensionality reduction methods that notably improve the performance of traditional methods. In addition, a lot of notable research has been developed in order to do a deeper exploration of user ratings to enhance the recommendation process [11, 20, 74, 77, 79, 81, 88, 90].

In this way, although the focus on recommender systems research began in the 1990 s [66], it has been evolving across the time in a very active and prolific research field in which currently there are several hot topics such as the item recommendation from implicit feedback [64], the use of deep learning [92], the context-aware recommendation [3], the group recommendation [51], the cross-domain recommendation [18], and the explainability [72], among other popular research efforts.

On the other hand, the preprocessing process of inconsistent user preferences in RS has also become an emerging field of study, mainly focused on movie recommendations [47, 59, 86]. Several authors have stated that user ratings in recommender systems are intrinsically inconsistent because of imperfect and even unintentional user behaviors while they express their preferences, limiting their performance with a magic barrier [69].

Extant research has provided different examples of the presence of natural noise in recommender systems:

  • Amatriain et al. [8] have suggested that preference values should not be regarded as ground-truth values because the rating gathering is a noisy process.

  • Pham and Jung [59] pointed out two probable causes for the presence of natural noise in recommender systems datasets, which are: (1) the fact that user preferences change across time, and (2) the users are imprecise when they provide rating values.

  • Said et al. [69] and Kluver et al. [35] have indicated that users’ imprecision can be caused by personal conditions, social influences, emotional states, or certain rating scales.

  • Yera et al. [86] have presented an illustrative example of natural noise, where a low rating is considered noisy if the corresponding user usually evaluates positively most of the items, and the associated item has been voted high by the majority of the items.

  • Yera et al. [85] also present an illustrative example of natural noise, where the noise degree of a rating is characterized by the number and weights of the identified user’s behaviors/regularities that the rating contradicts or does not verify. In this way, it is identified as noisy, a rating that causes the user to not follow a pattern/regularity with high support.

To overcome these issues, some works have been proposed in the last few years, focusing on detecting, removing, or correcting naturally noisy ratings by using their own rating information or information obtained from external sources [13, 47, 56, 59, 86, 94]. In addition, there have been studies centered on detecting whole noisy-but-non-malicious user profiles [44].

Previous research has focused on traditional evaluation setups that use ratings to create training and test sets and employ them to evaluate the accuracy of the corresponding method; in this scenario, these methods imply an improvement in the recommendation performance [27]. However, the data associated with a real-time recommender system do not match these settings. In the real-world scenario, preferences are incrementally entered, and therefore, rating sorting begins to play a relevant role here [5, 16, 73]. Furthermore, the system must simultaneously capture this temporal information and provide user rating predictions. Then, the addition of natural noise management into this new scenario requires a solution to several limitations that are connected with the application of the noise correction approach in RS datasets. Some of these limitations are: (1) to explore whether the application of natural noise management in some segment of recent data instead of the whole dataset would be effective in improving accuracy; (2) to explore the magnitude of the associated improvement degree; (3) to identify how this accuracy could vary across different lengths of the rating sequences; and (4) to evaluate other important criteria in natural noise management, such as the intrusion level and the running time, for performing a trade-off with the accuracy, and therefore suggest conclusions in order to use the new approaches in real scenarios. This paper focuses on supporting these issues by exploring approaches for performing natural noise management in such time-related recommendation scenarios.

Specifically, the main novel contributions of this paper in relation to previous proposals and existing similar approaches are:

  • The screening of the natural noise management process, tailored to an incremental, time-aware recommendation scenario.

  • The development of a comparison protocol between the time-aware natural noise management and the traditional natural noise management approach without the time dimension.

  • An extensive evaluation of the time-aware natural noise management performance, using as new rating predictors in a natural noise management context, up to ten different state-of-the-art recommendation approaches. It includes (1) a clustering-based method [25], (2) a basic neighborhood-based method [65], (3) a neighborhood-based method including average deviation [65], (4) a neighborhood-based method that includes biased-based baselines modeling [65], (5) a method based on negative matrix factorization [48], (6) the Koren’s basic SVD approach [40], (7) the Koren’s SVD approach with temporal information [38], (8) the slope-one approach [43], (9) a baseline only approach predicting the biased-based baseline estimate for given user and item [38], and (10) a normal predictor based on the distribution of the training set, highlighted by Hug, [31].

  • The overall evidence is that a natural noise management approach that incorporates time-related information and time windows is able to reduce the method’s intrusiveness, decrease the execution time, as well as lead to a similar or improved accuracy.

The paper is organized as follows. Section 2 presents previous studies related to recommender systems and natural noise management in collaborative recommendations and justifies the selection of a specific approach to be used as the base for the following sections. Section 3 describes new approaches for performing the natural noise management task in an incremental, time-aware movie recommendation scenario. Section 4 plans and develops an experimental framework to evaluate the new NNM framework tailored to the time-related context. Finally, we present a final discussion about the obtained results. Section 5 concludes the paper.

2 Preliminaries

In this section, we present the necessary background for easy following and understanding of our proposal. It includes some antecedents for movie recommender systems, for previous works about the management of natural noise or unintentional inconsistencies that users can introduce in RS, as well as a detailed reference to a pioneer work on natural noise management that will be used as the basis for the current proposal.

2.1 Antecedents on basic recommender systems

The movie recommendation domain boosted initially the development of modern recommender systems in the middle of the 1990 s [4, 65]. Two main recommendation paradigms have been modeled over the last 30 years for developing recommendation tools:

  • Content-based recommendation The basis of content-based recommendation is the use of movie attributes for composing the user and item profiles, considering the relationship between item attributes and the rating values provided by the users [4]. Then, several scoring approaches are employed to recommend the most appropriate movie profiles to each individual user. Some common movie attributes include genre, director, actors, country, year, or each movie’s associated tags [19, 58, 83]. In the last few years, several sophisticated approaches have been developed for building the aforementioned user and item profiles. It includes the use of advanced machine learning algorithms, as well as the use of semantic tools such as ontologies [46, 74].

  • Collaborative filtering-based recommendation In the case of collaborative filtering, the working principle relies on crowd preferences to suggest movies for the active user. Specifically, it is based on the discovery, in an explicit or implicit way, of users’ rating patterns that are similar to those associated with the current user and uses their associated information for the recommendation generation [55]. Two large families of collaborative filtering approaches can be identified: (1) memory-based [55], focused on directly finding appropriate users’ neighborhoods for the current user and using the preferences of such nearest neighbors for recommendation generation, and (2) model-based [40], focused on building intermediate models that comprise the preferences of the user’s crowd and can facilitate recommendation generation. Collaborative filtering approaches have been very popular in movie recommendation because they can provide accurate recommendations using only ratings and without any additional information, in contrast to content-based approaches that depend on item attributes for an appropriate performance [60].

Across these traditional approaches, in the last few years, movie recommendations have continuously been a relevant research topic. In this way, Deldjoo et al. [21] model a new concept, titled Movie Genome, as a way of alleviating the new item cold start problem in movie recommendation and therefore improving the recommendation accuracy. Kumar et al. [41] introduce a movie recommender system using sentiment analysis from microblogging data, leveraging, in this way, the content-based recommendation paradigm. Widiyaningtyas et al. [80] explore advanced correlation-based similarities between the user profiles for introducing new algorithms for movie recommendation focused on outperforming previous proposals. In a different direction, Chen et al. [17] exploit users’ positive and negative profiles and relies on preferences over movies to compose a novel movie recommendation method.

Overall, while most of the available approaches in movie recommendation focus on improving recommendations through the proposal of more sophisticated recommendation algorithms [26], the current paper follows an alternative research path. Specifically, it will focus on the improvement of recommendation accuracy supported by the management of natural noise [50, 85, 86] associated with user preferences and the use of time-related information.

The next subsection focuses briefly on the incorporation of time-related information into recommender systems.

2.2 Antecedent of time-related information in recommender systems

The value of time-related information in recommender systems was early pointed out by Ding & Li, [23], when they presented several weighting approaches to the basic memory-based collaborative filtering scenario, being used the time where the user’s opinion is provided as input for the weights’ calculation.

More recently, Campos et al. [12] present a large-scale survey on the use of time-aware recommender systems, illustrating a taxonomy for classifying the developed works on the use of time as a contextual dimension in this scenario. They cover three main work categories: (1) continuous time-aware heuristic approaches, (2) categorical time-aware heuristic approaches, and (3) time-adaptive models.

[75] have also developed a survey in a similar direction, identifying several groups of research trends related to different challenges in the field, such as the following:

  • Time-aware algorithms focused on modeling time as context. It includes time-aware factorization and time-aware neighborhood models. In the context-aware framework, time features (day, day of the week, working/nonworking hours) were used in the prefiltering, postfiltering, and modeling stages. The main limitation of this approach is related to the way in which the time variable is considered. It is only taken into account as one more variable of the problem, so traditional models are still applied. This modeling makes it especially complex to notice time series behavior of interest, such as concept drift [62], which can affect attributes such as user preferences, product popularity, or product characteristics, among others.

  • Time-dependent algorithms focused on using time as a sequence. This includes models that attempt to capture the phenomena related to modeling sequential temporal dynamics in recommender systems. Therefore, it deals with issues such as changes and fluctuations in user preferences and item popularity. Here [75] also considered time-dependent neighborhood models (usually implemented through time decay function or sliding window algorithms) and time-dependent factorization models. One of the potential pitfalls of this approach is the increased complexity of the proposed models compared to traditional models [34]. Such complexity leads to a significant increase in running time that is related to the temporal/sequential nature of the proposed processing.

Eventually, Vinagre et al. [75] also point out the separate modeling of short-term and long-term preferences [67], or the bringing to this context of algorithms formerly focused on processing high-speed data streams [45].

More recently, Rabiu et al. [63] presented an updated survey of temporal (time-related) models in recommender systems built on the same framework of Vinagre et al. [75]. Here, the authors suggest the necessity of incorporating change point detection methods across user preferences to improve the exploitation of the temporal dimension, adding time-related deep learning-based methods in this context, and working toward a tailored evaluation strategy for this scenario.

In the last few years, the research line around time-related recommendations has been linked with the topic of sequential recommendation. Quadrana et al. [61] formalize the input of the sequence-aware recommendation problem as an ordered and often timestamped list of past user actions. Furthermore, recently several authors have also introduced several machine learning-related approaches to this research framework, such as contrastive learning [15], self-attentive neural architectures [93], or knowledge-graphs [30].

In summary, the brief research literature presented in this section suggests that the use of time-related information has been a research goal of the research community. However, it is also important to point out that most of the developed approaches are centered on proposing algorithms directly focused on improving the recommendation performance. It is relevant that there is a lack of work systematically focused on the use of time-related information in RS tasks such as the data preprocessing [7] or the natural noise management [50], according to recent reviews in this field, already referred [61, 75]. This study aims to fill this gap by proposing a time-related natural noise management framework for a movie recommendation scenario.

The next section provides an overview of natural noise management approaches in RS, which is necessary for the introduction of the methods presented in this paper.

2.3 Related works on natural noise management in RS

The preprocessing of inconsistent user preferences, so-called natural noise, is a relatively new research field in CFRS. In this case, we referred to the inconsistencies unintentionally introduced by users due to factors like the change of taste over time, personal conditions, inconsistent rating strategies, or social influences, which, therefore, cause the appearance of a “magic barrier” that affects performance [69]; and excluded those inconsistencies deliberately inserted by some users to bias the behavior of the system [28, 53]. This second kind of inconsistency, also known as malicious noise, is out of the scope of this paper.

The related literature has identified several examples of the consequences of natural noise on the user experience and the system’s performance. Being rating gathering a noisy process [8], it is possible that a user could provide a 5-star noisy rating to some item that does not deserve due to lack of attention implying the subsequent, erroneous recommendation, of items linked to the rated one. Furthermore, the current preference over some items, e.g., movies, could be conditioned by issues such as the release date, the advertising associated with the actors, directors, or the movie itself [59]. Therefore, items preferred by some users in the past could not be preferred at present; conversely, some items disliked in the past could be loved now. Additionally, information from social networks could temporally condition the values of the ratings provided by the user [69]. For example, a 5-star item could be voted with two stars if the user reads negative comments about this item. In a different direction, the variation across diverse rating scales in preference gathering systems, such as [0, 5] or [0, 10], creates confusion in the users and then leads to the introduction of natural noise.

The presence of noisy ratings that contradict the common behaviors or regularities of the users [85] would then imply a negative impact on the recommendation accuracy, taking into account that most of the recommendation approaches are built over the identification of common users’ behaviors.

O’Mahony et al. [56] introduced the first study that uses the term “natural noise.” Here, the authors focus on identifying whether a rating is noise-free or contains natural noise. For this purpose, they determine the consistency between the original rating value and a new value predicted through a recommendation algorithm for the same user-item pair. Amatriain et al. [8] also consider that the characterization of natural noise is a key element in the RS research field. Initially, they analyze the response provided by traditional recommendation methods in natural noise conditions using data obtained in three different moments: the second one 24 h after the first one and the third one at least 15 days after the third one. This analysis shows that the error prediction considerably varies for each case. Then, the core of the work proposes a user-dependent procedure to remove these inconsistencies by assuming that there are several available ratings associated with the same user about the same item (one rating and several re-ratings). Pham & Jung [59] have proposed a preference-based approach for rating correction in RS. This proposal focuses on the use of item attributes to represent user preferences and the detection and correction of ratings that do not match the corresponding user preference models. Eventually, Li et al. [44] also presented a method for the inconsistencies handling in CF datasets. In this case, their method works at the user level, detecting noisy but non-malicious users whose preferences can affect the recommendation accuracy. Specifically, the proposal assumes that the ratings provided by the same user on closely correlated items should have similar scores. Then, it captures and accumulates the user’s contradictions and uses them to remove the top noisy profiles. This removal implies an improvement in the recommendation’s accuracy.

More recently, keeping in mind the same goal, the degree of user coherence in RS datasets has been measured using item attributes (e.g., directors, actors), showing that the recommendation accuracy is improved when the users with a lower coherence are discarded [10]. This work is continued by Yu et al. [91], proposing a correction approach for the preferences associated with such low-coherence users. In parallel, Saia et al. [68] presented an approach that uses semantic information to remove incoherent items from user profiles in recommendation scenarios. On the other hand, Yera et al. [86] and Castro et al. [13] proposed a natural noise management method for collaborative RS based on a correction paradigm. In contrast to previous studies, it does not depend on additional information beyond the rating matrix, like item attributes or user feedback. In this case, the method uses a previous classification approach to characterize user and item behavior and detects anomalous ratings based on this classification. Finally, for these tagged ratings, a correction process is performed by calculating a new rating value for the same user-item pair using the remaining ratings and a traditional CF technique. In particular, corrections are made if the difference between the old and new ratings is higher than the threshold. In this case, the predictor used to calculate such ratings was Resnick’s user-based CF method with Pearson’s similarity (UserKNNPearson) [65].

Over the last few years, several authors have extended the pioneering work developed by Yera et al. [86], being enriched with further computational intelligence techniques and extending the initial ideas. In this way, Zhu et al. [94] take advantage of the correlations between the entropy of the rating data and the prediction uncertainty in terms of evaluation metrics and develops a new denoising algorithm based on fuzzy clustering. The authors assume that the recommendation accuracy is specific to natural noise and that the entropy of an individual rating dataset indicates the uncertainty derived from noisy data. Furthermore, the fuzzy C-means algorithm is used for noisy rating verification. Recently, Luo et al. [47] presented a new approach for natural noise management in recommender systems that detects natural noise according to the inconsistency between rating behaviors and users’ and items’ categories in a similar way to Yera et al. [86]. Furthermore, the authors consider the probability that each user belongs to each subcategory and correct the natural noise with threshold values weighted by probabilities.

In parallel, Wang et al. [76] follow the same scheme and proposes an approach that employs fuzzy theory to handle natural noise in RS by classifying ratings into three fuzzy categories characterized by variable boundaries. Subsequently, fuzzy profiles of users and items are constructed to effectively identify natural noise within the ratings. Upon detecting noisy ratings, the authors employ the Maximum Membership Principle to replace them with rating threshold values. Also, Bag et al. [9] re-classify users and items of a system into three classes, namely strong, average, and weak, to identify and correct noise ratings. Subsequently, this study integrates the Bhattacharya coefficient, a well-performing similarity measure for a sparse dataset, with the proposed reclassification method to predict unrated items from the obtained noise-free sparse dataset and recommend preferred products to consumers. In addition, deep learning-based architectures have also been used for natural noise management in RS. Recently, Park et al. [57] proposed an autoencoder-based recommender system for exploiting the ability of both anomaly detection and CF. The proposed system detects natural noise in the rating data based on reconstruction errors after training. By removing the detected natural noise, the collaborative filtering approach can predict the unrated ratings using noise-free data.

Table 1 summarizes the described methods in terms of four main features:

  • Avoid loss of information: It refers to the fact of avoiding the removal of user preferences across the natural noise management approach.

  • Does not use additional information: It refers to the performance of the noise management without the dependency on information beyond the user preference values. Examples of this additional information could be item attributes or tags.

  • Considers a time-related context: It refers to the use of rating timestamps or similar time-related variables in the developed models.

  • Tailored to a group scenario: It refers to natural noise management models specifically conceived or evaluated in group recommendation scenarios.

Works like [56] and [44], even though they do not depend on additional information beyond the rating values, remove important information from the dataset. Other research like the developed by Pham & Jung, [59], Amatriain et al. [8], Bellogíín et al. [10], Yu et al. [91], and Saia et al. [68], although focusing on rating correction and therefore does not imply information loss, also depend on additional information beyond the rating matrix and therefore could be difficult to apply in some scenarios. Eventually, Yera et al. [85] introduce a regularity-based correction approach that does not depend on additional information but requires the discovery of intermediate knowledge in terms of association rules, which could be difficult to generalize in some scenarios.

In contrast to the abovementioned works, the previous classification-based approach developed by Yera et al. [86] and Castro et al. [13], also featured recently by Bag et al. [9] and Luo et al. [47], correct ratings, does not remove important information from the dataset, and does not depend on additional information such as item attributes. In this way, while most of the considered approaches are centered on individual recommendation, Castro et al. [13] have introduced natural noise management in group recommender systems.

Table 1 Comparative analysis of existing approaches

Therefore, considering the advantages of the previous classification-based approach for natural noise management [86], as well as its increasing popularity according to recent works that have continued this approach [9, 47] (see Table 1), the rest of the paper will take the pioneering previous classification-based approach [86], as the base for the current proposal. As presented in Table 1, the current proposal provides a novel feature in managing a time-related context, which contrasts previous approaches that do not consider it. We leave to future work the tailoring to a group recommendation scenario.

2.4 The classification-based approach for natural noise management in RS

The classification-based approach for natural noise management in RS (Fig. 1) was proposed as a way to perform this task without using additional information beyond the user ratings [86].

This approach comprises two main stages: (1) the detection of possible noisy ratings, and (2) the correction of noisy ratings.

The first stage performs a classification of users and items based on a direct inspection of their ratings, to identify tendencies to have low, medium, or high preferences. Overall, the ratings that do not match those well-identified tendencies are considered as possible noisy. This is the underlying technique behind this stage.

Specifically, each user, item, and rating are, respectively, classified into three possible classes, which are presented in Table 2. Specifically, it focuses on classifying users in the classes benevolent, average, critical, or variable; and items in the classes strongly preferred, averagely preferred, weakly preferred, or variably preferred. Variable and variably preferred classes are used, respectively, for users and items that can not be classified into specific classes. Besides, ratings can be classified as weak, average, or strong, depending on two thresholds. Algorithm 3 (included in Appendix A) shows the pseudocode of this process also included in our new proposals. Moreover, the proposal considers three groups that establish matching among user, item, and rating classes. The proposed method assumes that for a certain rating, if its user and item classes belong to the same group (different from the variable class), then the rating should belong to the corresponding rating class in the same group. Otherwise, the rating should be classified as a possible inconsistency.

Fig. 1
figure 1

Global scheme of the previous classification-based approach for natural noise management

Table 2 Group of homologous classes
Table 3 Classes definition

Table 3 presents criteria for classifying users and items using such rating classification. In the case of users, it assumes that for each user u the sets \(|W_u|\), \(|A_u|\) and \(|S_u|\) are the respective sets of weak, average, and strong ratings. Regarding the proportion of ratings in each class, the user is classified as critical, benevolent, or average, and those users who have a similar proportion of the three kinds of ratings are classified as variable users. In the case of items, it follows a very similar approach in relation to users but considers all the ratings associated with the item (see also Table 3). Here, sets \(|W_i|\), \(|A_i|\), and \(|S_i|\) are used as the respective weakly preferred, averagely preferred, and strongly preferred ratings for item i.

The second stage of the proposal is focused on correcting the ratings identified as possible inconsistencies, obtained in the previous stage. Specifically, a new rating value is predicted for each user-item pair associated with the possible noisy rating previously detected. This stage then uses an underlying rating prediction algorithm, which is the well-known Resnick’s user-based method with Pearson’s similarity (UserKNNPearson) [65], as the former collaborative filtering approach. In each case, if the original rating is sufficiently different from the predicted value, the old rating is replaced with the new one. In the proposal, the difference threshold was set to \(\delta =1\), as this value tends to be the minimum step between two ratings in recommendation scenarios. Algorithm 4 (included in Appendix A) presents this procedure, which is included in our new proposals.

As presented in this section, several authors have pointed out that user preferences evolve over time and that taking this issue into account leads to performance improvement in RS models [15, 30, 75]. It is then necessary to explore how the use of time-related information affects the behavior of this natural noise management model, which has already been justified. Therefore, two new proposals for performing natural noise management in an incremental, time-related recommendation scenario are presented in the next section.

3 Correcting noisy ratings in a time-aware recommendation scenario

The recommendation tasks are intrinsically incremental, taking into account that the ratings stored behind a CF recommender system are provided by users who simultaneously request suggestions from the system itself. However, as presented in the Introduction section, the use of natural noise management approaches to this incremental scenario brings new issues that have not yet been regarded, and to the best of our knowledge, no previous studies have focused on solving this task. Typical natural noise management methods receive as input a set of ratings and optionally additional information about them and return as output the corrected set. Under these circumstances, its deployment in an incremental, time-aware scenario faces troubles like the selection of the ratings set to be corrected across time and the selection of the data that must be considered for the correction process. Taking into account the relevancy of the time dimension and sequential recommendation context as research trends, it is necessary to tailor formerly developed natural noise management models to these new requirements and scenarios. Therefore, the goal of the current study is to screen new models for the natural noise management process contextualized to an incremental, time-related recommendation scenario.

These models use as underlying algorithms the approach for identifying possibly noisy ratings and the approach for correcting noisy ratings. Both algorithms, formerly proposed by Yera et al. [86], have been discussed in Sect. 2.4 and detailed in Algorithms 3 and 4

Furthermore, this work developed a comprehensive experimental procedure over several recommendation approaches, with a higher, more general magnitude in relation to the previously referred works on natural noise management. Specifically, the current research work will then screen two frameworks for natural noise management in recommender systems, where it is assumed a sequential gathering of the rating data, which is the real context of a deployed recommender system. The next sections describe these approaches.

3.1 Sequential natural noise management in collaborative filtering

In the first stage, we propose a framework, named SeqNNM, that considers the continuous gathering of sequential rating data by RS. Figure 2 illustrates this framework. Herein, it is assumed that a set of rating sequences \(s_1, s_2,..., s_k,...,s_n\) is continuously gathered by the system. Each newly gathered \(s_k\) is first added to the main RS dataset R. Then, the \(R+s_k\) dataset is corrected through the mentioned natural noise management approach. From the identification of noisy ratings, following Algorithm 3, and the subsequent prediction of corrected ratings, following the guidelines in Algorithm 4, a processed dataset is finally obtained with the noise corrected based on the available data up to that moment. The sequential processing of data in specific time steps is the main innovation of this proposal. After that, the data reached as output by the NNM approach started to be used as the main data of the recommender system for both the main recommendation generation process and for the subsequent runs of the NNM process. This procedure processes multiple times all the available data, so it is able to correct a large amount of noise through further intrusion into the original data. Algorithm 1 presents an overview of this framework.

Fig. 2
figure 2

Overview for the approach for natural noise correction in a sequential scenario

Algorithm 1
figure a

Pseudocode for the incremental time-aware natural noise management proposal. seq method.

3.2 Sequential natural noise management in collaborative filtering covering the last p rating sequences

The framework for sequential natural noise management presented in the previous subsection has a shortcoming of the high volume of data that is used for natural noise management across the processing of each new sequence, which could affect the time performance of the proposal. To alleviate this drawback, we propose an alternative approach, named SeqNNM-p, in which instead of correcting all data every time that a new rating sequence is processed, it would be corrected only the last p sequences of the most recent ratings in the dataset. This approach significantly limits the data to be processed in each iteration, considerably reducing the final running time and the intrusiveness of the original proposal since the number of instances identified as noise is reduced with a shorter time horizon.

Figure 3 illustrates this approach. Here, once the new rating sequence \(s_k\) is gathered, a temporal dataset T containing such a sequence, as well as previous ones. The natural noise management used as the starting point for these models (Sect. 2.4) is applied over this temporal dataset T, and at the last stage, the values of the modified ratings in T are updated in the original dataset R used for recommendation generation. Algorithm 2 screens this approach.

Fig. 3
figure 3

Overview on an improved approach for natural noise correction in a sequential scenario, considering the correction of the last k sequences

Algorithm 2
figure b

Pseudocode for the incremental time-aware natural noise management proposal, considering the correction of the last k sequences. seqk method.

Overall, the computational cost of both approaches presented in this section depends on two main factors: (1) the cost of the classification-based approach for natural noise management, which is used at the initial step in both approaches, and (2) the cost of the inner approaches for rating prediction. In the first case, considering that full inspection of the rating matrix is necessary, the theoretical cost would be \(O(|U|*|I|)\), where U and I are the sets of users and items. However, due to the sparsity of RS datasets, this matrix can be quickly inspected. In the second case, the complexity of the different rating prediction methods varies from methods with constant time to methods with higher complexity. Moreover, in the experimental section, it will be proved that, in practice, the approach is able to correct several ratings in a short period. In addition, it will be proved how the considered length of the sequence can manage such execution time while maintaining positive values in terms of accuracy for almost all the evaluated settings.

4 Experiments and results

This section executes an evaluation process to measure the impact of the proposed alternatives to natural noise management in an incremental time-aware recommendation scenario. We assume two main criteria for the performance evaluation: recommendation accuracy after the execution of the correction method in the data and the amount of rating modified by the correction process. With this aim, we initially discuss the experimental setup and then present and analyze the obtained experimental findings.

4.1 Evaluation protocol

In this study, we evaluated how our natural noise preprocessing approach increases data quality, thereby affecting recommendation accuracy. Therefore, we will compare the results provided by our sequential approach versus two different cases: the case of not applying natural noise methods and the case of applying the natural noise method identified as baseline [86], but without considering the sequential nature. After applying the selected natural noise method, the recommendation results were evaluated using a fivefold cross-validation approach.

It is important to highlight that the same rating prediction model is used for both the preprocessing natural noise step and the final recommendation. The next subsection further details the prediction models used to evaluate the natural noise management schemes proposed here. For the sequential proposal, it is necessary to simulate a real-world environment. Specifically, the initial training dataset will comprise data for the first ten weeks of the time frame linked to the dataset used in the experiments. The natural noise process is then performed over the entire dataset, adding the next week’s information in a sequential manner.

4.2 Models

In order to obtain robust results and to serve as a comparative basis for the state of the art, all prediction models included in the Python Surprise package [31] have been included in the experimentation. The selected models are included in Table 4. It is important to note that each model uses the default configuration set in Surprise.

The approaches proposed in this work, SeqNNM and SeqNNM-p, are identified for simplicity by seq and seqk, respectively, in the experimentation carried out.

Table 4 Recommendation algorithms used

4.3 Datasets and evaluation metrics

Our evaluation protocol selects two different versions of the Movielens dataset [29], which is a popular one in the RS field and additionally has a timestamp for each rating. First, MovieLens100k contains 100,000 movie ratings associated with 943 users, about 1682 items, where each rating belongs to the range [1, 5]. Second, the last 1 million instances of the MovieLens25M dataset, which contains 1,000,000 movie ratings associated with 8715 users, about 5667 items, where each rating is in the range [1, 5]. It is important to highlight that these datasets have been considered state-of-the-art datasets in recommender systems and are currently used by several research works [1, 2, 42, 54].

To evaluate the performance of the proposals, we perform a fivefold cross-validation approach, where 80% of the samples compose the training set and the remaining 20% the test set and measure the recommendation accuracy through widely used metrics such as the mean absolute error (MAE), the root-mean-square error (RMSE), normalized discounted cumulative gain (NDCG) [78], precision, recall, and F1-Score (F1). The NDCG metric [33] relies on discounted cumulative gain (DCG) and it is grounded in the assumption that highly relevant items appearing toward the end of a search result list should be penalized. This is because the graded relevance value diminishes logarithmically in proportion to the position of the result. The formalization of DCG is as follows:

$$\begin{aligned} {DCG}_u={\sum _{k=1}^{N}{\frac{r_{u,{recom}_{u,k}}}{\log _2{(k+1)}}}} \end{aligned}$$
(1)

where recom\(_{u,k} \in I\) is the item recommended to user u in k position.

To calculate NDCG, the DCG value needs to be normalized by dividing it by the maximum achievable DCG value, known as \(DCG_{perfect}\) [33]. \(DCG_{perfect}\) represents an ideal recommendation list where the most preferred items are ranked at the top. The NDCG values for each user are computed as follows:

$$\begin{aligned} {NDCG}=\frac{DCG}{{DCG}_{perfect}} \end{aligned}$$
(2)

As a final step, the NDCG values associated with individual users are averaged to derive the final reported NDCG value. Additionally, we extract the number of modified values through the natural noise correction process and the running time of the complete experimentation. The definition of the most known performance measures used is included in Table 5.

Table 5 Performance measures used for evaluating the recommendation accuracy

The intrusiveness of the studied models is another important parameter to be analyzed and is evaluated through the number of values modified by the applied natural noise techniques. In this scenario, a greater number of modified values indicate greater intrusiveness of the method.

Finally, the running time of each proposal is recorded to evaluate its scalability.

4.4 Experimental results

In this section, we present the experimental findings for the specified protocol. To facilitate the reading of this work, we have included a graphical analysis of the most relevant performance metrics of the obtained results. Furthermore, the tables with numerical results present all the metrics included in Sect. 4.3 for the results associated with the models proposed in Sect. 4.2.

4.4.1 MovieLens 100k dataset

The results included in Table 6 show a robust improvement in the recommendation performance when we apply traditional or sequenced natural noise corrections. In the same way, for each recommendation method, the proposed sequential natural noise correction process (seq) obtains the best results in most cases. The BaselineOnly method combined with the proposed sequential natural noise process obtained the best results in RMSE, MAE, and precision metrics. The KNNBasic method obtains the best results in Recall and F1-score metrics, but these results are not very different from those obtained by BaselineOnly. On the other hand, SVD++ obtained the best results for NDCG. This metric is particularly relevant to this type of problem and is gaining importance over time. SVD++ also offers very competitive results in other metrics, especially RMSE, MAE, and precision. BaselineOnly obtains the best results at the cost of being the second most intrusive method after NormalPredictor. It is important to note that methods that are significantly less intrusive than BaselineOnly, such as SVD++ or KNNBasic, are able to provide similar performances.

Table 6 Results obtained for MovieLens 100k dataset: no natural noise method (no), baseline natural noise proposal (nn), the sequential proposal (seq), and the sequential method considering the last k rating sequences (seqk)

In order to facilitate the comparison between the nn model and the seq model, Table 7 shows the percentage improvement obtained in each metric. In this case, a considerable improvement can be seen in all performance metrics in practically all cases. In the case of running time and the number of modified values, we see how seq is more time-consuming and intrusive. Both metrics show the cost of obtaining better results.

Table 7 Percentage of improvement (%) of the seq proposal vs. the state-of-the-art nn proposal for MovieLens 100k dataset

To analyze the results more clearly, some graphical comparisons have been included.

Fig. 4
figure 4

RMSE results for the multiple models and natural noise approaches selected in the MovieLens 100k dataset

Figure 4 includes the RMSE results for every model tested and all the natural noise approaches included in this work. The comparison shows how the application of any natural noise correction technique improves the final results obtained using all the methods. Our first proposal, sequential and cumulative natural noise correction (seq), provides the best results for all models, with significant improvements over the traditional approach (nn). Our second proposal (seqk) offers results that are progressively closer to those obtained by the traditional method (nn) as the value of k increases. This behavior is relevant in massive data or Big Data environments considering that seqk works with a reduced subset of data while nn needs all available data. Moreover, in the case of SVD, seqk11 is able to provide better results than nn, so in these environments, the seqk approach becomes a desirable alternative.

The NDCG metric (Table 7) shows a similar behavior to that observed for RMSE. The seq approach obtains the best results, nn obtains the second place, followed by the different seqk approaches, the higher the k, the better the performance. This metric shows reduced differences between seqk and nn, and it is possible to improve the results of the traditional approach with slight increases in the k parameter. It is important to note the reduction of resources associated with the seqk approach, both in time and memory, by processing a subset of the original data at each step.

Fig. 5
figure 5

Running time (secs) results, in logarithmic scale, for the multiple models and natural noise approaches selected in the MovieLens 100k dataset

Figure 5 shows the running time obtained from each approach. The difference between the seq approach with respect to the rest is quite clear. This approach requires the longest running time. In the second place, we found the traditional approach (nn), while our second proposal (seqk) offers a considerable reduction in the running time concerning nn. Since the cost in the result performance is reduced, the seqk approach provides a robust alternative in environments where time is a constraint to be considered.

Fig. 6
figure 6

Number of modified ratings produced for each model and natural noise approach evaluated combination in the MovieLens 100k dataset

Finally, Fig. 6 shows the number of modified values for each model and approach. The seq approach is the most intrusive among the models, except for the SlopeOne and KNN-based models. The seqk approach is more intrusive as the value of k increases. This behavior shows that a higher data availability leads to higher natural noise detection and, therefore, higher intrusiveness. Re-evaluation of the data when new data are sequentially added also leads to greater intrusiveness, as in the seq case.

4.4.2 MovieLens last 1 M of 25 M rating dataset

In this section, experimentation close to a real use case in a data-intensive environment is performed, allowing us to evaluate the performance and scalability of the proposals.

The results shown in Table 8 show a clear dominance of the SVD++ model with the proposed seq approach, obtaining the best results in terms of RMSE, MAE, and NDCG metrics. BaselineOnly, also with the seq approach, obtains the best precision and F1-Score results. Finally, the KNNBasic model, with the seq approach, obtains the best results in the Recall metric. Although the seq approach is more data-intrusive than traditional approaches, the differences in performance results are very significant, as can be seen in Fig. 7.

Table 8 Results obtained for MovieLens 1 M dataset: no natural noise method (no), baseline natural noise proposal (nn), the sequential proposal (seq), and the sequential method considering the last k rating sequences (seqk)
Fig. 7
figure 7

RMSE results for the multiple models and natural noise approaches selected in the MovieLens last 1 M of 25 M ratings dataset

Additionally, Table 8 shows an increase in the running time for the seq approach, while the seqk approach shows a significant reduction in the time cost. Because each time window is seven days, the seq approach can be applied in a real-world application without a problem. In the case of time constraints, such as to prevent the use of the seq or nn approaches, the seqk approaches offer competitive results (Fig. 7), with respect to the traditional approach (nn). Moreover, this approach allows us to improve its performance in terms of results by adapting the time horizon k to the time constraints of each problem.

As in the previous case, the comparison between the nn model and the seq model, by percentage of improvement, is included in Table 9. The results show that a significant enhancement is evident across nearly all performance metrics. When examining factors such as running time and the number of modified values, it becomes clear that the seq approach is more time-consuming and invasive. Both metrics highlight the trade-off involved in achieving improved results.

Table 9 Percentage of improvement (%) of the seq proposal versus the state-of-the-art nn proposal for MovieLens 1 M dataset

Analyzing the results obtained by all models and approaches in both datasets, we can see that, except for the KNNBaseline model, the rest of the models obtain, in most cases, an improvement in the performance of the results. If we focus on the seqk approaches, we can appreciate that the performance differences for different values of k are more significant in the case of MovieLens100k than in MovieLens1M (Figs. 4 and 7, respectively). These results show the importance of the temporal component in both scenarios, but it is more significant in small datasets. In addition, the importance of the parameter k can be appreciated, and it is advisable to adapt it to the type of problem addressed.

The results obtained for the seq approach, which are the best results in the vast majority of cases, require a large amount of running time (Fig. 8). Because, in a real case, the accumulation of data takes weeks or even months, for the amount of data we are working on, the running time does not limit the application of our proposal in real scenarios. In the extreme case of working with large amounts of data and very tight model running time windows, we always have the option of using the seqk approach. This approach allows us to adjust the performance and running time using the parameter k, which is especially useful for this type of problem.

Fig. 8
figure 8

Running time (secs) results, in logarithmic scale, for the multiple models and natural noise approaches selected in the MovieLens last 1 M of 25 M ratings dataset

Finally, in Fig. 9, we analyze the intrusivity levels of our MovieLens1M dataset and compare it with those obtained for the MovieLens100k dataset, Fig. 6, we can appreciate a significant reduction of the relative intrusivity in each dataset. This behavior can be observed numerically by comparing the results included in Tables 6 and 8. This is especially relevant in the case of the KNN-based models, where the seq approach is the second least intrusive, marking a significant difference from the trend shown by the rest of the cases in both datasets.

Fig. 9
figure 9

Number of modified ratings produced for each model and natural noise approach evaluated combination in the MovieLens last 1 M of 25 M ratings dataset

4.5 Discussion

The results obtained in the previous section show a considerable improvement in the data quality after the application of the two natural noise correction techniques proposed in this study.

The first approach proposed, seq, obtains considerable results improvement by adding and accumulating information sequentially (Tables 6 and 8). Although the intrusiveness is not high according to Figs. 6 and 9, the computation required in high-dimensional problems may limit its use.

The second proposed approach, seqk, focused on the use of data related to the last k weeks, is able to provide competitive results in a short time at the cost of higher intrusiveness, considering the traditional natural noise approach (Figs. 6 and  9) and a correct setting of the parameter k. This proposal uses a smaller amount of data, which allows its use in real problems with large dimensions and running time limitations (Figs. 5 and  8).

Based on the obtained results, the application of natural noise correction techniques has shown a robust increase in the quality of the processed data. For this reason, its use is highly recommended in any RS problem. Furthermore, with the development of the current work, it has been proved that applying natural noise management approaches over small segments of rating data in RS is feasible, which is a step-further viewpoint in relation to the former works in natural noise management [13, 86], which have always used the whole dataset as input. According to our viewpoint, this is one of the main contributions of this work in relation to previous contributions.

In addition, a well-defined balance was observed between the volume of data used for training the natural noise management model and the degree of accuracy improvement linked to such a trained model. Thus, a larger number of rating segments used for training natural noise management leads to a larger accuracy improvement. However, the use of a lower number of rating sequences in the correction implies a more modest accuracy improvement. Nevertheless, in this case, it is important to note that fewer sequences imply a lower running time, which could be a variable that must be controlled in practical application scenarios of the methods discussed.

The proposals presented in this paper improve the quality of the data processed in recommender systems by incorporating temporal information of interest into the natural noise-cleaning process in a transparent way for the end user. This is because the two proposals can be applied to any recommender system with temporal information since they can be included as an additional step just before introducing the data into the final model.

An important shortcoming related to the approaches presented in the current contribution is the lack of uncertainty management associated with the rating data. The management of uncertainty has been previously proven to be a useful component in natural noise management in recommender systems [85, 87]. Future work will focus on this direction. In addition, future work will explore different exponential functions for characterizing the importance of each rating, according to their associated timestamp, for building user profiles with the natural noise management-related task.

Furthermore, another important shortcoming of the current work is that it is specifically focused on individual recommendations. However, previous studies on natural noise management, such as [13], showed that this task has a very positive effect on group recommender systems. Therefore, these previous results highlight the necessity of exploring time-related natural noise management, as screened in the current work, in group recommender systems scenarios.

5 Conclusions

In recent years, several studies have shown that user preferences tend to be inconsistent, which affects the accuracy performance of recommender systems (RS). In order to address this issue, several preprocessing approaches have been developed, processing these anomalous behaviors and obtaining a positive impact on the recommendation precision.

In this study, we focus on the application of these preprocessing proposals in real-time RS. To this end, we propose two incremental strategies to correct noisy ratings in this scenario. Considering a simulated time-aware RS, we have shown how these strategies are appropriate considering the recommendation accuracy, running time, and intrusion level in the data. Specifically, it is important to highlight that the achieved recommendation accuracy outperforms the accuracy obtained by previous works on natural noise management, such as those discussed in Sect. 2, and that the identified intrusion degree is lower than the intrusion degree obtained by other data preprocessing tasks in related data mining scenarios [6].

Beyond the theoretical and experimental results obtained across this paper, the practical implications of the obtained results rely on the fact that they illustrate that the time-related, sequence-driven management of natural noise in recommender systems is feasible. While previous works in this area have focused on performing this task over a large batch of data, the current work illustrates that the noise correction of small data segments also leads to an improvement in prediction accuracy and could provide additional benefits such as smaller intrusiveness and a shorter running time. This natural noise management over small data segments could be the key for generalizing these types of approaches in currently deployed recommendation applications, considering their huge amount of data that makes the use of previous methods focused on the entire dataset inappropriate. This work holds potential applications in domains characterized by the continuous generation of content, which often experiences brief periods of trending. It requires proposals capable of effectively integrating the temporal information and enhancing the data quality for the final model. Examples of such domains include streaming platforms and social networks, which frequently produce trending content within specific timeframes.

In the next future work, the current proposals will be extended to the group recommendation scenario [13]. Furthermore, we will focus on reformulating the current proposals using fuzzy tools [14, 85]. As a major goal, we aim to minimize the number of corrections required in past ratings and eventually work toward a framework in which correction is performed just at the rating insertion moment. For this purpose, we intend to exploit the sequential pattern mining theory [52] to model, at least partially, the inconsistencies that appear.

Additionally, we pretend to validate the current proposal through its use for recommendation improvement in practical cases such as e-learning scenarios [89]. Finally, explainable recommendations should also be considered in this environment [82].