In this section, we explain how the entity-centric learning with ensembles was realized in [6] as well as the methods for reducing the primary memory requirements. Section 3.1 is mostly taken over from [6] as the entity-centric learning part of our process stays the same (text that is the same is in blue for review purposes).
Entity-centric learning
As in conventional opinion stream classification, our learning task is to predict the label of each arriving review. To exploit the fact that some reviews refer to the same entity, we partition the stream into one sub-stream per entity, as explained in Section 3.1.1. In Section 3.1.2, we describe how model learning, adaption, and forgetting are done by the entity-centric learning algorithm and by the entity ignorant one, which cooperate in an ensemble. We then present our weighting schemes for the ensemble members in Section 3.1.5.
Entity-centric modeling of the stream
We model the data set DS as a stream of incoming reviews, where each review belongs to a specific product. From here on, we use the more general terms “entity” instead of product and “observation” or “instance” instead of a review. We denote as t1,t2,…,tm,… the time points of the arrivals of the observations, so that om stands for the observation which has arrived at time point tm.
We denote the set of all entities as E, and the j th observation belonging to entity e ∈ E as obse,j ∈ DS. This implies that all observations belonging to e ∈ DS constitute a sub-stream Te. Since the first observation for an entity may arrive at any time point tm, and since the popularity of the entities varies, the sub-streams have different speeds, and the j th observation for entity e may arrive much later than the j th observation for entity e′.
Each observation consists of a text field (the review content) and the sentiment label from a set of labels \({\mathscr{L}}\). For example, the number of stars assigned to a review, or the set {pos, neg, neutral}.
Given the infinite stream DS of observations, the learning task is to build a model, which at each time point tm receives an observation om belonging to entity e ∈ E and predicts the label of this observation, given all reviews seen thus far for e and for all other entities.
An ensemble with two voting members
Our proposed ensemble has two voting members: a conventional stream classifier that treats all observations as independent, and the set of single-entity classifiers (SECs), one per entity. We explain the SECs first and describe the orchestration of the ensemble thereafter.
The entity-centric ensemble member
For each entity e ∈ E, we train and gradually adapt a single-entity classifier SECe. This classifier sees only the sub-stream of observations Te = {obse,1,obse,2,…}. Since the set of entities E over the stream DS is not known in advance, we perform a single initialization step for the whole DS. Then, whenever a new entity e shows up, we launch a new SECe. A SECe is invoked for classification and adaption only if an observation on e arrives. In the initialization step, we build a single feature space F of size N over DS, selecting the top-N words (for a very large N). When the first observation of an entity e, obse,1 appears, a new SECe is created and trained. In [6], we kept all SECs in primary memory (see Fig. 1), whereas in this work, we are using two different memory management strategies which are explained in Section 3.2.
As learning core for each SEC, we consider a Multinomial Naive Bayes with “gradual fading” (MNBF), proposed in [19]: this algorithm decays the word counts per class, depending on how long it has been since a word has been encountered for a given class. For SECe, we perform gradual fading on the sub-stream Te, wherein the count of a word per class is decayed proportionately to the last time point at which the word appeared in observation of e for this class. Since for each entity e, SECe is trained only within the substream Te, the conditional word counts per class, faded to different extents, differ among entities.
The entity-ignorant ensemble member
The second member of our ensemble is a conventional stream classifier that ignores the entity to which each observation belongs. We denote this classifier as “Entity IGnorant Classifier” (EIGC).
The EIGC uses the same feature space F as the SECs. Since it sees all observations of the stream DS, it is initialized as soon as the first observation arrives and can be used for learning and classification thereafter. As learning core of the EIGC, we use again the gradual fading MNB of [19], wherein the word counts are modified for each arriving observation and the fading of a word refers to the whole stream DS, as opposed to the sub-stream used by each SEC.
Ensemble variants based on weighting
We consider three weighting schemes for the ensemble members, each of them corresponding to an ensemble variant.
Variant 1: the entity-centric-classifier-ensemble
ECCE builds upon the fact that a minimum number of training observations x is necessary before a classifier can deliver reliable predictions. Hence, when the stream starts, ECCE initializes EIGC and one SEC for the entity e of each arriving observation. As soon as the minimum number of observations x has been reached for the SEC of an entity e, this classifier SECe can be used for predictions. Obviously, EIGC is the first learner to start, since it is trained on all observations. Thus, ECCE uses EIGC to deal with the cold-start problem for new entities and for rarely referenced ones. As soon as x observations have been seen for entity e, ECCE switches from EIGC to the dedicated SECe for observations on e. The EIGC is still trained in parallel so that it can benefit from the knowledge as well.
Variant 2: the entity-centric-weighted-ensemble
ECWE uses EIGC for the observations of some entities, even after the cold-start is over. In particular, ECWE assigns a weight w to the SECs of the ensemble and 1 − w to EIGC. These two weights are applied to the votes of the ensemble members for the label of each arriving observation.
Variant 3: the entity-RMSE-weighted-ensemble
ERWE replaces the fixed weights used by ECWE with a weight emanating from the classification error of each ensemble member, thus assigning a higher voting weight to the member that has a lower error. In particular, for each arriving observation o, let e be the entity to which o belongs. We define the weight assigned to SECe as
$$ wSEC(e)=\frac{RMSE(EIGC_{e})}{RMSE(EIGC_{e})+RMSE(SEC_{e})} $$
The weight assigned to EIGC for that entity e is
$$ wEIGC(e)=\frac{RMSE(SEC_{e})}{RMSE(EIGC_{e})+RMSE(SEC_{e})} $$
where we use the root mean square error (RMSE) as misclassification error, assuming ordinal labels. If the labels have no internal order, as would be the case for positive and negative labels only, the misclassification error can be used instead of RMSE.
Note that the weight of the EIGC vote for observation depends on the entity to which this observation belongs. This allows the ensemble variant ERWE to assign higher weights to EIGC on entities, for which the SEC does not perform well (yet), while giving preference to the SECs, as soon as they show superior performance. Furthermore, this method is parameter-free in contrast to the ECWE where we have to pick the weights in advance.
Memory reduction
In our work, we investigate two methods for reducing the memory requirements of entity-centric learning. The first method uses the lossy counting algorithm [14] to determine whether an entity-centric model should be kept in primary memory or should be stored in secondary memory for future retrieval. The second approach replaces complex entity-centric models with much simpler models that only rely on the label but not the text of an observation. These simple models have a much smaller memory footprint than the MNBFs [19] that we used in [6].
Entity management with lossy counting
In our case, we do not want to track frequent itemsets but the percentage of observations in a data stream that refer to a specific entity, so our elements e would be entity identifiers which in our case are the IDs of the Amazon products, which the incoming reviews refer to. At the end of a bucket, we collect the IDs of all the entities which would be deleted from D and save their models in secondary memory and delete them from the primary memory. We keep in primary memory the models of all entities that are in D. Models that have been saved to disk can later be retrieved and put back in primary memory in case future observations belonging to that entity arrive in the data stream. To realize this function, we need a second data structure L which simply tracks if we have seen the entity before. If an entity is in L but not in D, then we know we have to retrieve the model from disk. We can ignore the user parameter s as we are not interested in reporting frequent itemsets or frequent entities.
Replacing entity-centric models with text-ignorant models
Our second approach for reducing the memory footprint was inspired by our earlier work on entity-centric models [5] which only rely on the label of observation and ignore all the other features, in this case, the review texts. We could show that only using the labels of an entity could yield better predictions for some entities compared to the entity-ignorant MNBF but the entity-ignorant model is still the best overall. In our follow-up work [6], we show that we can improve the entity-ignorant MNBF by combining it with entity-centric MNBFs. In this work, we combine these two findings and now enrich the entity-ignorant MNBF with entity-centric models that only use labels, as these have a much smaller memory footprint than an MNBF. The label-only model that was most successful in [5] is a model that predicts the most frequent label of an entity seen so far which is why we also use this model in this study and call it the majority label of an entity.