Unsupervised Ensemble of Ranking Models for News Comments Using Pseudo Answers

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12036)


Ranking comments on an online news service is a practically important task, and thus there have been many studies on this task. Although ensemble techniques are widely known to improve the performance of models, there is little types of research on ensemble neural-ranking models. In this paper, we investigate how to improve the performance on the comment-ranking task by using unsupervised ensemble methods. We propose a new hybrid method composed of an output selection method and a typical averaging method. Our method uses a pseudo answer represented by the average of multiple model outputs. The pseudo answer is used to evaluate multiple model outputs via ranking evaluation metrics, and the results are used to select and weight the models. Experimental results on the comment-ranking task show that our proposed method outperforms several ensemble baselines, including supervised one.

1 Introduction

User comments on online news services can be regarded as a useful content since users can read other users’ opinions related to each news article. Many online news service sites rank comments in the order of the number of positive user-feedback for a comment, such as “Like”-button clicks, and preferentially display popular comments to readers. However, this type of user-feedback is not suitable to assess the comment quality, because this type of measurement is biased by where a comment appears [7]; Earlier comments tend to receive more feedback since they will be displayed at the top of the page. In attempt of solving this problem, several studies introduce some aspects of the comment quality to focus on, e.g., constructiveness [7, 13] or persuasiveness [22]. In particular, Fujita et al. [7] proposed a new dataset to rank comments directly according to comment quality. This is a difficult task because we have various situations of judging whether a comment is good. For example, comments can indicate rare user experiences, provide new ideas, or cause discussions. Ranking models often fail to capture such information.

According to recent studies [2, 12, 15], ensemble techniques are widely known to improve the accuracy of machine learning models. These ensemble techniques can be roughly divided into two types: averaging and selecting. Averaging methods such as Naftaly et al. [17] simply average multiple model outputs. Selecting methods such as majority vote [15] select the most frequent label from the predicted labels of multiple classifiers in post-processing. These methods assist models to make up for other models’ mistakes and to improve the results. Recently, Kobayashi [12] proposed an unsupervised ensemble method, post-ensemble, based on kernel density estimation, which was an extension of the majority vote to text generation models. He showed that this method outperformed averaging methods in a text summarization task.

In this paper, we propose a new unsupervised ensemble method, HPA, which is a hybrid of an output selection and a typical averaging method. In typical averaging methods, a lower accuracy model could merely be noise. A simple denoising method is to statically remove such lower accuracy models [19]. However, there is basically no model that fails for every inputs, particularly in neural models with the same architecture. In general, each model has its own strengths and weaknesses. Therefore, our method adopts dynamic denoising of outputs via a provisional averaging result. We use the provisional averaging result as a pseudo answer. Each predicted ranking is compared to the pseudo answer via a similarity function, and the similarity scores are used for selecting and weighting models. We adopt evaluation metrics as a kind of similarity to specialize in the ranking task. In experiments on a task of ranking constructive news comments, our proposed method HPA outperformed both previous unsupervised ensemble methods and a simple supervised ensemble method. Furthermore, we found that one of the evaluation metrics is useful as a similarity measure for the ensemble process.

2 Proposed Method

2.1 Problem Statement

Comment Ranking Task: Let an article be associated with comments \(C = (c_{1}, ..., c_{n})\). Each comment has a manually annotated score \(S = (s_{1}, ..., s_{n})\), such as the degree of comment quality. A ranking model m learns a scoring function \(\tilde{s_i} = m(c_i)\). We consider a predicted score sequence as a ranking of the comments \(r = (\tilde{s_1}, ..., \tilde{s_n})\), because we can generate a ranked comment sequence using this score sequence.

Ensemble Problem: We prepare N rankings \(R = (r_1, ..., r_N)\) from ranking models \(M = (m_1 ,..., m_N)\). The goal of the ensemble is to combine the ranking models to produce a better ranking than any of the individual ranking functions. A simple averaging method calculates the average of the comment scores, like \(r^*=\sum _{r \in R} \frac{r}{|R|}\).

2.2 Post-ensemble

We introduce PostNDCG which applies the post-ensemble method [12] to the ranking task. Post-ensemble is an unsupervised ensemble method based on kernel density estimation for sequence generation. This method compares the similarity between model outputs and selects the majority-like output which is similar to the other outputs. This selection is equivalent to selecting the output whose estimated density is the highest in the outputs. PostNDCG calculates this scoring function: \( f(r) = \frac{1}{|R|}\sum _{r^{\prime }\in R}sim(r, r^{\prime }), \) where \(sim(r, r')\) represents the similarity between r and \(r'\). The final ranking of PostNDCG is defined as \(r^* = \text {argmax}_{r \in R}\,f(r)\). We used the normalized discounted cumulative gain (NDCG@\(k\)) [1] as the similarity function \(sim(\cdot )\) to compare each ranker.
Fig. 1.

Example of HPA.

2.3 HPA Ensemble

We propose a Hybrid method using the Pseudo Answer (HPA). Figure 1 illustrates an example of HPA. Here, HPA selects the top three rankings \(\{r_2, r_3, r_5\}\) that are nearest to the pseudo answer. After that, it weights each selected ranking via a scoring function based on the pseudo answer. The concept of HPA is to denoise outputs via a pseudo answer \(\bar{r}\), which is represented by the average of each model output after the L2 normalization: \( \bar{r} = \frac{1}{|R|}\sum _{r\in R}\frac{r}{||r||}. \) The scoring function g is calculated as the similarity between the pseudo answer and the predicted ranking: \( g(r) = sim(\bar{r}, r). \) Then, HPA selects the top k models with the highest scores. The final ranking \(r^*\) is represented as, \( r^* = \sum _{r\in \bar{R}}g(r) \cdot r, \) where \(\bar{R}\) is the set of selected models (rankings).

3 Experiments

3.1 Experimental Settings

Dataset: We used a dataset for ranking constructive comments on Japanese articles in Yahoo! News1, which was prepared in Fujita et al. [7]. The dataset consists of triplets of an article title, comment, and constructiveness score. The constructiveness score (C-score) is defined as the number of crowdsourced workers, out of 40, who have judged a comment to be constructive. Therefore, the C-score is an integer ranging from 0 to 40. In this research, 130,000 comments from 1,300 articles were used as training data, 11,300 comments from 113 articles were used as validation data, and 42,436 comments from 200 articles were used as test data. In the training and validation data, 100 comments were randomly extracted in each article, whereas in the test data, all the comments were extracted assuming an actual service environment.

Preprocessing: We used a morphological analyzer MeCab2 [14] with a neologism dictionary, NEologd3 [20], for splitting Japanese texts into words. We replaced numbers with a special token and standardized the letter types by halfwidth to fullwidth4. We did not remove stop-words because function words will affect the performance in our task. We cutoff low-frequency words that appeared only three times or less in each dataset.

Model and Training: We used RankNet [1], a well-known pairwise ranking algorithm based on neural networks. Given a pair of two comments \(c_1\) and \(c_2\) on an article q, RankNet solves a binary classification problem of whether or not \(c_1\) has a higher score than \(c_2\). The score indicates the comment has high quality or not. We adopted the encoder-scorer structure for RankNet. The encoder consisted of two long short-term memory (LSTM) instances with 300 units to separately encode a comment and its title. The scorer predicted the ranking score of the comment via a fully-connected layer after concatenating the two encoded (comment and title) vectors. We used pre-trained word representations as the encoder input. They were obtained from a skip-gram model [16] trained with 1.5 million unlabeled news comments. We used the Adam optimizer (\(\alpha =0.0001\), \(\beta _1=0.9\), \(\beta _2=0.999\), \(\epsilon =1 \times 10^{-8}\)) to train these models. Both the dimensions of the hidden states of the encoders of article titles and comments were 300. In the experiments, we trained 100 different models by random initialization for the ensemble methods.

Evaluation: We used normalized discounted cumulative gain (NDCG@\(k\)) [1]. The NDCG@\(k\) is typically calculated in the top-k comments ranked by the ranking model and denoted by \(\text {NDCG}@k = Z_{k}\sum ^{k}_{i=1}\frac{\text {score}_{i}}{\log _2{(i+1)}}\), where \(\text {score}_i\) represents the true ranking score of the i-th comment ranked by the model, and \(Z_k\) is the normalization constant to scale the value between 0 and 1. In addition to NDCG@\(k\), we use Precision@\(k\) as the second evaluation metrics. Precision@\(k\) is defined as the ratio of the correctly included comments in the inferred top-k comments to the true top-k comments. In the experiment, we evaluated the case of \(k\in \{1, 5, 10 \}\). Note that a well-known paper [10] in the information retrieval field determined NDCG to be more appropriate than Precision@\(k\) for graded-scores settings like ours.

3.2 Compared Methods

Ensemble Baselines: We prepared the following methods as baselines. RankSVM and RankNet are baselines of a single model. ScoreAvg, RankAvg, TopkAvg, and NormAvg are commonly used ensemble methods that combine multiple models in post-processing without training. SupWeight is the popular supervised ensemble method based on weighting.

  • RankSVM: The best single RankSVM model proposed in Fujita et al. [7].

  • RankNet: The best single RankNet model in 100 models for ensemble.

  • ScoreAvg: Average output scores of the models for each comment.

  • RankAvg: Average rank orders of each comment.

  • TopkAvg: Select comments with higher scores than a threshold from each ranking and average their scores [5].

  • NormAvg: Average normalized output scores of the model outputs, as typified by [2]. We used L2 normalization to each ranking as \(r^{\prime } = r /||r||\).

  • SupWeight: Average weighted scores of the model outputs [19]. Scores are weighted on the basis of NDCG@\(k\) on the validation dataset. Note that their weights are constant values per model.

  • PostNDCG: Select the best single model per article introduced in Sect. 2.2.

Our Methods: We show proposed methods as following:
  • HPA : Hybrid the output selection method and a typical averaging method proposed in Sect. 2.3. We set \(k=50\), which obtained the highest accuracy in \(k = \{5\times n, n= 1,...,20\}\) on the validation dataset.

  • SPA : Select models using the pseudo answer and average them (equal to HPA without the weighting). We set \(k=50\) which is the same setting of HPA.

  • WPA: Average weighted model outputs using the pseudo answer (equal to HPA without the selecting).

Table 1.

NDCG@\(k\) and Precision@\(k\) scores (%) on ranking comment task (\(k \in \{1, 5, 10\}\)).






















































































3.3 Experimental Results

Our experimental results are shown in Table 1. As a result of the ensemble, we confirmed that all ensemble methods perform better than when using a single model. In particular, the proposed method HPA has achieved the highest NDCG@\(k\). PostNDCG achieved higher accuracy than RankNet. This implies that the method of calculating the similarity between models using evaluation metrics for each article is effective. However, it was less accurate than the common averaging ensemble method such as NormAvg. Since models were originally trained by a relative comparison of rankings, preserving the diversity of models is more effective for improving performance than selecting models with high confidence by using PostNDCG. The unsupervised method HPA outperformed the supervised method SupWeight. Therefore, we confirmed that it is better to determine the important model from the similarity between the predicted rankings rather than learning it in advance using the labeled data.
Table 2.

Comparison of similarity functions for HPA.












































Furthermore, we verified the effectiveness of NDCG@\(k\) as a similarity function to calculate HPA, compared to other similarity functions. We selected Precision, cosine similarity, Kendall rank correlation coefficient [11], and Spearman rank correlation coefficient [21] as compared methods. Table 2 shows the results of HPA when the similarity function is changed. The NDCG@\(k\) functions outperformed other similarity functions. Furthermore, Precision@\(k\) performed better than cos. Note that Precision@\(k\) equals top-k cosine similarity. It indicates top-k focused measurement, evaluation metrics, is useful for the ensemble.

4 Related Work

Analyzing comments on online forums, including news comments, has been widely studied in recent years. This line of research has included many studies on ranking comments according to user feedback [6, 9, 22]. On the other hand, there has also been much research on analyzing news comments in terms of “constructiveness” [7, 13, 18]. The most related research is Fujita et al. [7]. They ranked comments by using the C-score to evaluate the quality, instead of relying on user feedback. They created a news comment ranking dataset and improved the model performance from the viewpoint of the dataset structure. In our research, we further improve the the performance from the viewpoint of the model structure.

In the ensemble methods for ranking task, there are methods to average model outputs [2, 5], as mentioned in Sect. 3.2. Our method expands those methods by denoising through the relationships between predicted rankings. There is also research on learning the query-dependent weights with semi-supervised ensemble learning in an information retrieval task [8]. This method focused on selecting documents that are highly relevant to a query (article). It is effective for information retrieval tasks but not for ranking news comments task, because almost all such comments would be associated with a news article.

There are also approaches that improve the ranking model according to evaluation metrics: NDCG@\(k\), LambdaRank [3], and LambdaMART [4]. These methods handled model training by calculating NDCG@\(k\) between a gold ranking and a predicted one. It means NDCG@\(k\) was not used in inference. That fundamentally differs from our method which calculates NDCG@\(k\) between predicted rankings during inference.

5 Conclusion and Future Work

We proposed a hybrid unsupervised method of an output selection method and a typical averaging method. Our experiments showed that comparing predicted rankings using the evaluation metrics is effective for selecting and weighting models. For future work, we would like to compare the proposed method with the supervised ensemble method in terms of performance and speed. We also plan to combine various types of networks instead of using the same network structure.



  1. 1.
    Burges, C., et al.: Learning to rank using gradient descent. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pp. 89–96. ACM (2005).
  2. 2.
    Burges, C., Svore, K., Bennett, P., Pastusiak, A., Wu, Q.: Learning to rank using an ensemble of lambda-gradient models. In: Proceedings of the Learning to Rank Challenge, pp. 25–35. PMLR (2011).
  3. 3.
    Burges, C.J., Ragno, R., Le, Q.V.: Learning to rank with nonsmooth cost functions. In: Advances in Neural Information Processing Systems 19 (NIPS 2007), pp. 193–200 (2007).
  4. 4.
    Burges, C.J.: From RankNet to LambdaRank to LambdaMART: an overview. Learning 11(23–581), 81 (2010). Scholar
  5. 5.
    Cormack, G.V., Clarke, C.L., Buettcher, S.: Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009), pp. 758–759. ACM (2009).
  6. 6.
    Das Sarma, A., Das Sarma, A., Gollapudi, S., Panigrahy, R.: Ranking mechanisms in Twitter-like forums. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), pp. 21–30. ACM (2010).
  7. 7.
    Fujita, S., Kobayashi, H., Okumura, M.: Dataset creation for ranking constructive news comments. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pp. 2619–2626. Association for Computational Linguistics (2019).
  8. 8.
    Hoi, S.C., Jin, R.: Semi-supervised ensemble ranking. In: Proceedings of the 23rd National Conference on Artificial Intelligence-Volume 2 (AAAI 2008), pp. 634–639. AAAI Press (2008).
  9. 9.
    Hsu, C.F., Khabiri, E., Caverlee, J.: Ranking comments on the social web. In: Proceedings of the 2009 International Conference on Computational Science and Engineering (CSE 2009), vol. 4, pp. 90–97. IEEE (2009).
  10. 10.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. (TOIS) 20(4), 422–446 (2002). Scholar
  11. 11.
    Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938). Scholar
  12. 12.
    Kobayashi, H.: Frustratingly easy model ensemble for abstractive summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 4165–4176. Association for Computational Linguistics (2018).
  13. 13.
    Kolhatkar, V., Taboada, M.: Constructive language in news comments. In: Proceedings of the First Workshop on Abusive Language Online, pp. 11–17. Association for Computational Linguistics (2017).
  14. 14.
    Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to japanese morphological analysis. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 230–237. Association for Computational Linguistics (2004).
  15. 15.
    Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Inform. Comput. 108(2), 212–261 (1994). Scholar
  16. 16.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26 (NIPS 2013), pp. 3111–3119 (2013).
  17. 17.
    Naftaly, U., Intrator, N., Horn, D.: Optimal ensemble averaging of neural networks. Netw.: Comput. Neural Syst. 8(3), 283–296 (1997). Scholar
  18. 18.
    Napoles, C., Pappu, A., Tetreault, J.R.: Automatically identifying good conversations online (yes, they do exist!). In: Proceedings of the Eleventh International AAAI Conference on Web and Social Media (ICWSM 2017), pp. 628–631. AAAI Press (2017).
  19. 19.
    Opitz, D.W., Shavlik, J.W.: Actively searching for an effective neural network ensemble. Conn. Sci. 8(3–4), 337–354 (1996). Scholar
  20. 20.
    Sato, T., Hashimoto, T., Okumura, M.: Implementation of a word segmentation dictionary called mecab-ipadic-NEologd and study on how to use it effectively for information retrieval (in Japanese). In: Proceedings of the Twenty-three Annual Meeting of the Association for Natural Language Processing, pp. NLP2017-B6-1. The Association for Natural Language Processing (2017)Google Scholar
  21. 21.
    Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15(1), 72–101 (1904). Scholar
  22. 22.
    Wei, Z., Liu, Y., Li, Y.: Is this post persuasive? Ranking argumentative comments in online forum. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), vol. 2, pp. 195–200. Association for Computational Linguistics (2016).

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Tokyo Institute of TechnologyKanagawaJapan
  2. 2.Yahoo Japan Corporation/RIKEN AIPTokyoJapan

Personalised recommendations