Automated Assessment of the Quality of Peer Reviews using Natural Language Processing Techniques

  • Lakshmi RamachandranEmail author
  • Edward F. Gehringer
  • Ravi K. Yadav


A review is textual feedback provided by a reviewer to the author of a submitted version. Peer reviews are used in academic publishing and in education to assess student work. While reviews are important to e-commerce sites like Amazon and e-bay, which use them to assess the quality of products and services, our work focuses on academic reviewing. We seek to help reviewers improve the quality of their reviews. One way to measure review quality is through metareview or review of reviews. We develop an automated metareview software that provides rapid feedback to reviewers on their assessment of authors’ submissions. To measure review quality, we employ metrics such as: review content type, review relevance, review’s coverage of a submission, review tone, review volume and review plagiarism (from the submission or from other reviews). We use natural language processing and machine-learning techniques to calculate these metrics. We summarize results from experiments to evaluate our review quality metrics: review content, relevance and coverage, and a study to analyze user perceptions of importance and usefulness of these metrics. Our approaches were evaluated on data from Expertiza and the Scaffolded Writing and Rewriting in the Discipline (SWoRD) project, which are two collaborative web-based learning applications.


Intelligent tutoring systems Collaborative learning Peer reviews 



We would like to thank Da Young Lee for helping us review an early draft of the paper.


  1. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & et al. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. NAACL (pp. 19–27).Google Scholar
  2. Avis, D., & Imamura, T. (2007). A list heuristic for vertex cover, (Vol. 35, Elsevier.Google Scholar
  3. Bache, K., Newman, D., & Smyth, P. (2013). Text-based measures of document diversity. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD. doi: 10.1145/2487575.2487672 (pp. 23–31).
  4. Barzilay, R., & Elhadad, M. (1997). Using lexical chains for text summarization. In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization. (pp. 10–17).Google Scholar
  5. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of machine Learning research, 3, 993–1022.zbMATHGoogle Scholar
  6. Bohnet, B. (2010). Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics. COLING (pp. 89–97).Google Scholar
  7. Boonthum, C. (2004). istart: paraphrase recognition. In Proceedings of the ACL 2004 workshop on Student research. ACLstudent.Google Scholar
  8. Burstein, J., Marcu, D., & Knight, K. (2003). Finding the write stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent Systems, 18, 32–39.CrossRefGoogle Scholar
  9. Chapman, W. W., Bridewell, W., Hanbury, P., Cooper, G. F., & Buchanan, B. G. (2001). A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics, 34, 301–310.CrossRefGoogle Scholar
  10. Charikar, M., & Panigrahy, R. (2001). Clustering to minimize the sum of cluster diameters. In Proceedings of the thirty-third annual ACM symposium on Theory of computing (pp. 1–10).Google Scholar
  11. Cho, K., & Schunn, C. D. (2007). Scaffolded writing and rewriting in the discipline: A web-based reciprocal peer review system. Computers and Education, 48, 409–426. doi: 10.1016/j.compedu.2005.02.004.
  12. Cho, K. (2008). Machine classification of peer comments in physics. In Educational Data Mining (pp. 192–196).Google Scholar
  13. Coursey, K., & Mihalcea, R. (2009). Topic identification using wikipedia graph centrality. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09 (pp. 117–120).Google Scholar
  14. Dalvi, N., Kumar, R., Pang, B., & Tomkins, A. (2009). Matching reviews to objects using a language model. EMNLP ’09.Google Scholar
  15. Echeverría, V., Gomez, J. C., & Moens, M. F. (2013). Automatic labeling of forums using bloom’s taxonomy. In Advanced Data Mining and Applications, Springer (pp. 517–528).Google Scholar
  16. Erkan, G., & Radev, D. R. (2004). Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479.Google Scholar
  17. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: a library for large linear classification. The Journal of Machine Learning Research, 9, 1871–1874.zbMATHGoogle Scholar
  18. Fellbaum, C. (1998). Wordnet: an electronic lexical database, MIT Press.Google Scholar
  19. Fleiss, J. L., Cohen, J., & Everitt, B. (1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 323.CrossRefGoogle Scholar
  20. Foltz, P. W., Gilliam, S., & Kendall, S. A. (2000). Supporting content-based feedback in online writing evaluation with LSA. Interactive Learning Environments, 8, 111–129.CrossRefGoogle Scholar
  21. Ganesan, K., Zhai, C., & Han, J. (2010). Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In Proceedings of the 23rd International Conference on Computational Linguistics. COLING ’10 (pp. 340–348).Google Scholar
  22. Gehringer, E. F. (2010). Expertiza: Managing feedback in collaborative learning. In Monitoring and Assessment in Online Collaborative Environments: Emergent Computational Technologies for E-Learning Support (pp. 75–96).Google Scholar
  23. Haghighi, A. D., Ng, A. Y., & Manning, C. D. (2005). Robust textual inference via graph matching. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. HLT (pp. 387–394).Google Scholar
  24. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Springer.Google Scholar
  25. Kauchak, D., & Barzilay, R. (2006). Paraphrasing for automatic evaluation. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. HLT-NAACL ’06 (pp. 455–462).Google Scholar
  26. Kuhne, C., Bohm, K., & Yue, J. Z. (2010). Reviewing the reviewers: a study of author perception on peer reviews in computer science. In CollaborateCom (pp. 1–8).Google Scholar
  27. Lappas, T., Crovella, M., & Terzi, E. (2012). Selecting a characteristic set of reviews. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD. doi: 10.1145/2339530.2339663 (pp. 832–840).
  28. Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation. SIGDOC (pp. 24–26).Google Scholar
  29. Lim, E. P., Nguyen, V. A., Jindal, N., Liu, B., & Lauw, H. W. (2010). Detecting product review spammers using rating behaviors. In Proceedings of the 19th ACM international conference on Information and knowledge management. CIKM ’10 (pp. 939–948).Google Scholar
  30. Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: analyzing and comparing opinions on the web. In Proceedings of the 14th International Conference on World Wide Web (pp. 342–351).Google Scholar
  31. Liu, B. Q., Xu, S., & Wang, B. X. (2009). A combination of rule and supervised learning approach to recognize paraphrases. In Proceedings of the International Conference on Machine Learning and Cybernetics, (Vol. 1 pp. 110–115).Google Scholar
  32. Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. In IEEE 11th International Conference on Data Mining Workshops (ICDMW), 2011. IEEE (pp. 81– 88).Google Scholar
  33. Manshadi, M., Gildea, D., & Allen, J. (2013). Plurality, negation, and quantification: Towards comprehensive quantifier scope disambiguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-13).Google Scholar
  34. Meng, X., Wei, F., Liu, X., Zhou, M., Li, S., & et al. (2012). Entity-centric topic-oriented opinion summarization in twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD (pp. 379–387).Google Scholar
  35. Mihalcea, R. (2004). Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. ACLdemo.Google Scholar
  36. Moghaddam, S., Jamali, M., & Ester, M. (2011). Review recommendation: personalized prediction of the quality of online reviews. In Proceedings of the 20th ACM international conference on Information and knowledge management. CIKM ’11 (pp. 2249–2252).Google Scholar
  37. Nelson, M. M., & Schunn, C. D. (2009). The nature of feedback: How different types of peer feedback affect writing performance. In Instructional Science, (Vol. 27 pp. 375–401).Google Scholar
  38. Nguyen, H. V., & Litman, D. J. (2013). Identifying Localization in Peer Reviews of Argument Diagrams, Berlin, Heidelberg, (pp. 91–100). Berlin Heidelberg: Springer.Google Scholar
  39. Nguyen, H. V., & Litman, D. J. (2014). Improving peer feedback prediction: The sentence level is right, 99. ACL 2014.Google Scholar
  40. Patchan, M., Charney, D., & Schunn, C. (2009). A validation study of students’ end comments: Comparing comments by students, a writing instructor, and a content instructor. Journal of Writing Research, 1, 124–152.CrossRefGoogle Scholar
  41. R. (2008). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.Google Scholar
  42. Rada, R., Michailidis, A., & Wang, W. (1994). Collaborative hypermedia in a classroom setting. Journal Education Multimedia Hypermedia, 3, 21–36.Google Scholar
  43. Radev, D. R., Jing, H., Styś, M., & Tam, D. (2004). Centroid-based summarization of multiple documents. Information Processing & Management, 40, 919–938.CrossRefzbMATHGoogle Scholar
  44. Ramachandran, L., & Gehringer, E. F. (2010). Automating metareviewing. In poster presentation at Workshop on Computer-Supported Peer Review in Education, associated with Intelligent Tutoring Systems.Google Scholar
  45. Ramachandran, L., & Gehringer, E. F. (2012). A word-order based graph representation for relevance identification [poster]. In CIKM 2012, 21st ACM Conference on Information and Knowledge Management.Google Scholar
  46. Ramachandran, L., & Gehringer, E. F. (2013a). Graph-structures matching for review relevance identification. In Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing (pp. 53–60).Google Scholar
  47. Ramachandran, L., & Gehringer, E. F. (2013b). A user study on the automated assessment of reviews. In Proceedings of the Workshops at the 16th International Conference on Artificial Intelligence in Education AIED 2013, Memphis, USA, July 9–13, 2013.Google Scholar
  48. Ramachandran, L., & Gehringer, E. F. (2013c). An ordered relatedness metric for relevance identification. In In Semantic Computing (ICSC), 2013 IEEE Seventh International Conference on. doi: 10.1109/ICSC.2013.23 (pp. 86–89).
  49. Ramachandran, L., & Gehringer, E. F. (2011). Automated assessment of review quality using latent semantic analysis. In ICALT 2011, 11th IEEE International Conference on Advanced Learning Technologies (pp. 136–138).Google Scholar
  50. Ramachandran, L., & Gehringer, E. F. (2015). Identifying content patterns in peer reviews using graph-based cohesion. In Proceedings of Florida Artificial Intelligence Research Society Conference.Google Scholar
  51. Rooyen, S. V., Black, N., & Godlee, F. (1999). Development of the review quality instrument (rqi) for assessing peer reviews of manuscripts. Journal of Clinical Epidemiology, 52, 625–629. doi: 10.1016/S0895-4356(99)00047-5.
  52. Steinbach, M., Karypis, G., Kumar, V., & et al. (2000). A comparison of document clustering techniques. In KDD workshop on text mining., (Vol. 400 pp. 525–526).Google Scholar
  53. Titov, I., & McDonald, R. (2008). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web. WWW ’08 (pp. 111–120).Google Scholar
  54. Tognini-Bonelli, E. (2002). Corpus linguistics at work. Computational Linguistics, 28, 583–583.Google Scholar
  55. Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Inproceedings of HLT-NAACL 2003 (pp. 252–259).Google Scholar
  56. Tsatsaronis, G., Varlamis, I., & Nørvåg, K. (2010). Semanticrank: ranking keywords and sentences using semantic graphs. In Proceedings of the 23rd International Conference on Computational Linguistics. COLING (pp. 1074–1082).Google Scholar
  57. Tubau, S. (2008). Negative concord in English and Romance: Syntax-morphology interface conditions on the expression of negation. Netherlands Graduate School of Linguistics.Google Scholar
  58. Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Google Scholar
  59. UXMatters (2005). User experience definition.Google Scholar
  60. Wang, C., Yu, X., Li, Y., Zhai, C., & Han, J. (2013). Content coverage maximization on word networks for hierarchical topic summarization. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (pp. 249–258): ACM.Google Scholar
  61. Wessa, P., & De Rycker, A. (2010). Reviewing peer reviews: a rule-based approach. In International Conference on E-Learning (ICEL) (pp. 408–418).Google Scholar
  62. Xiong, W., Litman, D. J., & Schunn, C. D. (2010). Assessing reviewer’s performance based on mining problem localization in peer-review data. In EDM (pp. 211–220).Google Scholar
  63. Yadav, R. K. (2016). Web services for automated assessment of reviews. Master’s thesis, NC State University.Google Scholar
  64. Yang, X., Ghoting, A., Ruan, Y., & Parthasarathy, S. (2012). A framework for summarizing and analyzing twitter feeds. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD. doi: 10.1145/2339530.2339591 (pp. 370–378).
  65. Zhai, Z., Liu, B., Xu, H., & Jia, P. (2011). Clustering product features for opinion mining. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 347–354).Google Scholar
  66. Zhang, R., & Tran, T. (2010). Review recommendation with graphical model and em algorithm. In Proceedings of the 19th international conference on World wide web. WWW ’10 (pp. 1219–1220).Google Scholar
  67. Zhang, J., & Yang, Y. (2003). Robustness of regularized linear classification methods in text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 190–197).Google Scholar

Copyright information

© International Artificial Intelligence in Education Society 2017

Authors and Affiliations

  • Lakshmi Ramachandran
    • 1
    Email author
  • Edward F. Gehringer
    • 2
  • Ravi K. Yadav
    • 2
  1. 1.PearsonBoulderUSA
  2. 2.North Carolina State UniversityRaleighUSA

Personalised recommendations