Automated Assessment of the Quality of Peer Reviews using Natural Language Processing Techniques
- 714 Downloads
A review is textual feedback provided by a reviewer to the author of a submitted version. Peer reviews are used in academic publishing and in education to assess student work. While reviews are important to e-commerce sites like Amazon and e-bay, which use them to assess the quality of products and services, our work focuses on academic reviewing. We seek to help reviewers improve the quality of their reviews. One way to measure review quality is through metareview or review of reviews. We develop an automated metareview software that provides rapid feedback to reviewers on their assessment of authors’ submissions. To measure review quality, we employ metrics such as: review content type, review relevance, review’s coverage of a submission, review tone, review volume and review plagiarism (from the submission or from other reviews). We use natural language processing and machine-learning techniques to calculate these metrics. We summarize results from experiments to evaluate our review quality metrics: review content, relevance and coverage, and a study to analyze user perceptions of importance and usefulness of these metrics. Our approaches were evaluated on data from Expertiza and the Scaffolded Writing and Rewriting in the Discipline (SWoRD) project, which are two collaborative web-based learning applications.
KeywordsIntelligent tutoring systems Collaborative learning Peer reviews
We would like to thank Da Young Lee for helping us review an early draft of the paper.
- Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & et al. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. NAACL (pp. 19–27).Google Scholar
- Avis, D., & Imamura, T. (2007). A list heuristic for vertex cover, (Vol. 35, Elsevier.Google Scholar
- Bache, K., Newman, D., & Smyth, P. (2013). Text-based measures of document diversity. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD. doi: 10.1145/2487575.2487672 (pp. 23–31).
- Barzilay, R., & Elhadad, M. (1997). Using lexical chains for text summarization. In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization. (pp. 10–17).Google Scholar
- Bohnet, B. (2010). Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics. COLING (pp. 89–97).Google Scholar
- Boonthum, C. (2004). istart: paraphrase recognition. In Proceedings of the ACL 2004 workshop on Student research. ACLstudent.Google Scholar
- Charikar, M., & Panigrahy, R. (2001). Clustering to minimize the sum of cluster diameters. In Proceedings of the thirty-third annual ACM symposium on Theory of computing (pp. 1–10).Google Scholar
- Cho, K., & Schunn, C. D. (2007). Scaffolded writing and rewriting in the discipline: A web-based reciprocal peer review system. Computers and Education, 48, 409–426. doi: 10.1016/j.compedu.2005.02.004.
- Cho, K. (2008). Machine classification of peer comments in physics. In Educational Data Mining (pp. 192–196).Google Scholar
- Coursey, K., & Mihalcea, R. (2009). Topic identification using wikipedia graph centrality. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09 (pp. 117–120).Google Scholar
- Dalvi, N., Kumar, R., Pang, B., & Tomkins, A. (2009). Matching reviews to objects using a language model. EMNLP ’09.Google Scholar
- Echeverría, V., Gomez, J. C., & Moens, M. F. (2013). Automatic labeling of forums using bloom’s taxonomy. In Advanced Data Mining and Applications, Springer (pp. 517–528).Google Scholar
- Erkan, G., & Radev, D. R. (2004). Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479.Google Scholar
- Fellbaum, C. (1998). Wordnet: an electronic lexical database, MIT Press.Google Scholar
- Ganesan, K., Zhai, C., & Han, J. (2010). Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In Proceedings of the 23rd International Conference on Computational Linguistics. COLING ’10 (pp. 340–348).Google Scholar
- Gehringer, E. F. (2010). Expertiza: Managing feedback in collaborative learning. In Monitoring and Assessment in Online Collaborative Environments: Emergent Computational Technologies for E-Learning Support (pp. 75–96).Google Scholar
- Haghighi, A. D., Ng, A. Y., & Manning, C. D. (2005). Robust textual inference via graph matching. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. HLT (pp. 387–394).Google Scholar
- Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Springer.Google Scholar
- Kauchak, D., & Barzilay, R. (2006). Paraphrasing for automatic evaluation. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. HLT-NAACL ’06 (pp. 455–462).Google Scholar
- Kuhne, C., Bohm, K., & Yue, J. Z. (2010). Reviewing the reviewers: a study of author perception on peer reviews in computer science. In CollaborateCom (pp. 1–8).Google Scholar
- Lappas, T., Crovella, M., & Terzi, E. (2012). Selecting a characteristic set of reviews. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD. doi: 10.1145/2339530.2339663 (pp. 832–840).
- Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation. SIGDOC (pp. 24–26).Google Scholar
- Lim, E. P., Nguyen, V. A., Jindal, N., Liu, B., & Lauw, H. W. (2010). Detecting product review spammers using rating behaviors. In Proceedings of the 19th ACM international conference on Information and knowledge management. CIKM ’10 (pp. 939–948).Google Scholar
- Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: analyzing and comparing opinions on the web. In Proceedings of the 14th International Conference on World Wide Web (pp. 342–351).Google Scholar
- Liu, B. Q., Xu, S., & Wang, B. X. (2009). A combination of rule and supervised learning approach to recognize paraphrases. In Proceedings of the International Conference on Machine Learning and Cybernetics, (Vol. 1 pp. 110–115).Google Scholar
- Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. In IEEE 11th International Conference on Data Mining Workshops (ICDMW), 2011. IEEE (pp. 81– 88).Google Scholar
- Manshadi, M., Gildea, D., & Allen, J. (2013). Plurality, negation, and quantification: Towards comprehensive quantifier scope disambiguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-13).Google Scholar
- Meng, X., Wei, F., Liu, X., Zhou, M., Li, S., & et al. (2012). Entity-centric topic-oriented opinion summarization in twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD (pp. 379–387).Google Scholar
- Mihalcea, R. (2004). Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. ACLdemo.Google Scholar
- Moghaddam, S., Jamali, M., & Ester, M. (2011). Review recommendation: personalized prediction of the quality of online reviews. In Proceedings of the 20th ACM international conference on Information and knowledge management. CIKM ’11 (pp. 2249–2252).Google Scholar
- Nelson, M. M., & Schunn, C. D. (2009). The nature of feedback: How different types of peer feedback affect writing performance. In Instructional Science, (Vol. 27 pp. 375–401).Google Scholar
- Nguyen, H. V., & Litman, D. J. (2013). Identifying Localization in Peer Reviews of Argument Diagrams, Berlin, Heidelberg, (pp. 91–100). Berlin Heidelberg: Springer.Google Scholar
- Nguyen, H. V., & Litman, D. J. (2014). Improving peer feedback prediction: The sentence level is right, 99. ACL 2014.Google Scholar
- R. (2008). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0.Google Scholar
- Rada, R., Michailidis, A., & Wang, W. (1994). Collaborative hypermedia in a classroom setting. Journal Education Multimedia Hypermedia, 3, 21–36.Google Scholar
- Ramachandran, L., & Gehringer, E. F. (2010). Automating metareviewing. In poster presentation at Workshop on Computer-Supported Peer Review in Education, associated with Intelligent Tutoring Systems.Google Scholar
- Ramachandran, L., & Gehringer, E. F. (2012). A word-order based graph representation for relevance identification [poster]. In CIKM 2012, 21st ACM Conference on Information and Knowledge Management.Google Scholar
- Ramachandran, L., & Gehringer, E. F. (2013a). Graph-structures matching for review relevance identification. In Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing (pp. 53–60).Google Scholar
- Ramachandran, L., & Gehringer, E. F. (2013b). A user study on the automated assessment of reviews. In Proceedings of the Workshops at the 16th International Conference on Artificial Intelligence in Education AIED 2013, Memphis, USA, July 9–13, 2013.Google Scholar
- Ramachandran, L., & Gehringer, E. F. (2013c). An ordered relatedness metric for relevance identification. In In Semantic Computing (ICSC), 2013 IEEE Seventh International Conference on. doi: 10.1109/ICSC.2013.23 (pp. 86–89).
- Ramachandran, L., & Gehringer, E. F. (2011). Automated assessment of review quality using latent semantic analysis. In ICALT 2011, 11th IEEE International Conference on Advanced Learning Technologies (pp. 136–138).Google Scholar
- Ramachandran, L., & Gehringer, E. F. (2015). Identifying content patterns in peer reviews using graph-based cohesion. In Proceedings of Florida Artificial Intelligence Research Society Conference.Google Scholar
- Rooyen, S. V., Black, N., & Godlee, F. (1999). Development of the review quality instrument (rqi) for assessing peer reviews of manuscripts. Journal of Clinical Epidemiology, 52, 625–629. doi: 10.1016/S0895-4356(99)00047-5.
- Steinbach, M., Karypis, G., Kumar, V., & et al. (2000). A comparison of document clustering techniques. In KDD workshop on text mining., (Vol. 400 pp. 525–526).Google Scholar
- Titov, I., & McDonald, R. (2008). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web. WWW ’08 (pp. 111–120).Google Scholar
- Tognini-Bonelli, E. (2002). Corpus linguistics at work. Computational Linguistics, 28, 583–583.Google Scholar
- Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Inproceedings of HLT-NAACL 2003 (pp. 252–259).Google Scholar
- Tsatsaronis, G., Varlamis, I., & Nørvåg, K. (2010). Semanticrank: ranking keywords and sentences using semantic graphs. In Proceedings of the 23rd International Conference on Computational Linguistics. COLING (pp. 1074–1082).Google Scholar
- Tubau, S. (2008). Negative concord in English and Romance: Syntax-morphology interface conditions on the expression of negation. Netherlands Graduate School of Linguistics.Google Scholar
- Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Google Scholar
- UXMatters (2005). User experience definition.Google Scholar
- Wang, C., Yu, X., Li, Y., Zhai, C., & Han, J. (2013). Content coverage maximization on word networks for hierarchical topic summarization. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (pp. 249–258): ACM.Google Scholar
- Wessa, P., & De Rycker, A. (2010). Reviewing peer reviews: a rule-based approach. In International Conference on E-Learning (ICEL) (pp. 408–418).Google Scholar
- Xiong, W., Litman, D. J., & Schunn, C. D. (2010). Assessing reviewer’s performance based on mining problem localization in peer-review data. In EDM (pp. 211–220).Google Scholar
- Yadav, R. K. (2016). Web services for automated assessment of reviews. Master’s thesis, NC State University.Google Scholar
- Yang, X., Ghoting, A., Ruan, Y., & Parthasarathy, S. (2012). A framework for summarizing and analyzing twitter feeds. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD. doi: 10.1145/2339530.2339591 (pp. 370–378).
- Zhai, Z., Liu, B., Xu, H., & Jia, P. (2011). Clustering product features for opinion mining. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 347–354).Google Scholar
- Zhang, R., & Tran, T. (2010). Review recommendation with graphical model and em algorithm. In Proceedings of the 19th international conference on World wide web. WWW ’10 (pp. 1219–1220).Google Scholar
- Zhang, J., & Yang, Y. (2003). Robustness of regularized linear classification methods in text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 190–197).Google Scholar