Feature Analysis for Duplicate Detection in Programming QA Communities

  • Wei Emma ZhangEmail author
  • Quan Z. Sheng
  • Yanjun Shu
  • Vanh Khuyen Nguyen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10604)


In community question answering (CQA), duplicate questions are questions that were previously created and answered but occur again. These questions produce noises in the CQA websites which impede users to find answers efficiently. Programming CQA (PCQA), a branch of CQA that holds questions related to programming, also suffers from this problem. Existing works on duplicate detection in PCQA websites framed the task as a supervised learning task on the question pairs, and relied on a number of extracted features of the question pairs. But they extracted only textual features and did not consider the source code in the questions, which are linguistically very different to natural languages. Our work focuses on developing novel features for PCQA duplicate detection. We leverage continuous word vectors from the deep learning literature, probabilistic models in information retrieval and association pairs mined from duplicate questions using machine translation. We provide extensive empirical analysis on the performance of these features and their various combinations using a range of learning models. Our work could be helpful for both research works and practical applications that require extracting features from texts that are not all natural languages.


Feature analysis Question answering Duplicate detection 



Michael Sheng’s work is partially supported by Australian Research Council (ARC) Future Fellowship FT140101247 and Discovery Project Grant DP140100104. Yanjun Shu’s work is partially supported by China NSF (No. 61202091), the Fundamental Research Funds for Central Universities (No. NSRIF. 2016050) and the State Scholarship Fund of China Scholarship Council (No. 201606125073).


  1. 1.
    Ahasanuzzaman, M., Asaduzzaman, M., Roy, C.K., Schneider, K.A.: Mining duplicate questions in stack overflow. In: Proceedings of MSR 2016, pp. 402–412 (2016)Google Scholar
  2. 2.
    Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)CrossRefGoogle Scholar
  3. 3.
    Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on freebase from question-answer pairs. In: Proceedings of EMNLP 2013, pp. 1533–1544 (2013)Google Scholar
  4. 4.
    Berant, J., Liang, P.: Semantic parsing via paraphrasing. In: Proceedings of ACL 2014, pp. 1415–1425 (2014)Google Scholar
  5. 5.
    Cao, X., Cong, G., Cui, B., Jensen, C.S., Yuan, Q.: Approaches to exploring category information for question retrieval in community question-answer archives. ACM Trans. Inf. Syst. 30(2), 7 (2012)CrossRefGoogle Scholar
  6. 6.
    Chan, T.F., Golub, G.H., LeVeque, R.J.: Updating formulae and a pairwise algorithm for computing sample variances. In: Proceedings of COMPSTAT 1982, pp. 30–41 (1982)Google Scholar
  7. 7.
    Clinchant, S., Gaussier, É.: Information-based models for ad hoc IR. In: Proceedings of SIGIR 2010, pp. 234–241 (2010)Google Scholar
  8. 8.
    Collins, M.: Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In: Proceedings of EMNLP 2002, pp. 1–8 (2002)Google Scholar
  9. 9.
    Correa, D., Sureka, A.: Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In: Proceedings of WWW 2014, pp. 631–642 (2014)Google Scholar
  10. 10.
    Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  11. 11.
    Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995). doi: 10.1007/3-540-59119-2_166 CrossRefGoogle Scholar
  12. 12.
    Jelinek, F., Mercer, R.L.: Interpolated estimation of markov source parameters from sparse data. In: Proceedings of PRNI 1980, pp. 381–397 (1980)Google Scholar
  13. 13.
    Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings of RepL4NLP 2016, pp. 78–86 (2016)Google Scholar
  14. 14.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of ICML 2014, pp. 1188–1196 (2014)Google Scholar
  15. 15.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS 2013, pp. 3111–3119 (2013)Google Scholar
  16. 16.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRefzbMATHGoogle Scholar
  17. 17.
    Och, F.J., Ney, H.: The alignment template approach to statistical machine translation. Comput. Linguist. 30(4), 417–449 (2004)CrossRefzbMATHGoogle Scholar
  18. 18.
    Qin, T., Liu, T., Xu, J., Li, H.: LETOR: a benchmark collection for research on learning to rank for information retrieval. Inf. Retrieval 13(4), 346–374 (2010)CrossRefGoogle Scholar
  19. 19.
    Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3. In: Proceedings of TREC-3, pp. 109–126 (1996)Google Scholar
  20. 20.
    Shtok, A., Dror, G., Maarek, Y., Szpektor, I.: Learning from the past: answering new questions with past answers. In: Proceedings of WWW 2012, pp. 759–768 (2012)Google Scholar
  21. 21.
    Walker, S.H., Duncan, D.B.: Estimation of the probability of an event as a function of several independent variables. Biometrika 54(1–2), 167–179 (1967)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Yang, L., Bao, S., Lin, Q., Wu, X., Han, D., Su, Z., Yu, Y.: Analyzing and predicting not-answered questions in community-based question answering services. In: Proceedings of AAAI 2011, pp. 1273–1278 (2011)Google Scholar
  23. 23.
    Yin, P., Duan, N., Kao, B., Bao, J., Zhou, M.: Answering questions with complex semantic constraints on open knowledge bases. In: Proceedings of CIKM 2015, pp. 1301–1310 (2015)Google Scholar
  24. 24.
    Zhai, C., Lafferty, J.D.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004)CrossRefGoogle Scholar
  25. 25.
    Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of ICML 2004, pp. 919–926 (2004)Google Scholar
  26. 26.
    Zhang, W.E., Sheng, Q.Z., Lau, J.H., Abebe, E.: Detecting duplicate posts in programming QA communities via latent semantics and association rules. In: Proceedings of WWW 2017, pp. 1221–1229 (2017)Google Scholar
  27. 27.
    Zhang, Y., Lo, D., Xia, X., Sun, J.: Multi-factor duplicate question detection in stack overflow. J. Comput. Sci. Technol. 30(5), 981–997 (2015)CrossRefGoogle Scholar
  28. 28.
    Zhou, G., Liu, Y., Liu, F., Zeng, D., Zhao, J.: Improving question retrieval in community question answering using world knowledge. In: Proceedings of IJCAI 2013, pp. 2239–2245 (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Wei Emma Zhang
    • 1
    Email author
  • Quan Z. Sheng
    • 1
  • Yanjun Shu
    • 2
  • Vanh Khuyen Nguyen
    • 1
  1. 1.Department of ComputingMacquarie UniversitySydneyAustralia
  2. 2.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations