Classifying Forum Questions Using PCA and Machine Learning for Improving Online CQA

  • Simon FongEmail author
  • Yan Zhuang
  • Kexing Liu
  • Shu Zhou
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 545)


As one of the most popular e-Business models, community question answering (CQA) services increasingly gather large amount of knowledge through the voluntary services of the online community across the globe. While most questions in CQA usually receive an answer posted by the peer users, it is found that the number of unanswered or ignored questions soared up high in the past few years. Understanding the factors that contribute to questions being answered as well as questions remain ignored can help the forum users to improve the quality of their questions and increase their chances of getting answers from the forum. In this study, feature selection method called Principal Component Analysis was used to extract the factors or components of the features. Then data mining techniques was used to identify the relevant features that will help predict the quality of questions.


Community Question Answering Principal Component Analysis Machine Learning Business Intelligence 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Li, B., Jin, T., Lyu, M.R., King, I., Mak, B.: Analyzing and predicting question quality in community question answering services. In: Proceedings of the 21st International Conference Companion on World Wide Web, pp. 775–782. ACM, April 2012Google Scholar
  2. 2.
    Chen, L., Zhang, D., Mark, L.: Understanding user intent in community question answering. In: Proceedings of the 21st International Conference Companion on World Wide Web, pp. 823–828, April 2012Google Scholar
  3. 3.
    Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J.: Discovering value from community activity on focused question answering sites: a case study of stack overflow. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 850–858 (2012)Google Scholar
  4. 4.
    Barua, A., Thomas, S.W., Hassan, A.E.: What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Software Engineering, 1–36 (2012)Google Scholar
  5. 5.
    Mamykina, L., Manoim, B., Mittal, M., Hripcsak, G., Hartmann, B.: Design lessons from the fastest q&a site in the west. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2857–2866, May 2011Google Scholar
  6. 6.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Duchesnay, E.: Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Ng, A.Y.: Feature selection, L 1 vs. L 2 regularization, and rotational invariance. In: Proceedings of the Twenty-First International Conference on Machine Learning. ACM, July 2004Google Scholar
  8. 8.
    Ratanamahatana, C.A., Gunopulos, D.: Scaling up the naive bayesian classifier: using decision trees for feature selection. In: Proc. Workshop Data Cleaning and Preprocessing (DCAP 2002), at IEEE Int’l Conf. Data Mining, ICDM 2002 (2002)Google Scholar
  9. 9.
    Weinberger, K., Blitzer, J., Saul, L.: Distance metric learning for large margin nearest neighbor classification. Advances in Neural Information Processing Systems 18, 1473 (2006)Google Scholar
  10. 10.
    Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 161–168. ACM, June 2006Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2015

Authors and Affiliations

  1. 1.Department of Computer Information ScienceUniversity of MacauMacau SARChina
  2. 2.Department of Product MarketingMOZAT Pte LtdSingaporeSingapore

Personalised recommendations