A Machine Learning Approach to Cluster the Users of Stack Overflow Forum

  • J. Anusha
  • V. Smrithi Rekha
  • P. Bagavathi Sivakumar
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 325)


Online question and answer (Q&A) forums are emerging as excellent learning platforms for learners with varied interests. In this paper, we present our results on the clustering of Stack Overflow users into four clusters, namely naive users, surpassing users, experts, and outshiners. This clustering is based on various metrics available on the forum. We use the X-means and expectation maximization clustering algorithms and compare the results. The results have been validated using internal, external, and relative validation techniques. The objective of this clustering is to be able to trace and predict the activity of a user on this forum. According to our results, majority of users (71 % of 40,000 users under consideration) fall in the ‘experts’ category. This indicates that the users in Stack Overflow are of high quality thereby making the forum an excellent platform for users to learn about computer programming.


Clustering Naive users Surpassing users Bayesian information criterion 


  1. 1.
  2. 2.
    P. Morrison, E. Murphy-Hill, Is programming knowledge related to age? an exploration of stack overflow (MSR, San Francisco, CA, USA 2013)Google Scholar
  3. 3.
    M. Allamanis, C. Sutton, Why, when, and what: analyzing stack overflow questions by topic, type, and code (MSR, San Francisco, CA, USA 2013)Google Scholar
  4. 4.
    M. Asaduzzaman, A.S. Mashiyaty, C.K. Roy, K.A. Schneider, Answering Questions about Unanswered Questions of Stack Overflow (2013)Google Scholar
  5. 5.
    C. Treude, O. Barzilay, M.-A. Storey, How do programmers ask and answer questions on the web? (NIER Track). ICSE 11. (2011)Google Scholar
  6. 6.
    D. Correa, A. Sureka, Fit or unfit: analysis and prediction of closed questions on stack overflow (2013)Google Scholar
  7. 7.
    D. Pelleg, A. Moore, X-Means: extending k-means with efficient estimation of the number of clusters, ICML '00 in Proceedings of the Seventeenth International Conference on Machine Learning, pp. 727–734 (2000)Google Scholar
  8. 8.
    B. Chaudhari, M. Parikh, A comparative study of clustering algorithms using weka tools. Int. J. Appl. Innovation Eng Manage (IJAIEM). 1(2) (2012)Google Scholar
  9. 9.
    O.A. Abbas, Comparisons Between Data Clustering Algorithms. Int Arab J Info Technol. 5(3) (2008)Google Scholar
  10. 10.
    O.J. Oyelade, O.O. Oladipupo, I.C Obagbuwa, Application of k-means clustering algorithm for prediction of students academic performance. Int. J. Comput. Sci. Inf. Secur. 7(1) (2010)Google Scholar
  11. 11.
    R. Mauro, M. De Luca, G. DellAcqua, Using a K-means clustering algorithm to examine patterns of vehicle crashes in before-after analysis. Modern Appl. Sci. 79(10) (2013)Google Scholar
  12. 12.
    D. Morrison, I. McLoughlin, A. Hogan, C. Hayes, Evolutionary clustering and analysis of user behavior in online forums. Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media. (2012)Google Scholar
  13. 13.
    J.-. Wen, J.-Y. Nie, H.-J. Zhang, Query clustering using user logs. ACM Trans. Inf. Syst. 20(1)(2002)Google Scholar
  14. 14.
    S. Padmavathi, C. Rajalaxmi, K.P. Soman, Texel identification using K-Means clustering method, Adv. Compu. Sci.Eng. Appl. AISC Springer-Verlag Berlin Heidelberg. 167, 285–294 (2012)Google Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  • J. Anusha
    • 1
  • V. Smrithi Rekha
    • 2
  • P. Bagavathi Sivakumar
    • 1
  1. 1.Department of Computer Science and EngineeringCoimbatore Campus, Amrita Vishwa VidyapeethamCoimbatoreIndia
  2. 2.Center for Research in Advanced Technologies for EducationAmritapuri CampusKochiIndia

Personalised recommendations