Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion Clustering

  • Christian Sillaber
  • Ruth Breu
Conference paper
Part of the Springer Proceedings in Complexity book series (SPCOM)


Existing methods for finding near-duplicate content often fail when applied to informal user discussions spanning multiple messages, which can be found in collaborative requirement discussions. As a result, although the underlying knowledge sharing platform already contains duplicated entries the stakeholders often recreate already existing requirements discussions without contributing to the existing discussions. In this paper we therefore identify common reasons leading to near-duplicate content and develop a new algorithm for detecting near-duplicate content in multilevel requirement discussions. The algorithm is implemented using a large case study of real-world collaborative requirements engineering platforms serving hundreds of thousands of stakeholders. Our preliminary results show, that we outperform existing search algorithms and that we are able to identify near-duplicates in multilevel requirement discussions with high precision.


Discussion Thread Previous Message Message Body Requirement Discussion Empty Message 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The research herein is partially conducted within the competence network Softnet Austria ( and funded by the Austrian Federal Ministry of Economics (bm:wa), the province of Styria, the Steirische Wirtschaftsförder-ungsgesellschaft mbH. (SFG), and the city of Vienna in terms of the center for innovation and technology (ZIT). This work was supported by the project “QE LaB–Living Models for Open Systems (FFG 822740)” and partially funded by the European Commission under the FP7 project “PoSecCo” (IST 257129).


  1. 1.
    Wiener C, Acquah IN, Heiss M, Mayerdorfer T, Langen M, Kammergruber WC (2012) Targeting the right crowd for corporate problem solving-a siemens case study with TechnoWeb 2.0. In: International technology management conference, IEEE (June 2012), pp 239–247Google Scholar
  2. 2.
    Lee SB, Shiva SG (2009) A novel approach to knowledge sharing in software systems engineering. In: 2009 fourth IEEE international conference on global software engineering, IEEE (July 2009), pp 376–381Google Scholar
  3. 3.
    Rus I, Lindvall M, Sinha S (2002) Knowledge management in software engineering. IEEE Softw 19(3):26–38Google Scholar
  4. 4.
    Sillaber C, Chimiak-Opoka J, Breu R (2012) Supporting social driven requirements engineering processes through knowledge sharing platforms. In: Software Engineering/781: Control Applications, ACTA Press, AnaheimGoogle Scholar
  5. 5.
    Damian D (2006) Requirements engineering in distributed projects. In: Proceedings of the IEEE international conference on global software engineering, IEEE Computer Society, pp 69–75Google Scholar
  6. 6.
    Herlea D, Greenberg S (1998) Using a groupware space for distributed requirements engineering. In: Proceedings seventh IEEE international workshop on enabling technologies infrastucture for collaborative enterprises WET ICE 98 Cat No98TB100253, pp 57–62Google Scholar
  7. 7.
    Lohmann S, Dietzold S, Heim P, Heino N (2009) A web platform for social requirements engineering. In: Software Engineering, pp 309–315Google Scholar
  8. 8.
    Zhang Q, Zhang Y, Yu H, Huang X (2010) Efficient partial-duplicate detection based on sequence matching. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval-SIGIR’10, pp 675–689Google Scholar
  9. 9.
    Muthmann K, Barczynski WM, Brauer F, Löser A (2009) Near-duplicate detection for web-forums. In: Proceedings of the 2009 international database engineering and applications symposium, pp 142–151Google Scholar
  10. 10.
    Broder A (2000) Identifying and filtering near-duplicate documents. In: Combinatorial pattern matching, pp 1–10Google Scholar
  11. 11.
    Dawson S (2006) Online forum discussion interactions as an indicator of student community. Australas J Educ Technol 22(4):495Google Scholar
  12. 12.
    Hodge S (2005) Participation, discourse and power: a case study in service user involvement. Crit Soc Policy 25(2):164–179Google Scholar
  13. 13.
    Adamic L, Zhang J (2008) Knowledge sharing and yahoo answers: everyone knows something. In: Proceedings of the 17th international conference on World Wide Web, pp 665–674Google Scholar
  14. 14.
    Lee MK, Cheung CM, Lim KH, Sia CL (2006) Understanding customer knowledge sharing in web-based discussion boards: an exploratory study. Internet Res 16(3):289–303Google Scholar
  15. 15.
    Hou HT, Sung YT, Chang KE (2009) Exploring the behavioral patterns of an online knowledge-sharing discussion activity among teachers with problem-solving strategy. Teach Teach Educ 25(1):101–108Google Scholar
  16. 16.
    Ravichandran D, Pantel P, Hovy E (2005) Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Number June, Association for Computational Linguistics, pp 622–629Google Scholar
  17. 17.
    Suzuki M, Kuriyama N, Ito A, Makino S (2008) Automatic clustering of part of-speech for vocabulary divided PLSA language model. In: 2008 International conference on natural language processing and knowledge engineering, IEEE, 1–7 October 2008Google Scholar
  18. 18.
    Zhong M, Hu Y, Liu L, Lu R (2008) A practical approach for relevance measure of inter-sentence. In: 2008 fifth international conference on fuzzy systems and knowledge discovery, IEEE, pp 140–144Google Scholar
  19. 19.
    Zhang Z, Cheng H, Zhang S, Chen W, Fang Q (2008) Clustering aggregation based on genetic algorithm for documents clustering. In: 2008 IEEE congress on evolutionary computation (IEEE World Congress on Computational Intelligence), IEEE (June 2008), pp 3156–3161Google Scholar
  20. 20.
    Hammouda K, Kamel M (2004) Efficient phrase-based document indexing for Web document clustering. IEEE Trans Knowl Data Eng 16(10):1279–1296Google Scholar
  21. 21.
    Rafi M, Maujood M (2010) A comparison of two suffix tree-based document clustering algorithms. In: 2010 international conference on information and emerging technologies, IEEE, pp 1–5Google Scholar
  22. 22.
    Dey S, Murthy H (2012) Unsupervised clustering of syllables for language identification. In: Proceedings of the 20th European signal processing conference, vol 20(31), pp 25–329Google Scholar
  23. 23.
    You CH, Lee KA, Ma B, Li H (2008) Self-organized clustering for feature mapping in language recognition. In: 2008 6th international symposium on chinese spoken language processing, IEEE (December 2008), pp 1–4Google Scholar
  24. 24.
    Ambikairajah E (2008) Improvements on hierarchical language identification based on automatic language clustering. In: 2008 IEEE international conference on acoustics, speech and signal processing, IEEE (March 2008), pp 4241–4244Google Scholar
  25. 25.
    Zhang PY, Li CH (2009) Automatic text summarization based on sentences clustering and extraction. In: 2009 2nd IEEE international conference on computer science and information technology, vol 1(1), pp 167–170Google Scholar
  26. 26.
    Ferragina P, Gulli A (2004) The anatomy of a hierarchical clustering engine for webpage, news and book snippets. In: Data mining, 2004. ICDM’04. Fourth IEEE (2004), pp 10–13Google Scholar
  27. 27.
    Kotlerman L, Dagan I, Gorodetsky M, Daya E (2012) Sentence clustering via projection over term clusters. (2009) (2012), pp 38–43
  28. 28.
    Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892Google Scholar
  29. 29.
    Pasquier C (2010) Task 5: single document keyphrase extraction using sentence clustering and latent dirichlet allocation. In: Proceedings of the 5th international workshop on semantic evaluation (July, 2010), pp 154–157Google Scholar
  30. 30.
    Boros E, Kantor P, Neu D (2001) A clustering based approach to creating multidocument summaries. In: Proceedings of the 24th annual conference on pattern analysis and machine learning (2001), pp 443–459Google Scholar
  31. 31.
    Dongen SV (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30(1):121–141Google Scholar
  32. 32.
    Guénoche A (2004) Clustering by vertex density in a graph. Classification,Classification, Clustering, and Data Mining Applications. In: Book series of Studies in Classification, Data Analysis, and Knowledge Organisation, Springer, Heidelberg, pp 15–23 Google Scholar
  33. 33.
    Fung P, Ngai G (2006) One story, one flow: hidden markov story models for multilingual document summarization. ACM Trans Speech Lang Process 3(2):1–16Google Scholar
  34. 34.
    Skabar, A., Abdalgader, K.: Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm. IEEE Transactions on Knowledge and Data Engineering 5(1) (2013) 62–75Google Scholar
  35. 35.
    Erkan G, Radev D (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479Google Scholar
  36. 36.
    Biemann C (2006) Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of the first workshop on graph based methods for natural language processing, pp 73–80Google Scholar
  37. 37.
    Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop, pp 7–12Google Scholar
  38. 38.
    Zha H (2002) Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval-SIGIR’02, New York, NY, USA, ACM Press, p. 113Google Scholar
  39. 39.
    Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31:1361–1374Google Scholar
  40. 40.
    Minnen G, Carroll J, Pearce D (2000) Robust, applied morphological generation. In: Proceedings of the first international conference on natural language generation-Volume 14, Stroudsburg, PA, USA, Association for Computational Linguistics, pp 201–208Google Scholar
  41. 41.
    Runeson P (2007) Detection of duplicate defect reports using natural language processing. In: Software Engineering, 2007. ICSE 2007. 29th international conference on, Minneapolis, Minnesota, IEEE Computer Society (May 2007), pp 499–510Google Scholar
  42. 42.
    Wilbur WJ, Sirotkin K (1992) The automatic identification of stop words. J Inf Sci 18(1):45–55Google Scholar
  43. 43.
    Soukoreff R, MacKenzie I (2001) Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI’01 Extended abstracts on human factors in computing systems, New York, NY, USA, ACM, pp 31–32Google Scholar
  44. 44.
    Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41Google Scholar
  45. 45.
    Christopher D, Manning PR, Hinrich S (2008) Introduction to information retrieval. Cambridge University Press, pp I-XXI, 1–482Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.Institute of Computer ScienceUniversity of InnsbruckInnsbruckAustria

Personalised recommendations