Skip to main content

Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion Clustering

  • Conference paper
  • First Online:
The 8th International Conference on Knowledge Management in Organizations

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

  • 2851 Accesses

Abstract

Existing methods for finding near-duplicate content often fail when applied to informal user discussions spanning multiple messages, which can be found in collaborative requirement discussions. As a result, although the underlying knowledge sharing platform already contains duplicated entries the stakeholders often recreate already existing requirements discussions without contributing to the existing discussions. In this paper we therefore identify common reasons leading to near-duplicate content and develop a new algorithm for detecting near-duplicate content in multilevel requirement discussions. The algorithm is implemented using a large case study of real-world collaborative requirements engineering platforms serving hundreds of thousands of stakeholders. Our preliminary results show, that we outperform existing search algorithms and that we are able to identify near-duplicates in multilevel requirement discussions with high precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.golang.com

  2. 2.

    GNIU Aspell: International Ispell Version 3.1.20 (Aspell 0.60.6.1).

  3. 3.

    http://public.wsu.edu/~brians/errors/errors.html (Accessed: 2012-09-11).

References

  1. Wiener C, Acquah IN, Heiss M, Mayerdorfer T, Langen M, Kammergruber WC (2012) Targeting the right crowd for corporate problem solving-a siemens case study with TechnoWeb 2.0. In: International technology management conference, IEEE (June 2012), pp 239–247

    Google Scholar 

  2. Lee SB, Shiva SG (2009) A novel approach to knowledge sharing in software systems engineering. In: 2009 fourth IEEE international conference on global software engineering, IEEE (July 2009), pp 376–381

    Google Scholar 

  3. Rus I, Lindvall M, Sinha S (2002) Knowledge management in software engineering. IEEE Softw 19(3):26–38

    Google Scholar 

  4. Sillaber C, Chimiak-Opoka J, Breu R (2012) Supporting social driven requirements engineering processes through knowledge sharing platforms. In: Software Engineering/781: Control Applications, ACTA Press, Anaheim

    Google Scholar 

  5. Damian D (2006) Requirements engineering in distributed projects. In: Proceedings of the IEEE international conference on global software engineering, IEEE Computer Society, pp 69–75

    Google Scholar 

  6. Herlea D, Greenberg S (1998) Using a groupware space for distributed requirements engineering. In: Proceedings seventh IEEE international workshop on enabling technologies infrastucture for collaborative enterprises WET ICE 98 Cat No98TB100253, pp 57–62

    Google Scholar 

  7. Lohmann S, Dietzold S, Heim P, Heino N (2009) A web platform for social requirements engineering. In: Software Engineering, pp 309–315

    Google Scholar 

  8. Zhang Q, Zhang Y, Yu H, Huang X (2010) Efficient partial-duplicate detection based on sequence matching. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval-SIGIR’10, pp 675–689

    Google Scholar 

  9. Muthmann K, Barczynski WM, Brauer F, Löser A (2009) Near-duplicate detection for web-forums. In: Proceedings of the 2009 international database engineering and applications symposium, pp 142–151

    Google Scholar 

  10. Broder A (2000) Identifying and filtering near-duplicate documents. In: Combinatorial pattern matching, pp 1–10

    Google Scholar 

  11. Dawson S (2006) Online forum discussion interactions as an indicator of student community. Australas J Educ Technol 22(4):495

    Google Scholar 

  12. Hodge S (2005) Participation, discourse and power: a case study in service user involvement. Crit Soc Policy 25(2):164–179

    Google Scholar 

  13. Adamic L, Zhang J (2008) Knowledge sharing and yahoo answers: everyone knows something. In: Proceedings of the 17th international conference on World Wide Web, pp 665–674

    Google Scholar 

  14. Lee MK, Cheung CM, Lim KH, Sia CL (2006) Understanding customer knowledge sharing in web-based discussion boards: an exploratory study. Internet Res 16(3):289–303

    Google Scholar 

  15. Hou HT, Sung YT, Chang KE (2009) Exploring the behavioral patterns of an online knowledge-sharing discussion activity among teachers with problem-solving strategy. Teach Teach Educ 25(1):101–108

    Google Scholar 

  16. Ravichandran D, Pantel P, Hovy E (2005) Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Number June, Association for Computational Linguistics, pp 622–629

    Google Scholar 

  17. Suzuki M, Kuriyama N, Ito A, Makino S (2008) Automatic clustering of part of-speech for vocabulary divided PLSA language model. In: 2008 International conference on natural language processing and knowledge engineering, IEEE, 1–7 October 2008

    Google Scholar 

  18. Zhong M, Hu Y, Liu L, Lu R (2008) A practical approach for relevance measure of inter-sentence. In: 2008 fifth international conference on fuzzy systems and knowledge discovery, IEEE, pp 140–144

    Google Scholar 

  19. Zhang Z, Cheng H, Zhang S, Chen W, Fang Q (2008) Clustering aggregation based on genetic algorithm for documents clustering. In: 2008 IEEE congress on evolutionary computation (IEEE World Congress on Computational Intelligence), IEEE (June 2008), pp 3156–3161

    Google Scholar 

  20. Hammouda K, Kamel M (2004) Efficient phrase-based document indexing for Web document clustering. IEEE Trans Knowl Data Eng 16(10):1279–1296

    Google Scholar 

  21. Rafi M, Maujood M (2010) A comparison of two suffix tree-based document clustering algorithms. In: 2010 international conference on information and emerging technologies, IEEE, pp 1–5

    Google Scholar 

  22. Dey S, Murthy H (2012) Unsupervised clustering of syllables for language identification. In: Proceedings of the 20th European signal processing conference, vol 20(31), pp 25–329

    Google Scholar 

  23. You CH, Lee KA, Ma B, Li H (2008) Self-organized clustering for feature mapping in language recognition. In: 2008 6th international symposium on chinese spoken language processing, IEEE (December 2008), pp 1–4

    Google Scholar 

  24. Ambikairajah E (2008) Improvements on hierarchical language identification based on automatic language clustering. In: 2008 IEEE international conference on acoustics, speech and signal processing, IEEE (March 2008), pp 4241–4244

    Google Scholar 

  25. Zhang PY, Li CH (2009) Automatic text summarization based on sentences clustering and extraction. In: 2009 2nd IEEE international conference on computer science and information technology, vol 1(1), pp 167–170

    Google Scholar 

  26. Ferragina P, Gulli A (2004) The anatomy of a hierarchical clustering engine for webpage, news and book snippets. In: Data mining, 2004. ICDM’04. Fourth IEEE (2004), pp 10–13

    Google Scholar 

  27. Kotlerman L, Dagan I, Gorodetsky M, Daya E (2012) Sentence clustering via projection over term clusters. newdesign.aclweb.org (2009) (2012), pp 38–43

  28. Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892

    Google Scholar 

  29. Pasquier C (2010) Task 5: single document keyphrase extraction using sentence clustering and latent dirichlet allocation. In: Proceedings of the 5th international workshop on semantic evaluation (July, 2010), pp 154–157

    Google Scholar 

  30. Boros E, Kantor P, Neu D (2001) A clustering based approach to creating multidocument summaries. In: Proceedings of the 24th annual conference on pattern analysis and machine learning (2001), pp 443–459

    Google Scholar 

  31. Dongen SV (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30(1):121–141

    Google Scholar 

  32. Guénoche A (2004) Clustering by vertex density in a graph. Classification,Classification, Clustering, and Data Mining Applications. In: Book series of Studies in Classification, Data Analysis, and Knowledge Organisation, Springer, Heidelberg, pp 15–23

    Google Scholar 

  33. Fung P, Ngai G (2006) One story, one flow: hidden markov story models for multilingual document summarization. ACM Trans Speech Lang Process 3(2):1–16

    Google Scholar 

  34. Skabar, A., Abdalgader, K.: Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm. IEEE Transactions on Knowledge and Data Engineering 5(1) (2013) 62–75

    Google Scholar 

  35. Erkan G, Radev D (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479

    Google Scholar 

  36. Biemann C (2006) Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of the first workshop on graph based methods for natural language processing, pp 73–80

    Google Scholar 

  37. Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop, pp 7–12

    Google Scholar 

  38. Zha H (2002) Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval-SIGIR’02, New York, NY, USA, ACM Press, p. 113

    Google Scholar 

  39. Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31:1361–1374

    Google Scholar 

  40. Minnen G, Carroll J, Pearce D (2000) Robust, applied morphological generation. In: Proceedings of the first international conference on natural language generation-Volume 14, Stroudsburg, PA, USA, Association for Computational Linguistics, pp 201–208

    Google Scholar 

  41. Runeson P (2007) Detection of duplicate defect reports using natural language processing. In: Software Engineering, 2007. ICSE 2007. 29th international conference on, Minneapolis, Minnesota, IEEE Computer Society (May 2007), pp 499–510

    Google Scholar 

  42. Wilbur WJ, Sirotkin K (1992) The automatic identification of stop words. J Inf Sci 18(1):45–55

    Google Scholar 

  43. Soukoreff R, MacKenzie I (2001) Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI’01 Extended abstracts on human factors in computing systems, New York, NY, USA, ACM, pp 31–32

    Google Scholar 

  44. Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41

    Google Scholar 

  45. Christopher D, Manning PR, Hinrich S (2008) Introduction to information retrieval. Cambridge University Press, pp I-XXI, 1–482

    Google Scholar 

Download references

Acknowledgments

The research herein is partially conducted within the competence network Softnet Austria (www.soft-net.at) and funded by the Austrian Federal Ministry of Economics (bm:wa), the province of Styria, the Steirische Wirtschaftsförder-ungsgesellschaft mbH. (SFG), and the city of Vienna in terms of the center for innovation and technology (ZIT). This work was supported by the project “QE LaB–Living Models for Open Systems (FFG 822740)” and partially funded by the European Commission under the FP7 project “PoSecCo” (IST 257129).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Sillaber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media Dordrecht

About this paper

Cite this paper

Sillaber, C., Breu, R. (2014). Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion Clustering. In: Uden, L., Wang, L., Corchado Rodríguez, J., Yang, HC., Ting, IH. (eds) The 8th International Conference on Knowledge Management in Organizations. Springer Proceedings in Complexity. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7287-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-94-007-7287-8_20

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-007-7286-1

  • Online ISBN: 978-94-007-7287-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics