Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion Clustering

Sillaber, Christian; Breu, Ruth

doi:10.1007/978-94-007-7287-8_20

Christian Sillaber⁶ &
Ruth Breu⁶

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

2851 Accesses

Abstract

Existing methods for finding near-duplicate content often fail when applied to informal user discussions spanning multiple messages, which can be found in collaborative requirement discussions. As a result, although the underlying knowledge sharing platform already contains duplicated entries the stakeholders often recreate already existing requirements discussions without contributing to the existing discussions. In this paper we therefore identify common reasons leading to near-duplicate content and develop a new algorithm for detecting near-duplicate content in multilevel requirement discussions. The algorithm is implemented using a large case study of real-world collaborative requirements engineering platforms serving hundreds of thousands of stakeholders. Our preliminary results show, that we outperform existing search algorithms and that we are able to identify near-duplicates in multilevel requirement discussions with high precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.golang.com
2.
GNIU Aspell: International Ispell Version 3.1.20 (Aspell 0.60.6.1).
3.
http://public.wsu.edu/~brians/errors/errors.html (Accessed: 2012-09-11).

References

Wiener C, Acquah IN, Heiss M, Mayerdorfer T, Langen M, Kammergruber WC (2012) Targeting the right crowd for corporate problem solving-a siemens case study with TechnoWeb 2.0. In: International technology management conference, IEEE (June 2012), pp 239–247
Google Scholar
Lee SB, Shiva SG (2009) A novel approach to knowledge sharing in software systems engineering. In: 2009 fourth IEEE international conference on global software engineering, IEEE (July 2009), pp 376–381
Google Scholar
Rus I, Lindvall M, Sinha S (2002) Knowledge management in software engineering. IEEE Softw 19(3):26–38
Google Scholar
Sillaber C, Chimiak-Opoka J, Breu R (2012) Supporting social driven requirements engineering processes through knowledge sharing platforms. In: Software Engineering/781: Control Applications, ACTA Press, Anaheim
Google Scholar
Damian D (2006) Requirements engineering in distributed projects. In: Proceedings of the IEEE international conference on global software engineering, IEEE Computer Society, pp 69–75
Google Scholar
Herlea D, Greenberg S (1998) Using a groupware space for distributed requirements engineering. In: Proceedings seventh IEEE international workshop on enabling technologies infrastucture for collaborative enterprises WET ICE 98 Cat No98TB100253, pp 57–62
Google Scholar
Lohmann S, Dietzold S, Heim P, Heino N (2009) A web platform for social requirements engineering. In: Software Engineering, pp 309–315
Google Scholar
Zhang Q, Zhang Y, Yu H, Huang X (2010) Efficient partial-duplicate detection based on sequence matching. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval-SIGIR’10, pp 675–689
Google Scholar
Muthmann K, Barczynski WM, Brauer F, Löser A (2009) Near-duplicate detection for web-forums. In: Proceedings of the 2009 international database engineering and applications symposium, pp 142–151
Google Scholar
Broder A (2000) Identifying and filtering near-duplicate documents. In: Combinatorial pattern matching, pp 1–10
Google Scholar
Dawson S (2006) Online forum discussion interactions as an indicator of student community. Australas J Educ Technol 22(4):495
Google Scholar
Hodge S (2005) Participation, discourse and power: a case study in service user involvement. Crit Soc Policy 25(2):164–179
Google Scholar
Adamic L, Zhang J (2008) Knowledge sharing and yahoo answers: everyone knows something. In: Proceedings of the 17th international conference on World Wide Web, pp 665–674
Google Scholar
Lee MK, Cheung CM, Lim KH, Sia CL (2006) Understanding customer knowledge sharing in web-based discussion boards: an exploratory study. Internet Res 16(3):289–303
Google Scholar
Hou HT, Sung YT, Chang KE (2009) Exploring the behavioral patterns of an online knowledge-sharing discussion activity among teachers with problem-solving strategy. Teach Teach Educ 25(1):101–108
Google Scholar
Ravichandran D, Pantel P, Hovy E (2005) Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Number June, Association for Computational Linguistics, pp 622–629
Google Scholar
Suzuki M, Kuriyama N, Ito A, Makino S (2008) Automatic clustering of part of-speech for vocabulary divided PLSA language model. In: 2008 International conference on natural language processing and knowledge engineering, IEEE, 1–7 October 2008
Google Scholar
Zhong M, Hu Y, Liu L, Lu R (2008) A practical approach for relevance measure of inter-sentence. In: 2008 fifth international conference on fuzzy systems and knowledge discovery, IEEE, pp 140–144
Google Scholar
Zhang Z, Cheng H, Zhang S, Chen W, Fang Q (2008) Clustering aggregation based on genetic algorithm for documents clustering. In: 2008 IEEE congress on evolutionary computation (IEEE World Congress on Computational Intelligence), IEEE (June 2008), pp 3156–3161
Google Scholar
Hammouda K, Kamel M (2004) Efficient phrase-based document indexing for Web document clustering. IEEE Trans Knowl Data Eng 16(10):1279–1296
Google Scholar
Rafi M, Maujood M (2010) A comparison of two suffix tree-based document clustering algorithms. In: 2010 international conference on information and emerging technologies, IEEE, pp 1–5
Google Scholar
Dey S, Murthy H (2012) Unsupervised clustering of syllables for language identification. In: Proceedings of the 20th European signal processing conference, vol 20(31), pp 25–329
Google Scholar
You CH, Lee KA, Ma B, Li H (2008) Self-organized clustering for feature mapping in language recognition. In: 2008 6th international symposium on chinese spoken language processing, IEEE (December 2008), pp 1–4
Google Scholar
Ambikairajah E (2008) Improvements on hierarchical language identification based on automatic language clustering. In: 2008 IEEE international conference on acoustics, speech and signal processing, IEEE (March 2008), pp 4241–4244
Google Scholar
Zhang PY, Li CH (2009) Automatic text summarization based on sentences clustering and extraction. In: 2009 2nd IEEE international conference on computer science and information technology, vol 1(1), pp 167–170
Google Scholar
Ferragina P, Gulli A (2004) The anatomy of a hierarchical clustering engine for webpage, news and book snippets. In: Data mining, 2004. ICDM’04. Fourth IEEE (2004), pp 10–13
Google Scholar
Kotlerman L, Dagan I, Gorodetsky M, Daya E (2012) Sentence clustering via projection over term clusters. newdesign.aclweb.org (2009) (2012), pp 38–43
Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892
Google Scholar
Pasquier C (2010) Task 5: single document keyphrase extraction using sentence clustering and latent dirichlet allocation. In: Proceedings of the 5th international workshop on semantic evaluation (July, 2010), pp 154–157
Google Scholar
Boros E, Kantor P, Neu D (2001) A clustering based approach to creating multidocument summaries. In: Proceedings of the 24th annual conference on pattern analysis and machine learning (2001), pp 443–459
Google Scholar
Dongen SV (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30(1):121–141
Google Scholar
Guénoche A (2004) Clustering by vertex density in a graph. Classification,Classification, Clustering, and Data Mining Applications. In: Book series of Studies in Classification, Data Analysis, and Knowledge Organisation, Springer, Heidelberg, pp 15–23
Google Scholar
Fung P, Ngai G (2006) One story, one flow: hidden markov story models for multilingual document summarization. ACM Trans Speech Lang Process 3(2):1–16
Google Scholar
Skabar, A., Abdalgader, K.: Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm. IEEE Transactions on Knowledge and Data Engineering 5(1) (2013) 62–75
Google Scholar
Erkan G, Radev D (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479
Google Scholar
Biemann C (2006) Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of the first workshop on graph based methods for natural language processing, pp 73–80
Google Scholar
Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop, pp 7–12
Google Scholar
Zha H (2002) Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval-SIGIR’02, New York, NY, USA, ACM Press, p. 113
Google Scholar
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31:1361–1374
Google Scholar
Minnen G, Carroll J, Pearce D (2000) Robust, applied morphological generation. In: Proceedings of the first international conference on natural language generation-Volume 14, Stroudsburg, PA, USA, Association for Computational Linguistics, pp 201–208
Google Scholar
Runeson P (2007) Detection of duplicate defect reports using natural language processing. In: Software Engineering, 2007. ICSE 2007. 29th international conference on, Minneapolis, Minnesota, IEEE Computer Society (May 2007), pp 499–510
Google Scholar
Wilbur WJ, Sirotkin K (1992) The automatic identification of stop words. J Inf Sci 18(1):45–55
Google Scholar
Soukoreff R, MacKenzie I (2001) Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI’01 Extended abstracts on human factors in computing systems, New York, NY, USA, ACM, pp 31–32
Google Scholar
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41
Google Scholar
Christopher D, Manning PR, Hinrich S (2008) Introduction to information retrieval. Cambridge University Press, pp I-XXI, 1–482
Google Scholar

Download references

Acknowledgments

The research herein is partially conducted within the competence network Softnet Austria (www.soft-net.at) and funded by the Austrian Federal Ministry of Economics (bm:wa), the province of Styria, the Steirische Wirtschaftsförder-ungsgesellschaft mbH. (SFG), and the city of Vienna in terms of the center for innovation and technology (ZIT). This work was supported by the project “QE LaB–Living Models for Open Systems (FFG 822740)” and partially funded by the European Commission under the FP7 project “PoSecCo” (IST 257129).

Author information

Authors and Affiliations

Institute of Computer Science, University of Innsbruck, Technikerstrasse 21a A, 6020, Innsbruck, Austria
Christian Sillaber & Ruth Breu

Authors

Christian Sillaber
View author publications
You can also search for this author in PubMed Google Scholar
Ruth Breu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Sillaber .

Editor information

Editors and Affiliations

School of Computing, Staffordshire University, Stafford, United Kingdom
Lorna Uden
College of Management, National University of Kaohsiung, Kaohsiung, Taiwan, Taiwan
Leon S.L. Wang
and Control Faculty of Science, Universidad Salamanca Department of Computing Science, Salamanca, Spain
Juan Manuel Corchado Rodríguez
National University of Kaohsiung, Kaohsiung, Taiwan
Hsin-Chang Yang
Department of Information Management, National University of Kaohsiung, Kaohsiung, Taiwan, Taiwan
I-Hsien Ting

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sillaber, C., Breu, R. (2014). Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion Clustering. In: Uden, L., Wang, L., Corchado Rodríguez, J., Yang, HC., Ting, IH. (eds) The 8th International Conference on Knowledge Management in Organizations. Springer Proceedings in Complexity. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7287-8_20

Download citation

DOI: https://doi.org/10.1007/978-94-007-7287-8_20
Published: 06 September 2013
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-7286-1
Online ISBN: 978-94-007-7287-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics