Abstract
Existing methods for finding near-duplicate content often fail when applied to informal user discussions spanning multiple messages, which can be found in collaborative requirement discussions. As a result, although the underlying knowledge sharing platform already contains duplicated entries the stakeholders often recreate already existing requirements discussions without contributing to the existing discussions. In this paper we therefore identify common reasons leading to near-duplicate content and develop a new algorithm for detecting near-duplicate content in multilevel requirement discussions. The algorithm is implemented using a large case study of real-world collaborative requirements engineering platforms serving hundreds of thousands of stakeholders. Our preliminary results show, that we outperform existing search algorithms and that we are able to identify near-duplicates in multilevel requirement discussions with high precision.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
GNIU Aspell: International Ispell Version 3.1.20 (Aspell 0.60.6.1).
- 3.
http://public.wsu.edu/~brians/errors/errors.html (Accessed: 2012-09-11).
References
Wiener C, Acquah IN, Heiss M, Mayerdorfer T, Langen M, Kammergruber WC (2012) Targeting the right crowd for corporate problem solving-a siemens case study with TechnoWeb 2.0. In: International technology management conference, IEEE (June 2012), pp 239–247
Lee SB, Shiva SG (2009) A novel approach to knowledge sharing in software systems engineering. In: 2009 fourth IEEE international conference on global software engineering, IEEE (July 2009), pp 376–381
Rus I, Lindvall M, Sinha S (2002) Knowledge management in software engineering. IEEE Softw 19(3):26–38
Sillaber C, Chimiak-Opoka J, Breu R (2012) Supporting social driven requirements engineering processes through knowledge sharing platforms. In: Software Engineering/781: Control Applications, ACTA Press, Anaheim
Damian D (2006) Requirements engineering in distributed projects. In: Proceedings of the IEEE international conference on global software engineering, IEEE Computer Society, pp 69–75
Herlea D, Greenberg S (1998) Using a groupware space for distributed requirements engineering. In: Proceedings seventh IEEE international workshop on enabling technologies infrastucture for collaborative enterprises WET ICE 98 Cat No98TB100253, pp 57–62
Lohmann S, Dietzold S, Heim P, Heino N (2009) A web platform for social requirements engineering. In: Software Engineering, pp 309–315
Zhang Q, Zhang Y, Yu H, Huang X (2010) Efficient partial-duplicate detection based on sequence matching. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval-SIGIR’10, pp 675–689
Muthmann K, Barczynski WM, Brauer F, Löser A (2009) Near-duplicate detection for web-forums. In: Proceedings of the 2009 international database engineering and applications symposium, pp 142–151
Broder A (2000) Identifying and filtering near-duplicate documents. In: Combinatorial pattern matching, pp 1–10
Dawson S (2006) Online forum discussion interactions as an indicator of student community. Australas J Educ Technol 22(4):495
Hodge S (2005) Participation, discourse and power: a case study in service user involvement. Crit Soc Policy 25(2):164–179
Adamic L, Zhang J (2008) Knowledge sharing and yahoo answers: everyone knows something. In: Proceedings of the 17th international conference on World Wide Web, pp 665–674
Lee MK, Cheung CM, Lim KH, Sia CL (2006) Understanding customer knowledge sharing in web-based discussion boards: an exploratory study. Internet Res 16(3):289–303
Hou HT, Sung YT, Chang KE (2009) Exploring the behavioral patterns of an online knowledge-sharing discussion activity among teachers with problem-solving strategy. Teach Teach Educ 25(1):101–108
Ravichandran D, Pantel P, Hovy E (2005) Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Number June, Association for Computational Linguistics, pp 622–629
Suzuki M, Kuriyama N, Ito A, Makino S (2008) Automatic clustering of part of-speech for vocabulary divided PLSA language model. In: 2008 International conference on natural language processing and knowledge engineering, IEEE, 1–7 October 2008
Zhong M, Hu Y, Liu L, Lu R (2008) A practical approach for relevance measure of inter-sentence. In: 2008 fifth international conference on fuzzy systems and knowledge discovery, IEEE, pp 140–144
Zhang Z, Cheng H, Zhang S, Chen W, Fang Q (2008) Clustering aggregation based on genetic algorithm for documents clustering. In: 2008 IEEE congress on evolutionary computation (IEEE World Congress on Computational Intelligence), IEEE (June 2008), pp 3156–3161
Hammouda K, Kamel M (2004) Efficient phrase-based document indexing for Web document clustering. IEEE Trans Knowl Data Eng 16(10):1279–1296
Rafi M, Maujood M (2010) A comparison of two suffix tree-based document clustering algorithms. In: 2010 international conference on information and emerging technologies, IEEE, pp 1–5
Dey S, Murthy H (2012) Unsupervised clustering of syllables for language identification. In: Proceedings of the 20th European signal processing conference, vol 20(31), pp 25–329
You CH, Lee KA, Ma B, Li H (2008) Self-organized clustering for feature mapping in language recognition. In: 2008 6th international symposium on chinese spoken language processing, IEEE (December 2008), pp 1–4
Ambikairajah E (2008) Improvements on hierarchical language identification based on automatic language clustering. In: 2008 IEEE international conference on acoustics, speech and signal processing, IEEE (March 2008), pp 4241–4244
Zhang PY, Li CH (2009) Automatic text summarization based on sentences clustering and extraction. In: 2009 2nd IEEE international conference on computer science and information technology, vol 1(1), pp 167–170
Ferragina P, Gulli A (2004) The anatomy of a hierarchical clustering engine for webpage, news and book snippets. In: Data mining, 2004. ICDM’04. Fourth IEEE (2004), pp 10–13
Kotlerman L, Dagan I, Gorodetsky M, Daya E (2012) Sentence clustering via projection over term clusters. newdesign.aclweb.org (2009) (2012), pp 38–43
Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892
Pasquier C (2010) Task 5: single document keyphrase extraction using sentence clustering and latent dirichlet allocation. In: Proceedings of the 5th international workshop on semantic evaluation (July, 2010), pp 154–157
Boros E, Kantor P, Neu D (2001) A clustering based approach to creating multidocument summaries. In: Proceedings of the 24th annual conference on pattern analysis and machine learning (2001), pp 443–459
Dongen SV (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30(1):121–141
Guénoche A (2004) Clustering by vertex density in a graph. Classification,Classification, Clustering, and Data Mining Applications. In: Book series of Studies in Classification, Data Analysis, and Knowledge Organisation, Springer, Heidelberg, pp 15–23
Fung P, Ngai G (2006) One story, one flow: hidden markov story models for multilingual document summarization. ACM Trans Speech Lang Process 3(2):1–16
Skabar, A., Abdalgader, K.: Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm. IEEE Transactions on Knowledge and Data Engineering 5(1) (2013) 62–75
Erkan G, Radev D (2004) LexRank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479
Biemann C (2006) Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of the first workshop on graph based methods for natural language processing, pp 73–80
Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop, pp 7–12
Zha H (2002) Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval-SIGIR’02, New York, NY, USA, ACM Press, p. 113
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31:1361–1374
Minnen G, Carroll J, Pearce D (2000) Robust, applied morphological generation. In: Proceedings of the first international conference on natural language generation-Volume 14, Stroudsburg, PA, USA, Association for Computational Linguistics, pp 201–208
Runeson P (2007) Detection of duplicate defect reports using natural language processing. In: Software Engineering, 2007. ICSE 2007. 29th international conference on, Minneapolis, Minnesota, IEEE Computer Society (May 2007), pp 499–510
Wilbur WJ, Sirotkin K (1992) The automatic identification of stop words. J Inf Sci 18(1):45–55
Soukoreff R, MacKenzie I (2001) Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI’01 Extended abstracts on human factors in computing systems, New York, NY, USA, ACM, pp 31–32
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41
Christopher D, Manning PR, Hinrich S (2008) Introduction to information retrieval. Cambridge University Press, pp I-XXI, 1–482
Acknowledgments
The research herein is partially conducted within the competence network Softnet Austria (www.soft-net.at) and funded by the Austrian Federal Ministry of Economics (bm:wa), the province of Styria, the Steirische Wirtschaftsförder-ungsgesellschaft mbH. (SFG), and the city of Vienna in terms of the center for innovation and technology (ZIT). This work was supported by the project “QE LaB–Living Models for Open Systems (FFG 822740)” and partially funded by the European Commission under the FP7 project “PoSecCo” (IST 257129).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media Dordrecht
About this paper
Cite this paper
Sillaber, C., Breu, R. (2014). Improving Near-Duplicate Detection in Multi-Layered Collaborative Requirements Engineering Discussions Through Discussion Clustering. In: Uden, L., Wang, L., Corchado Rodríguez, J., Yang, HC., Ting, IH. (eds) The 8th International Conference on Knowledge Management in Organizations. Springer Proceedings in Complexity. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7287-8_20
Download citation
DOI: https://doi.org/10.1007/978-94-007-7287-8_20
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-7286-1
Online ISBN: 978-94-007-7287-8
eBook Packages: Computer ScienceComputer Science (R0)