Abstract
Structure-oriented approaches in clone detection have become popular in both code-based and model-based clone detection. However, existing methods for capturing structural information in software artifacts are either too computationally expensive to be efficient or too light-weight to be accurate in clone detection. In this paper, we present Exas, an accurate and efficient structural characteristic feature extraction approach that better approximates and captures the structure within the fragments of artifacts. Exas structural features are the sequences of labels and numbers built from nodes, edges, and paths of various lengths of a graph-based representation. A fragment is characterized by a structural characteristic vector of the occurrence counts of those features. We have applied Exas in building two clone detection tools for source code and models. Our analytic study and empirical evaluation on open-source software show that Exas and its algorithm for computing the characteristic vectors are highly accurate and efficient in clone detection.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Kontogiannis, K.A., Demori, R., Merlo, E., Galler, M., Bernstein, M.: Pattern matching for clone and concept detection. Reverse Engineering, 77–108 (1996)
Roy, C., Cordy, J.: Towards a mutation-based automatic framework for evaluating code clone detection tools. In: C3S2E 2008, pp. 137–140. ACM, New York (2008)
Read, R., Corneil, D.: The graph isomorph disease. Journal of Graph Theory 1, 339–363 (1977)
Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: ICSM 1998, p. 368. IEEE CS, Los Alamitos (1998)
Li, Z., Lu, S., Myagmar, S.: CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Trans. Softw. Eng. 32(3), 176–192 (2006)
Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28(7), 654–670 (2002)
Evans, W.S., Fraser, C.W., Ma, F.: Clone detection via structural abstraction. In: WCRE 2007: Working Conference on Reverse Engineering, pp. 150–159. IEEE CS, Los Alamitos (2007)
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: scalable and accurate tree-based detection of code clones. In: ICSE 2007, pp. 96–105. IEEE CS, Los Alamitos (2007)
Nguyen, T.T., Nguyen, H.A., Pham, N.H., Al-Kofahi, J.M., Nguyen, T.N.: Cleman: Comprehensive clone group evolution management. In: ASE 2008. IEEE CS, Los Alamitos (2008)
The MathWorks Inc. SIMULINK Model-Based and System-Based Design (2002)
Deissenboeck, F., Hummel, B., Jürgens, E., Schätz, B., Wagner, S., Girard, J.F., Teuchert, S.: Clone detection in automotive model-based development. In: ICSE 2008, pp. 603–612. ACM, New York (2008)
Pham, N.H., Nguyen, H.A., Nguyen, T.T., Al-Kofahi, J.M., Nguyen, T.N.: Complete and Accurate Clone Detection in Graph-based Models. In: ICSE 2009, International Conference on Software Engineering. IEEE CS, Los Alamitos (2009)
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Albert-Ludwigs University at Freiburg. Technical report (1991)
Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In: SIGMOD 2005: International conference on Management of data (2005)
Kuramochi, M., Karypis, G.: Finding frequent patterns in a large sparse graph*. Data Mining and Knowledge Discovery 11(3), 243–271 (2005)
Liu, H., Ma, Z., Zhang, L., Shao, W.: Detecting duplications in sequence diagrams based on suffix trees. In: APSEC 2006, pp. 269–276. IEEE CS, Los Alamitos (2006)
Ohst, D., Welle, M., Kelter, U.: Differences between versions of UML diagrams. SIGSOFT Softw. Eng. Notes 28(5), 227–236 (2003)
Xing, Z., Stroulia, E.: UMLDiff: an algorithm for object-oriented design differencing. In: ASE 2005, pp. 54–65. ACM, New York (2005)
Mehra, A., Grundy, J., Hosking, J.: A generic approach to supporting diagram differencing and merging for collaborative design. In: ASE 2005, pp. 204–213. ACM, New York (2005)
Nejati, S., Sabetzadeh, M., Chechik, M., Easterbrook, S., Zave, P.: Matching and merging of statecharts specifications. In: ICSE 2007, pp. 54–64. IEEE CS, Los Alamitos (2007)
Xiong, Y., Liu, D., Hu, Z., Zhao, H., Takeichi, M., Mei, H.: Towards automatic model synchronization from model transformations. In: ASE 2007, pp. 164–173. ACM, New York (2007)
Bergmann, G., Ökrös, A., Ráth, I., Varró, D., Varró, G.: Incremental pattern matching in the viatra model transformation system. In: GRaMoT 2008: Proc. of international workshop on graph and model transformations, pp. 25–32. ACM, New York (2008)
Fluri, B., Wuersch, M., PInzger, M., Gall, H.: Change distilling: Tree differencing for fine-grained source code change extraction. IEEE Trans. Softw. Eng. 33(11), 725–743 (2007)
Basit, H., Jarzabek, S.: Efficient token based clone detection with flexible tokenization. In: FSE 2007, pp. 513–516. ACM, New York (2007)
Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E.: Comparison and evaluation of clone detection tools. IEEE Trans. Softw. Eng. 33(9), 577–591 (2007)
Koschke, R., Falke, R., Frenzel, P.: Clone detection using abstract syntax suffix trees. In: WCRE 2006, pp. 253–262. IEEE CS, Los Alamitos (2006)
Baker, B.S.: Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences 26(1), 28–42 (1996)
Johnson, J.H.: Identifying redundancy in source code using fingerprints. In: CASCON 1993, pp. 171–183. IBM Press (1993)
Baker, B.S.: Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26(5), 1343–1362 (1997)
Baxter, I.D., Pidgeon, C., Mehlich, M.: DMS®: Program transformations for practical scalable software evolution. In: ICSE 2004, pp. 625–634. IEEE CS, Los Alamitos (2004)
Wahler, V., Seipel, D., Gudenberg, J.W., Fischer, G.: Clone detection in source code by frequent itemset techniques. In: SCAM 2004, pp. 128–135. IEEE CS, Los Alamitos (2004)
Komondoor, R., Horwitz, S.: Using slicing to identify duplication in source code. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg (2001)
Mayrand, J., Leblanc, C., Merlo, E.: Experiment on the automatic detection of function clones in a software system using metrics. In: ICSM 1996, p. 244. IEEE CS, Los Alamitos (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, H.A., Nguyen, T.T., Pham, N.H., Al-Kofahi, J.M., Nguyen, T.N. (2009). Accurate and Efficient Structural Characteristic Feature Extraction for Clone Detection. In: Chechik, M., Wirsing, M. (eds) Fundamental Approaches to Software Engineering. FASE 2009. Lecture Notes in Computer Science, vol 5503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00593-0_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-00593-0_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00592-3
Online ISBN: 978-3-642-00593-0
eBook Packages: Computer ScienceComputer Science (R0)