Skip to main content
Log in

Pattern matching for clone and concept detection

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

A legacy system is an operational, large-scale software system that is maintained beyond its first generation of programmers. It typically represents a massive economic investment and is critical to the mission of the organization it serves. As such systems age, they become increasingly complex and brittle, and hence harder to maintain. They also become even more critical to the survival of their organization because the business rules encoded within the system are seldom documented elsewhere.

Our research is concerned with developing a suite of tools to aid the maintainers of legacy systems in recovering the knowledge embodied within the system. The activities, known collectively as “program understanding”, are essential preludes for several key processes, including maintenance and design recovery for reengineering.

In this paper we present three pattern-matching techniques: source code metrics, a dynamic programming algorithm for finding the best alignment between two code fragments, and a statistical matching algorithm between abstract code descriptions represented in an abstract language and actual source code. The methods are applied to detect instances of code cloning in several moderately-sized production systems including tcsh, bash, and CLIPS.

The programmer's skill and experience are essential elements of our approach. Selection of particular tools and analysis methods depends on the needs of the particular task to be accomplished. Integration of the tools provides opportunities for synergy, allowing the programmer to select the most appropriate tool for a given task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adamov, R. “Literature review on software metrics”, Zurich: Institut fur Informatik der Universitat Zurich, 1987.

    Google Scholar 

  • Baker S. B, “On Finding Duplication and Near-Duplication in Large Software Systems” In Proceedings of the Working Conference on Reverse Engineering 1995, Toronto ON. July 1995

  • Biggerstaff, T., Mitbander, B., Webster, D., Program Understanding and the Concept Assignment Problem, Communications of the ACM, May 1994, Vol. 37, No.5, pp. 73–83.

    Article  Google Scholar 

  • P. Brown et. al. “Class-Based n-gram Models of natural Language”, Journal of Computational Linguistics, Vol. 18, No.4, December 1992, pp.467–479.

    Google Scholar 

  • Buss, E., et. al. “Investigating Reverse Engineering Technologies for the CAS Program Understanding Project”, IBM Systems Journal, Vol. 33, No. 3, 1994, pp. 477–500.

    Article  Google Scholar 

  • G. Canfora., A. Cimitile., U. Carlini., “A Logic-Based Approach to Reverse Engineering Tools Production” Transactions of Software Engineering, Vol.18, No. 12, December 1992, pp. 1053–1063.

    Article  Google Scholar 

  • Chikofsky, E.J. and Cross, J.H. II, “Reverse Engineering and Design Recovery: A Taxonomy,” IEEE Software, Jan. 1990, pp. 13 - 17.

  • Church, K., Helfman, I., “Dotplot: a program for exploring self-similarity in millions of lines of text and code”, J. Computational and Graphical Statistics 2,2, June 1993, pp. 153–174.

  • C-Language Integrated Production System User's Manual NASA Software Technology Division, Johnson Space Center, Houston, TX.

  • Fenton, E. “Software metrics: a rigorous approach”, Chapman and Hall, 1991.

  • Halstead, M., H., “Elements of Software Science”, New York: Elsevier North-Holland, 1977.

    MATH  Google Scholar 

  • J. Hartman., “Technical Introduction to the First Workshop on Artificial Intelligence and Automated Program Understanding” First Workshop on AI and Automated Program Understanding, AAAI'92, San-Jose, CA.

  • Horwitz S., “Identifying the semantic and textual differences between two versions of a program. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, June 1990, pp. 234–245.

  • Jankowitz, H., T., “Detecting plagiarism in student PASCAL programs”, Computer Journal, 31.1, 1988, pp. 1–8.

    Article  Google Scholar 

  • Johnson, H., “Identifying Redundancy in Source Code Using Fingerprints” In Proceedings of CASCON '93, IBM Centre for Advanced Studies, October 24 – 28, Toronto, Vol.1, pp. 171 – 183.

  • Kuhn, R., DeMori, R., “A Cache-Based Natural Language Model for Speech Recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No.6, June 1990, pp. 570–583.

    Article  Google Scholar 

  • Kontogiannis, K., DeMori, R., Bernstein, M., Merlo, E., “Localization of Design Concepts in Legacy Systems”, In Proceedings of International Conference on Software Maintenance 1994, September 1994, Victoria, BC. Canada, pp. 414–423.

  • Kontogiannis, K., DeMori, R., Bernstein, M., Galler, M., Merlo, E., “Pattern matching for Design Concept Localization”, In Proceedings of the Second Working Conference on Reverse Engineering, July 1995, Toronto, ON. Canada, pp. 96–103.

  • McCabe T., J. “Reverse Engineering, reusability, redundancy: the connection”, American Programmer 3, 10, October 1990, pp. 8–13.

    Google Scholar 

  • Moller, K., Software metrics: a practitioner's guide to improved product development”

  • Muller, H., Corrie, B., Tilley, S., Spatial and Visual Representations of Software Structures, Tech. Rep. TR-74. 086, IBM Canada Ltd. April 1992.

  • Mylopoulos, J., “Telos: A Language for Representing Knowledge About Information Systems”, University of Toronto, Dept. of Computer Science Technical Report KRR-TR-89-1, August 1990, Toronto.

  • J. NIng., A. Engberts., W. Kozaczynski., “Automated Support for Legacy Code Understanding”, Communications of the ACM, May 1994, Vol.37, No.5, pp.50–57.

    Article  Google Scholar 

  • Paul, S., Prakash, A., “A Framework for Source Code Search Using Program Patterns”, IEEE Transactions on Software Engineering, June 1994, Vol. 20, No.6, pp. 463–475.

    Article  Google Scholar 

  • Rich, C. and Wills, L.M., “Recognizing a Program's Design: A Graph-Parsing Approach”, IEEE Software, Jan 1990, pp. 82 - 89.

  • Tilley, S., Muller, H., Whitney, M., Wong, K., “Domain-retargetable Reverse EngineeringII: Personalized User Interfaces”, In CSM'94: Proceedings of the 1994 Conference on Software Maintenance, September 1994, pp. 336 – 342.

  • Viterbi, A.J, “Error Bounds for Convolutional Codes and an Asymptotic Optimum Decoding Algorithm”, IEEE Trans. Information Theory, 13(2) 1967.

  • Wills, L.M., “Automated Program Recognition by Graph Parsing”, MIT Technical Report, AI Lab No. 1358, 1992

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work is in part supported by IBM Canada Ltd., Institute for Robotics and Intelligent Systems, a Canadian Network of Centers of Excellence and, the Natural Sciences and Engineering Research Council of Canada. Based on “Pattern Matching for Design Concept Localization” by K.A.Kontogiannis, R.DeMori, M.Bernstein, M.Galler, E.Merlo, which first appeared in Proceedings of the Second Working Conference on Reverse Enginering, pp.96–103, July, 1995, © IEEE, 1995

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kontogiannis, K.A., Demori, R., Merlo, E. et al. Pattern matching for clone and concept detection. Automated Software Engineering 3, 77–108 (1996). https://doi.org/10.1007/BF00126960

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00126960

Keywords

Navigation