On Identifying Similarities in Git Commit Trends—A Comparison Between Clustering and SimSAX

Ochodek, Miroslaw; Staron, Miroslaw; Meding, Wilhelm

doi:10.1007/978-3-030-35510-4_7

Miroslaw Ochodek¹⁰,
Miroslaw Staron¹¹ &
Wilhelm Meding¹²

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 371))

Included in the following conference series:

International Conference on Software Quality

580 Accesses

Abstract

Software products evolve increasingly fast as markets continuously demand new features and agility to customer’s need. This evolution of products triggers an evolution of software development practices in a different way. Compared to classical methods, where products were developed in projects, contemporary methods for continuous integration, delivery, and deployment develop products as part of continuous programs. In this context, software architects, designers, and quality engineers need to understand how the processes evolve over time since there is no natural start and stop of projects. For example, they need to know how similar two iterations of the same program or how similar two development programs are. In this paper, we compare three methods for calculating the degree of similarity between projects by comparing their Git commit series. We test three approaches—the DNA-motifs-inspired SimSAX measure and clustering of subsequences (k-Means and Hierarchical clustering). Our results show that the clustering algorithms are much more sensitive to parameters and often find similarities that are not correct. SimSAX, on the other hand, can be calibrated to find fewer similarities between the projects; the similarities are also more consistent for SimSAX than they are for the clustering. We conclude that it is better to use DNA-inspired motifs as they provide more accurate results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The SimSAX tool—https://github.com/mochodek/simsax.
2.
More information about calibrating SimSAX\(_{n,w,a}\)(A, B) can be found in [14].

References

van der Aalst, W.M.P., de Medeiros, A.K.A., Weijters, A.J.M.M.: Process equivalence: comparing two process models based on observed behavior. In: Dustdar, S., Fiadeiro, J.L., Sheth, A.P. (eds.) BPM 2006. LNCS, vol. 4102, pp. 129–144. Springer, Heidelberg (2006). https://doi.org/10.1007/11841760_10
Chapter Google Scholar
Aghabozorgi, S., Shirkhorshidi, A.S., Wah, T.Y.: Time-series clustering-a decade review. Inf. Syst. 53, 16–38 (2015)
Article Google Scholar
Bardsiri, V.K., Jawawi, D.N.A., Hashim, S.Z.M., Khatibi, E.: Increasing the accuracy of software development effort estimation using projects clustering. IET Softw. 6(6), 461–473 (2012)
Article Google Scholar
Bosch, J.: Continuous Software Engineering. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-11283-1
Book Google Scholar
Bosch, J.: Speed, data, and ecosystems: the future of software engineering. IEEE Softw. 33(1), 82–88 (2016)
Article Google Scholar
Hindle, A., German, D.M., Holt, R.: What do large commits tell us?: a taxonomical study of large commits. In: Proceedings of the 2008 International Working Conference on Mining Software Repositories, pp. 99–108. ACM (2008)
Google Scholar
Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: Open source scientific tools for Python (2001). http://www.scipy.org/. Accessed 12 Mar 2018
Keogh, E., Ratanamahatana, C.A.: Exact indexing of dynamic time warping. Knowl. Inf. Syst. 7(3), 358–386 (2004)
Article Google Scholar
Keogh, E.J., Pazzani, M.J.: A simple dimensionality reduction technique for fast similarity search in large time series databases. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 122–133. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45571-X_14
Chapter Google Scholar
Liao, T.W.: Clustering of time series data a survey. Pattern Recogn. 38(11), 1857–1874 (2005)
Article Google Scholar
Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery pp. 2–11. ACM (2003)
Google Scholar
Lokan, C., Wright, T., Hill, P., Stringer, M.: Organizational benchmarking using the ISBSG data repository. IEEE Softw. 18(5), 26–32 (2001)
Article Google Scholar
Nayebi, M., Kuznetsov, K., Chen, P., Zeller, A., Ruhe, G.: Anatomy of functionality deletion. In: Proceedings of the Conference on Mining Software Repositories (MSR18), Gothenburg, Sweden (2018)
Google Scholar
Ochodek, M., Staron, M., Meding, W.: SimSAX: a measure of project similarity based on symbolic approximation method and software defect inflow. Inf. Softw. Technol. (2019). http://www.sciencedirect.com/science/article/pii/S0950584919301363
Rana, R., Staron, M., Berger, C., Hansson, J., Nilsson, M., Törner, F., Meding, W., Höglund, C.: Selecting software reliability growth models and improving their predictive accuracy using historical projects data. J. Syst. Softw. 98, 59–78 (2014)
Article Google Scholar
Shepperd, M., Schofield, C.: Estimating software project effort using analogies. IEEE Trans. Softw. Eng. 23(11), 736–743 (1997)
Article Google Scholar
Silhavy, R., Silhavy, P., Prokopová, Z.: Evaluating subset selection methods for use case points estimation. Inf. Softw. Technol. 97, 1–9 (2018)
Article Google Scholar
Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B., Wessln, A.: Experimentation in Software Engineering: An Introduction. Kluwer Academic Publisher, Boston (2000)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Poznan University of Technology, Poznań, Poland
Miroslaw Ochodek
Chalmers | University of Gothenburg, Gothenburg, Sweden
Miroslaw Staron
Ericsson AB, Gothenburg, Sweden
Wilhelm Meding

Authors

Miroslaw Ochodek
View author publications
You can also search for this author in PubMed Google Scholar
Miroslaw Staron
View author publications
You can also search for this author in PubMed Google Scholar
Wilhelm Meding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miroslaw Ochodek .

Editor information

Editors and Affiliations

Vienna University of Technology, Vienna, Austria
Dietmar Winkler
Vienna University of Technology, Vienna, Austria
Stefan Biffl
fortiss GmbH, Germany, and Blekinge Institute of Technology, Karlskrona, Sweden
Daniel Mendez
Software Quality Lab GmbH, Linz, Austria
Johannes Bergsmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ochodek, M., Staron, M., Meding, W. (2020). On Identifying Similarities in Git Commit Trends—A Comparison Between Clustering and SimSAX. In: Winkler, D., Biffl, S., Mendez, D., Bergsmann, J. (eds) Software Quality: Quality Intelligence in Software and Systems Engineering. SWQD 2020. Lecture Notes in Business Information Processing, vol 371. Springer, Cham. https://doi.org/10.1007/978-3-030-35510-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-35510-4_7
Published: 09 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35509-8
Online ISBN: 978-3-030-35510-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics