Skip to main content

ASAP: A Source Code Authorship Program

Abstract

Source code authorship attribution is the task of determining who wrote a computer program, based on its source code, usually when the author is either unknown or under dispute. Areas where this can be applied include software forensics, cases of software copyright infringement, and detecting plagiarism. Numerous methods of source code authorship attribution have been proposed and studied. However, there are no known easily accessible and user-friendly programs that perform this task. Instead, researchers typically develop software in an ad hoc manner for use in their studies, and the software is rarely made publicly available. In this paper, we present a software tool called A Source Code Authorship Program (ASAP), which is suitable to be used by either the layperson or the expert. An author can be attributed to individual documents one at a time, or complex authorship attribution experiments can easily be performed on large datasets. In this paper, the interface and implementation of the ASAP tool is presented, and the tool is validated by using it to replicate previously published authorship attribution experiments.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. 1.

    Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Proceedings of the Second Asian Information Retrieval Symposium (AIRS), pp. 174–189 (2005)

  2. 2.

    Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Effective identification of source code authors using byte-level information. In: Proceedings of the 28th International Conference on Software Engineering (ICSE), pp. 893–896 (2006)

  3. 3.

    Burrows, S., Tahaghoghi, S.: Source code authorship attribution using n-grams. In: Proceedings of the 12th Australasian Document Computing Symposium, pp. 32–39 (2007)

  4. 4.

    Krsul, I., Spafford, E.: Authorship analysis: identifying the author of a program. Comput. Secur. (COMPSEC) 16(3), 233–257 (1997)

    Article  Google Scholar 

  5. 5.

    MacDonell, S., Gray, A., MacLennan, G., Sallis, P.: Software forensics for discriminating between program authors. In: Proceedings of the 6th International Conference on Neural Information Processing (ICONIP), pp. 66–71 (1999)

  6. 6.

    Ding, H., Samadzadeh, M.: Extraction of java program fingerprints for software authorship identification. J. Syst. Softw. 72, 49–57 (2004)

    Article  Google Scholar 

  7. 7.

    Lange, R., Mancoridis, S.: Using code metric histograms and genetic algorithms to perform author identification for software forensics. In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (GECCO), pp. 2082–2089 (2007)

  8. 8.

    Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S.: A probabilistic approach to source code authorship identification, Proceedings of the Fourth International Conference on Information Technology, pp. 243248 (2007)

  9. 9.

    Elenbogen, B., Seliya, N.: Detecting outsourced student programming assignments. J. Comput. Sci. Coll. 23(3), 50–57 (2008)

    Google Scholar 

  10. 10.

    Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S.: On the use of discretized source code metrics for author identification. In: Proceedings of the 1st International Symposium on Search Based Software Engineering (SSBSE), pp. 69–78 (2009)

  11. 11.

    Wisse, W., Veenman, C.: Scripting DNA: identifying the JavaScript programmer. Digit. Investig. 15, 6171 (2015)

    Article  Google Scholar 

  12. 12.

    Neme, A., Pulido, J., Muoz, A., Hernndez, S., Dey, T.: Stylistics analysis and authorship attribution algorithms based on self-organizing maps. Neurocomputing 147, 147–159 (2015)

    Article  Google Scholar 

  13. 13.

    Caliskan-Islam, A., Harang, R., Liu, A., Narayanan, A., Voss, C., Yamaguchi, F.: De-anonymizing programmers via code stylometry. In: Proceedings of the 24th USENIX Security Symposium, pp. 255–270 (2015)

  14. 14.

    Yang, X., Xu, G., Li, Q., Guo, Y., Zhang, M.: Authorship attribution of source code by using back propagation neural network based on particle swarm optimization. PLoS ONE 12(11) (2017). https://doi.org/10.1371/journal.pone.0187204

  15. 15.

    Tennyson, M.: Authorship Attribution of Source Code. Nova Southeastern University, Florida (2013)

    Google Scholar 

  16. 16.

    Tennyson, M., Mitropoulos, F.: Choosing a Profile Length in the SCAP Method of Source Code Authorship Attribution. In: 2014 Proceedings of the IEEE Southeastcon, pp. 1–6 (2014)

  17. 17.

    Tennyson, M., Mitropoulos, F.: Improving the Burrows Method of Source Code Authorship Attribution. In: Proceedings of the IADIS International Conference on Applied Computing, p. 39 (2013)

  18. 18.

    Burrows, S.: Source Code Authorship Attribution. RMIT, Melbourne (2010)

    Google Scholar 

  19. 19.

    Burrows, S., Uitdenbogerd, A., Turpin, A.: Comparing techniques for authorship attribution of source code. J. Softw. Pract. Exp. 44, 1–32 (2014)

    Article  Google Scholar 

  20. 20.

    Swain, S., Mishra, G., Sindhu, C.: Recent approaches on authorship attribution techniques: an overview. In: Proceedings of the International Conference on Electronics, Communication and Aerospace Technology (ICECA), (2017)

  21. 21.

    Hendrikse, S.: The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files. Nova Southeastern University, Florida (2017)

    Google Scholar 

  22. 22.

    Tennyson, M.: A replicated comparative study of Source Code Authorship Attribution. In: Proceedings of the 3rd International Workshop on Replication in Empirical Software Engineering Research (RESER), pp. 76–83 (2013)

  23. 23.

    McDonald, A., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Proceedings of the International Symposium on Privacy Enhancing Technologies Symposium (PETS), pp. 299–318 (2012)

  24. 24.

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  25. 25.

    Frank, E., Hall, M., Witten, I.: The WEKA Workbench, 4th edn. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

  26. 26.

    Prechelt, L., Malpohl, G., Philippsen, M.: Finding plagiarisms among a set of programs with JPlag. J. Univers. Comput. Sci. 8(11), 1016–1038 (2002)

    Google Scholar 

  27. 27.

    Schleimer, S., Wilkerson, D., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)

  28. 28.

    Niezgoda, S., Way, T.: SNITCH: a software tool for detecting cut and paste plagiarism. In: Proceedings of the 37th SIGCSE Technical Symposium on Computer Science Education (SIGCSE), pp. 51–55 (2006)

  29. 29.

    Robertson, S., Walker, S.: Okapi/Keenbow at TREC-8. In: Proceedings of the 8th Text Retrieval Conference (TREC-8), pp. 151–162 (1999)

Download references

Acknowledgements

I would like to extend a sincere word of thanks to the following current and former students for their software development contributions: Ethan Hill, Jacob Siegers, Justin Sassine, Conor Aberle, Joseph Sorgea, Anirudh Kambatla, Brian Rickard, and Michael Decker.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Matthew F. Tennyson.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tennyson, M.F. ASAP: A Source Code Authorship Program. Int J Softw Tools Technol Transfer 21, 471–484 (2019). https://doi.org/10.1007/s10009-019-00517-3

Download citation

Keywords

  • Authorship attribution
  • Source code
  • Software forensics
  • Plagiarism detection
  • Software copyright infringement
  • Similarity search
  • Information retrieval
  • Machine learning