Using LSTMs to Model the Java Programming Language

  • Brendon BoldtEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10614)


Recurrent neural networks (RNNs), specifically long-short term memory networks (LSTMs), can model natural language effectively. This research investigates the ability for these same LSTMs to perform next “word” prediction on the Java programming language. Java source code from four different repositories undergoes a transformation that preserves the logical structure of the source code and removes the code’s various specificities such as variable names and literal values. Such datasets and an additional English language corpus are used to train and test standard LSTMs’ ability to predict the next element in a sequence. Results suggest that LSTMs can effectively model Java code achieving perplexities under 22 and accuracies above 0.47, which is an improvement over LSTM’s performance on the English language which demonstrated a perplexity of 85 and an accuracy of 0.27. This research can have applicability in other areas such as syntactic template suggestion and automated bug patching.


Long Short-term Memory (LSTM) Java Programming Language Java Source Code LSTM Neural Network Java Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Allamanis, M., Sutton, C.: Mining source code repositories at massive scale using language modeling. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR 2013, Piscataway, NJ, USA, pp. 207–216. IEEE Press (2013)Google Scholar
  2. 2.
    Nguyen, A.T., Nguyen, T.N.: Graph-based statistical language model for code. In: Proceedings of the 37th International Conference on Software Engineering, ICSE 2015, Piscataway, NJ, USA, vol. 1, pp. 858–868. IEEE Press (2015)Google Scholar
  3. 3.
    Asaduzzaman, M., Roy, C.K., Schneider, K.A., Hou, D.: A simple, efficient, context-sensitive approach for code completion. J. Softw.: Evol. Process 28(7), 512–541 (2016). JSME-15-0030.R3Google Scholar
  4. 4.
    Kim, D., Nam, J., Song, J., Kim, S.: Automatic patch generation learned from human-written patches. In: Proceedings of the 2013 International Conference on Software Engineering, ICSE 2013, Piscataway, NJ, USA, pp. 802–811. IEEE Press (2013)Google Scholar
  5. 5.
    Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. CoRR, abs/1409.2329 (2014)Google Scholar
  6. 6.
  7. 7.
    Sundermeyer, M., Ney, H., Schlüter, R.: From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(3), 517–529 (2015)CrossRefGoogle Scholar
  8. 8.
    Wang, M., Song, L., Yang, X., Luo, C.: A parallel-fusion RNN-LSTM architecture for image caption generation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4448–4452. IEEE (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Marist CollegePoughkeepsieUSA

Personalised recommendations