Neural ParsCit: a deep learning-based reference string parser


We present a deep learning approach for the core digital libraries task of parsing bibliographic reference strings. We deploy the state-of-the-art long short-term memory (LSTM) neural network architecture, a variant of a recurrent neural network to capture long-range dependencies in reference strings. We explore word embeddings and character-based word embeddings as an alternative to handcrafted features. We incrementally experiment with features, architectural configurations, and the diversity of the dataset. Our final model is an LSTM-based architecture, which layers a linear chain conditional random field (CRF) over the LSTM output. In extensive experiments in both English in-domain (computer science) and out-of-domain (humanities) test cases, as well as multilingual data, our results show a significant gain (\(p<0.01\)) over the reported state-of-the-art CRF-only-based parser.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. 1.

  2. 2.

    Code and data available at

  3. 3.

  4. 4.


  1. 1.

    Bengio, Y.: Learning deep architectures for AI. Found. trends\({\textregistered }\) Mach. Learn.2(1), 1–127 (2009)

    Article  Google Scholar 

  2. 2.

    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

    Article  Google Scholar 

  3. 3.

    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: A CPU and GPU math compiler in python. In: Proceedings of 9th Python in Science Conference, pp. 1–7 (2010)

  4. 4.

    Chen, C.C., Yang, K.H., Chen, C.L., Ho, J.M.: Bibpro: a citation parser based on sequence alignment. IEEE Trans. Knowl. Data Eng. 24(2), 236–250 (2012)

    Article  Google Scholar 

  5. 5.

    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)

    MATH  Google Scholar 

  6. 6.

    Councill, I.G., Giles, C.L., Kan, M.Y.: Parscit: an open-source CRF reference string parsing package. LREC 8, 661–667 (2008)

    Google Scholar 

  7. 7.

    Cuong, N.V., Chandrasekaran, M.K., Kan, M.Y., Lee, W.S.: Scholarly document information extraction using extensible features for efficient higher order semi-CRFs. In: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, ACM, pp. 61–64 (2015)

  8. 8.

    Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)

    Article  Google Scholar 

  9. 9.

    Giles, C.L., Bollacker, K.D., Lawrence, S.: Citeseer: an automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries, ACM, pp. 89–98 (1998)

  10. 10.

    Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 (2015)

  11. 11.

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  12. 12.

    Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Mag. 19(9/10), (2013)

  13. 13.

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)

  14. 14.

    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, vol. 1, pp. 282–289 (2001)

  15. 15.

    Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)

  16. 16.

    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  17. 17.

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  18. 18.

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp. 3111–3119 (2013)

  19. 19.

    Mohamed, A.R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)

    Article  Google Scholar 

  20. 20.

    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. ICML 3(28), 1310–1318 (2013)

    Google Scholar 

  21. 21.

    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–43 (2014)

    Google Scholar 

  22. 22.

    Rabiner, L., Juang, B.: An introduction to hidden Markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)

    Article  Google Scholar 

  23. 23.

    Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 833–840 (2011)

  24. 24.

    Romanello, M., Boschetti, F., Crane, G.: Citations in the digital library of classics: extracting canonical references by using conditional random fields. In: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, Association for Computational Linguistics, pp. 80–87 (2009)

  25. 25.

    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report, DTIC Document (1985)

  26. 26.

    Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)

    Article  Google Scholar 

  27. 27.

    Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: Cermine: automatic extraction of structured metadata from scientific literature. IJDAR 18(4), 317–335 (2015)

    Article  Google Scholar 

Download references


We would like to acknowledge the support of the NExT research grant funds, supported by the National Research Foundation, Prime Minister’s Office, Singapore, under its IRC @ SG Funding Initiative. We would also like to gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX Titan X GPU used for this research. We also acknowledge Muthu Kumar Chandrasekaran for insightful feedback on data handling and editing, along with Kishaloy Halder and Wenqiang Lei for their help in the ongoing integration with the current ParsCit pipeline.

Author information



Corresponding author

Correspondence to Animesh Prasad.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Prasad, A., Kaur, M. & Kan, M. Neural ParsCit: a deep learning-based reference string parser. Int J Digit Libr 19, 323–337 (2018).

Download citation


  • Reference string parsing
  • Sequence labeling
  • CRF
  • LSTM