Abstract
We present a deep learning approach for the core digital libraries task of parsing bibliographic reference strings. We deploy the state-of-the-art long short-term memory (LSTM) neural network architecture, a variant of a recurrent neural network to capture long-range dependencies in reference strings. We explore word embeddings and character-based word embeddings as an alternative to handcrafted features. We incrementally experiment with features, architectural configurations, and the diversity of the dataset. Our final model is an LSTM-based architecture, which layers a linear chain conditional random field (CRF) over the LSTM output. In extensive experiments in both English in-domain (computer science) and out-of-domain (humanities) test cases, as well as multilingual data, our results show a significant gain (\(p<0.01\)) over the reported state-of-the-art CRF-only-based parser.
Similar content being viewed by others
Notes
References
Bengio, Y.: Learning deep architectures for AI. Found. trends\({\textregistered }\) Mach. Learn.2(1), 1–127 (2009)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: A CPU and GPU math compiler in python. In: Proceedings of 9th Python in Science Conference, pp. 1–7 (2010)
Chen, C.C., Yang, K.H., Chen, C.L., Ho, J.M.: Bibpro: a citation parser based on sequence alignment. IEEE Trans. Knowl. Data Eng. 24(2), 236–250 (2012)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
Councill, I.G., Giles, C.L., Kan, M.Y.: Parscit: an open-source CRF reference string parsing package. LREC 8, 661–667 (2008)
Cuong, N.V., Chandrasekaran, M.K., Kan, M.Y., Lee, W.S.: Scholarly document information extraction using extensible features for efficient higher order semi-CRFs. In: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, ACM, pp. 61–64 (2015)
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Giles, C.L., Bollacker, K.D., Lawrence, S.: Citeseer: an automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries, ACM, pp. 89–98 (1998)
Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Mag. 19(9/10), (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, vol. 1, pp. 282–289 (2001)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp. 3111–3119 (2013)
Mohamed, A.R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. ICML 3(28), 1310–1318 (2013)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–43 (2014)
Rabiner, L., Juang, B.: An introduction to hidden Markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 833–840 (2011)
Romanello, M., Boschetti, F., Crane, G.: Citations in the digital library of classics: extracting canonical references by using conditional random fields. In: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, Association for Computational Linguistics, pp. 80–87 (2009)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report, DTIC Document (1985)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, Ł.: Cermine: automatic extraction of structured metadata from scientific literature. IJDAR 18(4), 317–335 (2015)
Acknowledgements
We would like to acknowledge the support of the NExT research grant funds, supported by the National Research Foundation, Prime Minister’s Office, Singapore, under its IRC @ SG Funding Initiative. We would also like to gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX Titan X GPU used for this research. We also acknowledge Muthu Kumar Chandrasekaran for insightful feedback on data handling and editing, along with Kishaloy Halder and Wenqiang Lei for their help in the ongoing integration with the current ParsCit pipeline.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Prasad, A., Kaur, M. & Kan, MY. Neural ParsCit: a deep learning-based reference string parser. Int J Digit Libr 19, 323–337 (2018). https://doi.org/10.1007/s00799-018-0242-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-018-0242-1