Abstract
The CL-SciSumm 2016 shared task introduced an interesting problem: given a document D and a piece of text that cites D, how do we identify the text spans of D being referenced by the piece of text? The shared task provided the first annotated dataset for studying this problem. We present an analysis of our continued work in improving our system’s performance on this task. We demonstrate how topic models and word embeddings can be used to surpass the previously best performing system.
Similar content being viewed by others
References
Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 500–509. Association for Computational Linguistics (2011)
Abu-Jbara, A., Radev, D.: Reference scope identification in citing sentences. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 80–90. Association for Computational Linguistics (2012)
Aggarwal, P., Sharma, R.: Lexical and syntactic cues to identify reference scope of citance. In: BIRNDL@ JCDL, pp. 103–112 (2016)
Barrera, A., Verma, R.: Combining syntax and semantics for automatic extractive single-document summarization. In: CICLING, LNCS, vol. 7182, pp. 366–377 (2012)
Barrera, A., Verma, R., Vincent, R.: Semquest: University of Houston’s semantics-based question answering system. In: Text Analysis Conference (2011)
Besagni, D., Belaïd, A., Benet, N.: A segmentation method for bibliographic references by contextual tagging of fields. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, 2003, pp. 384–388. IEEE (2003)
Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media, Inc, Newton (2009)
Campos, G.O., Zimek, A., Sander, J., Campello, R.J., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)
Cao, Z., Li, W., Wu, D.: Polyu at cl-scisumm 2016. In: BIRNDL@ JCDL, pp. 132–138 (2016)
Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: what do citation summaries tell us about a research article? J. Am. Soc. Inform. Sci. Technol. 59(1), 51–62 (2008)
Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a Meeting Held 6–9 December 2010, Vancouver, British Columbia, Canada, pp. 856–864 (2010)
Jaidka, K., Chandrasekaran, M.K., Elizalde, B.F., Jha, R., Jones, C., Kan, M.Y., Khanna, A., Molla-Aliod, D., Radev, D.R., Ronzano, F., et al.: The computational linguistics summarization pilot task. In: Proceedings of TAC (2014)
Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: Overview of the 2nd computational linguistics scientific document summarization shared task (CL-SciSumm 2016). In: Proceedings of the Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2016) (2016). Newark, New Jersey, USA
Klampfl, S., Rexha, A., Kern, R.: Identifying referenced text in scientific publications by summarisation and classification techniques. In: BIRNDL@ JCDL, pp. 122–131 (2016)
Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: The computational linguistics summarization pilot task @ birndl 2016 (2016). https://github.com/WING-NUS/scisumm-corpus/blob/master/publications/BIRNDL2016/CL-SciSumm2016_slidedeck.pptx
Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68–73. ACM (1995)
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q., et al.: From word embeddings to document distances. In: ICML, vol. 15, 957–966 (2015)
Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: Cist system for cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 156–167 (2016)
Lu, K., Mao, J., Li, G., Xu, J.: Recognizing reference spans and classifying their discourse facets. In: BIRNDL@ JCDL, pp. 139–145 (2016)
Microsoft: Distributed machine learning toolkit (2016). https://github.com/Microsoft/DMTK
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Mohammad, S., Dorr, B., Egan, M., Hassan, A., Muthukrishan, P., Qazvinian, V., Radev, D., Zajic, D.: Using citations to generate surveys of scientific paradigms. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 584–592. Association for Computational Linguistics (2009)
Nomoto, T.: Neal: a neurally enhanced approach to linking citation and reference. In: BIRNDL@ JCDL, pp. 168–174 (2016)
Powley, B., Dale, R.: Evidence-based information extraction for high accuracy citation and author name identification. In: Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 618–632. Le Centre de Hautes Etudes Internationales D’Informatique Documentaire (2007)
Qazvinian, V., Radev, D.R.: Identifying non-explicit citing sentences for citation-based summarization. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 555–564. Association for Computational Linguistics (2010)
Qazvinian, V., Radev, D.R., Mohammad, S., Dorr, B.J., Zajic, D.M., Whidby, M., Moon, T.: Generating extractive summaries of scientific paradigms. J. Artif. Intell. Res. (JAIR) 46, 165–201 (2013)
Saggion, H., AbuRaed, A., Ronzano, F.: Trainable citation-enhanced summarization of scientific articles. In: Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M., Mayr, P., Wolfram, D. (eds.) Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL); 2016 June 23; Newark, United States. CEUR Workshop Proceedings: [Sl]; 2016. pp. 175–186. CEUR Workshop Proceedings (2016)
Siddharthan, A., Teufel, S.: Whose idea was this, and why does it matter? Attributing scientific work to citations. In: HLT-NAACL, pp. 316–323 (2007)
Verma, R.M., Chen, P., Lu, W.: A semantic free-text summarization system using ontology knowledge. In: Document Understanding Conference (2007)
Acknowledgements
We would like to thank Avisha Das, Arthur Dunbar, and Mahsa Shafaei for providing the annotations. The authors acknowledge the use of the Maxwell Cluster and the advanced support from the Center of Advanced Computing and Data Systems at the University of Houston to carry out the research presented here.
Author information
Authors and Affiliations
Corresponding author
Additional information
Research supported in part by NSF Grants CNS 1319212, DGE 1433817, and DUE 1241772.
Rights and permissions
About this article
Cite this article
Moraes, L., Baki, S., Verma, R. et al. Identifying reference spans: topic modeling and word embeddings help IR. Int J Digit Libr 19, 191–202 (2018). https://doi.org/10.1007/s00799-017-0220-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-017-0220-z