Skip to main content
Log in

Identifying reference spans: topic modeling and word embeddings help IR

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

The CL-SciSumm 2016 shared task introduced an interesting problem: given a document D and a piece of text that cites D, how do we identify the text spans of D being referenced by the piece of text? The shared task provided the first annotated dataset for studying this problem. We present an analysis of our continued work in improving our system’s performance on this task. We demonstrate how topic models and word embeddings can be used to surpass the previously best performing system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://aclweb.org/anthology/.

References

  1. Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 500–509. Association for Computational Linguistics (2011)

  2. Abu-Jbara, A., Radev, D.: Reference scope identification in citing sentences. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 80–90. Association for Computational Linguistics (2012)

  3. Aggarwal, P., Sharma, R.: Lexical and syntactic cues to identify reference scope of citance. In: BIRNDL@ JCDL, pp. 103–112 (2016)

  4. Barrera, A., Verma, R.: Combining syntax and semantics for automatic extractive single-document summarization. In: CICLING, LNCS, vol. 7182, pp. 366–377 (2012)

  5. Barrera, A., Verma, R., Vincent, R.: Semquest: University of Houston’s semantics-based question answering system. In: Text Analysis Conference (2011)

  6. Besagni, D., Belaïd, A., Benet, N.: A segmentation method for bibliographic references by contextual tagging of fields. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, 2003, pp. 384–388. IEEE (2003)

  7. Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media, Inc, Newton (2009)

    MATH  Google Scholar 

  8. Campos, G.O., Zimek, A., Sander, J., Campello, R.J., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)

    Article  MathSciNet  Google Scholar 

  9. Cao, Z., Li, W., Wu, D.: Polyu at cl-scisumm 2016. In: BIRNDL@ JCDL, pp. 132–138 (2016)

  10. Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: what do citation summaries tell us about a research article? J. Am. Soc. Inform. Sci. Technol. 59(1), 51–62 (2008)

    Article  Google Scholar 

  11. Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a Meeting Held 6–9 December 2010, Vancouver, British Columbia, Canada, pp. 856–864 (2010)

  12. Jaidka, K., Chandrasekaran, M.K., Elizalde, B.F., Jha, R., Jones, C., Kan, M.Y., Khanna, A., Molla-Aliod, D., Radev, D.R., Ronzano, F., et al.: The computational linguistics summarization pilot task. In: Proceedings of TAC (2014)

  13. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: Overview of the 2nd computational linguistics scientific document summarization shared task (CL-SciSumm 2016). In: Proceedings of the Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2016) (2016). Newark, New Jersey, USA

  14. Klampfl, S., Rexha, A., Kern, R.: Identifying referenced text in scientific publications by summarisation and classification techniques. In: BIRNDL@ JCDL, pp. 122–131 (2016)

  15. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: The computational linguistics summarization pilot task @ birndl 2016 (2016). https://github.com/WING-NUS/scisumm-corpus/blob/master/publications/BIRNDL2016/CL-SciSumm2016_slidedeck.pptx

  16. Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68–73. ACM (1995)

  17. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q., et al.: From word embeddings to document distances. In: ICML, vol. 15, 957–966 (2015)

  18. Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: Cist system for cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 156–167 (2016)

  19. Lu, K., Mao, J., Li, G., Xu, J.: Recognizing reference spans and classifying their discourse facets. In: BIRNDL@ JCDL, pp. 139–145 (2016)

  20. Microsoft: Distributed machine learning toolkit (2016). https://github.com/Microsoft/DMTK

  21. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  22. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  23. Mohammad, S., Dorr, B., Egan, M., Hassan, A., Muthukrishan, P., Qazvinian, V., Radev, D., Zajic, D.: Using citations to generate surveys of scientific paradigms. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 584–592. Association for Computational Linguistics (2009)

  24. Nomoto, T.: Neal: a neurally enhanced approach to linking citation and reference. In: BIRNDL@ JCDL, pp. 168–174 (2016)

  25. Powley, B., Dale, R.: Evidence-based information extraction for high accuracy citation and author name identification. In: Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 618–632. Le Centre de Hautes Etudes Internationales D’Informatique Documentaire (2007)

  26. Qazvinian, V., Radev, D.R.: Identifying non-explicit citing sentences for citation-based summarization. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 555–564. Association for Computational Linguistics (2010)

  27. Qazvinian, V., Radev, D.R., Mohammad, S., Dorr, B.J., Zajic, D.M., Whidby, M., Moon, T.: Generating extractive summaries of scientific paradigms. J. Artif. Intell. Res. (JAIR) 46, 165–201 (2013)

    Article  MathSciNet  Google Scholar 

  28. Saggion, H., AbuRaed, A., Ronzano, F.: Trainable citation-enhanced summarization of scientific articles. In: Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M., Mayr, P., Wolfram, D. (eds.) Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL); 2016 June 23; Newark, United States. CEUR Workshop Proceedings: [Sl]; 2016. pp. 175–186. CEUR Workshop Proceedings (2016)

  29. Siddharthan, A., Teufel, S.: Whose idea was this, and why does it matter? Attributing scientific work to citations. In: HLT-NAACL, pp. 316–323 (2007)

  30. Verma, R.M., Chen, P., Lu, W.: A semantic free-text summarization system using ontology knowledge. In: Document Understanding Conference (2007)

Download references

Acknowledgements

We would like to thank Avisha Das, Arthur Dunbar, and Mahsa Shafaei for providing the annotations. The authors acknowledge the use of the Maxwell Cluster and the advanced support from the Center of Advanced Computing and Data Systems at the University of Houston to carry out the research presented here.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis Moraes.

Additional information

Research supported in part by NSF Grants CNS 1319212, DGE 1433817, and DUE 1241772.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moraes, L., Baki, S., Verma, R. et al. Identifying reference spans: topic modeling and word embeddings help IR. Int J Digit Libr 19, 191–202 (2018). https://doi.org/10.1007/s00799-017-0220-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-017-0220-z

Keywords

Navigation