Identifying reference spans: topic modeling and word embeddings help IR

Moraes, Luis; Baki, Shahryar; Verma, Rakesh; Lee, Daniel

doi:10.1007/s00799-017-0220-z

Identifying reference spans: topic modeling and word embeddings help IR

Published: 20 June 2017

Volume 19, pages 191–202, (2018)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Luis Moraes ORCID: orcid.org/0000-0002-2643-8647¹,
Shahryar Baki¹,
Rakesh Verma¹ &
…
Daniel Lee¹

410 Accesses
4 Citations
6 Altmetric
Explore all metrics

Abstract

The CL-SciSumm 2016 shared task introduced an interesting problem: given a document D and a piece of text that cites D, how do we identify the text spans of D being referenced by the piece of text? The shared task provided the first annotated dataset for studying this problem. We present an analysis of our continued work in improving our system’s performance on this task. We demonstrate how topic models and word embeddings can be used to surpass the previously best performing system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of BiLSTM-CRF model with different embeddings for product name extraction in unstructured Turkish text

Article Open access 21 February 2024

A survey on neural topic models: methods, applications, and challenges

Article Open access 25 January 2024

Word Embedding for Understanding Natural Language: A Survey

Notes

https://aclweb.org/anthology/.

References

Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 500–509. Association for Computational Linguistics (2011)
Abu-Jbara, A., Radev, D.: Reference scope identification in citing sentences. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 80–90. Association for Computational Linguistics (2012)
Aggarwal, P., Sharma, R.: Lexical and syntactic cues to identify reference scope of citance. In: BIRNDL@ JCDL, pp. 103–112 (2016)
Barrera, A., Verma, R.: Combining syntax and semantics for automatic extractive single-document summarization. In: CICLING, LNCS, vol. 7182, pp. 366–377 (2012)
Barrera, A., Verma, R., Vincent, R.: Semquest: University of Houston’s semantics-based question answering system. In: Text Analysis Conference (2011)
Besagni, D., Belaïd, A., Benet, N.: A segmentation method for bibliographic references by contextual tagging of fields. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, 2003, pp. 384–388. IEEE (2003)
Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media, Inc, Newton (2009)
MATH Google Scholar
Campos, G.O., Zimek, A., Sander, J., Campello, R.J., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)
Article MathSciNet Google Scholar
Cao, Z., Li, W., Wu, D.: Polyu at cl-scisumm 2016. In: BIRNDL@ JCDL, pp. 132–138 (2016)
Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: what do citation summaries tell us about a research article? J. Am. Soc. Inform. Sci. Technol. 59(1), 51–62 (2008)
Article Google Scholar
Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a Meeting Held 6–9 December 2010, Vancouver, British Columbia, Canada, pp. 856–864 (2010)
Jaidka, K., Chandrasekaran, M.K., Elizalde, B.F., Jha, R., Jones, C., Kan, M.Y., Khanna, A., Molla-Aliod, D., Radev, D.R., Ronzano, F., et al.: The computational linguistics summarization pilot task. In: Proceedings of TAC (2014)
Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: Overview of the 2nd computational linguistics scientific document summarization shared task (CL-SciSumm 2016). In: Proceedings of the Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2016) (2016). Newark, New Jersey, USA
Klampfl, S., Rexha, A., Kern, R.: Identifying referenced text in scientific publications by summarisation and classification techniques. In: BIRNDL@ JCDL, pp. 122–131 (2016)
Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: The computational linguistics summarization pilot task @ birndl 2016 (2016). https://github.com/WING-NUS/scisumm-corpus/blob/master/publications/BIRNDL2016/CL-SciSumm2016_slidedeck.pptx
Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68–73. ACM (1995)
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q., et al.: From word embeddings to document distances. In: ICML, vol. 15, 957–966 (2015)
Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: Cist system for cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 156–167 (2016)
Lu, K., Mao, J., Li, G., Xu, J.: Recognizing reference spans and classifying their discourse facets. In: BIRNDL@ JCDL, pp. 139–145 (2016)
Microsoft: Distributed machine learning toolkit (2016). https://github.com/Microsoft/DMTK
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Mohammad, S., Dorr, B., Egan, M., Hassan, A., Muthukrishan, P., Qazvinian, V., Radev, D., Zajic, D.: Using citations to generate surveys of scientific paradigms. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 584–592. Association for Computational Linguistics (2009)
Nomoto, T.: Neal: a neurally enhanced approach to linking citation and reference. In: BIRNDL@ JCDL, pp. 168–174 (2016)
Powley, B., Dale, R.: Evidence-based information extraction for high accuracy citation and author name identification. In: Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 618–632. Le Centre de Hautes Etudes Internationales D’Informatique Documentaire (2007)
Qazvinian, V., Radev, D.R.: Identifying non-explicit citing sentences for citation-based summarization. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 555–564. Association for Computational Linguistics (2010)
Qazvinian, V., Radev, D.R., Mohammad, S., Dorr, B.J., Zajic, D.M., Whidby, M., Moon, T.: Generating extractive summaries of scientific paradigms. J. Artif. Intell. Res. (JAIR) 46, 165–201 (2013)
Article MathSciNet Google Scholar
Saggion, H., AbuRaed, A., Ronzano, F.: Trainable citation-enhanced summarization of scientific articles. In: Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M., Mayr, P., Wolfram, D. (eds.) Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL); 2016 June 23; Newark, United States. CEUR Workshop Proceedings: [Sl]; 2016. pp. 175–186. CEUR Workshop Proceedings (2016)
Siddharthan, A., Teufel, S.: Whose idea was this, and why does it matter? Attributing scientific work to citations. In: HLT-NAACL, pp. 316–323 (2007)
Verma, R.M., Chen, P., Lu, W.: A semantic free-text summarization system using ontology knowledge. In: Document Understanding Conference (2007)

Download references

Acknowledgements

We would like to thank Avisha Das, Arthur Dunbar, and Mahsa Shafaei for providing the annotations. The authors acknowledge the use of the Maxwell Cluster and the advanced support from the Center of Advanced Computing and Data Systems at the University of Houston to carry out the research presented here.

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, 77204, USA
Luis Moraes, Shahryar Baki, Rakesh Verma & Daniel Lee

Authors

Luis Moraes
View author publications
You can also search for this author in PubMed Google Scholar
Shahryar Baki
View author publications
You can also search for this author in PubMed Google Scholar
Rakesh Verma
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luis Moraes.

Additional information

Research supported in part by NSF Grants CNS 1319212, DGE 1433817, and DUE 1241772.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moraes, L., Baki, S., Verma, R. et al. Identifying reference spans: topic modeling and word embeddings help IR. Int J Digit Libr 19, 191–202 (2018). https://doi.org/10.1007/s00799-017-0220-z

Download citation

Received: 22 October 2016
Revised: 01 June 2017
Accepted: 06 June 2017
Published: 20 June 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s00799-017-0220-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying reference spans: topic modeling and word embeddings help IR

Abstract

Access this article

Similar content being viewed by others

Application of BiLSTM-CRF model with different embeddings for product name extraction in unstructured Turkish text

A survey on neural topic models: methods, applications, and challenges

Word Embedding for Understanding Natural Language: A Survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Identifying reference spans: topic modeling and word embeddings help IR

Abstract

Access this article

Similar content being viewed by others

Application of BiLSTM-CRF model with different embeddings for product name extraction in unstructured Turkish text

A survey on neural topic models: methods, applications, and challenges

Word Embedding for Understanding Natural Language: A Survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation