Citance-based retrieval and summarization using IR and machine learning

Karimi, Samaneh; Moraes, Luis; Das, Avisha; Shakery, Azadeh; Verma, Rakesh

doi:10.1007/s11192-018-2785-8

Citance-based retrieval and summarization using IR and machine learning

Published: 04 July 2018

Volume 116, pages 1331–1366, (2018)
Cite this article

Scientometrics Aims and scope Submit manuscript

Samaneh Karimi ORCID: orcid.org/0000-0003-3483-0590^1,2,
Luis Moraes²,
Avisha Das²,
Azadeh Shakery^1,3 &
…
Rakesh Verma²

625 Accesses
12 Citations
2 Altmetric
Explore all metrics

Abstract

We consider the three interesting problems posed by the CL-SciSumm series of shared tasks. Given a reference document D and a set \(C_D\) of citances for D: (1) find the span of reference text that corresponds to each citance \(c \in C_D\), (2) identify the facet corresponding to each span of reference text from a predefined list of five facets, and (3) construct a summary of at most 250 words for D based on the reference spans. The shared task provided annotated training and test sets for these problems. This paper describes our efforts and the results achieved for each problem, and also a discussion of some interesting parameters of the datasets, which may spur further improvements and innovations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Artificial intelligence to automate the systematic review of scientific literature

Article Open access 11 May 2023

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Notes

A short piece of text from D.
I.e. what specific information has been cited.
Some examples on how this difference in rankings can occur for Task 1A with the two different metrics were given in Jaidka et al. (2017).
Only alphanumeric characters remain unfiltered.
https://aclweb.org/aclwiki/Recognizing_Textual_Entailment.
The latest version of the software is at https://github.com/tomtung/tifmo.
http://aclweb.org/anthology/.
https://sourceforge.net/p/lemur/wiki/RankLib/.
Normalized Discounted Cumulative Gain is a metric for search results that takes into account the position of relevant items.
Recall that we use the citances to solve the Task 1B.
Citances for which all our systems failed to identify the correct reference spans.
Total number of unique words in the document.
Ratio of vocabulary size to the total number of words in the document.
Citances for which none of the retrieved sentences were relevant across all our proposed systems.
The annotator information for the 2017 test set is not yet available.

References

AbuRaed, A., Chiruzzo, L., Saggion, H., Accuosto, P., & Bravo Serrano, À. (2017). Lastus/taln @ clscisumm-17: Cross-document sentence matching and scientific text summarization systems. In Proceedings of the Computational Linguistics Scientific Summarization Shared Task (CL-SciSumm 2017) organized as a part of the 2nd Joint Workshop on Bibliometricenhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017) and co-located with the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), August 11, 2017, Tokyo, Japan (pp. 55–66).
Barrera, A., & Verma, R. (2012). Combining syntax and semantics for automatic extractive single-document summarization. In CICLING (Vol. LNCS 7182, pp. 366–377).
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.
MATH Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
MATH Google Scholar
Blitzer, J., McDonald, R. T., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In EMNLP 2007, Proceedings of the 2006 conference on empirical methods in natural language processing, 22–23 July 2006 (pp. 120–128). Sydney.
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.
Article MATH Google Scholar
Burges, C. J. (2010). From ranknet to lambdarank to lambdamart: An overview. Learning, 11, 23–581.
Google Scholar
Cao, Z., Li, W., & Wu, D. (June 2016). Polyu at CL-SciSumm 2016. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2016) (pp. 132–138). Newark, NJ.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
MATH Google Scholar
Dong, Y., Tian, R., & Miyao, Y. (2014). Encoding generalized quantifiers in dependency-based compositional semantics. In PACLIC.
Dubay, W. H. (2004). The principles of readability. Costa Mesa, CA.
Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., & Radev, D. (2008). Blind men and elephants: What do citation summaries tell us about a research article? Journal of the American Society for Information Science and Technology, 59(1), 51–62. https://doi.org/10.1002/asi.20707.
Article Google Scholar
Felber, T., & Kern, R. (August 2017). Graz university of technology at CL-SciSumm 2017: Query generation strategies. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017). Tokyo.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619.
Article Google Scholar
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Article MathSciNet MATH Google Scholar
Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization techniques: A survey. Artificial Intelligence Review, 47(1), 1–66.
Article Google Scholar
Hoffman, M. D., Blei, D. M., & Bach, F. R. (2010). Online learning for latent Dirichlet allocation. In Advances in neural information processing systems 23: 24th annual conference on neural information processing systems 2010. Proceedings of a meeting held 6–9 December 2010 (pp. 856–864). Vancouver, BC.
Jaidka, K., Chandrasekaran, M. K., Jain, D., & Kan, M. Y. (2017). Overview of the CL-SciSumm 2017 shared task. In Proceedings of joint workshop on bibliometric-enhanced information retrieval and NLP for digital libraries (BIRNDL 2017). Tokyo: CEUR.
Jones, E., Oliphant, T., & Peterson, P., et al. (2001). SciPy: Open source scientific tools for Python. http://www.scipy.org/.
Karimi, S., Moraes, L., Das, A., & Verma, R. (August 2017). University of Houston@ CL-SciSumm 2017: Positional language models, structural correspondence learning and textual entailment. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017). Tokyo.
Kincaid, J. P., Fishburne, R. P., Rogers, R. L., & Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count, and flesch reading ease formula) for navy enlisted personnel. Research Branch Report 875, Chief of Naval Technical Training: Naval Air Station Memphis.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
Article MathSciNet MATH Google Scholar
Kusner, M. J., Sun, Y., Kolkin, N. I., Weinberger, K. Q., et al. (2015). From word embeddings to document distances. ICML, 15, 957–966.
Google Scholar
Lauscher, A., Glavaš, G., & Eckert, K. (August 2017). University of mannheim@ CL-SciSumm-17: Citation-based summarization of scientific articles using semantic textual similarity. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017). Tokyo.
Li, L., Zhang, Y., Mao, L., Chi, J., Chen, M., & Huang, Z. (August 2017) Cist@ CL-SciSumm-17: Multiple features based citation linkage, classification and summarization. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017). Tokyo.
Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. https://www.microsoft.com/en-us/research/publication/rouge-a-package-for-automatic-evaluation-of-summaries/.
Lu, K., Mao, J., Li, G., & Xu, J. (2016). Recognizing reference spans and classifying their discourse facets. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2016).
Lv, Y., & Zhai, C. (2009). Positional language models for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval SIGIR 09.
Ma, S., Xu, J., Wang, J., & Zhang, C. (August 2017). Njust@ CL-SciSumm-17. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017). Tokyo.
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50–60. https://doi.org/10.1214/aoms/1177730491.
Article MathSciNet MATH Google Scholar
Manning, C. D., Raghavan, P., & Schtze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511809071.
Book MATH Google Scholar
McLaughlin, G. H. (1969). SMOG grading—A new readability formula. Journal of Reading, 12(8), 639–646. http://www.jstor.org/stable/40011226.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR. arXiv:1301.3781.
Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41.
Article Google Scholar
Mohammad, S., Dorr, B., Egan, M., Hassan, A., Muthukrishan, P., Qazvinian, V., et al. (2009). Using citations to generate surveys of scientific paradigms. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics (pp. 584–592). Association for Computational Linguistics.
Moraes, L., Baki, S., Verma, R., & Lee, D. (2017). Identifying reference spans: Topic modeling and word embeddings help IR. International Journal on Digital Libraries. https://doi.org/10.1007/s00799-017-0220-z.
Google Scholar
Moraes, L. F. T., Baki, S., Verma, R. M., & Lee, D. (2016). University of Houston at CL-SciSumm 2016: Svms with tree kernels and sentence similarity. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL) co-located with the joint conference on digital libraries 2016 (JCDL 2016) (pp. 113–121). Newark, NJ, June 23, 2016. http://ceur-ws.org/Vol-1610/paper13.pdf.
Nakov, P. I., Schwartz, A. S., & Hearst, M. (2004). Citances: Citation sentences for semantic analysis of bioscience text. Proceedings of the SIGIR, 4, 81–88.
Google Scholar
Nanba, H., Kando, N., & Okumura, M. (2000). Classification of research papers using citation links and citation types: Towards automatic review article generation. Advances in Classification Research Online, 11(1), 117–134.
Google Scholar
Pontius, R., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32, 4407–4429.
Article Google Scholar
Pramanick, A., Mandi, S., Dey, M., & Das, D. (August 2017). Employing word vectors for identifying, classifying and summarizing scientific documents. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017). Tokyo.
Prasad, A. (August 2017). Wing-nus at CL-SciSumm 2017: Learning from syntactic and semantic similarity for citation contextualization. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017). Tokyo.
Qazvinian, V., Radev, D. R., Mohammad, S., Dorr, B. J., Zajic, D. M., Whidby, M., et al. (2013). Generating extractive summaries of scientific paradigms. Journal of Artificial Intelligence Research (JAIR), 46, 165–201.
Article MathSciNet Google Scholar
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
Google Scholar
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.
Article Google Scholar
Shivam Bansal, C. A. (2017). Textstat 0.4.1. https://pypi.python.org/pypi/textstat.
Tian, R., Miyao, Y., & Matsuzaki, T. (2014). Logical inference on dependency-based compositional semantics. In Proceedings of the 52nd annual meeting of the association for computational linguistics. ACL.
Verma, R. M., & Lee, D. (2017). Extractive summarization: Limits, compression, generalized model and heuristics. Computacion y Sistemas, 21(4), 787–798.
Google Scholar
Wessa, P. (2017). Spearman rank correlation (v1.0.3) in free statistics software (v1.2.1) office for research development and education. https://www.wessa.net/rwasp_spearman.wasp/.
Zhang, D., & Li, S. (August 2017). PKU@CL-SciSumm-17: Citation contextualization. In: Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017). Tokyo.
Zhao, K., Huang, L., & Ma, M. (2016). Textual entailment with structured attentions and composition. In COLING 2016, 26th international conference on computational linguistics, proceedings of the conference: Technical papers, December 11–16, 2016, Osaka, Japan (pp. 2248–2258). http://aclweb.org/anthology/C/C16/C16-1212.pdf.

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
Samaneh Karimi & Azadeh Shakery
Computer Science Department, University of Houston, Houston, TX, 77204, USA
Samaneh Karimi, Luis Moraes, Avisha Das & Rakesh Verma
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), University of Tehran, Tehran, Iran
Azadeh Shakery

Authors

Samaneh Karimi
View author publications
You can also search for this author in PubMed Google Scholar
Luis Moraes
View author publications
You can also search for this author in PubMed Google Scholar
Avisha Das
View author publications
You can also search for this author in PubMed Google Scholar
Azadeh Shakery
View author publications
You can also search for this author in PubMed Google Scholar
Rakesh Verma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samaneh Karimi.

Additional information

Research supported in part by NSF Grants DUE 1241772, CNS 1319212, DGE 1433817 and DUE 1356705.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Karimi, S., Moraes, L., Das, A. et al. Citance-based retrieval and summarization using IR and machine learning. Scientometrics 116, 1331–1366 (2018). https://doi.org/10.1007/s11192-018-2785-8

Download citation

Received: 04 October 2017
Published: 04 July 2018
Issue Date: August 2018
DOI: https://doi.org/10.1007/s11192-018-2785-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Citance-based retrieval and summarization using IR and machine learning

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Artificial intelligence to automate the systematic review of scientific literature

Automated identification of media bias in news articles: an interdisciplinary literature review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Citance-based retrieval and summarization using IR and machine learning

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Artificial intelligence to automate the systematic review of scientific literature

Automated identification of media bias in news articles: an interdisciplinary literature review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation