Scholarly literature mining with information retrieval and natural language processing: Preface

Cabanac, Guillaume; Frommholz, Ingo; Mayr, Philipp

doi:10.1007/s11192-020-03763-4

Scholarly literature mining with information retrieval and natural language processing: Preface

Published: 17 November 2020

Volume 125, pages 2835–2840, (2020)
Cite this article

Download PDF

Scientometrics Aims and scope Submit manuscript

Scholarly literature mining with information retrieval and natural language processing: Preface

Download PDF

3536 Accesses
12 Citations
10 Altmetric
Explore all metrics

Introduction

This special issue features the work of authors originally coming from different communities: bibliometrics/scientometrics (SCIM), information retrieval (IR) and, as an emerging player gaining more relevance for both aforementioned fields, natural language processing (NLP). The work presented in their papers combine ideas from all these fields, having in common that they all are using the scholarly data well known in scientometrics and solving problems typical to scientometric research. They model and mine citations, as well as metadata of bibliographic records (authorships, titles, abstracts sometimes), which is common practice in SCIM. They also mine and process fulltexts (including in-text references and equations) which is common practice in IR and requires established NLP text mining techniques. IR collections are utilised to ensure reproducible evaluations; creating and sharing test collections in evaluation initiatives such as CLEF eHealth^{Footnote 1} is common IR tradition that is also prominent in NLP, eg., by the CL-SciSumm shared task.^{Footnote 2}

From an IR perspective, surprisingly, scholarly information retrieval and recommendation, though gaining momentum, have not always been the focus of research in the past. Besides operating on a rich set of data for researchers in all three disciplines to play with, scholarly search poses challenges in particular for IR due to the complex information needs that require different approaches than known from, e.g., Web search, where information needs are simpler in many cases. As an example, the current COVID-19 crisis shows that hybrid SCIM/IR/NLP approaches are increasingly required to ensure researchers get access to important relevant and high-quality information, often only available on preprint servers, in a short period of time (Brainard 2020; Fraser et al. 2020; Kwon 2020; Palayew et al. 2020). These kinds of complex information needs pose challenges which have been recognised by the Information Retrieval community that quickly launched the TREC-COVID initiative run by NIST (Roberts et al. 2020), demonstrating the timeliness of our endeavour and this special issue. Working on scholarly material thus has incentives for researchers in Information Retrieval but we believe the challenges can only be tackled effectively by all three communities as a whole. The NLP community has initiated a similar activity with a dedicated workshop series NLP COVID-19 Workshop^{Footnote 3} which is running at major NLP conferences (ACL & EMNLP) in 2020.

With the surge of “scholarly big data” (Giles 2013), Bibliometrics and Information Retrieval in combination with NLP methods have seen a recent renaissance that resulted in a series of special issues:

“Combining Bibliometrics and Information Retrieval” (Mayr and Scharnhorst 2015) in Scientometrics (2015).
“Bibliometric-enhanced Information Retrieval” (Cabanac et al. 2018) in Scientometrics (2018).
“Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries” (Mayr et al. 2018) in International Journal on Digital Libraries (2018).
“Mining Scientific Papers: NLP-enhanced Bibliometrics” (Atanassova et al. 2019) in Frontiers in Research Metrics and Analytics (2019).

Special issue papers

This special issue on “Scholarly literature mining with Information Retrieval and Natural Language Processing” presents works intersecting Bibliometrics and Information Retrieval, utilising Natural Language Processing (NLP). The special issue was announced via an open call for papers^{Footnote 4}. In response to the CFP, we received 24 submissions which were reviewed by two to three reviewers (for overlapping papers, eg., IR and NLP, we selected reviewers from both domains). Eventually, the guest editors accepted 14 papers. Nine papers have been rejected and one paper was withdrawn by the authors during the reviewing rounds.

In the following we provide an overview of the 14 papers organised into 3 clusters. We introduce the paper ordering of the special issue in Table 1. To generate a lightweight overview of the variety of the papers we identified the research Tasks and Area of Application, the used Corpus, Objects, and Methods of each contribution.

The papers in this special issue appear in the following sequence. We decided to start with a set of more classical papers featuring scientometric methods like network analysis and bibliographic data from the Web of Science, Scopus or similar resources. The second set of papers is more IR oriented: papers mine fulltexts and they use techniques like embeddings and neural networks. The third cluster of papers contains NLP-oriented papers which are, for instance, specialised in summarisation and utilise scholarly documents.

Cluster 1. SCIM with IR and NLP

Lietz: Drawing impossible boundaries: field delineation of Social Network Science.
Schneider et al.: Continued post-retraction citation of a fraudulent clinical trial report, eleven years after it was retracted for falsifying data.
Kreutz et al.: Evaluating semantometrics from computer science publications.
Haunschild & Marx: Discovering seminal works with marker papers.
Lamirel et al.: An overview of the history of Science of Science in China based on the use of bibliographic and citation data: a new method of analysis based on clustering with feature maximization and contrast graphs.

Cluster 2. IR and Text-mining of scholarly literature

Nogueira et al.: Navigation-based candidate expansion and pretrained language models for citation recommendation.
Greiner-Petter et al.: Math-word embedding in math search and semantic extraction.
Carvallo et al.: Automatic document screening of medical literature using word and text embeddings in an active learning setting.
Saier & Färber: unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata.

Cluster 3. NLP-oriented papers on scholarly literature

Zerva et al.: Cited text span identification for scientific summarisation using pre-trained encoders.
La Quatra et al.: Exploiting pivot words to classify and summarize discourse facets of scientific papers.
AbuRa’ed et al.: Automatic related work section generation: experiments in scientific document abstracting.
Jimenez et al.: Automatic prediction of citability of scientific articles by stylometry of their titles and abstracts.
Portenoy & West: Constructing and evaluating automated literature review systems.

Table 1 Overview of the articles in this special issue

Full size table

We hope the selection of papers in this special issue will be interesting and enjoyable for researchers coming from all relevant fields and provides a starting point for future explorations in the field.^{Footnote 5}

Notes

https://clefehealth.imag.fr.
https://github.com/WING-NUS/scisumm-corpus.
https://www.nlpcovid19workshop.org/acl2020/.
https://sites.google.com/view/scientometrics-si2019-bir.
Since 2016 we maintain the “Bibliometric-enhanced-IR Bibliography” https://github.com/PhilippMayr/Bibliometric-enhanced-IR_Bibliography/ that collects scientific papers which appear in collaboration with the BIR/BIRNDL organizers.

References

Atanassova, I., Bertin, M., & Mayr, P. (2019). Editorial: mining scientific papers: NLP-enhanced bibliometrics. Frontiers in Research Metrics and Analytics,. https://doi.org/10.3389/frma.2019.00002.
Article Google Scholar
Brainard, J. (2020). New tools aim to tame pandemic paper tsunami. Science, 368(6494), 924–925. https://doi.org/10.1126/science.368.6494.924.
Article Google Scholar
Cabanac, G., Frommholz, I., & Mayr, P. (2018). Bibliometric-enhanced information retrieval: Preface. Scientometrics, 116(2), 1225–1227. https://doi.org/10.1007/s11192-018-2861-0.
Article Google Scholar
Cabanac, G., Frommholz, I., & Mayr, P. (2020). Bibliometric-Enhanced Information Retrieval 10th Anniversary Workshop Edition. In J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, & F. Martins (Eds.), Advances in Information Retrieval, LNCS, (Vol. 12036, pp. 641–647). Berlin: Springer International Publishing. https://doi.org/10.1007/978-3-030-45442-5_85.
Chapter Google Scholar
Fraser, N., Brierley, L., Dey, G., Polka, J.K., Pálfy, M., Nanni, F., & Coates, J.A. (2020). Preprinting the COVID-19 pandemic. bioRxiv. https://doi.org/10.1101/2020.05.22.111294
Giles, C.L. (2013). Scholarly big data. In CIKM’13: Proceedings of the 22nd ACM international conference on conference on information and knowledge management, p. 1. ACM, New York, NY. https://doi.org/10.1145/2505515.2527109
Kwon, D. (2020). How swamped preprint servers are blocking bad coronavirus research. Nature, 581(7807), 130–131. https://doi.org/10.1038/d41586-020-01394-6.
Article Google Scholar
Mayr, P., Frommholz, I., Cabanac, G., Chandrasekaran, M. K., Jaidka, K., Kan, M. Y., et al. (2018). Introduction to the special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL). International Journal on Digital Libraries, 19(2—-3), 107–111. https://doi.org/10.1007/s00799-017-0230-x.
Article Google Scholar
Mayr, P., & Scharnhorst, A. (2015). Combining bibliometrics and information retrieval: Preface. Scientometrics, 102(3), 2191–2192. https://doi.org/10.1007/s11192-015-1529-2.
Article Google Scholar
Palayew, A., Norgaard, O., Safreed-Harmon, K., Andersen, T. H., Rasmussen, L. N., & Lazarus, J. V. (2020). Pandemic publishing poses a new COVID-19 challenge [Comment]. Nature Human Behaviour, 4(7), 666–669. https://doi.org/10.1038/s41562-020-0911-0.
Article Google Scholar
Roberts, K., Alam, T., Bedrick, S., Demner-Fushman, D., Lo, K., Soboroff, I., et al. (2020). TREC-COVID: Rationale and structure of an information retrieval shared task for COVID-19. Journal of the American Medical Informatics Association, 27(9), 1431–1436. https://doi.org/10.1093/jamia/ocaa091.
Article Google Scholar

Download references

Acknowledgements

We wish to thank all contributors to this special issue: The researchers who submitted papers, the many reviewers who generously offered their time and expertise, and the participants of the BIR and BIRNDL workshops (Cabanac et al. 2020).

Author information

Authors and Affiliations

Computer Science Department, IRIT UMR 5505 CNRS, University of Toulouse, 118 Route de Narbonne, 31062, Toulouse Cedex 9, France
Guillaume Cabanac
University of Bedfordshire, Luton, LU1 3JU, UK
Ingo Frommholz
GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany
Philipp Mayr

Authors

Guillaume Cabanac
View author publications
You can also search for this author in PubMed Google Scholar
Ingo Frommholz
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Mayr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume Cabanac.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cabanac, G., Frommholz, I. & Mayr, P. Scholarly literature mining with information retrieval and natural language processing: Preface. Scientometrics 125, 2835–2840 (2020). https://doi.org/10.1007/s11192-020-03763-4

Download citation

Received: 09 October 2020
Published: 17 November 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11192-020-03763-4

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Scholarly literature mining with information retrieval and natural language processing: Preface

Introduction

Special issue papers

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation