Using neural-network based paragraph embeddings for the calculation of within and between document similarities

Thijs, Bart

doi:10.1007/s11192-020-03583-6

Using neural-network based paragraph embeddings for the calculation of within and between document similarities

Published: 13 July 2020

Volume 125, pages 835–849, (2020)
Cite this article

Scientometrics Aims and scope Submit manuscript

Bart Thijs ORCID: orcid.org/0000-0003-0446-8332^1,2

734 Accesses
8 Citations
Explore all metrics

Abstract

Science mapping using document networks comes often with the implicit assumption that scientific papers are indivisible units with unique links to neighbour documents. Research on proximity in co-citation analysis and the study of lexical properties of sections and citation contexts indicate that this assumption doesn’t always hold. Moreover, the meaning of words and co-words depends on the context in which they appear. This study proposes the use of a neural network architecture for word and paragraph embeddings (Doc2Vec) for the measurement of similarity among those smaller units of analysis. It is shown that paragraphs in the “Introduction” and the “Discussion” Section are more similar to the abstract, that the similarity among paragraphs is related to -but not linearly- the distance between the paragraphs. The “Methodology” Section is least similar to the other sections. Abstracts of citing-cited documents are more similar than random pairs and the context in which a reference appears is most similar to the abstract of the cited document. This novel approach with higher granularity can be used for bibliometric aided retrieval and to assist in measuring interdisciplinarity through the application of network-based centrality measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A tale of two databases: the use of Web of Science and Scopus in academic papers

Article 22 February 2020

Identifying interdisciplinary topics and their evolution based on BERTopic

Article 03 July 2023

Mapping science through editorial board interlocking: connections and distance between fields of knowledge and institutional affiliations

Article Open access 12 May 2024

References

Abramo, G., D'Angelo, C. A., & Costa, F. D. (2012). Identifying interdisciplinarity through the disciplinary classification of coauthors of scientific publications. Journal of the Association for Information Science & Technology, 63(11), 2206–2222.
Google Scholar
Ahlgren, P., & Collinader, C. (2009). Document-document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.
Article Google Scholar
Bertin M., Atanassova I., Larivière V., Gingras, Y. (2013) The distribution of references in scientific papers: An Analysis of the IMRaD Structure. In: Proceedings of the 14th International Conference of the International Society for Scientometrics and Informetrics. Vienna, Austria, (pp. 591–603).
Bertin, M., Atanassova, I., Sugimoto, C. R., & Larivière, V. (2016). The linguistic patterns and rhetorical structure of citation context: an approach using n-grams. Scientometrics, 109, 1417–1434.
Article Google Scholar
Blei, D. (2012). Probabilistic topic models. Communications of the ACM., 55(4), 77–84.
Article Google Scholar
Boyack, K. W. (2017). Investigating the effect of global data on topic detection. Scientometrics, 111(2), 999–1015.
Article Google Scholar
Chen, D., Mannig, C.D., (2014). A fast and accurate dependency parser using neural networks. In: Proceedings of EMNLP 2014. Doha, Qatar.
Gal, D., Thijs, B., Sipido, K., Glänzel, W., (2017) Topic modelling based network maps in cardiovascular research. In: Proceedings of the 16th International Conference of the International Society for Scientometrics and Informetrics. Wuhan, China, (pp. 591–603).
Gipp, B, Beel, J. (2007) Citation Proximity Analysis (CPA)—A New approach for identifying related work based on co-citation analysis. In: Proceedings of the 12th International Conference of the International Society for Scientometrics and Informetrics. Rio de Janeiro, Brazil, (pp. 571–575).
Glänzel, W., & Thijs, B. (2017). Using hybrid methods and 'core documents' for the representation of clusters and topics: the astronomy dataset. Scientometrics, 111(2), 1071–1087.
Article Google Scholar
Harris, J. A., Arabzadeh, E., Fairhall, A. L., Benito, C., & Diamond, M. E. (2006). Factors affecting frequency discrimination of vibrotactile stimuli: implications for cortical encoding. PlosOne, 1(1), e100.
Article Google Scholar
Kiss, J. Z., Aanes, G., Schiefloe, M., Coelho, L. H. F., Millar, K. D. L., & Edelmann, R. E. (2014). Changes in operational procedures to improve spaceflight experiments in plant biology in the European Modular cultivation system. Advances in Space Research, 53(5), 818–827.
Article Google Scholar
Leydesdorff, L., & Hellsten, I. (2006). Measuring the meaning of words in contexts: An automated analysis of controversies about 'Monarch butterflies', 'Frankenfoods', and 'stem cells'. Scientometrics, 67(2), 231–258.
Article Google Scholar
Leydesdorff, L., & Rafols, I. (2011). Indicators of the interdisciplinarity of journals: diversity, centrality, and citations. Journal of Informetrics, 5(1), 87–100.
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
Pennington, J., Socher, R., Manning, C.D., (2014). GloVe: Global Vectors for Word Representation. (available at: https://nlp.stanford.edu/pubs/glove.pdf)
Quoc, L. & Mikolov, T., (2014), Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML. Beijing, China, (pp. 1188–1196).
Rehurek, R., Sojka, P. (2010). Software framework for topic modelling with large corpora.In: Proceedings LREC Workshop on New Challenges for NLP Frameworks.
Small, H. (1994). A SCI-map case-study—building a map of AIDS research. Scientometrics, 30(1), 229–241.
Article Google Scholar
Thijs B. (2017) Drakkar: A graph based all-nearest neighbour search algorithm for bibliographic coupling. CEUR workshop Proceedings, 1823, art.nr. 10.
Thijs, B., Glänzel, W., Meyer, M.S. (2017) Improved lexical similarities for hybrid clustering through the use of noun phrases extraction. MSI working paper series. University of Leuven, Leuven, Belgium
Thijs, B., & Glänzel, W. (2018). The contribution of the lexical component in hybrid clustering, the case of four decades of "Scientometrics". Scientometrics, 115(1), 21–33.
Article Google Scholar
Wang, J., Thijs, B., & Glänzel, W. (2015). Interdisciplinarity and impact: Distinct effects of variety. Balance and Disparity. Plos One, 10(5), e0127298.
Article Google Scholar
Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. Scientometrics, 111(2), 1017–1031.
Article Google Scholar
Zhang, L., Rousseau, R., & Glänzel, W. (2016). Diversity of references as an indicator of the interdisciplinarity of journals: Taking similarity between subject fields into account. Journal of the association for information science and technology, 67(5), 1257–1265.
Article Google Scholar

Download references

Acknowledgements

This paper is an extended version of an article presented at the 17th International conference on Scientometrics and Informetrics held between September 2nd and 5th, 2019 in Rome, Italy at the Sapienza University. The research was conducted by the author during his stay at the Université Grenoble-Alpes (France) as a visiting researcher. The author is now fulltime affiliated with KU Leuven, Belgium

Author information

Authors and Affiliations

KU Leuven, ECOOM, FEB, Naamsestraat 61, 3000, Leuven, Belgium
Bart Thijs
Univ Grenoble Alpes, LIG, SIGMA, 38000, Grenoble, France
Bart Thijs

Authors

Bart Thijs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bart Thijs.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thijs, B. Using neural-network based paragraph embeddings for the calculation of within and between document similarities. Scientometrics 125, 835–849 (2020). https://doi.org/10.1007/s11192-020-03583-6

Download citation

Received: 13 January 2020
Published: 13 July 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11192-020-03583-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using neural-network based paragraph embeddings for the calculation of within and between document similarities

Abstract

Access this article

Similar content being viewed by others

A tale of two databases: the use of Web of Science and Scopus in academic papers

Identifying interdisciplinary topics and their evolution based on BERTopic

Mapping science through editorial board interlocking: connections and distance between fields of knowledge and institutional affiliations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using neural-network based paragraph embeddings for the calculation of within and between document similarities

Abstract

Access this article

Similar content being viewed by others

A tale of two databases: the use of Web of Science and Scopus in academic papers

Identifying interdisciplinary topics and their evolution based on BERTopic

Mapping science through editorial board interlocking: connections and distance between fields of knowledge and institutional affiliations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation