Skip to main content
Log in

Using neural-network based paragraph embeddings for the calculation of within and between document similarities

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Science mapping using document networks comes often with the implicit assumption that scientific papers are indivisible units with unique links to neighbour documents. Research on proximity in co-citation analysis and the study of lexical properties of sections and citation contexts indicate that this assumption doesn’t always hold. Moreover, the meaning of words and co-words depends on the context in which they appear. This study proposes the use of a neural network architecture for word and paragraph embeddings (Doc2Vec) for the measurement of similarity among those smaller units of analysis. It is shown that paragraphs in the “Introduction” and the “Discussion” Section are more similar to the abstract, that the similarity among paragraphs is related to -but not linearly- the distance between the paragraphs. The “Methodology” Section is least similar to the other sections. Abstracts of citing-cited documents are more similar than random pairs and the context in which a reference appears is most similar to the abstract of the cited document. This novel approach with higher granularity can be used for bibliometric aided retrieval and to assist in measuring interdisciplinarity through the application of network-based centrality measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Harris et al (2006) In-text references are marked in bold and underlined

Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Abramo, G., D'Angelo, C. A., & Costa, F. D. (2012). Identifying interdisciplinarity through the disciplinary classification of coauthors of scientific publications. Journal of the Association for Information Science & Technology, 63(11), 2206–2222.

    Google Scholar 

  • Ahlgren, P., & Collinader, C. (2009). Document-document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.

    Article  Google Scholar 

  • Bertin M., Atanassova I., Larivière V., Gingras, Y. (2013) The distribution of references in scientific papers: An Analysis of the IMRaD Structure. In: Proceedings of the 14th International Conference of the International Society for Scientometrics and Informetrics. Vienna, Austria, (pp. 591–603).

  • Bertin, M., Atanassova, I., Sugimoto, C. R., & Larivière, V. (2016). The linguistic patterns and rhetorical structure of citation context: an approach using n-grams. Scientometrics, 109, 1417–1434.

    Article  Google Scholar 

  • Blei, D. (2012). Probabilistic topic models. Communications of the ACM., 55(4), 77–84.

    Article  Google Scholar 

  • Boyack, K. W. (2017). Investigating the effect of global data on topic detection. Scientometrics, 111(2), 999–1015.

    Article  Google Scholar 

  • Chen, D., Mannig, C.D., (2014). A fast and accurate dependency parser using neural networks. In: Proceedings of EMNLP 2014. Doha, Qatar.

  • Gal, D., Thijs, B., Sipido, K., Glänzel, W., (2017) Topic modelling based network maps in cardiovascular research. In: Proceedings of the 16th International Conference of the International Society for Scientometrics and Informetrics. Wuhan, China, (pp. 591–603).

  • Gipp, B, Beel, J. (2007) Citation Proximity Analysis (CPA)—A New approach for identifying related work based on co-citation analysis. In: Proceedings of the 12th International Conference of the International Society for Scientometrics and Informetrics. Rio de Janeiro, Brazil, (pp. 571–575).

  • Glänzel, W., & Thijs, B. (2017). Using hybrid methods and 'core documents' for the representation of clusters and topics: the astronomy dataset. Scientometrics, 111(2), 1071–1087.

    Article  Google Scholar 

  • Harris, J. A., Arabzadeh, E., Fairhall, A. L., Benito, C., & Diamond, M. E. (2006). Factors affecting frequency discrimination of vibrotactile stimuli: implications for cortical encoding. PlosOne, 1(1), e100.

    Article  Google Scholar 

  • Kiss, J. Z., Aanes, G., Schiefloe, M., Coelho, L. H. F., Millar, K. D. L., & Edelmann, R. E. (2014). Changes in operational procedures to improve spaceflight experiments in plant biology in the European Modular cultivation system. Advances in Space Research, 53(5), 818–827.

    Article  Google Scholar 

  • Leydesdorff, L., & Hellsten, I. (2006). Measuring the meaning of words in contexts: An automated analysis of controversies about 'Monarch butterflies', 'Frankenfoods', and 'stem cells'. Scientometrics, 67(2), 231–258.

    Article  Google Scholar 

  • Leydesdorff, L., & Rafols, I. (2011). Indicators of the interdisciplinarity of journals: diversity, centrality, and citations. Journal of Informetrics, 5(1), 87–100.

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.

  • Pennington, J., Socher, R., Manning, C.D., (2014). GloVe: Global Vectors for Word Representation. (available at: https://nlp.stanford.edu/pubs/glove.pdf)

  • Quoc, L. & Mikolov, T., (2014), Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML. Beijing, China, (pp. 1188–1196).

  • Rehurek, R., Sojka, P. (2010). Software framework for topic modelling with large corpora.In: Proceedings LREC Workshop on New Challenges for NLP Frameworks.

  • Small, H. (1994). A SCI-map case-study—building a map of AIDS research. Scientometrics, 30(1), 229–241.

    Article  Google Scholar 

  • Thijs B. (2017) Drakkar: A graph based all-nearest neighbour search algorithm for bibliographic coupling. CEUR workshop Proceedings, 1823, art.nr. 10.

  • Thijs, B., Glänzel, W., Meyer, M.S. (2017) Improved lexical similarities for hybrid clustering through the use of noun phrases extraction. MSI working paper series. University of Leuven, Leuven, Belgium

  • Thijs, B., & Glänzel, W. (2018). The contribution of the lexical component in hybrid clustering, the case of four decades of "Scientometrics". Scientometrics, 115(1), 21–33.

    Article  Google Scholar 

  • Wang, J., Thijs, B., & Glänzel, W. (2015). Interdisciplinarity and impact: Distinct effects of variety. Balance and Disparity. Plos One, 10(5), e0127298.

    Article  Google Scholar 

  • Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. Scientometrics, 111(2), 1017–1031.

    Article  Google Scholar 

  • Zhang, L., Rousseau, R., & Glänzel, W. (2016). Diversity of references as an indicator of the interdisciplinarity of journals: Taking similarity between subject fields into account. Journal of the association for information science and technology, 67(5), 1257–1265.

    Article  Google Scholar 

Download references

Acknowledgements

This paper is an extended version of an article presented at the 17th International conference on Scientometrics and Informetrics held between September 2nd and 5th, 2019 in Rome, Italy at the Sapienza University. The research was conducted by the author during his stay at the Université Grenoble-Alpes (France) as a visiting researcher. The author is now fulltime affiliated with KU Leuven, Belgium

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bart Thijs.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thijs, B. Using neural-network based paragraph embeddings for the calculation of within and between document similarities. Scientometrics 125, 835–849 (2020). https://doi.org/10.1007/s11192-020-03583-6

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03583-6

Keywords

Navigation