Semantic and relational spaces in science of science: deep learning models for article vectorisation

Over the last century, we observe a steady and exponential growth of scientific publications globally. The overwhelming amount of available literature makes a holistic analysis of the research within a field and between fields based on manual inspection impossible. Automatic techniques to support the process of literature review are required to find the epistemic and social patterns that are embedded in scientific publications. In computer sciences, new tools have been developed to deal with large volumes of data. In particular, deep learning techniques open the possibility of automated end-to-end models to project observations to a new, low-dimensional space where the most relevant information of each observation is highlighted. Using deep learning to build new representations of scientific publications is a growing but still emerging field of research. The aim of this paper is to discuss the potential and limits of deep learning for gathering insights about scientific research articles. We focus on document-level embeddings based on the semantic and relational aspects of articles, using Natural Language Processing (NLP) and Graph Neural Networks (GNNs). We explore the different outcomes generated by those techniques. Our results show that using NLP we can encode a semantic space of articles, while GNN we enable us to build a relational space where the social practices of a research community are also encoded.


Introduction
The relation between two research articles is a multidimensional phenomenon.Research articles can be related because of their topics, authors, or the organisational affiliation of the authors.This implies that there is no unique measurement to compare the relatedness of scientific publications.
When a human compares a pair of articles, he or she can recognise the relatedness in its complexity, given his/her own biases and expertise in the field.The relatedness of a pair of articles is not only interesting for a pairwise comparison, but also to build a holistic representation of a field of research or a discipline.To handle the massively increasing volume and production rate of scientific research articles, we discuss automatised ways to relate research articles to each other.
One of the most important dimensions of relatedness is the semantic content of articles.Using text as data, Natural Language Processing (hereafter NLP) studies the ways in which textual meaning can be extracted from the documents within a text corpus, in a summarised way (Jurafsky et al. 2008).This operationalises the concept of semantic relatedness (Mikolov, Yih, et al. 2013).One of those techniques is Topic Modelling, which uses the co-occurrences of words in documents to detect the distribution of topics in a given corpus (Blei, Ng, et al. 2003).Daenekindt et al. 2020 analysed the field of higher education using correlated topic modelling (Blei and Lafferty 2007), which allowed the authors to compare 17,000 abstracts published in journals in the field of higher education research.Schwemmer et al. 2020 studied the methodological divergence in sociology between quantitative and qualitative analysis in more than 8,700 research articles in top journals of the field, using the wordfish model (Slapin et al. 2008).Another study analysed more than 20 million article titles to investigate the expansion of cognitive boundaries in physics, astronomy, and biomedicine.Findings include that the number of publications grows exponentially, but that the space of ideas expands linearly (Milojević 2015).
Another important aspect in the relatedness of articles is the overarching network structure of science (Boyack et al. 2010).A network of articles can be made explicit by their references to previous work.Considering direct citations, two articles are linked if one of them contains a reference to the other.This network has a strong temporal dependency, as nodes can only have outgoing links to older articles.The links can also be based on co-citations (Kessler 1963;Small 1973).The network structure has the property of defining the distance between two articles by the length of the path between them, i.e., the number of articles needed to to get from one to another, only following the references lists of the corresponding article at each step.Intertwined with the articles' network, a collaboration network of authors can be created (Moody 2004).
New deep learning models promise to revolutionise the way we approach both, text and network data.Nevertheless, there is an ongoing debate in the Artificial Intelligence community on the bias introduced by algorithms, e.g.Bolukbasi et al. 2016;Buolamwini et al. 2018;Caliskan et al. 2017;Whittaker et al. 2018.This implies that, when we introduce new methodologies, and specially black-box models like in deep learning, it is important to study the implicit biases they carry.
Together with other deep learning techniques, embeddings are an important breakthrough in the field of machine learning.An embedding is a low dimensional dense vector, i.e., of real-valued representations that encodes relevant information from the original, high dimensional data (Mikolov, Chen, et al. 2013).Nevertheless, not much research has been carried out to bring these techniques closer to the field of Science of Science, except of the following examples: A recent article used Word2Vec (Mikolov, Chen, et al. 2013) to distinguish the most relevant terms in quantitative and qualitative research in the field of Science of Science (Kang et al. 2020).Paper2Vector, Zhang, Zhao, et al. 2019, is a model that trains word-word, document-word, and document-document relations based on the Skip-Gram model (Mikolov, Chen, et al. 2013).This model, although it uses textual content as well as the citation network, cannot use both at the same time as well as it cannot use other metadata features.Another approach used Graph Neural Networks (hereafter GNN) models, a deep learning model that leverage on the network structure of the data, as well as the BERT pre-trained model for the textual embedding, which constitutes the state of the art in NLP, but it does not consider the possibility of using the textual embedding as input for the GNN (Jeong et al. 2020).
We have selected Science of Science as a case study to explore these new methodological approaches because it is a highly complex and multidisciplinary field of research (Fortunato et al. 2018) that aims to explore the driving forces of science and to develop new methods to better understand its evolution over time.The emergence of the field has speed up with the availability of large-scale data sets on science production and the disruption of disciplinary boundaries, that encouraged scientists from different disciplines to closely collaborate, an ideal playground to study different types of embeddings as an innovative tool for the methodological development in Science of Science, for an in-depth analysis scientific research articles.We identified this as an important research gap in this developing field of research, and will discuss potential uses and limitations of these new methods.The article focuses on document-level embeddings based on the semantic and relational aspects of articles, using NLP and GNN's to explore semantic and relational spaces in Science of Science.Four different aspects will be analysed: (a) collaboration-patterns (Sooryamoorthy 2009); (b) the cumulative effect on citations, i.e., the Matthew effect (de Solla Price 1963; Garfield 1972;Garfield and Merton 1979); (c) the position of countries in science global production (King 2004); and (d) the quantitative-qualitative divide (Kang et al. 2020).
Our main hypothesis is that, while textual embeddings build a representation of the semantic space, GNN embeddings focus on the relational and structural space of a network of research articles.Therefore, textual embeddings help to identify similar content, whereas GNN embeddings are useful to study the embedded social relations in the production of scientific knowledge.An in-depth investigation of this hypothesis is an important step for Science of Science as a field of research, allowing us to pair powerful analysis techniques from computer science with a thematic disciplinary foundation.
The main objective of this methodological contribution is to present a approximation to the use of embeddings in the field of Science of Science.We propose two families of models that use different types of inputs and generate different types of insights: First, we build articles embeddings based on their textual characteristics, including titles, keywords, and abstracts.For this family of models, we use three different techniques: Topic Modelling (Blei, Ng, et al. 2003), Doc2Vec (Mikolov, Sutskever, et al. 2013), andBERT (Devlin et al. 2019).Doc2Vec was selected because it is specifically designed for document-level embeddings.BERT achieves the state of the art for various NLP tasks (Tenney et al. 2019).For a non-deep learning benchmark, we selected the Topic Modelling approach, which is a widely used framework.Second, we build GNN models that include text, metadata and the citation network of selected articles.The GNN models are trained on the link prediction task, i.e., predicting whether two articles are linked by a citation, and therefore focus on the network properties.This study does not develop new methodologies from text or network embeddings, but intends to act as a bridge between the new developments made on the field of Deep Learning, and studies on Science of Science.In our case study, we try the different models on a data set of 22,151 articles from Science of Science, involving different fields, ranging from history and philosophy of science to library and information sciences.
The following two research questions are guiding our analysis: How can we encode the relational dimension of articles, and which are its properties?How can we encode the semantic dimension of articles, and which are its properties?Structure of the paper.First, we will present the data set characteristics, while in Section 3 we provide an overview of embedding techniques in both text and networks, the experimental setup and the performance metrics that are used to evaluate the models.In Section 4 we present our results.We finish the paper with a conclusion and final remarks for future research in Section 5.

Data Set and Network Statistics
The data set was built on a "journal-based" approach.First, we defined a set of core journals in the field of Science of Science.This selection is based on a recommendation of journals of the "International Society for Scientometrics and Informetrics" (ISSI) 1 , and has been expanded to include related journals from social sciences, and history and philosophy of science to show the wide variability and multidisciplinarity of this field of research.Our main goal was to include all journals that focus exclusively on Science of Science, independently of their disciplinary approach.Second, from this selected set of journals, we included all articles that are available in Elseviers' Scopus journal database.Methodologically, we have used Scopus API 2 to extract the data.By including all articles from these journals, we avoided a potential selection bias towards keywords, and ensure a comprehensive investigation of the field of Science of Science.Given that this study focuses on the methodological aspects of the use of embeddings, we decided to work with this non-exhaustive corpus of research articles, which gives us the advantage to investigate the distribution of embeddings at the journal level.As in Section 4.3 we carry out a country level analysis, it is important to mention the limitations of the Scopus database.Limitations of the data set include a bias towards English-speaking and Western-oriented journals, and a lack of coverage of social sciences.We are only including peer-reviewed articles and do not investigate other publication formats (e.g., monographs, contributions to edited volumes, conference proceedings, etc.).
Table 1 displays information on the journals in our sample, including the number of articles retrieved, the mean and maximum number of citations by journal, and the year of the first and last publication.We have used "Scopus Subject Areas and All Science Journal Classification Codes" (ASJC) for a first differentiation of the journals by discipline3 .Having investigated the repetition of these areas in the different journals, and additionally based on the results of Topic Modelling (see Section 4.1), we assigned each of them to four fields of study: Management, Library and Information Sciences, History and Philosophy of Science and Other Social Sciences, composed by Education, Communication and Anthropology.We consider the repetition of areas and acknowledge that some journals are more multidisciplinary in nature than others, which could lead to an assignment to another field.For example, Social Studies of Science has as subject areas History, Social Sciences (all) and History and Philosophy of Science, so it could be assigned to both, Other Social Sciences and History and Philosophy of Science.This implies that the defined fields cannot be perfectly matched and characterise each journal.In addition, fields are not equidistant from each other.For example, the relation between Management and Library and Information Sciences is closer than the relationship between those and Philosophy.Methodologically, these fields are not used as features in the subsequent models, so they do not introduce biases to the models, but are a helpful tool used for the analysis of results, as they allow to visually study the projection of the embeddings in Sections 4.2 and 4.2, and for studying the journals epistemic practice division in Section 4.3.In Section 4.1, we show that we can partially infer these fields from the Topic Modelling results.
The distribution of the number of research articles and citations per journal is skewed (see also Bornmann et al. 2008), with a tendency towards History and Philosophy of Science having less citations per article than the rest of the fields, which corresponds to different citation and publication behaviours across disciplines (Lillquist et al. 2010).Overall, the data set includes 22,151 articles, with an average of 20.7 citations per articles.
From the collection of 22,151 articles, we retrieved the references to build the citation network.75% of them either cited or were cited by another article in our sample.This subset builds the basis for the citation network4 .Figure 1 presents a summary of the characteristics of the resulting network the log-log degree distribution of nodes and the fitted power law distribution.The network has 16,578 nodes (articles) and 68,797 links (citations).The average degree of the network is 8.3 compared to 20.7 connections when using the entire Scopus database.While the network is not connected, the giant component includes most of the edges and vertices.We built 100 replications of a random Erdös-Renyi network (Erdős et al. 1960) with the same number of vertices and edges, and computed the average cluster coefficient as well as the average mean path length for comparison.We show that the ratio of the network cluster coefficient with respect to the random Erdös-Renyi network is 162, while the ratio between our network mean path length and that of the random Erdös-Renyi network is 1.27.This means that our network has the properties of small word networks (Davis et al. 2003;Iyer et al. 2006).

Methods
This section introduces different approaches for dense vector representations of documents, called embeddings, which will be used for analyses of scientific research articles in Science of Science.Then, the implementation of details and performance metrics will be presented.Classic Machine Learning Approaches First, we will focus on classic machine learning approaches for encoding research articles based on feature engineering, to highlight its differences between those and deep learning based approaches.Feature engineering refers to the way in which measurable attributes of an observation can be encoded.In a traditional data set, each observation network.  is defined as a vector, and the full data set is therefore a composition of these vectors, building a matrix of observations as rows and columns as features (Broman et al. 2018).If the data consists of a network, besides the features-matrix, another matrix to describe the network structure has to be implemented, the so-called adjacency matrix A (Barabási 2016).An article's vector representation can be any metric way to summarise its measurable characteristics.Such a vector can describe its metadata features, for example, the year of publication, the number of citations at a moment in time, the number of authors, or a label assigned to the organisation or journal.It can also include descriptors of the network in which the article is embedded: degree, betweenness, or other centrality measures.

Metric
The classic treatment of the textual content of an article, for example, the title, abstract, keywords, or full text, is based on the bag of words: A Document-Term Matrix (DTM) where each document is represented as a vector, indicating which of the words in the vocabulary is present in a given document (Jurafsky et al. 2008).In the DTM matrix, the i-th row represents the i-th document and the j-th column represents the j-th word in the vocabulary of the corpus, the value x i,j can either indicate if the j-th word is present in the i-th document as a binary value, the number of times it appears, or a normalised counting, like Term Frequency-Inverse Document Frequency (TF-IDF) (Jurafsky et al. 2008).
As any n-dimensional vector, an article constitutes a point in an n−dimensional space.Each of these n dimensions keeps its independent meaning, which is useful for making interpretations over the position of different documents in the space.The representation might be affected by problems of high dimensionality and sparsity.This is specially true for encoding text, as the vocabulary size can rise to tens of thousands of words, with a high probability that most words will not appear in most documents.For research articles, this might be true for features like authors, organisational affiliations, or journals, if each value is encoded as a dummy variable, i.e., as many variables that take only the value 0 or 1 to indicate the absence or presence of each category.In highly dimensional spaces, the notions of distance become blurred and is hard to generate new insights from the data (Bellman 1966).Therefore, when studying the relations between articles, the analysis tends to focus on a restricted subset of the multiplicity of dimensions that exist.Compared to other methodological approaches, deep learning models can take a multiplicity of dimensions of analysis into account.Applying deep learning methods contributes to development on this research gap, as they are able to use multiple features and select those that are more relevant based on the optimisation problem.
Once encoded, descriptive statistics can be used to investigate the relations between observations and features.When the number of dimensions to be considered increases, descriptive statistics can not deal with the studied phenomenon, and some types of models might be necessary to assess the importance of different features.Modelling relations is an entrenched way to reduce the complexity of the problem by selecting the dimensions in focus.One possibility is to use models that measure the relation between two aspects of the phenomena, for example, linear models.These types of models demand a lot of time and expert knowledge, as they require a careful design and hand-engineered features to have a better performance.New developments in machine learning, like deep learning models, can learn from the data to encode the most relevant information, allowing researchers to focus on the analysis of the results.Deep learning models Deep learning models are designed in an end-to-end way.Their goal is to minimise the feature engineering steps, and to let the model itself define which the most important latent features are.These models have greater flexibility in terms of the inputs they can receive and the outputs they generate (Goodfellow et al. 2016).In this article, we focus on those models where the output is a subspace projection where observations are represented.This encoding is defined by the models to maximise an associated task that changes between models.If the model works properly, it will encode those aspects from the original data that are most relevant for solving that associated task.Using this new, data-driven way of encoding information, it is possible to generate new insights comparing aggregated levels of representation like countries or journals.For this type of modelling, we consider two sub-types: First, the textual embeddings that use textual features as their input.From this, we will infer the semantic space of articles.We aim to encode the conceptual sense of each article as a low dimensional vector.Second, we will train models that use as input both text and the citation networks, from which we will infer the relational space of research articles.In the relational space, we aim to encode the citation practices as a social phenomenon into a low dimensional vector.
Here we consider the semantic and relational spaces as the mathematical objects that represent those properties of articles.The embeddings are the deep learning implementations that we train over our data set to approximate those mathematical objects.

The Semantic Space of a Research Article
In this section, we explain the three models that we have used to build the semantic space of research articles.Doc2Vec (Mikolov, Sutskever, et al. 2013) and BERT (Devlin et al. 2019) are deep learning models based on the Word2Vec model (Mikolov, Chen, et al. 2013).The Latent Dirichlet Allocation model (hereafter LDA), which is a non deep learning approach, is extensively used for topic modelling, but that can also be considered as an embedding.Word embedding.The representation of documents for training deep learning models is commonly based on word embeddings (Bojanowski et al. 2007;Mikolov, Chen, et al. 2013;Pennington et al. 2014).In word embeddings, each word is represented as a dense vector, and documents are a concatenation of those vectors.To build this representation, Mikolov proposes Word2Vec, (Mikolov, Chen, et al. 2013), and the Skip-Gram implementation.Given a corpus of text and a window size, the model defines the context of a word as the surrounding words within the window size.Then, for each word, it tries to predict its context, internally building a vector for each word.When trained, it learns to project closer words with similar meaning.This means that when we use this model on our Science of Science data set, for example, words like technology and innovation have a closer representation between them than with the word student.When using word embeddings, the document is represented as a matrix of word vectors.Our goal is to make an embedding of articles, instead of words.Mikolov, Sutskever, et al. 2013 includes the identifier of the corresponding document as an additional token to the context window.In this way, it creates both an embedding for words and for documents.BERT embedding.BERT (Devlin et al. 2019) improves the Word2Vec model using attention mechanisms (Vaswani et al. 2017).This means that not every word in the context is equally considered while trying to predict the next word in the context, which implies that the word embedding of a word is also determined by the context.BERT is also based on the principle of transfer learning.This means that the word vectors can be learned on a big general domain corpus, instead of the specific corpus for each task.This is useful because to build a robust representation of concepts, the word embeddings need billions of observations (Mikolov, Chen, et al. 2013).LDA embedding.For comparison, we also use the LDA model proposed by Blei, Ng, et al. 2003.This model is not based on a deep learning architecture.Instead, it is a generative Bayesian model.It starts from the premise that a corpus is a collection of topics, and that each document is a mixture of those topics.The model takes the distribution of words within documents as input and generates the topics as a probability distribution of words.For exam-ple, if in our corpus science policy is a recurrent discussion, LDA would define a topic were words like "technology", "innovation" or "policy" are the most relevant.Given these topics as distributions over words, the model will also define each document as a distribution over those topics.LDA is not usually seen as generating an embedding, as this terminology is usually found within the deep learning community.Nonetheless, its output can be thought of as an embedding for articles, as we can use the articles' distribution over topics as their low-dimensional representation5 .
These three models are compared based on their T-SNE projection (van der Maaten et al. 2008), and also based on how much the GNN model improves when using each of these as input.

The Relational Space of a Research Article
The semantic embeddings are built to study exclusively the textual content of documents.An analysis of research articles can consider more than its textual content.We have the opportunity to assess other meta-data features and the citation network to make relationships between articles explicit.GNN's have the potential to assess a more holistic representation of research articles.
GNN is a developing field in the deep learning community that tries to apply techniques that have been proven as useful in computer vision and NLP, to problems where the data has a network structure.Multi-Layer perceptrons work well with flat inputs, where it learns the compositionality of features, but it does not consider their specific dependencies explicitly (Goodfellow et al. 2016).Recurrent Neural Networks (RNN) are designed for sequential inputs, like text, where the order of the input features is explicitly fed to the network (Sutskever et al. 2011).Convolutional Neural Networks (CNN) are useful when dealing with images, where spatial relations within the image are searched, independently of the specific position in the grid (LeCun et al. 1989).The problem with graphs is that while they have explicit relations among nodes, which we would like to incorporate to our models, they are not as regular as relations in images or text, so traditional RNN and CNN cannot deal with this type of data structure.The two main differences are: 1. nodes have a variable number of neighbours; and 2. neighbours do not have a predefined order.
Deep learning on graphs deals with these issues, trying to generalise RNN and CNN to the complex relations in networks.Although GNN can be used for a number of tasks, like node and graph classification, we approach the embedding generation as an unsupervised problem, aiming to reconstruct the node's neighbourhood.This means that we are not trying to predict some node's features while we train our deep learning models, but we try to rebuild the network structure.Following Hamilton et al. 2017b, we approach the problem with an Encoder-Decoder framework.Figure 2 shows the outline of this architecture.where Z is the low-level matrix representation of nodes with dimension n × d with n the number of articles and d the dimension of the article embedding, generated by the encoder.This generates a pairwise decoder, where for each node, we reconstruct its relation with all other nodes, generating an n×n matrix.

Encoder z Decoder
If nodes share a similar low-level representation, then the inner product will give a higher value for their pairwise relation.
If on top of the inner product function we apply a sigmoid activation layer, the decoder will produce the pairwise relation of nodes expressed as a probability, i.e., the probability of the two nodes being linked in the network: where Â is the reconstructed adjacency matrix.The goal is to optimise the encoder in order to minimise the reconstruction loss: where is the loss function, in our case the binary cross entropy, between the reconstructed link of nodes u and v, and the true value in the adjacency matrix.Optimising this loss function implies training an encoder that will generate an embedding representation of nodes that preserve their similarities in terms of the network structure.
The main difference between the used models is how they define the encoder step.We use as encoders the Graph Convolutional Network (GCN) (Kipf et al. 2017), GraphSAGE (Hamilton et al. 2017a), Graph Isomorphic Network (GIN) (Xu et al. 2019), Graph Attention Network (GAT) (Veličković et al. 2018), Attention-based Graph Neural Networks (AGNN) (Thekumparampil et al. 2018) and GraphUNet (Gao et al. 2019) layers, which constitute the current state of the art in the field.In Appendix C.2, we present an overview of these models.

Implementation Steps
In this section we present the pre-processing, data cleaning, features, hyperparameters, and network architectures that we have used to build and evaluate the models.
All textual models were built using a combination of the title, abstract, and keywords suggested by the authors of each article.Given that the words in the title and keywords are good representations of the content of an article, we use them in triplicate: we build a text that has three times the title, three times each keyword and once the abstract.To clean the data, we remove stopwords and trademarks of the journals.After this, we replace the numbers with the special token "num".We also did stemming, and later replaced the stem with the most frequent word with that stem.
The Doc2Vec model was implemented using Gensim (Rehurek et al. 2010, May) with the 'distributed memory' learning algorithm, a vector size of twenty and a window size of ten, and using the concatenation of context vectors.To train the BERT embedding, we used the HuggingFace implementation (Wolf et al. 2019) with the 'bert-base-uncased' pre-trained model.Given that BERT generates word vectors, the sentence embedding was built as the mean of the token embeddings in each sentence.The LDA model was implemented using Scikit-learn (Pedregosa et al. 2011) with twenty components, removing tokens that appear in less than five documents or more than 65% of the documents.
The GNN models were trained using the following features: • First author affiliation ID • First author ID • Year of publication • Journal subject area (three per journal, as dummy variables) • Topic distribution (from LDA) • Keywords, title, and abstract information, either as TF-IDF, or using the sentence embedding from Word2Vec or BERT • Cumulative citations after t years from its publication, for t ∈ 1, ..., 10, and total number of citations6 .
The GNN architectures were implemented using the Pytorch Geometric implementation (Fey et al. 2019).In all cases, we use the Graph Autoencoder (Kipf et al. 2016) with the inner-product decoder.The main difference between the GNN models (GCN, GraphSage, GIN, GAT, AGNN, GraphUNet) is on the encoder side.To find the best set of hyper-parameters, we replicated the specifications from the original papers.The following hyper-parameters have been used in each model: The GCN has an output dimension of 32 and has been built with two GCN convolutions, with a Relu activation layer (Agarap 2018).GraphSage was trained with a similar architecture, using the Sage convolutional layer.The GIN model was trained with five GIN convolutional layers, using ELU activation layer (Clevert et al. 2016), batch normalisation and a normalisation layer in the end.The GAT model was trained with two GAT convolutional layers, with a dropout of 0.6 and a normalisation layer in the end, with an ELU activation layer after the first convolution; the embedding dimension was 16.The AGNN model also has two convolutions, starting with a linear projection followed by a Relu activation layer.The GraphUNet has the same dropout as the one mentioned in the original paper (Gao et al. 2019), with a depth of 4 and an embedding dimension of 16.

Performance Metrics
We decided to use two traditionally used metrics in GNN for the evaluation of the link prediction task: the Average Precision (AP) and the Area Under the Receiver Operating Characteristic Curve (AUC).To explain AP, we have to introduce some preliminary notions: for a binary classification model where an observation can be either positive or negative.A True Positive (tp) is an observation predicted as positive that is a real positive case.In the same way, we can define true negative (tn), false positive (f p), and false negative (f n).Furthermore, we define Precision = tp tp + f p , Precision measures describe how many of the predicted links are actual links.Recall, or True Positive Rate (TPR) measure, how many of the actual links are predicted by the model as positive cases.The False Positive Rate (FPR) measures the "false alarm", i.e., the ratio between true positive and true negative cases.Given that the models predict a ranked sequence of link candidates, i.e., each potential link is associated with a probability.There is an implicit trade-off between Precision and Recall, or between TPR and FPR.For each mode, using the ranked predictions, we can build a curve of Precision against Recall.The AP is the area under that curve, and can be computed as This means, for every candidate it computes the precision at that point, weighted by the change in the Recall.In a similar way, the AUC is the area under the TPR against FPR curve.
After having presented our data and methods as well as the experimental setup of our study, we now present our results.

Topic Modelling
To present a first characterisation of the data set, we use Topic Modelling to find the latent space of sub-fields within the corpus, using the LDA model (Blei, Ng, et al. 2003).LDA provides two different outputs: First, the distribution of word over topics, which is useful for defining the meaning of each topic.Second, the distribution of topics over documents7 .Using LDAvis, we built an interactive visualisation of the distribution of words over topics8 (Sievert et al. 2014).Figure 3 show the relative importance of each topic per field, calculated as the proportion of the topic in the field over the proportion of the topic in the entire data set.
If we compare the the most relevant words per topic9 with the distribution per field in Figure 3 we can see that in History and Philosophy of Science the topics discussed are education (topic 5), shared also with the field of Other Social Sciences, history (topic 15) and logic (topic 19).In the field of Library and Information Sciences many different topics are discussed, with especial emphasis in bibliometrics (topic 9), universities and scientists (topic 4), also shared with Other Social Sciences.In Management we can see the focus in technology policy (topic 1) and patents and firms (topic 10).In Other Social Sciences there is also a variety of topics, and besides the shared topics 4 and 5, there is also attention on public decision making (topic 8).
The LDA results show the wide variety of topics covered in the field of Science of Science and also within the different disciplines involved in it.Some topics are shared across fields, and some journals, like Scientometrics10 cover a wide many thematic studies.This analysis also shows that there is no unique unequivocal way of gathering journals by field.

Evaluation of Embeddings
In this section, we perform an evaluation of the multiple models proposed.For this, we quantitatively compare the performance of the models on the link prediction task, including how the different textual embeddings as features improve the results.We also visualise our results based on the T-SNE projection of the embeddings.Relational Space As the GNN models are trained on a measurable task, we can first compare the models results, to analyse the resulting embedding of the one with better performance.In Table 2 we show the Area Under the ROC Curve and the Average Precision, two metrics widely used to evaluate GNNs for the link prediction task (Kipf et al. 2016;Zhang and Chen 2018).For the six different architectures, we are using three different ways of encoding text: The traditional TF-IDF features (Kipf et al. 2017), the sentences embedding using Doc2Vec (Hamilton et al. 2017a), and BERT.As shown, the best architecture is in every case the GCN from (Kipf et al. 2017), closely followed by the GIN (Xu et al. 2019), and GraphSage (Hamilton et al. 2017a).GCN is also the one that improves the most when using BERT, achieving the best performance of 0.91 in both AUC and Average Precision.Using either TF-IDF or D2V achieves approximately equal results.We performed ablation studies to understand the features relevance.In Table 3 we present the impact of removing one of the features from the GCN model with BERT embeddings.Results show that adding or removing the label of the first author ID does not change the performance, which corresponds with the high cardinality of this feature.Similarly, removing the organisational affiliation, subject area, topic distribution, and year only give a minor decrease in the performance.When we remove the cumulative distribution of citations the model worsens by 2% its predictive power, and by 5% or 6% if we remove Table 3: Link prediction results when removing a feature.Area Under the Curve and Average Precision.Average result of 10 runs and standard deviation in parenthesis.
Removed AUC AP First Author 0.91 (0.0) 0.91 (0.0) Affiliation 0.9 (0.01) 0.9 (0.0) Subject Area 0.9 (0.01) 0.9 (0.01) Topic Distribution 0.9 (0.01) 0.9 (0.01) Year 0.9 (0.0) 0.9 (0.0) citations at 1:10 0.89 (0.0) 0.89 (0.0) BERT embedding 0.86 (0.0) 0.87 (0.0) the BERT embedding, which means this two features are the most relevant.Not having the BERT embedding gives a closer result to those obtained using Doc2Vec and TF-IDF, which also shows the small impact of those ways of encoding text over the GNN.To conclude, the GCN is robust to the use of different features, and it is mostly focused on the network structure.We expect the embedding representation to be highly related with the network properties of the citation patterns and only mildly related with the semantic patterns through BERT.
In Figures 4 and 5, we show the T-SNE projection (van der Maaten et al. 2008), for a sample of two journals per field, coloured by field and journal, and sized by number of citations.We can also see the 95% ellipses of each journal, i.e., containing the 95% of the articles of the corresponding journal (Fox et al. 2018).Figure 4 is based on the GNN embedding using GCN and BERT.Results show that there is a correlation between the number of citations and the position in space.For example, articles from the journals Research Policy and Scientometrics that present the highest number of citations are located in the bottom-right and top of the T-SNE projection.Journals from the field of History and Philosophy of Science, like Synthese and Studies in History and Philosophy of Science, where the citing culture and the selection bias of the data set (see chapter 2) imply lower citations, are located on the left of the plot.Within this field, articles with a higher number of citations cluster together in the top-left of the T-SNE representation.All fields but History and Philosophy of Science form a uniform point cloud that correlates more with the number of citations than the corresponding journal.This can also be seen in the overlapping ellipses of most journals, except for those from History and Philosophy of Science.The organisation of the plot by citation patterns rather than thematically highlights when compared with the semantic embeddings in Figure 5.This result is in line with the expectation of the GNN, paying more attention to the relational patterns rather than the semantic content of research articles.Citation rates differ among disciplines.Especially research articles in the social sciences and humanities have a lower citation rate than articles in other disciplines.Further, the selection bias of the data set in favour of specific journals might cause an underestimation of citation rates for those journals, which might have a higher number of cross-references to journals that are not included in our sample as well as to other publication formats (monographs, contributions to edited volumes, etc.) that have not been analysed in this study.
Semantic Space Figure 5 shows results for the three textual embeddings.The Doc2Vec model is the only one of the three proposed models that was designed for document-level embeddings.Nevertheless, it does not show any correlation among journals or number of citations; the shape of the point cloud is spherical.This might be an indication that Doc2Vec is not able to learn a good representation of the documents of our data set.This is probably due to the fact that the Skip-Gram model is usually trained on millions of data points.As outlined above, pre-trained embeddings built on a general purpose corpus can improve the words representation, but this is not possible on the document level in Doc2Vec.
The LDA embedding correctly delimitates the four sub-fields of the data set.Articles from the field of History and Philosophy of Science are located on the left side of the projection, with articles from History and Philosophy of Science located closer to the border with the field of Other Social Sciences.This latter field is in between History and Philosophy on its left and the field of Management to its right, where articles, especially from Research Evaluation, tend to merge.We can also see a small number of articles from this field in the middle of the cluster of Library and Information Sciences.Finally, the fields of Management and Library and Information Sciences, although they are well defined, share a point of contact in the centre of the plot, and some articles from each of these fields can be located in the cluster of the other.Compared with Figure 4, the correlation of highly cited articles is driven by the journals, as within each field we cannot see any clustering of highly cited articles.The BERT embedding, where we were able to use pre-trained word vectors, shows a better performance than Doc2Vec, even when the model is not originally designed for document-level representation, and we are averaging the word embeddings in each document.The T-SNE representation shows a stronger delimitation between fields, and a delimitation between journals within each field.Articles from History and Philosophy of Science are located on the left side of the figure, but with a stronger delimitation.In the top of the figure we can see the field of Management, where articles from Research Policy are placed towards the centre of the figure, and those from Science and Public Policy are shifted to the top of the figure.The field of Library and Information Sciences is split in three groups: Most of the articles from Scientometrics lay on the centre of the figure, mixed with those from Research Evaluation.On the left, a small proportion of the articles from this journal stays closer to the field of History and Philosophy of Science.On the right, another group of articles from this journal, and most the articles from the Journal of Informetrics is closer to those from Science and Public Policy.This might be a reflection of a methodological field that develops technical aspects, but also applies the developed methodologies to thematically specific research questions.Finally, the field of Other Social Sciences is found in the bottom of the plot, with articles from Public Understanding of Science more delimited, and articles from Research Evaluation closer to the field of Library and Information Sciences.

Comparing the Differences between the Relational and Semantic Spaces
After studying the overall quality of the different models, we focus on those that have the best performance in both the semantic and the relational space.
For the latter, we select the GCN using BERT embeddings as features.For the semantic space, we mainly focus on the BERT model, but also show some results for the LDA model for comparison.In this section, we show how the embedding representation of articles change in the semantic and relational space.For this, we compare the results on four different topics largely studied in the field of Science of Science: First, the representation of collaboration patterns, second, the Matthew effect in science, third, we perform a country level analysis, and fourth, the epistemic practice division in the field.Our goal is to compare how these different phenomenon are encoded in the resulting embedding, and how their representations differ within the proposed models.

Collaboration Patterns
The pairwise similarity between articles can be observed through the cosine similarity between their vector.To compare the higher level groups, we can calculate the average cosine similarity between groups.
One possible dimension of analysis are the collaboration patterns in terms of co-authored papers, and whether they are encoded in the embeddings.For this, we the divide articles between four groups by their forms of collaboration: A) Single author, B) internal collaborations within a single organisation, C) collaborations between authors with different organisational affiliations from the same country, and D) international collaborations, including authors from different countries and organisational affiliations.To avoid biases due to different collaboration patterns by journal, the results will be illustrated by reference to a comparison of the journals Research Policy and Scientometrics, i.e. comparing the average cosine similarity of articles from one journal to the other, and within each journal.Each of the three embeddings shows different uses of space, and hence for each we define a specific colour scale.
Figure 6a shows the results for GNN.Results shows a higher cosine similarity for international collaboration.Average similarity reduces from this type of collaboration toward single authorship, and this is a pattern that remains independently of the journal, i.e., articles from Research Policy are not closer to other articles in Research Policy than to articles in Scientometrics, except for single author papers, where we find a small difference.This means that international collaborations are closer to all other articles.If we consider the relational space as a hyper-sphere of articles, i.e., an sphere that lays in more than three dimensions, international collaborations are in the centre of this hyper-sphere, and single authored publications tend to be on the periphery.This is inline with the bibliometric analysis on the higher impact of international collaborations measured through higher citation rates (Adams 2013;Persson et al. 2004;Van Raan 1998).Figure 6b shows the results for the LDA model, while Figure 6c shows the results on the BERT model.In both cases, these semantic embeddings show no strong correlation between collaboration patterns and ar-ticle similarity.Articles from Research Policy are closer to each other, as well as Scientometrics articles between them.In both cases, articles from Research Policy tend to be closer together, and in the case of BERT, single authored articles from Scientometrics tend to be closer to Research Policy publications.This means that while the semantic embeddings build similar representations for articles in the same journal, the GNN builds a representation of articles where the international collaborations are similarly encoded and in a central position.
The Matthew Effect of Science The Matthew effect in science, introduced by Robert K. Merton (1974), states that articles that are already highly cited have a higher chance of being cited again.Figure 6a reflects on this via the collaboration patterns, while Figure 4 also shows a correlation of articles by the number of citations.To test if the embeddings are able to capture the Matthew effect, we divided the articles by their number of total citations in the Scopus data set, by the quartiles, and a separate group for those with zero citations (five groups in total).Then, we calculated the average Frobenius norm of each articles, and aggregated the results for these five groups.When studying the distribution of the Frobenius norm by citation level on the different models 11 , we found that while the GNN generates a higher value for the highly cited articles, the BERT, Doc2Vec and LDA models don't follow this pattern.This results mean that the GNN is systematically representing highly cited articles differently.This is expected, given that the GNN is trained on a link prediction task, and a higher Frobenius norm is associated with a higher probability of link, i.e., citation link, via the inner product decoder (see Section 3).Nevertheless, this results show that when we design the GNN embeddings for the link prediction task, instead of trying to predict subject areas, as in Kipf et al. 2017, we are able to capture the Matthew effect in our embedding.This is an important conclusion for future research, as it show the way in which we can use the embeddings framework for studying the Matthew effect in science.Also, we found that the semantic embeddings do not encode this phenomena, and this is also important for predictive modelling in which if the Matthew effect is encoded in the embedding, this would imply reinforcing inequalities.

Country-level Analysis
In the same way we built a BERT embedding by averaging the word embeddings of each document, we can build a hierarchical representation of entities by averaging its components.One of the dimensions of analysis is the role of countries in the field of Science of Science.For this, we took the first author's organisational affiliation to ascribe a geographical location to an article.This does not necessarily mean that an article has been written in that country, but it gives us a proxy for the geographical distribution of scientific work and allows us to reconstruct the average position of countries in the embedding.Using the cosine similarity between countries, Figure 7 shows the average similarity of a country with respect to all others in the GNN  (horizontal axis) and BERT (vertical axis) embeddings.This means that we are comparing the semantic proximity on the vertical axis against the structural network-based proximity on the horizontal axis.Results show that there is a centre of gravity of science production (Zhang, Powell, et al. 2015) that includes most of the English-speaking countries, (Western) Europe and (East) Asia.Close to the core, we can also find some countries from (South) America and (East) Europe.South Africa is the only country from its continent close to the centre, which might be an indication for research activities of the Centre for Research on Evaluation Science and Technology (CREST) at Stellenbosch University12 .Results also show that the BERT cosine similarity is almost always higher than 0.8, while the GNN ranges between −0.5 and 0.5.This means that the semantic representation is in general very similar between all countries, while in the structural representation countries are never too close, and many of them are even in the opposite direction of most of the other countries.The presented results can be interpreted as follows: While researchers in Science of Science from all countries, within these journals, work on more or less similar content, the relevance that the academic community gives to their work is highly skewed.For example, in the case of Uruguay, the average BERT cosine similarity, i.e., based on the textual content of the articles, from this country and all others, is almost 0.95, a really high value considering cosine similarity moves between −1 and 1.On the other hand, its citation-based cosine similarity is less than −0.35, which means that it is in an opposite direction with respect to most of the countries.As we mentioned in Section 2, this analysis is limited due to the limits of the data set.We cannot fully account for scientific production outside the countries that appear here as peripheral.Including journals from other regions and languages would most probably change the layout of the results, specially for the semantic embeddings (Beigel 2014).In this sense, we have to limit the scope of interpretation to the fact that, within these journals, the topics discussed do not vary much.Nevertheless, this result is inline with many other studies in the field on the unequal distribution of citations, at least in the international journals (Bonitz et al. 1997;Demeter et al. 2020;King 2011;Merton 1974).This analysis answers our first research question.If we use GAE with the GCN, we can encode the relational dimension of articles.With the analysis of collaboration patterns, the Frobenius norm, and the country-level analysis, we can see that the idea of prestige is captured by the GNN embeddings.This concept unfolds into different expressions, such as the different position articles have in the embedding according to their collaboration patterns and citation levels, and also on hierarchical levels of analysis, like the distribution of countries.
Projection of Journals on the Epistemic Spectrum Word embeddings have shown impressive results on analogy tasks.Mikolov, Yih, et al. 2013 shows that, for word embeddings, we can solve the task "man is to woman as king is to " by doing − − → king + − −−− → women − − − → man, which returns a vector close to −−−→ queen.This implies that there is a latent dimension of gender in the word embedding, that can be reconstructed by the subtraction −−−−→ woman − − − → man.Kozlowski et al. 2019 suggested to use analogies to describe social dimensions.Kang et al. 2020 applied this in the field of Science of Science.The authors define two word embeddings and compare how different concepts, like "theory" and "measure" are projected into the "good-bad" dimension, based on a selection of journals assigned to the quantitative and qualitative research communities.For our analysis, we use the analogies approach proposed by Mikolov, Yih, et al. 2013 to study the differences between journals.But instead of building different embeddings based on a pre-clasification of journals, as in Kang et al. 2020, we define a single data set and build the analogy on two pivotal journals.The "quantitative-qualitative" division is an open discussion (Leydesdorff et al. 2020;Weber 2004) and a purely quantitative approach does not seem to be able to solve it.Given this, our analysis do not intend to prove that there is such a division.A more accurate interpretation is to simply think the analogy we propose as the latent dimension that separates epistemic practices from two journals, being this either methodological, epistemic, ontological, or simply related to a different vocabulary typically used in different research fields.
For this, we first generate a vector representation of each journal in our data set, in the same way we built the representation of countries.Then, we select two journals as pivots for building the latent dimension.The selection of the pivotal journals is necessarily an arbitrary one, but compared with the approach proposed by Kang et al. 2020, we do not need to previously assign each journal to one of the two poles.After this, we project the other journals to this latent dimension using cosine similarity, in this way the journals projection will be closer to one of the two poles.If the way in which journals order themselves in this dimension seems to be random, then there is no latent dimension along the two pivotal journals.In Figure 8 we consider this exercise using the journals ISIS and Journal of Informetrics as pivots.This latter defines its scope on "research on quantitative aspects of information science"13 .ISIS is a long standing journal on "history of science, medicine, and technology and their cultural influences"14 .In the BERT embedding we can see that the Journal of Informetrics is on one side of the extremes, but on the other side we have British Journal for the History of Science.We can also see that there is a strong division of journals along the axis, with seven journals really close to the ISIS pole and nine close to the Journal of Informetrics pole.Almost all journals in the ISIS pole are from History and Philosophy of Science, except for Minerva, a multidisciplinary journal.All journals from Management and Library and Information Sciences are on the Journal of Informetrics pole, although the journal Scientometrics is not as close to the extreme as would be expected.Except for Minerva all other journals from Other Social Sciences are also on the Journal of Informetrics pole.On the other hand, the same exercise using the GNN embedding gives a different result.In this case, both journals from the field of Library and Information Sciences are together in the middle range, indicating that the Journal of Informetrics-ISIS dimension is poorly defined on this embedding, indicating that the GNN embedding is not capturing the semantic information as well as the BERT embedding.If we consider the mean citations by journal in the GNN embedding, there seem to be more correlation with this, than that with the epistemic differences by journal.Given this, we can conclude that while the BERT embedding can correctly capture this phenomenon, the GNN embedding is driven by the relational structure of the citation network, rather than the epistemic content of articles, and therefore is not a proper tool for this type of analysis.
The outlined results answer research question two: Using text-based embeddings, we can encode a semantic representation at the article level.Using analogies between journals, we can reflect the epistemic difference between them and project other journals, or articles onto that spectrum.

Conclusions
In this paper, we explored the use of embeddings as representations of research articles.We presented an overview of techniques designed on the different elements that compose an article: its text, metadata, and citations.The objective of this article is to study the use of new methodologies that are currently being developed in the field of computer science to analyse journals and research articles in the field of Science of Science.
In Section 3, we presented two approaches for building an embedding space of articles: the NLP techniques, and the GNN techniques.In Section 4.2, we found that, using textual content in document embeddings, we are able to build a semantic space.In Section 4.2, we evaluated the performance of the different models, and we concluded that, when we use the network structure in a GNN, we define a relational space that embed the social relations underneath.We also presented an extensive comparative analysis within each group of models to find the best performing architecture and the set of hyper-parameters.Our results show that for the semantic embedding, the BERT model gives the most clear results, while for the link prediction task on the direct citation network, the GAE with GCN, using the BERT embedding along with the metadata features gives the best performance.In Section 4.3, we compared the semantic and relational spaces along four different dimensions: collaboration patterns, the country-level analysis, the Mathew effect, and the journals epistemic practice division.From this analysis, we found that the relational space captures the difference on collaboration status, citation levels, and aggregated levels of analysis like differences between countries.The semantic space on its hand, captures phenomena related with articles topics, their corresponding journal, and epistemological aspects, like the journals epistemic practice division.
These are promising results as it opens the possibility for many different in-depth analysis using this different techniques, according to the respective research question.A responsible use of embeddings as tools for analysis does not imply that we should not use those embeddings that reproduce the existing bias, but that we should acknowledge it.For example, a recommendation system based on GNN would reproduce the biased attention researchers give to different articles.On the other hand, using GNN can be of great help to better understand the biases themselves.These biases, as seen on the country-level analysis, might not only be a reflection of the scientific community, but are likely related to the chosen Scopus database and the sampling of journals and articles itself.This paper is based on a small data set in the field of Science of Science, and therefore represents a case study.More research on other fields and cross-disciplinary data sets have to be made to explore if the present results are also valid in other fields of research as.The main contributions of this paper are fourfold: 1. Data: We have built a data set with 16 core journals from the field of Science of Science, from a range of disciplines.The data set includes 22,151 research articles, including their metadata (abstract, title, keywords, authors, organisational affiliations of the authors, year of publication, and journals subject areas), the results of the LDA model (the topics distribution of each article), and the cumulative citations from the network.From this data set, we build a network of direct citations with 16,578 nodes and 68,797 links.This data set is suitable for models that work with metadata features, NLP, and networks, allowing us to compare the performance of different approaches15 .
2. Semantic Modelling: We performed three different approaches for sentences embedding of titles, abstracts, and keywords: Doc2Vec, LDA and BERT, and compared their performance as features for a link prediction model.We presented the Topic Model results as an interactive plot, which give an overview of the topics discussed within the data set.We came to the conclusion that for a data set of these characteristics it is more powerful to use a BERT model, because it can benefit from a pre-training on a large corpus.

Relational Modelling:
We trained the GNN on the link prediction task and made an extensive comparison of multiple possible encoders, using GNN layers that are currently the state of the art in computer sciences.We show that the GCN is the best performing architecture for this task in the present context.We also performed ablation studies to find that the BERT encoding of text and the cumulative citations are the most relevant features for the model.

Models Comparison:
We compared the latent information in the different types of embeddings and arrived to the conclusion that the textual embedding generates a semantic space, while the GNN generates a relational space.This is an important methodological conclusion, as its a distinction of which type of modelling should be used given the research questions.
Multiple recommendations for future research rise from this article: First, methodological research needs to be done over other GNN architectures.In this article, we presented GNN models for the link prediction task, but node prediction tasks are also suitable for this network.Predicting the article's journal, for example, would generate an embedding representation much closer to the semantic space.Predicting the article's author with GNN can be an important improvement for the author-name disambiguation problem (Schulz et al. 2014).Besides, direct citation networks are not the only network structure to be consider.Second, the network that emerges from bibliographic coupling and co-citation can be compared with the results from direct citations.Third, the study of co-authorships and mobility networks are promising lines of research, in order to explore how communities of co-authors and institutions are organised in the embeddings.Fourth, the use of this methodology can generate important new insights for the field of Science of Science: the flexibility of the low-level representations allows a quantitative approach to many research questions.Problems like the Matthew or the Mathilda effects (Rossiter 1993) can be approached using the cosine similarity analysis presented in this article.The time dynamics in this phenomena can also be studied by simply splitting the data set in decades and building multiple embeddings, as in Garg et al. 2018.Also, as we have seen in this article, the field classification on journal level can be a problematic task.Embeddings methodologies can be used for field classification on article level.Finally, the development of methodologies based on relational embeddings enable researchers to detect gender and ethnic biases also carries the potential use in policy recommendations.
we consider these documents as a realisation of this chained random processes, we can use the Bayes Theorem to infer the probabilistic distributions: where θ is the Dirichlet process that defines the distribution of topics over documents, z, and α is its parameter.

C.2 GNN Models
Figure

C.2.1 First Approaches
In this section, we briefly present the first approaches for GNN.These models are not used in the subsequent experiments, as they are not currently the state of the art.Nevertheless, they are all conceptually important.For the task of node embedding, given the developments on Word Embeddings Mikolov, Yih, et al. 2013, one of the first strategies proposed was to use random walks over nodes to define a sequence that can be later used as the input for a Word2Vec model, as it is normally done with texts on NLP.The first model that proposed this technique was DeepWalk Perozzi et al. 2014. Later, Grover et al. 2016 proposed node2vec, which defines flexible biased random walks, that includes parameters for adjusting the path taken by the random walks to search for structural roles or community structures.The major problem with these approaches is that they do not consider the features of the nodes, so they miss potentially useful information.Scarselli et al. 2009 proposed the Graph Neural Network model which iteratively updates the nodes' state looking at its neighbours, until it converges.This recurrent model uses a single layer which is iteratively updated.
Convolutional Graph Networks (CGN) models instead use a stack of layers.In this way, the number of updates is fixed, and the parameters of each layer are allowed to change, giving more flexibility to the model.Spectral-based methods were the first type of CGN Bruna et al. 2014.They use the graph Fourier transform on the Laplacian matrix (a normalised adjacency matrix), which can be thought of as the effect of a signal over the network.This model, while conceptually important, suffers from many problems.In particular, as it uses the full graph structure, it can only work on transductive settings, i.e., it cannot be used on a different graph on train and test.More importantly, the eigen-decomposition requires O(n 3 ) where n is the number of nodes.This is prohibitively expensive when the network has billions of nodes, like social networks.

C.2.2 State Of The Art
In this section, we present the current state of the art in GNN.A combination of these models will be used in the experimental analysis for the task of link prediction.
GCN Many models improve the limitations present in the spectral method proposed by Bruna et al. 2014. Defferrard et al. 2016 build an approximation of the original model with Chebyshev polynomials.Kipf et al. 2017 introduce the Graph Convolutional Networks (GCN), which further reduces the model and includes self-connections, this means that in the iterative process of building a representation based on its neighbours, the node will also look at itself, which is a desirable property.The GCN simplifies the model by only looking at the first-order neighbourhood.If the representation of a node is initiated with its feature vector, the GCN update builds an average representation based on itself, due to self-connections, and its neighbours.Instead of iterating the update step until convergence, like recurrent models, in GCN a stack of layers is built.The stacking of GCN layers allows the node representation to be based on more distant nodes.Here, simplicity is the key for building a powerful model.
GraphSage Hamilton et al. 2017a propose several variations over the GCN, in the GraphSage model.As the formulation of the problem can be useful for understanding how Graph Convolutional Networks works, we present their algorithm in 1 .The model needs the following inputs: • The Graph G(V, E) with a list of vertices V and edges E, and • the input features x, where x v is the feature vector of the node v.

40
We also need to define the number of layers the model will have, K, a set of weight matrices W k for each layer (that will be later trained with the data), and an activation function σ.We also need a way in which a group node embeddings will be aggregate, and a way of defining the neighbourhood of a node.
The model starts the node embeddings with their feature vectors.After this, in each of the K layers, for each node v ∈ V, it first defines its neighbours.One of the changes with respect to GCN is that GraphSage samples a fixed amount of neighbours to control the computational footprint.Given the neighbourhood, the AGGREGAT E function is used to build a new vectorised representation of those (line 4), and this is later concatenated with the embedding of node v in the previous layer (line 5).A projection is made with the W k matrix, and also an activation layer is used.This correspond with the typical structure of a deep learning layer.When the model is fitted with back-propagation for a specific task, the W k are updated in order to optimise the lost function Kelley 1960.Line 6 is simply a regularisation.
output Vector representations z v ∀v ∈ V.
1: h 0 v ← x v , ∀v ∈ V; 2: for k = 1 . . .K do 6: the nodes embedding as input and generates a single embedding representation for the entire network.If for two graphs G 1 and G 2 that are non-isomorphic we can build a different embedding representation, we are able to distinguish between those two.Xu et al. 2019 proves that the GNN power for discriminating between non-isomorphic networks is at most that of the Weisfeiler-Lehman test.However not every GNN can have this power, the Graph Isomorphic Network (GIN) is a proposal that achieves the maximum discriminative capacity by using a specific update function, which replaces the AGGREGAT E functions proposed by Hamilton et al. 2017a (mean, LSTM or max ) with a summation over the neighbourhood.This model has the advantage of a theoretically robust decision on the AGGREGATE function.
GAT In GCN, nodes embeddings are updated using their neighbours embedding.Up to this point, all models consider that every neighbour has the same influence, which might not be true.Graph Attention Networks (GAT), introduced by Veličković et al. 2018, use attention mechanisms (which are currently state of the art in other problems, like NLP) to assign a different influence to each neighbour.The update of the node representation becomes: where W is still a learning weight matrix and α u,v is the normalised attention between nodes u and v.
Following Vaswani et al. 2017, GAT uses multihead attention, which implies applying K independent attentions and concatenating their results (except for the final layer, where they are averaged).
This model has the advantage of learning the different importance of the neighbours, given their feature vector.Moreover, the attention mechanisms have the potential of an increasing interpretabilty of the mode.Thekumparampil et al. 2018 propose a variation of this model, Attention-based Graph Neural Networks (AGNN), where the relevance is defined based on the cosine similarity.
GraphUNet Gao et al. 2019 proposed GraphUNet, based on a new definition of Pooling layers.Convolutional layers in computer vision are normally used with Pooling layers.This is because the convolutional filters are trained to detect the presence of a specific feature in a portion of the image, and if the feature is found, the result will have high positive values.The max pooling layer, makes a downsizing of the input and captures the highest values.By doing this, it generates a clear indication whether or not that particular feature was present.However, as it happens with traditional convolutions, the pooling layers are defined based on the regular pattern of the image.Defining a pooling layer on the graph domain could be useful for building better representations of hierarchical patterns.Gao et al. 2019 proposed a new definition of pooling gPool.The gPool layer makes a linear projection of the nodes features, and a k-max pooling selection.With the identifier of the selected nodes, it builds the new (reduced) adjacency matrix and feature matrix.Gao et al. 2019 also proposes an unpooling layer, which rebuilds the original network, and used together they can build an encoder-decoder architecture.The benefit of such an architecture is that it lets the node embedding be built based on the hierarchical properties of their neighbourhood.
Autoencoders The Autoencoder is an architecture in deep learning where the network tries to learn the input, but goes through a compressed state.The network can be divided into two elements: • the encoder, which can be a regular stack of layers, that ends with a vectorised representation of the input, • the decoder, where the representation received from the encoder will be sized up to the original form.
The idea is that, if the network is able to reconstruct the original input with a small error margin, then most of the information from the original input is correctly compressed in the low dimension at the end of the encoder.Kipf et al. 2016 proposed the Graph Autoencoder (GAE) and the Variational GAE (a probabilistic implementation of the GAE), where the encoder can be any way of building a node embedding, like a stack of GCN layers, and the decoder is (for the GAE) the reconstructed adjacency matrix, Â: Â = σ(ZZ T ), with Z = GCN (X, A) Where Z represents the embedding matrix built using the GCN and σ is a logistic sigmoid function.The model use the inner product of the node embeddings to reconstruct a probabilistic adjacency matrix.This transforms the embedding into edge probabilities for the given node pairs.This is a particularly useful architecture for the task of link prediction.
In this paper, we train our models for this task in a transductive setting.We randomly remove some citation links for test and validation set, and evaluate how well the reconstructed adjacency matrix can predict those removed links.We use as encoders the GCN, GraphSAGE, GIN, GAT, AGNN, and GraphUNet layers defined above, which constitute the current state of the art in the field.

Figure 1 :
Figure 1: Network Statistics, log-log degree distribution and power law fit.

Figure 3 :
Figure 3: Relative importance of topics per field.

Figure 4 :
Figure 4: GNN Embedding, with GCN and BERT encoded text.T-SNE projection.Journals from the same field are presented as variations in the luminosity of a similar chromaticity.The size corresponds to the number of citations.95% ellipses by journal.

Figure 5 :
Figure 5: Semantic Embeddings.T-SNE projection.Journals from the same field are presented as variations in the luminosity of a similar chromaticity.The size corresponds to the number of citations.95% ellipses by journal.20

11
See Figure E.1 in the Appendix

Figure 6 :
Figure 6: Cosine Similarity by collaboration status, controlled by journal.A: Single authorship; B: Collaborations between authors from the same institution; C: Collaboration between authors from different institutions from the same country; D: Collaborations between authors from institutions in different countries.

Figure 7 :
Figure 7: Average cosine similarity by countries.BERT and GNN.Size by number of citations.

Figure 8 :
Figure 8: Cosine similarity with the Journal of Informetrics-ISIS dimension.

Figure D. 1 :
Figure D.1: Relative importance of topics per journal.

Table 1 :
Statistics of the data set.

Table 2 :
Link prediction results.Area Under the Curve and Average Precision.Mean result of 10 runs and standard deviation in parenthesis.
Bacciu et al. 2020;Hamilton et al. 2017b;Wu et al. 2020;Zhou et al. 2018.ue.The first approaches were based on building sequences using random walksPerozzi et al. 2014, or recurrent models Scarselli et al. 2009.After this, Graph Convolutions were defined using spectral methods Bruna et al. 2014 and spatial methods Hamilton et al. 2017a.More recently, new architectures were proposed that incorporate attention mechanisms Veličković et al. 2018, U-nets Gao et al. 2019 and autoencodersKipf et al. 2016.For the remaining of this section, we present these models' intuitions, for an in-depth literature review, we refer the readers toBacciu et al. 2020;Hamilton et al. 2017b;Wu et al. 2020;Zhou et al. 2018.