Human-in-the-loop latent space learning for biblio-record-based literature management

Watanabe, Shingo; Ito, Hiroyoshi; Matsubara, Masaki; Morishima, Atsuyuki

doi:10.1007/s00799-023-00389-8

Human-in-the-loop latent space learning for biblio-record-based literature management

Open access
Published: 04 January 2024

Volume 25, pages 123–136, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal on Digital Libraries Aims and scope Submit manuscript

Human-in-the-loop latent space learning for biblio-record-based literature management

Download PDF

542 Accesses
Explore all metrics

Abstract

Every researcher must conduct a literature review, and the document management needs of researchers working on various research topics vary. However, there are two major challenges. First, traditional methods such as the tree hierarchy of document folders and tag-based management are no longer effective with the enormous volume of publications. Second, although their bibliographic information is available to everyone, many papers can only be accessed through paid services. This study attempts to develop an interactive tool for personal literature management based solely on their bibliographic records. To make such a tool possible, we developed a principled “human-in-the-loop latent space learning” method that estimates the management criteria of each researcher based on his or her feedback to calculate the positions of documents in a two-dimensional space on the screen. As a set of bibliographic records forms a graph, our model is naturally designed as a graph-based encoder–decoder model that connects the graph and the space. In addition, we also devised an active learning framework using uncertainty sampling for it. The challenge here is to define the uncertainty in a problem setting. Experiments with ten researchers from the humanities, science, and engineering domains show that the proposed framework provides superior results to a typical graph convolutional encoder–decoder model. In addition, we found that our active learning framework was effective in selecting good samples.

Bibrecord-Based Literature Management with Interactive Latent Space Learning

LitVis: a visual analytics approach for managing and exploring literature

Article 28 September 2023

Topic Browsing System for Research Papers Based on Hierarchical Latent Tree Analysis

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Every researcher must conduct a literature review, and there is a personalized need for researchers working on various research topics in their document management. They must organize publications according to their criteria to find relevant research and understand their field trends.

However, there are two significant challenges faced in personalized literature management. First, researchers must manage much research. Fire [6] found that recently, more than seven million new scholarly studies are published annually.

Therefore, the traditional approaches such as the tree hierarchy of document folders and tag-based management, are no longer effective. There is a need for automated literature management techniques.

Second, accessing the content of a paper is challenging. Although their bibliographic information is available to everyone, many papers can only be accessed through paid services. According to Nicolson et al. [15], 65% out of the 100 most cited papers were paywalled. This is a major barrier to researchers accessing relevant.

Therefore, methods for automatic literature management that use the literature contents [24, 31, 40] have limited applicability.

With this background, this study attempts to develop an interactive tool for personal literature management based on bibliographic records without the need to access the contents of papers^{Footnote 1} The tool asks the researcher to place icons corresponding to papers in a two-dimensional space on the screen using their own criteria, and then predicts the positions of newly arrived papers that the user would place. Figure 1 illustrates this process. First, since the relationships among bibliographic records are naturally modeled as a graph, the set of biblio-records is represented as a heterogeneous graph of biblio-records whose nodes correspond to papers, authors, conference names, years, etc. (Fig. 1(1)). The graph connects papers that share the same authors, the same years, and so on. Subsequently, the machine learner that implements our human-in-the-loop latent space learning method (Sect. 4), computes and visualizes the positions in a two-dimensional space on the screen that corresponds to the space for papers that exists in the researcher’s mind (see Fig. 1(2)). Next, the researcher provides feedback on the suggested positions by moving papers from incorrect positions according to their criteria to the correct position. In the feedback phase, the researchers are provided details about the literature, including the title, authors, publication location, and year. Then, the learner receives the feedback and updates the criteria in the space so that it can correctly predict the positions of newly the arrived papers.

The interactive nature not only captures the current latent space of papers in each researcher’s mind, but also allows the system to follow the researcher’s criteria that are evolving over time [1].

Thus, our problem can be considered as latent space learning with a graph convolutional encoder–decoder model [18]. Here, the encoder and decoder map the paper nodes in the graph to points in the latent space and vice versa, and the objective function is the cross-entropy loss for generating adjacency matrices for document clusters in the space. However, existing models do not support our human-in-the-loop approach; that is, they do not allow the user to provide interactive feedback to the latent space. Therefore, we developed a principled “human-in-the-loop latent space learning” method that estimates the management criteria of each researcher based on their feedback on the estimated positions of documents in a two-dimensional space on the screen. Our challenge is how to make the model capture the characteristics of the latent space for literature management.

1.1 Challenges and contributions

(1) We present a principled framework for interactive latent space learning in literature management. It is based on a common graph convolutional encoder–decoder model, in which the criteria for individual literature management are represented by the weights of a set of meta-paths (i.e., sequences of attributes at the schema of bib-records data), which are a popular means of capturing the semantics of heterogeneous graph [24, 32]. Our model is unique in that it is based on the following two assumptions. First, the user’s criteria in the latent space are consistent only locally. This was inspired by the results in psychology such as [35].Thus, our first research question (RQ1) is whether each researcher has different criteria for different sub-spaces in the latent space or not.

Second, the two papers are connected through paths on the graph if they are close to each other in the latent space. Therefore, unlike other popular graph convolutional encoder–decoder models, our decoder is based on the Euclidean distance between the latent vectors. Thus, our second question (RQ2) is whether our decoder is effective or not.

(2) We show the experimental results of ten academic researchers from the science, engineering, and the humanities domains. The results answer the two research questions positively and show that the approach is much superior to a typical graph convolutional model. The resulting quality is practically good in that it can place the new paper in a position close to the correct one although it does not necessarily exact one. This implies that our tool can help researchers manage relevant publications based on their own criteria.

(3) Based on the above experimental results, a natural question is whether we can devise an active learning method to improve the learning efficiency. We devised an active learning approach using uncertainty sampling. The challenge here is to define “uncertainty” in our problem setting—what the system learns is the importance of each meta-path for each cluster of each user. Thus, our third research question (RQ3) is whether we can develop an effective active learning for the setting or not. We formally define our uncertainty in the setting and the framework is then evaluated experimentally. The result shows that the uncertainty sampling strategy allows the system to boost the performance compared to random sampling, with a statistical significance.

1.2 Limitations

This study does not intend to find the best feature set or the best performance in learning the latent space using bibliographic data that are potentially available to the public.

2 Related work

2.1 Literature management tools

Tools to assist researchers in organizing related papers are widely used, and studies have been conducted on such tools. Francese [7] conducted a survey at the University of Turin to determine the manner in which students and researchers manage their bibliographies. The results of the survey showed that EndNote was the most popular bibliography management software for researchers to manage their electronic literature online, used by 49% of the respondents, followed by BibTex (11%) and Mendeley (9%). In general, such tools can automatically classify the documents with objective criteria such as years and authors and require explicit inputs from users (such as tags given to each paper) to manage them using the users’ criteria for document management. By contrast, our system automatically estimates the user’s document management criteria and can map new documents onto the space so that the user can easily grasp how they are related to other papers.

2.2 Document classification, clustering, recommendation

Document classification, clustering and recommendations are of increasing interest because of the increasing number of academic papers that researchers must manage. Various methods have been proposed, such as hierarchical Bayesian clustering [14] and metric learning [25, 37]; however, almost all these approaches use natural language processing methods [12, 30]. Unlike our method, most existing methods classify, cluster, and recommend documents by analyzing the abstracts and content of papers assuming that the document contents can be accessed, which limits their applicability in the current digital library situation.

Studies have been conducted on personalized paper recommendation methods that do not require document contents [16, 21, 24, 36, 38]. Paper recommendation is orthogonal to the latent space learning problem in that the former does not identify any criteria for how researchers manage the papers, and our method does not address the problem of identifying papers to recommend. Combining these two approaches is an interesting topic for future research.

2.3 Active learning

Active learning is utilized in various machine learning techniques, including the latent space learning [4, 26]. Typically, sampling strategies for active learning are designed to increase classifications and regressions in terms of their evaluation measures. In contrast, our feedback system on the placement of documents in the latent space serves as an oracle for latent space learning while allowing the criteria for organizing documents in each cluster to evolve with the interactive interface. In addition, some studies have used active learning for graph convolutional encoder–decoder models [2, 3]. In these studies, the information entropy and the graph structure are used to select the most informative nodes for the next iteration. Whereas the methods in these studies asks users for nodes with uncertain labels, our method determines the data to be asked based on the uncertainty regarding the importance of the meta-path.

2.4 Latent space learning

Latent space learning has been used to learn data features and comprehending data patterns and/or structural similarities in various contexts. For example, PTE [33] is a semi-supervised latent space learning technique used for textual data. In addition, doc2vec [20] creates representations for each document using latent space learning.

Network-embedding techniques that consider latent semantics have attracted considerable attention for graphs [8, 11, 13, 28, 29]. Some of the techniques such as Deepwalk [27] and Node2vec [9] rely on random walks to produce a distributed representation of nodes; LINE [34] consider and embed nodes that indirectly have edges attached to one another; the Kipf and Welling GCN [18] method learns the latent vectors of the nodes while considering the network structure. In addition, to fit autoencoders [17] to network data, the GCN was used in graph autoencoders (GAE) and variational graph autoencoders (VGAE) [19]. Both methods involve a two-layer graph convolutional network and reconstruction of the adjacency matrix using an encoder–decoder algorithm. Our model is unique in that it addresses the local consistency of criteria in the latent space and adopts the distance-based decoder tailored for literature management.

3 Definitions and the problem

Table 1 Notations used in this paper

Full size table

We discuss our problem using the notations listed in Table 1. First, we define the important concepts we used in the discussion, and then define our problem.

3.1 Heterogeneous information network

Real-world systems, such as bibliographic information networks, are structured into HINs [5, 32]. A heterogeneous information network (HIN) is a special type of network structure that has multiple types of nodes and edges.

Definition 1

(Heterogeneous Information Network) An HIN is defined as a directed graph $G(\mathcal {V}, \mathcal {E})$ with an object-type mapping function $\tau : \mathcal {V} \rightarrow \mathcal {A}$ and relation-type mapping function $\phi : \mathcal {E} \rightarrow \mathcal {R}$, where mathcal V and $\mathcal {E}$ represent set of the nodes and edges, and $\mathcal {A}$ and $\mathcal {R}$ are the set of the object types and the relation types, respectively. In general, $|\mathcal {A}| + |\mathcal {R}| > 2 $. For example, in a bibliographic information network, there are object types, such as paper (P), author (A), term (T), year (Y), and relation types, for example, published a paper (A-P) or a paper is published in a venue (P-V). By constructing a schema of paths called a meta-path from these types of objects and relations, we can explain the rich semantics of HIN.

3.2 Meta-path

Intuitively, a meta-path is a sequence of object-types that can have an instance in the graph. For example, A-P and PAP are meta-paths. Meta-paths are commonly used to capture rich semantics of [24, 32].

Definition 2

(Meta-Path) The meta-path P is defined as $ \mathcal {A}_1 \xrightarrow [\text {}] {\text {R}_1} \mathcal {A}_2 \xrightarrow [\text {}] {\text {R}_2} \cdots \xrightarrow [\text {}] {\text {R}_l} \mathcal {A}_{l+1} $ and defines a composite relation $ \mathcal {R} = \mathcal {R}_1 \circ \mathcal {R}_2 \circ \cdots \mathcal {R}_l $ between types $\mathcal {A}_1$ and $\mathcal {A}_{l+1}$ where $\circ $ denotes the composition operator of the relations. As this study focuses on in the relationships between papers, we consider a meta-path in which both the starting and ending points of the meta-path are papers (P). For example, the meta-path “Paper (P)− Author (A)− Paper (P)” indicates the relationship between papers written by the same author.

3.3 Problem

We assume that a set of documents represents an HIN and that each document has features. In this study, we assumed that an attribute is the index of a document, which is represented as a matrix $ \textbf{X} \in \mathbb {R}^{|\mathcal {D}|\times |\mathcal {D}|} $. We construct adjacency matrices $ \{\textbf{A}^p\in \mathbb {R}^{|\mathcal {D}|\times |\mathcal {D}|} \}_{p\in \mathcal {P}} $, each of which represents the relationships between documents in a meta-path p. Additionally, the user interaction processes are provided to estimate the user’s document management criteria. This interaction is denoted as a set of tuples $(\vec {z}, \hat{\vec {z}})\in \hat{\mathcal {Z}}$, where $\vec {z}_{d_i}$ represents the initial point of the $d_i$’s latent vector and $\hat{\vec {z}}_{d_i}$ represents the point of the vector after the interaction. We formally define our research as follows. Given a set of adjacency matrices $\{\textbf{A}^p\}_{p\in \mathcal {P}}$, the feature of documents X, and a set of interactions $\hat{\mathcal {Z}}$, we find $Z_\mathcal {Q}$, which is a set of latent vectors of a set of unknown documents $\mathcal {Q}$.

4 Proposed learning method

This section explains our proposed learning method, called ISLE (Interactive latent Space Learning). Algorithm 1 illustrates the structure of ISLE. The components of the algorithm are explained below.

To enable the model to capture the problem of identifying the positions of documents in the latent space in the user’s mind, our method was designed based on the two assumptions: First, there is some locality of the criteria for managing documents in the space in mind; when the researcher moves papers to a place near some of the other papers, there is a consistent criteria in the neighborhood, but the consistency is not guaranteed in other places. Second, two papers are connected through many paths some way on the graph if they are close to each other in the latent space. Therefore, unlike other popular graph convolutional network-based encoder–decoder models, our model’s decoder is based on the Euclidean distance of the latent vectors.

The learning phase of our proposed framework comprises three steps:

1.
Clustering the latent vectors;
2.
Estimating document management criteria in each cluster;
3.
Learning the latent vectors of documents based on graph autoencoders and obtaining the latent vector for the new document.

These steps were included in the iterations of our human-in-the-loop framework. Each time a user provides feedback, this step to updates the clusters and fine-tunes the models. Figure 2 illustrates the learning phase in one “move-learn-display” iteration in our framework in Fig. 1.

4.1 Clustering the latent vectors

The first step in our proposed method is to cluster the latent space in which the user provides feedback. (This corresponds to Line 22–23 in Algorithm 1.) The k-means clustering method is used in [10]. Clustering by k-means results in an adjacency matrix and center of mass for each cluster. The k-means optimization problem is expressed as follows:

$$\begin{aligned} \{\textbf{r}_{d_i}\}_{d_i\in \mathcal {D}},\{\vec {\mu }_k\}_{k\in [n_c]} = \mathop {\mathrm{arg\,min}}\limits _{\{\textbf{r}_{d_i}\},\{\vec {\mu }_k\}} J, \end{aligned}$$

(1)

where $\textbf{r}_{d_i} = (r_{d_i,1},\ldots ,r_{d_i,k})^\top $ represents the cluster assignment vector of the document $d_i$. Each element $r_{d_i,j}$ is one if document $d_i$ belongs to cluster j and zero otherwise. $\vec {\mu }_k \in \mathbb {R}^L$ is the centroid vector of cluster k. The objective function J is defined as follows:

$$\begin{aligned} J = \sum _{d_i \in \mathcal {D}}\sum _{k \in [n_c]} r_{d_i,k} \left\Vert \vec {z}_{d_i} - \vec {\mu }_k \right\Vert _2^2, \end{aligned}$$

(2)

where $\vec {z}_{d_i}$ is a latent vector for a document $d_i$. After solving the K-means clustering, we obtain the k-th cluster

$$\begin{aligned} \mathcal {C}_k = \{d_i \in \mathcal {D} \mid r_{d_i, k}=1\}. \end{aligned}$$

(3)

4.2 Estimation of document management criteria in a given cluster

The second step in our proposed method is to estimate the user’s document management criteria in each cluster (corresponding to Line 24–25 of Algorithm 1). We used meta-paths as the management criteria for documents and weighed the meta-paths from the user’s space.

We assume that the management criteria is unique to each cluster created by a user. The fundamental concept for determining the weight of the meta-path is that when two documents in a cluster are related to a meta-path, the user manages the cluster by considering the meta-path. From this insight, we calculate the weight of meta-path p for k-th cluster from the adjacency matrix as follows:

$$\begin{aligned} w^p_k = \frac{n^p_k}{\sum _{ p \in \mathcal {P} } n^p_k} \end{aligned}$$

(4)

where $n_k^p$ is the number of paths which are assigned to the cluster k and have the meta-path p and is defined as follows:

$$\begin{aligned} n_k^p = \left| \left\{ d_i \in \mathcal {C}_k \mid \exists d_j \in \mathcal {C}_k : \textbf{A}^p_{d_i, d_j}=1\right\} \right| . \end{aligned}$$

(5)

Once the weights of the meta-paths within a cluster k are determined, the adjacency matrices are weighted accordingly. The weighted adjacency matrix is defined as follows:

$$\begin{aligned} \tilde{\textbf{A}} = \displaystyle \sum _{k \in [n_c]} \displaystyle {\sum _{ p \in \mathcal {P} }} w^p_k \, \textbf{A}_{\mathcal {C}_k}^{p}, \end{aligned}$$

(6)

4.3 Learning the latent vector of documents based on graph autoencoders

The third step in our proposed method is to learn the latent vector of documents based on graph autoencoders. (This corresponds to Line 26–30 of Algorithm 1.) We used the weighted adjacency matrix $\tilde{\textbf{A}}_k$ to obtain a latent vector in the latent space for each document. To this end, we constructed a graph convolutional network(GCN)-based encoder–decoder model with supervision from the user’s interactions.

4.3.1 Encoder

Our encoder is GCN [18] with two layers.

Particularly, the latent vectors are calculated using the following equations:

$$\begin{aligned} \textbf{Z} = GCN_{\phi }(\textbf{X}, \tilde{\textbf{A}}), \end{aligned}$$

(7)

where $\textbf{X}$ is the feature matrix. GCN is defined as

$$\begin{aligned} GCN_{\phi }(\textbf{X}, \tilde{\textbf{A}}) = \hat{\textbf{A}} ReLU(\hat{\textbf{A}} \textbf{X} \textbf{W}^{(0)})\textbf{W}^{(1)} \end{aligned}$$

(8)

with the GCN parameter set $\phi =\left\{ \textbf{W}^{(0)}, \textbf{W}^{(1)}\right\} $ where $\textbf{W}^{(0)}\in \mathbb {R}^{|\mathcal {D}|\times h_1} $ is the weight of first layer and $\textbf{W}^{(1)} \in \mathbb {R}^{h_1\times L} $ is the weight of the second layer. $\hat{\textbf{A}}$ is defined as

$$\begin{aligned} \hat{\textbf{A}} = \textbf{D}^{-\frac{1}{2}} \tilde{\textbf{A}} \textbf{D}^{-\frac{1}{2}}. \end{aligned}$$

(9)

The decoder reconstructs the adjacency matrix $\tilde{\textbf{A}}$ by computing the probability $p_{\theta } (\tilde{\textbf{A}}|\textbf{Z}_\mathcal {D})$ of the edge generation based on the latent vector of each document where

$$\begin{aligned} p_{\theta } (\tilde{\textbf{A}}|\textbf{Z}) = \prod _{d_i \in \mathcal {D}}\prod _{d_j \in \mathcal {D}} p_{\theta } (\tilde{\textbf{A}}_{{d_i,d_j}} | \vec {z}_{d_i},\vec {z}_{d_j}). \end{aligned}$$

(10)

The decoder in the generative model was configured using the Euclidean distance between the latent vectors. This is intended to increase the probability of generating edges between documents that are placed closer together because the user provides feedback to the system based on the distance between documents. The decoder is expressed as follows:

$$\begin{aligned} p_{\theta } (\tilde{\textbf{A}}_{{d_i,d_j}} | \vec {z_{d_i}},\vec {z_{d_j}}) = \sigma \left( \frac{a}{\Vert \vec {z_{d_i}}-\vec {z_{d_j}}\Vert ^2_2}+ b\right) , \end{aligned}$$

(11)

where $\sigma (\cdot )$ denotes a sigmoid function and $\theta =\{a, b\}$ denotes a set of parameters used in the decoder.

4.3.2 Objective function

The objective function consists of cross-entropy loss for generating the adjacency matrix and supervision from the interaction by the user. Parameters $\varvec{\phi }=\{\phi \}$ and $\varvec{\theta }=\{\theta \}$ are learned to maximize them.

The cross-entropy used to generate the adjacency matrix is defined as follows:

$$\begin{aligned} \mathcal {L}_{GAE}&= {\sum _{d_i \in \mathcal {D}}\sum _{d_j\in \mathcal {D}}} \tilde{\textbf{A}}_{{d_i,d_j}} \log p_{\theta _k} \left( \tilde{\textbf{A}}_{{d_i,d_j}} \mid \vec {z_{d_i}},\vec {z_{d_j}}\right) \end{aligned}$$

(12)

$$\begin{aligned}&= \mathbb {E}_{GCN_{\phi }((\textbf{X}, \tilde{\textbf{A}})}[\log p_{\theta } (\tilde{\textbf{A}}|\textbf{Z})] \end{aligned}$$

(13)

Moreover, we define a loss function that measures the difference between the user’s feedback and the learned latent vectors to minimize disagreement. We measured this disagreement using the conditional probability that given the user feedback and the generation probability of the latent vector. The objective function is defined as follows:

$$\begin{aligned} \mathcal {L}_{feedback}&= \log p\left( GCN_{\phi }(\textbf{X}, \tilde{\textbf{A}}) \mid \hat{\mathcal {Z}} \right) \end{aligned}$$

(14)

$$\begin{aligned}&= \sum _{(\vec {z_{d_i}},\hat{\vec {z_{d_i}}})\in \hat{\mathcal {Z}}} \log \mathcal {N}(\vec {z_{d_i}} \mid \hat{\vec {z_{d_i}}}, \sigma ^2 \textbf{I}) \end{aligned}$$

(15)

$$\begin{aligned}&= - \sum _{(\vec {z_{d_i}},\hat{\vec {z_{d_i}}})\in \hat{\mathcal {Z}}} \left\Vert \vec {z_{d_i}}-\vec {\hat{z_{d_i}}} \right\Vert ^2_2 + const., \end{aligned}$$

(16)

where $\mathcal {N}(\vec {x} \mid \vec {\mu },\varvec{\Sigma })$ denotes the multivariate normal distribution and $\vec {z_{d_i}}^k$ represents a latent vector generated by the encoder. The overall optimization problem is defined as follows:

$$\begin{aligned} \varvec{\phi },\varvec{\theta } = \mathop {\mathrm{arg\,max}}\limits _{\varvec{\phi },\varvec{\theta }} \mathcal {L}_{GAE} + \alpha \mathcal {L}_{feedback}, \end{aligned}$$

(17)

where $\alpha $ denotes the hyper-parameter.

4.4 Sampling strategy for active learning

In this section, we introduce an active learning sampling strategy for ISLE. (This step corresponds to Line 20–21 in Algorithm 1) Our strategy is a type of uncertainty sampling [22], which selects a sample that provides us one of the most informative answers to improve the model. The challenge here is to formalize the notion of uncertainty in our setting. To efficiently obtain users’ literature management criteria, we ask users for more important data in acquiring a user’s literature management criteria. Therefore, we define our uncertainty as the uncertainty of the meta-path weight in the clusters, where the weights are represented as a Dirichlet distribution, and measure the uncertainty as the information entropy of the probability distribution. The difference between the entropies of the prior and posterior distributions is defined as an increase in information acquired by asking the user. In our query strategy, our system is designed to ask a user for the most informative data, regardless of the cluster in which the requested data will be placed.

The prior distribution of the meta-path weight in the k-th cluster is a Dirichlet distribution, which is defined as follows:

$$\begin{aligned} p(\varvec{\pi } | \textbf{n}_k) = Dir(\varvec{\pi } | \textbf{n}_k) = C_D(\textbf{n}_k) \pi ^{n^p_k - 1}_p \end{aligned}$$

(18)

where $C_D(\textbf{n}_k)$ is normalizing constant; $C_D(\textbf{n}_k)=\frac{\Gamma (\sum _p n^p_k)}{\prod _p \Gamma (n^p_k)}\prod _{p\in \mathcal {P}}$. $\textbf{n}_k=(n_k^1,n_k^2,\ldots ,n_k^{|P|})$, and $n_k^p$ is defined as Eq. (5). After the user provides feedback on the position of a new paper, the posterior distribution is calculated as follows:

$$\begin{aligned} p(\varvec{\pi } | \textbf{n}_k, \textbf{n}_{new})&= \frac{p(\varvec{\pi }, \textbf{n}_{new} | \textbf{n}_k)}{p(\textbf{n}_{new} | \textbf{n}_k)} \propto p(\varvec{\pi }, \textbf{n}_{new} | \textbf{n}_k) \end{aligned}$$

(19)

$$\begin{aligned}&= Multi(\textbf{n}_{new} | \varvec{\pi }) Dir(\varvec{\pi }| \textbf{n}_k) \end{aligned}$$

(20)

$$\begin{aligned}&\propto \prod _{p\in \mathcal {P}} \pi _p^{n^p_{new}} \prod _{p\in \mathcal {P}} \pi _p^{n^p_k-1} \end{aligned}$$

(21)

$$\begin{aligned}&= \prod _{p\in \mathcal {P}} \pi _p^{n^p_k+n^p_{new} - 1} \end{aligned}$$

(22)

$$\begin{aligned}&\propto Dir(\varvec{\pi } | \textbf{n}_k + \textbf{n}_{new}) \end{aligned}$$

(23)

where $\textbf{n}_{new}=(n^1_{new}, n^2_{new},\ldots , n^{|P|}_{new})$ is the number of new paths in the meta-path p in a cluster when a new document $d_{new}$ is added to the cluster:

$$\begin{aligned} n^p_{new} = \left| \left\{ d_i \in \mathcal {C}_k \mid \textbf{A}^p_{d_i, d_{new}}=1\right\} \right| \end{aligned}$$

(24)

Based on the prior and posterior distribution, the information gain in a cluster k is defined as follows:

$$\begin{aligned} \Delta H_k =\,&\mathbb {E}[-\log Dir(\varvec{\pi } | \textbf{n}_k)]\nonumber \\&- \mathbb {E}[-\log Dir(\varvec{\pi } | \textbf{n}_k + \textbf{n}_{new})] \end{aligned}$$

(25)

where the entropy of the Dirichlet distribution is calculated as follows:

$$\begin{aligned}&\mathbb {E}[-\log Dir(\varvec{\pi }|\varvec{\alpha })] \nonumber \\&= -\sum ^{K}_{k=1}(\alpha _k - 1)(\psi (\alpha _k)-\psi (\sum _{i=1}^K \alpha _i)) - \ln C_D(\varvec{\alpha }) \end{aligned}$$

(26)

$\psi $ is the digamma function. Based on the decrease of the entropy in a cluster k, we define the overall information gain of the asked data as the sum of them:

$$\begin{aligned} \phi _{gain} = \sum _{k\in [n_c]} \Delta H_k \end{aligned}$$

(27)

Based on the above criteria, we ask for the most informative data for the user and get feedback.

5 Experiment

We conducted an experiment to answer our three research questions and determine the effectiveness of the method. For RQ1, we compared our method with its variation that assumes the consistency of the criteria across the latent space. For RQ2, we compared our framework with a popular encoder–decoder model for graphs as a baseline that uses an inner-product-based decoder. For RQ3, we compared the effectiveness of the proposed method using the active query strategy using randomly selected queries.

5.1 Settings

5.1.1 Interface

We developed a two-dimensional literature management tool prototype to conduct our experiments. Figure 3 shows the actual interface used in the experiments. The interface displays a two-dimensional space to place the icon of the papers, and the subjects place the papers onto the space.

5.1.2 Participants

We recruited ten researchers (a humanities domain researcher, two data engineering domains, three HCI(Human Computer Interaction) domains and four ML(Machine Learning) domain researchers).

5.1.3 Data collection

First, we asked each of the participants to send us the BibTeX records of any 50 papers related to his or her research. Second, we asked them to use our tool, in such a way that the tool shows the biblio-records in random order and the user puts each into the two-dimensional space. As a result, we obtained the history of how they behaved in the 50 iterations, that is, how they moved their papers to incrementally place all 50 papers in their spaces. The user sees the title, author, conference, and year of publication of each paper in the phase. Figure 4 shows the spaces created by three participants. The distribution of the icon of the papers and the placement scheme of the subjects differ from each other, indicating that there are various document management criteria for each subject.

5.1.4 Evaluation

First, we randomly selected 10 data points and let them be the unknown documents, $\mathcal {Q}$. The remaining 40 data were used as training data, and when the user interacted with them one by one, we simulated whether $\mathcal {Q}$ could be placed in the position expected by the user. In the experiments, we set the hyper-parameters to $L=2$, $\alpha =100$, $n_c=6$, and $h_1=4$.

5.1.5 Metrics

We used Recall@k and nDCG@k [23] where k=6. Recall@k is expressed using the following equation:

$$\begin{aligned} Recall@k = \frac{|\mathcal {U}\cap \mathcal {P}_k|}{|\mathcal {U}|}, \end{aligned}$$

(28)

where $\mathcal {U}$ denotes the set of the closest (with the Euclidean distance) k documents in the latent space to the test data placed by the user, and $\mathcal {P}_k$ is the set of the closest k latent vectors to the position of the test data predicted by the model. nDCG@k [23] is obtained by dividing the value of DCG@k by the most ideal value of DCG@k, that is, if all model predictions are correct. The inverse of the distance from the correct position was used as the relevance value.

5.1.6 Active learning

Since our active learning strategy (Sect. 4.4) required a seed to choose the next query, we randomly choose the first query in the experiment.

5.2 Baselines and variations

(1) VGAE. VGAE is a popular encoder–decoder model for graphs [19]. In our experiment, VGAE is a variant of ISLE, in which the decoder is replaced by the decoder used in ordinary VGAE. That is, the decoder expressed in Eq. (11) in Sect. 5 is replaced with the inner product of each latent representation, which is expressed as follows:

$$\begin{aligned} p_{\theta } (\textbf{A}_{d_i,d_j} | \vec {z_{d_i}},\vec {z_{d_j}}) = \sigma (\vec {z_{d_i}} \cdot \vec {z_{d_j}}), \end{aligned}$$

(29)

Table 2 Meaning of each meta-path

Full size table

where $\sigma $ denotes the sigmoid function. Note that the meta-paths are considered to create the adjacency matrix, and Step 1 (clustering) (Sect. 4.1) is applied.

(2) ISLE and VAGE without clustering. We used VGAE and ISLE which omit Step 1 (clustering) (Sect. 4.1), to address RQ1.

(3) ISLE with different sets of meta-paths.

The meta-paths used are listed in Table 2. We compared the following five cases for ISLE, while we used all meta-paths for VGAE.

(a) ALL: The adjacency matrix comprises the PAP, PTP, PYP, and PVP meta-paths. (b) PAP Only: The adjacency matrix is composed of PAP only. (c) PTP Only: The adjacency matrix is composed of PTP only. (d) PYP Only: The adjacency matrix is composed of PYP only. (e) PVP Only: The adjacency matrix is composed of PVP only.

(4) Active ISLE. ISLE implements the sampling strategy introduced in Sect. 4.4

5.3 Results: passive setting

Figures 5a–6b show the result. The solid line in each figure represents the mean, and the shaded area represents the 95% confidence interval. The red lines in each figure indicate the results of our proposed method when the adjacency matrix provided as the input consists of ALL, as described in Sect. 5.2. The blue lines in Fig. 5 indicate the results of VGAE when the adjacency matrix provided as the input consists of ALL, as described in Sect. 5.2. The yellow and olive lines in the Fig. 5 indicate the results of estimating the document management criteria without using clusters in the proposed method. The green, peach, purple, and gray lines in Fig. 6 depict the limited types of meta-paths given as inputs in the proposed method. The figures demonstrate that the ISLE outperformed all the methods and that the accuracy improved as the number of feedbacks increased.

Note that in our context, recall@k indicates how close the predicted position is to the correct position, whereas nDCG@k indicates how it maintains the order of distances. Unlike in the ordinary information retrieval context, Recall@k is more critical for our problem because the order of distances can dramatically change, even if the position is slightly moved.

Figure 6a compares for the results with different sets of meta-paths. The results show that ISLE performs the best when we use all of the four meta-paths. As we noted in the limitation part, finding the best feature set was not our research question. However, this implies that researchers are aware of multiple criteria when managing papers and that the proposed method can flexibly express these criteria by using multiple meta-paths.

5.4 Results: active setting

Figure 8 compares the results of ISLE and Active ISLE, where the blue and orange lines show the results for Active ISLE and ISLE, respectively. As the Fig. 8 shows, Active ISLE exhibits a higher recall value with fewer interactions. This implies that the sampling strategy works well for quickly identifying the criteria for each document cluster.

6 Discussion

6.1 The locality of criteria in the latent space (RQ1)

The results shown in Fig. 5a and b clearly indicate that methods with a clustering phase are superior to those without clustering. This shows that the clusters of each researcher have a different set of weights for meta-paths, which means that researchers use different criteria in the sub-spaces in their latent space. Figure 7 shows the normalized distribution of the meta-path weights in each cluster for three of the ten subjects. Although PTP accounts for a large proportion of their distributions, their weights are often considerably different to each cluster even for the same researcher.

6.2 Effectiveness of the Euclidean distance-based decoder (RQ2)

The idea behind our second assumption is that the user provides feedback in the latent space based on the Euclidean distance rather than the angle between documents (which is the principle of the VGAE’s decoder). Therefore, the ISLE decoder, which calculates the generation probability of edges based on the Euclidean distance between documents, is more accurate. The results presented in Fig. 5a and b clearly support this assumption.

6.3 Effectiveness of the active learning (RQ3)

Figure 8 shows that the accuracy of the active learning method is generally higher than that of the random sampling method. The main objective of introducing active learning is to achieve high accuracy with a small number of interactions by asking the user for an informative node. Figure 8 shows that this objective was achieved by introducing active learning. When comparing the recall values at the 10th interaction, a statistically significance was observed. This implies the potential effectiveness of query strategies that focus on the uncertainty of meta-path importance. In this experiment, the first dataset was randomly selected. An effective way to select the first data point is a subject for future studies.

6.4 Individual difference

We collected data from ten researchers, and Fig. 9 shows how each researcher’s feedback affected the accuracy of the data. Figure 9a, b shows that the accuracy generally improves as the number of feedback cycles increases. Our findings show that the management criteria of each researcher can be captured using meta-paths, although there are individual differences. In addition, the accuracy of the active learning method was higher than that of the random sampling method in almost all the researchers.

The degree of accuracy improvement through interaction varies from user to user. We assume that this is due to the manner in which users create their latent space. Figure 4 illustrates how each subject creates a latent space. If the user has cluster regions that are clearly divided in the latent space, we can estimate the criteria for managing the literature. However, if the clusters were ambiguous and could not be clustered according to a user’s expected document management criteria, we believe that the increase in interactions did not dramatically improve the accuracy. The following are possible reasons for the decrease in accuracy during the experiment: (1) The criteria changed during the experiment and the cluster was reconstructed. (2) The English paper is mixed with papers in another language.

7 Conclusion and future work

In this study, we proposed a method that estimates a user’s document management criteria based on human-in-the-loop latent space learning.

The experimental results showed that the proposed method accurately placed unknown documents at the user’s desired position compared with the baseline method. In addition, experiments with multiple and a limited number of meta-paths showed that the proposed method (ISLE) is more accurate when multiple meta-paths were used, indicating that ISLE is effective even when users manage documents according to various criteria. Based on the above results, we added an active learning framework to estimate a user’s document management criteria with a fewer number of interactions. The experimental results demonstrated the effectiveness of active learning.

In the future, we intend to study (1) develop a document management system based on ISLE that can be used in the real-world (2) consider using longer meta-paths. The realization of these goals will not only provide a better method for human-in-the-loop latent space learning but will also provide support for researchers in literature management.

Notes

This paper is an extended version of our ICADL paper [39].

References

Bates, M.J.: The design of browsing and berrypicking techniques for the online search interface. Online Rev. (1989). https://doi.org/10.1108/eb024320
Article Google Scholar
Cai, H., Zheng, V.W., Chang, K.C.C.: Active learning for graph embedding (2017). arXiv preprint arXiv:1705.05085
Chen, X., Yu, G., Wang, J., Domeniconi, C., Li, Z., Zhang, X.: Activehne: active heterogeneous network embedding (2019). arXiv preprint arXiv:1905.05659
Deng, Y., Yuan, Y., Fu, H., Qu, A.: Query-augmented active metric learning. J. Am. Stat. Assoc. (2022). https://doi.org/10.1080/01621459.2021.2019045
Article Google Scholar
Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144 (2017). https://doi.org/10.1145/3097983.3098036
Fire, M., Guestrin, C.: Over-optimization of academic publishing metrics: observing Goodhart’s law in action. GigaScience 8(6), giz053 (2019). https://doi.org/10.1093/gigascience/giz053
Article Google Scholar
Francese, E.: Usage of reference management software at the University of Torino. In: Usage of Reference Management Software at the University of Torino, pp. 145–174 (2013). https://doi.org/10.4403/jlis.it-8679
Fu, X., Zhang, J., Meng, Z., King, I.: Magnn: metapath aggregated graph neural network for heterogeneous graph embedding. In: Proceedings of the Web Conference 2020, pp. 2331–2341 (2020). https://doi.org/10.1145/3366423.3380297
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016). https://doi.org/10.1145/2939672.2939754
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Applied Statistics) 28(1), 100–108 (1979). https://doi.org/10.2307/2346830
Article Google Scholar
Hsu, Y.L., Tsai, Y.C., Li, C.T.: Fingat: financial graph attention networks for recommending top-k profitable stocks. IEEE Trans. Knowl. Data Eng. (2021). https://doi.org/10.1109/TKDE.2021.3079496
Article Google Scholar
Hu, X., Yoo, I.: A comprehensive comparison study of document clustering for a biomedical digital library medline. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL’06), pp. 220–229. IEEE (2006). https://doi.org/10.1145/1141753.1141802
Huang, X., Qian, S., Fang, Q., Sang, J., Xu, C.: Meta-path augmented sequential recommendation with contextual co-attention network. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16(2), 1–24 (2020). https://doi.org/10.1145/3382180
Article Google Scholar
Iwayama, M., Tokunaga, T.: Hierarchical Bayesian clustering for automatic text classification. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence-Volume 2, pp. 1322–1327 (1995)
Josh, N., Pepe, A.: 65 out of the 100 most cited papers are paywalled (2019). https://www.authorea.com/users/8850/articles/125400-65-out-of-the-100-most-cited-papers-arepaywalled. Accessed 30 June 2022
Kang, Y., Hou, A., Zhao, Z., Gan, D.: A hybrid approach for paper recommendation. IEICE Trans. Inf. Syst. 104(8), 1222–1231 (2021). https://doi.org/10.1587/transinf.2020BDP0008
Article Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes (2013). arXiv preprint arXiv:1312.6114
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2016). arXiv preprint arXiv:1609.02907
Kipf, T.N., Welling, M.: Variational graph auto-encoders (2016). arXiv preprint arXiv:1611.07308
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Lee, J., Lee, K., Kim, J.G.: Personalized academic research paper recommendation system (2013). arXiv preprint arXiv:1304.5457
Lewis, D.D.: A sequential algorithm for training text classifiers: corrigendum and additional data. In: Acm Sigir Forum, vol. 29, pp. 13–19. ACM New York (1995). https://doi.org/10.1145/219587.219592
Liang, D., Krishnan, R.G., Hoffman, M.D., Jebara, T.: Variational autoencoders for collaborative filtering. In: Proceedings of the 2018 World Wide Web Conference, pp. 689–698 (2018). https://doi.org/10.1145/3178876.3186150
Ma, X., Wang, R.: Personalized scientific paper recommendation based on heterogeneous graph representation. IEEE Access 7, 79887–79894 (2019). https://doi.org/10.1109/ACCESS.2019.2923293
Article Google Scholar
Mikawa, K., Goto, M.: Regularized distance metric learning for document classification and its application. J. Jpn. Ind. Manag. Assoc. 66(2E), 190–203 (2015). https://doi.org/10.11221/jima.66.190
Article Google Scholar
Nadagouda, N., Xu, A., Davenport, M.A.: Active metric learning and classification using similarity queries. In: Uncertainty in Artificial Intelligence, pp. 1478–1488. PMLR (2023)
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014). https://doi.org/10.1145/2623330.2623732
Salehi, A., Davulcu, H.: Graph attention auto-encoders (2019). arXiv preprint arXiv:1905.10715.
Salha-Galvan, G., Hennequin, R., Chapus, B., Tran, V.A., Vazirgiannis, M.: Cold start similar artists ranking with gravity-inspired graph autoencoders. In: Fifteenth ACM Conference on Recommender Systems, pp. 443–452 (2021). https://doi.org/10.1145/3460231.3474252
Scharpf, P., Schubotz, M., Youssef, A., Hamborg, F., Meuschke, N., Gipp, B.: Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 137–146 (2020). https://doi.org/10.1145/3383583.3398529
Sherkat, E., Nourashrafeddin, S., Milios, E.E., Minghim, R.: Interactive document clustering revisited: a visual analytics approach. In: 23rd International Conference on Intelligent User Interfaces, pp. 281–292 (2018). https://doi.org/10.1145/3172944.3172964
Sun, Y., Han, J.: Meta-path-based search and mining in heterogeneous information networks. Tsinghua Sci. Technol. 18(4), 329–338 (2013). https://doi.org/10.1109/TST.2013.6574671
Article Google Scholar
Tang, J., Qu, M., Mei, Q.: Pte: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1165–1174 (2015). https://doi.org/10.1145/2783258.2783307
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015). https://doi.org/10.1145/2736277.2741093
Vlaev, I.: Local choices: rationality and the contextuality of decision-making. Brain Sci. 8(1), 8 (2018). https://doi.org/10.3390/brainsci8010008
Article Google Scholar
Waheed, W., Imran, M., Raza, B., Malik, A.K., Khattak, H.A.: A hybrid approach toward research paper recommendation using centrality measures and author ranking. IEEE Access 7, 33145–33158 (2019). https://doi.org/10.1109/ACCESS.2019.2900520
Article Google Scholar
Wang, J., Wu, S., Vu, H.Q., Li, G.: Text document clustering with metric learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 783–784 (2010). https://doi.org/10.1145/1835449.1835614
Wang, W., Tang, T., Xia, F., Gong, Z., Chen, Z., Liu, H.: Collaborative filtering with network representation learning for citation recommendation. IEEE Trans. Big Data 8(5), 1233–1246 (2020). https://doi.org/10.1109/TBDATA.2020.3034976
Article Google Scholar
Watanabe, S., Ito, H., Matsubara, M., Morishima, A.: Bibrecord-based literature management with interactive latent space learning. In: Proceeding of 24th International Conference on Asian Digital Libraries, ICADL, pp. 155–171. Springer (2022). https://doi.org/10.1007/978-3-031-21756-2_13
Wei, C.P., Chiang, R.H., Wu, C.C.: Accommodating individual preferences in the categorization of documents: a personalized clustering approach. J. Manag. Inf. Syst. 23(2), 173–201 (2006). https://doi.org/10.2753/MIS0742-1222230208
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by Kumagai Gumi Co., Ltd., JSPS KAKENHI Grant Number 22H00508, 22K17944, 21H03552, and JST CREST. This study was approved by the IRB of the University of Tsukuba. We thank to Masao Takaku for his valuable comments.

Author information

Hiroyoshi Ito, Masaki Matsubara and Atsuyuki Morishima have contributed equally to this work.

Authors and Affiliations

University of Tsukuba, Tsukuba, Japan
Shingo Watanabe, Hiroyoshi Ito, Masaki Matsubara & Atsuyuki Morishima

Authors

Shingo Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyoshi Ito
View author publications
You can also search for this author in PubMed Google Scholar
Masaki Matsubara
View author publications
You can also search for this author in PubMed Google Scholar
Atsuyuki Morishima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shingo Watanabe.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Watanabe, S., Ito, H., Matsubara, M. et al. Human-in-the-loop latent space learning for biblio-record-based literature management. Int J Digit Libr 25, 123–136 (2024). https://doi.org/10.1007/s00799-023-00389-8

Download citation

Received: 25 July 2023
Revised: 03 November 2023
Accepted: 04 November 2023
Published: 04 January 2024
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00799-023-00389-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Human-in-the-loop latent space learning for biblio-record-based literature management

Abstract

Similar content being viewed by others

Bibrecord-Based Literature Management with Interactive Latent Space Learning

LitVis: a visual analytics approach for managing and exploring literature

Topic Browsing System for Research Papers Based on Hierarchical Latent Tree Analysis

1 Introduction

1.1 Challenges and contributions

1.2 Limitations

2 Related work

2.1 Literature management tools

2.2 Document classification, clustering, recommendation

2.3 Active learning

2.4 Latent space learning

3 Definitions and the problem

3.1 Heterogeneous information network

Definition 1

3.2 Meta-path

Definition 2

3.3 Problem

4 Proposed learning method

4.1 Clustering the latent vectors

4.2 Estimation of document management criteria in a given cluster

4.3 Learning the latent vector of documents based on graph autoencoders

4.3.1 Encoder

4.3.2 Objective function

4.4 Sampling strategy for active learning

5 Experiment

5.1 Settings

5.1.1 Interface

5.1.2 Participants

5.1.3 Data collection

5.1.4 Evaluation

5.1.5 Metrics

5.1.6 Active learning

5.2 Baselines and variations

5.3 Results: passive setting

5.4 Results: active setting

6 Discussion

6.1 The locality of criteria in the latent space (RQ1)

6.2 Effectiveness of the Euclidean distance-based decoder (RQ2)

6.3 Effectiveness of the active learning (RQ3)

6.4 Individual difference

7 Conclusion and future work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation