Skip-thought
We begin by introducing the general architecture of the particular encoder-decoder (Kiros et al. 2015) design we decided to employ. Here, a recurrent neural network (RNN), in particular using the Gated Recurrent Unit (GRU) architecture (Chung et al. 2014), is used as the encoder while a pair of RNNs with conditional GRU are used as decoders. The model is trained using the Adam stochastic optimization algorithm (Kingma and Ba 2015).
The input to the model is a triplet of sequences (si−1,si,si+1), with \(\mathbf {x}^{t}_{i}\) being the t-th value in the sequence si. In the case where the sequences are sentences, each input x simply represents an embedding of a word in a sentence. The vectors \(\mathbf {x}^{t}_{i}\) that comprise the middle sequence, si, are then fed sequentially as input into the encoder. The encoder generates a hidden vector \(\mathbf {h}^{t}_{i}\) at each time step t, this is the information the model retained after processing the sub-sequence \(\mathbf {x}^{1}_{i},\cdots, \mathbf {x}^{t}_{i}\) and can be thought of as the representation of the particular sub-sequence. The hidden state \(\mathbf {h}^{N_{i}}_{i}\) can thus be considered the representation of the entire sequence, where Ni is the length of sequence si. Given a sequence to encode, the encoder iterates through the following equations. Here the subscripts i are dropped for simplicity.
$$\begin{array}{*{20}l} \mathbf{r}^{t} &= \sigma\left(\mathbf{W}_{r}\mathbf{x}^{t} + \mathbf{U}_{r}\mathbf{h}^{t-1}\right) \end{array} $$
(1)
$$\begin{array}{*{20}l} \mathbf{z}^{t} &= \sigma\left(\mathbf{W}_{z}\mathbf{x}^{t} + \mathbf{U}_{z}\mathbf{h}^{t-1}\right) \end{array} $$
(2)
$$\begin{array}{*{20}l} \bar{\mathbf{h}}^{t} &= \text{tanh}\left(\mathbf{W}\mathbf{x}^{t} + \mathbf{U}\left(\mathbf{r}^{t} \odot \mathbf{h}^{t-1}\right)\right) \end{array} $$
(3)
$$\begin{array}{*{20}l} \mathbf{h}^{t} &= \left(1 - \mathbf{z}^{t}\right) \odot \mathbf{h}^{t-1} + \mathbf{z}^{t} \odot \bar{\mathbf{h}}^{t} \end{array} $$
(4)
where rt is the forget gate, zt is the update gate, \(\bar {\mathbf {h}}^{t}\) is the proposed hidden state, and ⊙ is the component-wise product. Here rt decides what information to discard from the previous state, zt decides what new information to encode, and the new hidden vector ht is calculated accordingly. Values in rt and zt are within the range [0,1].
Two decoders with separate parameters are then used to reconstruct the previous sequence si−1 and the next sequence si+1. The computation for the decoder is similar to that of the encoder, except this time the models are also conditioned on the final encoder output or representation hi (which is \(\mathbf {h}^{N_{i}}_{i}\)). Decoding involves iterating through the following statements. Again the subscript i+1 (similarly, i−1) is dropped.
$$\begin{array}{*{20}l} \mathbf{r}^{t} &= \sigma\left(\mathbf{W}^{d}_{r}\mathbf{x}^{t-1} + \mathbf{U}^{d}_{r}\mathbf{h}^{t-1} + \mathbf{C}_{r}\mathbf{h}_{i}\right) \end{array} $$
(5)
$$\begin{array}{*{20}l} \mathbf{z}^{t} &= \sigma\left(\mathbf{W}^{d}_{z}\mathbf{x}^{t-1} + \mathbf{U}^{d}_{z}\mathbf{h}^{t-1} + \mathbf{C}_{z}\mathbf{h}_{i}\right) \end{array} $$
(6)
$$\begin{array}{*{20}l} \bar{\mathbf{h}}^{t} &= \text{tanh}\left(\mathbf{W}^{d}\mathbf{x}^{t-1} + \mathbf{U}^{d}\left(\mathbf{r}^{t} \odot \mathbf{h}^{t-1}\right) + \mathbf{C}\mathbf{h}_{i}\right) \end{array} $$
(7)
$$\begin{array}{*{20}l} \mathbf{h}^{t}_{i+1} &= \left(1 - \mathbf{z}^{t}\right) \odot \mathbf{h}^{t-1} + \mathbf{z}^{t} \odot \bar{\mathbf{h}}^{t} \end{array} $$
(8)
here the C matrices are used to bias the decoder computation using the representation produced by the encoder. Also, note that the input values x are from the previous time step since the decoder’s job is to reconstruct the sequence si+1 (similarly, si−1) one step at a time. The probability of value \(\mathbf {x}_{i+1}^{t}\) can then be calculated by
$$\begin{array}{*{20}l} P\left(\mathbf{x}^{t}_{i+1} | \mathbf{x}^{< t}_{i+1}, \mathbf{h}_{i}\right) \propto \text{exp}\left(\mathbf{v}_{\mathbf{x}^{t}_{i+1}} \mathbf{h}^{t}_{i+1}\right) \end{array} $$
(9)
where \(\mathbf {v}_{\mathbf {x}^{t}_{i+1}}\) is the row vector in the “vocabulary" vector V corresponding to the input \(\mathbf {x}^{t}_{i+1}\). The vocabulary matrix, V, is a weight matrix shared by both decoders connecting the decoder’s hidden states for computing a distribution over the inputs.
Finally, given a triplet of sequences, the training objective is then given by
$$\begin{array}{*{20}l} \sum\limits_{t} \text{log}P\left(\mathbf{x}^{t}_{i+1} | \mathbf{x}^{< t}_{i+1}, \mathbf{h}_{i}\right) + \sum\limits_{t} \text{log}P\left(\mathbf{x}^{t}_{i-1} | \mathbf{x}^{< t}_{i-1}, \mathbf{h}_{i}\right) \end{array} $$
(10)
which is the sum of log-probabilities for the values in the previous and next statements, si−1 and si+1, conditioned on the final representation for si. The total objective would then be the above summed for all triplets used in the training data.
Skip-graph
We now discuss how an encoder-decoder can be used to learn useful representations for sub-sequences derived from walks over a labeled graph. Figure 3 shows an example of how a random walk over a graph can be fed into the encoder-decoder introduced above.
Training set generation
Given a set of graphs \(\mathcal {D}\), a sample size K, a minimum random walk length lmin, and a maximum random walk length lmax, we take each graph \(\mathcal {G} \in \mathcal {D}\) and generate K random walk sequences. Specifically, for a graph \(\mathcal {G}\), K sequences of the form
$$ \ell_{v}\left(v_{1}\right),\cdots, \ell_{v}\left(v_{k}\right), \ell_{v}\left(v_{k+1}\right),\cdots, \ell_{v}\left(v_{k + k^{'}}\right), \ell_{v}\left(v_{k + k^{'} + 1}\right),\cdots, \ell_{v}\left(v_{k + k^{'} + k^{\prime\prime}}\right) $$
(11)
are generated. Here, \(v_{1} \in \mathcal {V}\) is a randomly selected start node, \(\left (v_{i}, v_{i+1}\right) \in \mathcal {E}\) for i from \(\phantom {\dot {i}\!}1\cdots k + k^{'} + k^{\prime \prime } - 1\), and \(\phantom {\dot {i}\!}l_{min} \geq k, k^{'}, k^{\prime \prime }\geq l_{max}\). Each sequence can then be split into a triplet of three sub-sequences with \(s_{1} = \ell _{v}(v_{1}),\cdots, \ell _{v}(v_{k}), s_{2} = \ell _{v}\left (v_{k+1}\right),\cdots, \ell _{v}\left (v_{k + k^{'}}\right)\), and \(s_{3} = \ell _{v}\left (v_{k + k^{'} + 1}\right),\cdots, \ell _{v}\left (v_{k + k^{'} + k^{\prime \prime }}\right)\).
When generating sequences, \(\phantom {\dot {i}\!}k, k^{'}\), and k′′ are randomly drawn to be between the constraints lmin and lmax each time. This is to ensure that the length of the sub-sequences do not need to have fixed lengths and can instead vary. Because of this, graph sub-structures or regions of varying sizes can easily be processed by the model.
In the above formulation, we assume that only the vertices in the graph are labeled and node and edge features are not given. When nodes, or edges, are labeled and feature vectors are provided we can use a one-hot embedding to represent each unique combination of labels and features. This treats each distinct combination as a unique “word" and does not capture the relationship between nodes or edges that share labels or certain features. A better approach is to simply use a one-of-\(|\mathcal {L}|\) vector to encode the label and concatenate this with the feature vector, this allows the node or edge embedding to capture shared features and labels.
Once all the triplets of random walk sequences have been generated, they can be used to train the encoder-decoderFootnote 1 in an unsupervised fashion. The intuition behind this is quite simple. If an encoder-decoder model is trained with a large number of random walks, the sub-sequence corresponding to sub-structures in the graph that co-appear frequently will have learned embeddings that are more similar. This allows us to learn representations for sub-structures that are more compact since the different sub-structures are not considered independently of one another. Figure 5 illustrates this idea.
Obtaining final graph representation
After the encoder-decoder has been trained, we can freeze the model and use the encoder to generate representations, hi, for any arbitrary random walk sequence. Ultimately, however, we are interested in obtaining representations for entire graphs so we try several strategies for aggregating the encoder representations obtained from a set of independent random walks sampled from a given graph. Sampling multiple short walks from a graph allows us to obtain a relatively accurate profile of a graph. The representations can then be aggregated to get a representation for the graph as a whole. While one can certainly try more sophisticated approaches for aggregation like a neural network that learns a final graph representation from the sampled representations, we choose to use relatively simple aggregation techniques to highlight the usefulness of the model.
-
1.
Single walk: In this approach we do not use several encoder representations. Instead, we train the model on relatively long (relative to the size of the graphs in the dataset) random walk sequences and use a single long walk over the graph to obtain its representation.
-
2.
Average: We compute the component-wise average of the encoder representations of the sampled random walk sequences. This is then used as the graph representation.
-
3.
Max: As in (Kiela and Bottou 2014), we take the component-wise absolute maximum of all encoder representations.
-
4.
Cluster: The encoder representations are first fed into a clustering technique like K-means (Hamerly and Elkan 2003) and we use the cluster information to create a bag-of-cluster vector that serves as the graph’s representation.
The procedure for obtaining the graph embeddings is summarized in Algorithm 1. The calculated graph embeddings can now be used with any off-the-shelf classifier.
Time complexity
The overall time it takes to train an encoder-decoder model depends on two things: the size of the training set \(\mathcal {D}\), and the average length of the walks in each triplet. In previous work, an encoder-decoder was trained on a very large dataset with 74,004,228 sequences with average length of 13, demonstrating that the model can be trained in a relatively reasonable amount of time on large datasets (Kiros et al. 2015).
Once the unsupervised training of the model is complete, we can proceed to compute a graph’s embedding (even for an unseen sample) in time \(\mathcal {O}(K^{\prime } \cdot T \cdot d^{2})\). As mentioned previously, K′ is the number of random walks we use to calculate the final graph embedding, T is the average length of the random walks, and d is the embedding size (for simplicity we assume that the input size is equal to the embedding size).