Gaussian conditional random fields extended for directed graphs
Abstract
For many realworld applications, structured regression is commonly used for predicting output variables that have some internal structure. Gaussian conditional random fields (GCRF) are a widely used type of structured regression model that incorporates the outputs of unstructured predictors and the correlation between objects in order to achieve higher accuracy. However, applications of this model are limited to objects that are symmetrically correlated, while interaction between objects is asymmetric in many cases. In this work we propose a new model, called Directed Gaussian conditional random fields (DirGCRF), which extends GCRF to allow modeling asymmetric relationships (e.g. friendship, influence, love, solidarity, etc.). The DirGCRF models the response variable as a function of both the outputs of unstructured predictors and the asymmetric structure. The effectiveness of the proposed model is characterized on six types of synthetic datasets and four realworld applications where DirGCRF was consistently more accurate than the standard GCRF model and baseline unstructured models.
Keywords
Structured regression Gaussian conditional random fields Asymmetric structure Directed Gaussian conditional random fields1 Introduction
Structured regression models are designed to use relationships between objects for predicting output variables. In other words, structured regression models are using the given attributes and dependencies between the outputs to make predictions. This prior knowledge about relationships among the outputs is applicationspecific. For example relationships between hospitals can be based on similarity of their specialization (Polychronopoulou and Obradovic 2014), relationships between pairs of scientific papers can be presented as the similarity of sequences of citation (Slivka et al. 2014), relationships between documents can be quantified based on similarity of their contents (Radosavljevic et al. 2014), etc. The Gaussian conditional random fields (GCRF) model is a type of structured regression model that incorporates the outputs of unstructured predictors (based on the given attribute values) and the correlation between output variables in order to achieve higher prediction accuracy. This model was first applied in computer vision (Liu et al. 2007), but since then it has been used in different applications (Polychronopoulou and Obradovic 2014; Radosavljevic et al. 2010; Uversky et al. 2013), and extended for various purposes (Glass et al. 2015; Slivka et al. 2014; Stojkovic et al. 2016). A main assumption in the GCRF model is that if two objects are closely related, they should be more similar to each other and they should have similar values of the output variable. The similarity considered in GCRF is symmetric. However, in many realworld networks objects are asymmetrically linked (BeguerisseDíaz et al 2014). Therefore, one limitation of the GCRF model is that the direction of link is neglected.
Networked data (such as social networks, traffic networks, information networks, etc.) are naturally modeled as graphs, where objects are represented as nodes, and relations are represented as edges between nodes. Many of these objects have directed links. For example, friendship strength is often not symmetric. In empirical studies (Michell and Amos 1997; Snijders et al. 2010) of friendship networks, participants are typically asked to identify their friends and to mark how close friends they are, which results in a directed graph in which friendships often run in only one direction between a pair of individuals. Another example is in social networks, such as Twitter or GitHub, where a user could follow all tweets posted by another user, or a developer could follow the work conducted by another developer. Also, in the email system, each individual communicates with one or more individuals by sending and receiving email messages, which results in a directed graph in which each edge has the number of sent emails as its weight.
In this work, we propose a new model, called Directed Gaussian Conditional Random Fields (DirGCRF), which extends the GCRF model by considering asymmetric similarity. The DirGCRF models the response variable as a function of both the outputs of unstructured predictors and the asymmetric structure. To evaluate the proposed model, we tested it on both synthetic and realworld datasets and compared its accuracy with standard GCRF, as well as with unstructured predictors Neural Networks and Linear Regression and simple Last and Average methods. All datasets and codes are publicly available.
 1.
This is the first work that considers asymmetric links between objects in GCRFbased structured regression.
 2.
The proposed model considers both asymmetric structure and the outputs of unstructured predictor.
 3.
The effectiveness of the proposed directed model is characterized by experiments on six types of synthetic datasets and four realworld applications.
2 Related work
Survey of graph models literature
Method  Network  Year  Purpose  Cold start 

GLS (Altken 1935)  Discriminative  1935  Bias reduction  Yes, O(f) 
GCRF (Radosavljevic et al. 2010)  Discriminative  2010  Multiple output  Yes, \(O(n^3)\) 
SpGCRF (Wytock and Kolter 2013)  Generative  2013  Multiple output  No 
Network lasso (Hallac et al. 2015)  Discriminative  2015  Multiple output  Yes, O(nf) 
None of the above models can handle asymmetric link weights. However, this work is focused on advancing the GCRF model because it produces high accuracy and it is the most scalable learning approach of all listed above (Glass et al. 2015). GCRF has been used on a broad set of applications: climate (Radosavljevic et al. 2010, 2014; Djuric et al. 2015), energy forecasting (Wytock and Kolter 2013; Guo 2013), healthcare (Gligorijevic et al. 2015; Polychronopoulou and Obradovic 2014), speech recognition (Khorram et al. 2014), computer vision (Tappen et al. 2007; Wang et al. 2014), etc. There are other works that capture asymmetric dependencies, such as AsymMRF model (Heesch and Petrou 2010). Since it is out of scope of this paper, for more details, please refer to Heesch and Petrou (2010) and Wang et al. (2005). Below we give a brief description of CRF and GCRF.
In this work, the restrictions on symmetric link weights is relaxed, which alters the model in a way that is no longer capable of using a precision matrix. Additionally, convexity is no longer guaranteed. We will show convexity in special cases and demonstrate it empirically in Sect. 4.4.
3 Methodology
The proposed model DirGCRF is described in this section. Since asymmetric influence between objects violates some of the fundamental assumptions of the GCRF model (Radosavljevic et al. 2010), we rederive the pseudoGaussian form and explain where the new formulation differs from the original. Below are the details of the derivation of a new matrix Q.
4 Experiments
4.1 Datasets and experimental setup
4.1.1 Synthetic datasets

Fully connected directed graph: Each pair of distinct nodes is connected by a pair of edges (one in each direction) with different weights.

Directed graph with edge probability p: Directed graphs with different density. For each pair of distinct nodes, a random number between 0 and 1 is generated. If the number exceeds p, then the selected node pair will be connected with an edge.

Directed graph without direct loop: Each pair of distinct nodes is connected by a single edge, which direction is chosen randomly. For example, if there is an edge from node A to node B, there could not be an edge from node B to node A.

Directed acyclic graph: A graph with no cycles. For example, there is no path that starts from a node A and follows a consistentlydirected sequence of edges that loops back to node A.

Chain: All nodes are connected in a single sequence, from one node to another.

Binary tree: A graph with a tree structure in which each node could have at most two children.
In all experiments on generated synthetic datasets one graph is used for training and five graphs for testing. For evaluating accuracy, experiments were conducted on graphs with 200 nodes. For testing run time, experiments were conducted on fully connected directed graphs with 500, 1K, 5K, 10K and 15K nodes.
4.1.2 Realworld datasets
We also evaluated our model on four realworld datasets: Delinquency (Snijders et al. 2010), Teenagers (Michell and Amos 1997), Glasgow (Bush et al. 1997) and Geostep (Scepanovic et al. 2015). The first three datasets contain data about habits of students (e.g. tobacco and alcohol consumption) and friendship networks at different observation time points. Geostep dataset contains data about treasure hunt games. Node attributes, edge weights, and response variables are extracted from data. All values were normalized to fit in range from 0 to 1. The experimental procedure and the obtained results are described in more detail in Sects. 4.3.1–4.3.4.
4.1.3 Baselines

GCRF: In order to apply the standard GCRF to the directed graphs, S matrix was converted from asymmetric to symmetric. In a symmetric matrix each pair of distinct nodes is connected by a single undirected edge, where the weight was calculated as an average of weights in the corresponding asymmetric matrix. The Neural Network unstructured predictor was used for both, DirGCRF and standard GCRF.

NN: Neurons in feedforward artificial neural networks are grouped in three layers: input, output and hidden layer. The number of neurons in the input layer was same as the number of features in the considered dataset. The number of neurons in the output layer was 1 for all datasets. The number of neurons in the hidden layer was selected based on the accuracy performance on the training data.

LR or MLR: Linear regression or multivariate linear regression is used depending on the number of features in the considered dataset. Coefficients of predictors were trained on the features of all nodes on the training data, and then applied on the features on the test data to form the prediction.

Last: In the realworld datasets, the graphs have evolved. Therefore, we consider one simple method, Last, which assigns values to the response variables using the same values as in the previous time point.

Average: Another simple technique that calculates prediction of \(\mathbf {y}\) value at each time stamp as the average of the \(\mathbf {y}\) values in all previous time stamps.
Average (± standard deviation) \(R^2\) of DirGCRF and GCRF on different types of asymmetric structures with parameters values \(\alpha =5\) and \(\beta =1\)
Graph type  DirGCRF  GCRF 

Directed graphs  0.9176 (±0.00625)  0.5893 (±0.02680) 
Directed graphs with \(p=0.5\)  0.9799 (±0.00332)  0.6582 (±0.06063) 
Directed graphs with \(p=0.2\)  0.9951 (±0.00074)  0.8880 (±0.00846) 
Directed graphs without direct loop  0.9865 (±0.00084)  0.4608 (±0.03497) 
Acyclic graphs  0.9881 (±0.00019)  0.2580 (±0.03584) 
Chains  0.9995 (±0.00001)  0.9987 (±0.00009) 
Binary trees  0.9995 (±0.00004)  0.9988 (±0.00008) 
All methods are implemented in Java, and experiments were run on Windows with 32GB memory (28GB for JVM) and 3.4GHz CPU. All codes are publicly available.^{1}
4.2 Performance on synthetic datasets
4.2.1 Effectiveness of DirGCRF
We first tested the accuracy of the DirGCRF model, and compared the performance against the standard GCRF model. Experiments were conducted on all synthetic datasets described in the Sect. 4.1.1. The outputs of unstructured predictor (R) and similarity matrix (S) are randomly generated. For each type of graph, one graph is used for training the model, and five graphs for testing. All graphs contain 200 nodes. \(\alpha \) was set as 5 and \(\beta \) was set as 1 in this experiment. Average \(R^2\) and standard deviations of both models are presented in Table 2.
The results show that the DirGCRF produces higher accuracy than the standard GCRF on all synthetic directed graphs. On the fully connected directed graph, DirGCRF has 0.33 larger \(R^2\) value than GCRF. With decreasing probability of edge existence, the graphs become sparser. Thus, the difference between DirGCRF and GCRF in accuracy becomes smaller. For graphs that do not have a direct loop or cycle, DirGCRF performs much better than GCRF, 0.53 and 0.73 larger \(R^2\) value, respectively, which indicates the superiority of DirGCRF on directed graphs. Also, we noticed that in all experiments DirGCRF has very low standard deviation (from 0.007 to 0.00004) of \(R^2\) performance.
The only exceptions are the results on the chains and binary trees where both algorithms have similar accuracy. This is expected since these structures are very sparse where every node has a maximum of two nodes that directly affect its output.
4.2.2 Accuracy with respect to different \(\alpha \) and \(\beta \) values
The purpose of this experiment is to find out how values of \(\alpha \) and \(\beta \) parameters in data generation process affect the accuracy of DirGCRF and GCRF models. In this experiment, we tested three different setups to generate synthetic graphs. In the first one, \(\alpha \) has higher value, \(\alpha = 5\) and \(\beta =1\), which means that more emphasis is put on the unstructured predictor value and less on the structure. In the second one, both parameters have the same value: \(\alpha = 1\) and \(\beta =1\). In the third one, the \(\beta \) parameter has higher value: \(\alpha = 1\) and \(\beta =5\), that is, more emphasis is put on the structure.
4.2.3 Run time
Time complexity of DirGCRF is same as time complexity of the standard GCRF (Radosavljevic et al. 2014). If the number of nodes in the training set is N and the learning process lasts T iterations, computation results in \(O(\textit{TN}^3)\) time to train the model. The main cost of computation is matrix inversion.
Run time of DirGCRF for different number of nodes
No. of nodes  Speed 

500  8 s 
1000  48 s 
5000  2 h 
10,000  17 h 
15,000  2.2 days 
4.3 Performance on realworld datasets
Realworld dataset
Dataset (nodes)  Time points  \(\mathbf {x}\)  \(\mathbf {y}\)  S 

Delinquency  4  1. Previous delinquency  Delinquency level  Friendship network 
(26 students)  2. Alcohol consumption  
Teenagers  4  1. Previous alcohol consumption  Alcohol consumption  Friendship network 
(50 teenagers)  
Glasgow  3  1. Alcohol  Tobacco consumption  Friendship network 
(129 students)  2. Cannabis consumption  
3. Romantic relationship  
4. Pocket money per month  
Geostep (50 games)  N/A  1. No. of clues in social category  Relevance for touristic purposes  Games similarity 
2. No. of clues in business cat.  
3. No. of clues in travel cat.  
4. No. of clues in irrelevant cat.  
5. Privacy scope  
6. Duration 
4.3.1 Delinquency dataset
The goal was to predict the delinquency level for each student. Training was performed on the observation points 2 and 3. Alcohol consumption and previous delinquency level were used as attribute values x. The models were tested on the observation point 4.
From the results presented at Fig. 5 we can see that the DirGCRF model outperforms all other competing models. The DirGCRF model has 8% larger accuracy than the standard GCRF model, and 4% larger accuracy than the Neural Network. Neural Network was the second best model. Multivariate Linear Regression was less accurate, but better than the Last and Average methods which produced negative \(R^2\)s. The GCRF model produces a lower \(R^2\) than NN, which means that using converted symmetric friendship network was not helpful to improve the regression.
4.3.2 Teenagers dataset
The Teenagers (Michell and Amos 1997) dataset^{3} consists of three temporal observations of 50 teenagers (aged 13) in a school in the West of Scotland over a 3year period (1995–1997). Just like in the Delinquency dataset the teenagers were asked to identify up to 12 best friends. The total number of edges in these observations was between 113 and 122 (density around 5%). On average 60% of teenagers’ friendships were onedirectional. The same approach (Eq. 8) as in the Delinquency dataset was used to calculate similarity matrix. Besides friendship networks, the dataset contains information about teenager’s alcohol consumption (ranging from 1 to 5). The goal in this dataset was to predict alcohol consumption at the observation time point 3, based on two previous observations.
4.3.3 Glasgow dataset

Alcohol consumption (from 1 to 5).

Cannabis consumption (from 1 to 4).

Romantic relationship (indicates whether the student had a romantic relation at the specific time point).

Amount of pocket money per month.
4.3.4 Geostep dataset
It can be noticed that, from Figs. 5, 6, 7 and 8, the accuracies of DirGCRF, GCRF and NN are consistent for all four realworld datasets. In each dataset DirGCRF has the highest accuracy, while GCRF has lower accuracy than Neural Network.
4.4 Convexity
5 Conclusions
In this paper, we introduced a problem of using structured regression for predicting output variables that are asymmetrically linked. A new model, called directed Gaussian conditional random fields (DirGCRF), is proposed. This model extends the GCRF model by considering asymmetric similarities among objects. To evaluate the proposed model, we tested it on both synthetic and realworld datasets. A significant accuracy improvement is achieved compared to standard GCRF: from 5 to 19% for realworld datasets and in average 30% for synthetic datasets. If the data has more emphasis on structure than on values that are provided by the unstructured predictor, then the DirGCRF model even doubles the accuracy of GCRF for some types of directed graphs. Also, the experimental results confirmed that the simple approach of converting an asymmetric similarity matrix to a symmetric one for GCRF has negative impact on regression performance. Since this model is implemented in Java, which takes time to handle large matrix computations, our plan for future work is to implement the model in a procedural or functional programming language in order to speed it up and make it more efficient for large datasets. We also plan to apply the DirGCRF model to other realworld applications and to demonstrate that our model can use multiple unstructured predictors (multiple \(\alpha \) parameters) and multiple graphs (multiple \(\beta \) parameters).
Footnotes
Notes
Acknowledgements
This research was supported in part by DARPA Grant FA95501210406 negotiated by AFOSR, NSF BIGDATA Grant 14476570 and ONR Grant N000141512729.
References
 Altken, A. (1935). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48.CrossRefGoogle Scholar
 BeguerisseDíaz, M., GardunoHernández, G., Vangelov, B., Yaliraki, S. N., & Barahona, M. (2014). Interest communities and flow roles in directed networks: The twitter network of the UK riots. Journal of the Royal Society Interface, 11(101), 20140940.CrossRefGoogle Scholar
 Bush, H., West, P., & Michell, L. (1997). The role of friendship groups in the uptake and maintenance of smoking amongst preadolescent and adolescent children: Distribution of frequencies. Working Paper No. 62. MRC Medical Sociology Unit Glasgow.Google Scholar
 Djuric, N., Radosavljevic, V., Obradovic, Z., & Vucetic, S. (2015). Gaussian conditional random fields for aggregation of operational aerosol retrievals. IEEE Geoscience and Remote Sensing Letters, 12, 761–765.CrossRefGoogle Scholar
 Glass, J., Ghalwash, M., Vukicevic, M., & Obradovic, Z. (2015). Extending the modelling capacity of Gaussian conditional random fields while learning faster. In Proceedings 30th AAAI conference on artificial intelligence (AAAI16), pp. 1596–1602.Google Scholar
 Gligorijevic, D., Stojanovic, J., & Obradovic, Z. (2015). Improving confidence while predicting trends in temporal disease networks. In 4th workshop on data mining for medicine and healthcare, SIAM international conference on data mining (SDM).Google Scholar
 Guo, H. (2013). Modeling shortterm energy load with continuous conditional random fields. In European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD), pp. 433–448.Google Scholar
 Hallac, D., Leskovec, J., & Boyd, S. (2015). Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 387–396). ACM.Google Scholar
 Haykin, S. S. (2009). Neural networks and learning machines (Vol. 3). Upper Saddle River: Pearson.Google Scholar
 Heesch, D., & Petrou, M. (2010). Markov random fields with asymmetric interactions for modelling spatial context in structured scene labelling. Journal of Signal Processing Systems, 61(1), 95–103.CrossRefGoogle Scholar
 Khorram, S., Bahmaninezhad, F., & Sameti, H. (2014). Speech synthesis based on Gaussian conditional random fields. In Artificial intelligence and signal processing, pp. 183–193.Google Scholar
 Liu, C., Adelson, E. H., & Freeman, W. T. (2007) Learning Gaussian conditional random fields for lowlevel vision. In Proceedings of CVPR (p. 7). Citeseer.Google Scholar
 Michell, L., & Amos, A. (1997). Girls, pecking order and smoking. Social Science & Medicine, 44(12), 1861–1869.CrossRefGoogle Scholar
 Polychronopoulou, A., & Obradovic, Z. (2014). Hospital pricing estimation by gaussian conditional random fields based regression on graphs. In 2014 IEEE international conference on bioinformatics and biomedicine (BIBM) (pp. 564–567). IEEE.Google Scholar
 Radosavljevic, V., Vucetic, S., & Obradovic, Z. (2010). Continuous conditional random fields for regression in remote sensing. In ECAI, pp. 809–814.Google Scholar
 Radosavljevic, V., Vucetic, S., & Obradovic, Z. (2014). Neural Gaussian conditional random fields. In Joint European conference on machine learning and knowledge discovery in databases (pp. 614–629). Springer.Google Scholar
 Scepanovic, S., Vujicic, T., Matijevic, T., & Radunovic, P. (2015). Game based mobile learning—Application development and evaluation. In Proceedings of an 6th conference on elearning, pp. 142–147.Google Scholar
 Slivka, J., Nikolić, M., Ristovski, K., Radosavljević, V., & Obradović, Z. (2014). Distributed Gaussian conditional random fields based regression for large evolving graphs. In Proceedings of 14th SIAM international conference on data mining, workshop on mining networks and graphs.Google Scholar
 Snijders, T. A., Van de Bunt, G. G., & Steglich, C. E. (2010). Introduction to stochastic actorbased models for network dynamics. Social Networks, 32(1), 44–60.CrossRefGoogle Scholar
 Stojkovic, I., Jelisavcic, V., Milutinovic, V., & Obradovic, Z. (2016). Distance based modeling of interactions in structured regression, pp. 2032–2038.Google Scholar
 Tappen, M. F., Liu, C., Adelson, E. H., & Freeman, W. T. (2007). Learning Gaussian conditional random fields for lowlevel vision. In IEEE conference on computer vision and pattern recognition (CVPR), pp. 1–8.Google Scholar
 Uversky, A., Ramljak, D., Radosavljević, V., Ristovski, K., & Obradović, Z. (2013). Which links should i use?: A variogrambased selection of relationship measures for prediction of node attributes in temporal multigraphs. In Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining (pp. 676–683). ACM.Google Scholar
 Wang, S., Wang, S., Greiner, R., Schuurmans, D., & Cheng, L. (2005). Exploiting syntactic, semantic and lexical regularities in language modeling via directed Markov random fields. In Proceedings of the 22nd international conference on machine learning (pp. 948–955). ACM.Google Scholar
 Wang, S., Zhang, L., Urtasun, R. (2014). Transductive Gaussian processes for image denoising. In 2014 IEEE international conference on computational photography (ICCP) (pp. 1–8). IEEE.Google Scholar
 Weisberg, S. (2005). Applied linear regression (Vol. 528). Hoboken: Wiley.CrossRefMATHGoogle Scholar
 Wytock, M., & Kolter, J. Z. (2013). Sparse Gaussian conditional random fields: Algorithms, theory, and application to energy forecasting. ICML, 3, 1265–1273.Google Scholar