From Relational Data to Graphs: Inferring Significant Links Using Generalized Hypergeometric Ensembles
- 1 Citations
- 28 Mentions
- 2.5k Downloads
Abstract
The inference of network topologies from relational data is an important problem in data analysis. Exemplary applications include the reconstruction of social ties from data on human interactions, the inference of gene co-expression networks from DNA microarray data, or the learning of semantic relationships based on co-occurrences of words in documents. Solving these problems requires techniques to infer significant links in noisy relational data. In this short paper, we propose a new statistical modeling framework to address this challenge. The framework builds on generalized hypergeometric ensembles, a class of generative stochastic models that give rise to analytically tractable probability spaces of directed, multi-edge graphs. We show how this framework can be used to assess the significance of links in noisy relational data. We illustrate our method in two data sets capturing spatio-temporal proximity relations between actors in a social system. The results show that our analytical framework provides a new approach to infer significant links from relational data, with interesting perspectives for the mining of data on social systems.
Keywords
Statistical analysis Graph theory Network inference Statistical ensemble Relational data Graph mining Graph analysis Network analysis Social network Social network analysis Community structures Data mining Social interactions1 Motivation
Advances in data sensing and collection give rise to an increasing volume of data that capture dyadic relations between elements or actors in social, natural, and technical systems. While it is common to apply graph mining and network analysis to such relational data, it is often questionable whether the application of these techniques is actually justified. Consider, for instance, various forms of time series data, which not only tell us which elements of a complex system are related but also when or in which order relations occur. Such data give rise to temporal networks, which question the application of widely used network-based modeling and data mining techniques [13, 24, 26, 27, 30]. Apart from temporal information, we often have access to data that capture multiple types of relations or interactions. The resulting multi-layer network topologies give rise to complications that threaten standard techniques, e.g., to infer and analyze social networks, detect community structures, or to model and control dynamical processes in networked systems [3, 7, 16, 28, 35].
The challenges outlined above are due to the growing availability of additional information – such as time-stamped, sequential or multi-dimensional relational data – which must be incorporated into network-based techniques to model and analyze relational data. However, we are often confronted with situations in which we lack information that is needed to interpret observed relations. Consider, for instance, data sets that capture the simultaneous presence of two users at the same location, the joint expression of two genes in a DNA microarray, or the co-occurrence of two words in the same document. Each of these observed relations can either be due to an underlying social tie, a functional relationship between genes, a semantic link between two words, or it could simply have occurred by mere chance. Rather than naîvely analyzing such data from the perspective of graphs or networks, we should thus treat them as noisy observations that may or may not indicate true relations between a system’s elements.
Principled and efficient methods to solve this network inference problem are of major importance for the modeling and analysis of social networks, the reconstruction of biological networks, and the mining of semantic structures in information systems. The problem has received significant attention from the data mining and machine learning community, as well as from researchers in graph theory and network science. Especially in the latter community, the problem is commonly addressed using statistical ensembles, i.e., generative stochastic models of graphs that can be used for inference, learning and modeling tasks. A common issue of these techniques is that the underlying statistical ensembles are not analytically tractable, thus requiring time-consuming numerical simulations and Monte-Carlo sampling techniques.
To address this problem, in this short paper we propose generalized hypergeometric ensembles (gHypE), a novel framework of statistical ensembles to infer significant links in relational data. The framework can be viewed as generalization of the configuration model, which is commonly used to generate random graph topologies with a given sequence of node degrees. Our framework extends this state-of-the-art graph-theoretic approach in two ways. First, it provides analytically tractable probability spaces of directed and undirected multi-edge graphs, eliminating the need for expensive numerical simulations. Second, it allows to account for known factors that influence the occurrence of interactions, such as known group structures, similarities between elements, or other forms of biases. We demonstrate our framework in two real-world data sets that capture spatio-temporal proximities of actors in a social system. The results show that our framework provides interesting new perspectives for the mining and learning in graphs.
2 Background and Related Work
The problem of inferring significant links in relational data has been addressed in a number of works. In the following, we coarsely categorize them into three lines of research.
Applying predictive analytics techniques, a first set of works studied the problem from the perspective of link prediction [17]. In [29], a supervised learning technique is used to predict types of social ties based on unlabeled interactions. The authors of [25] show that tensor factorization techniques allow to infer international relations from data that capture how often two countries co-occur in news reports. In [33], a link-based latent variable model is used to predict friendship relations using data on social interactions.
Using the special characteristics of time-stamped social interactions or geographical co-occurrences, a second line of works has additionally accounted for spatio-temporal information. Studying data on time-stamped proximities of students at MIT campus, the authors of [8] show that the temporal and spatial distribution of proximity events allows to infer social ties with high accuracy. In [5], a model that captures location diversity, regularity, intensity and duration is used to predict social ties based on co-location events. An entropy-based approach taking into account the diversity of interactions’ locations has been used in [22].
Addressing scenarios where neither training data nor spatio-temporal information is available, a third line of works is based on generative models for random graphs. Such models can be used as null models for observed dyadic interactions, which help us to assess whether the relations between a given pair of elements occur significantly more often than expected. Existing works in this area typically rely on standard modeling frameworks, such as exponential random graphs [4, 23], or the configuration model for graphs with given degree sequence or distribution [18]. On the one hand, these approaches provide statistically principled network inference and learning methods for general relational data [2, 12, 19, 32]. On the other hand, the underlying generative models are often not analytically tractable, thus requiring expensive numerical simulations [19, 23]. Proposing a framework of analytically tractable generative models for directed and undirected multi-edge graphs, in this work we close this research gap.
3 Generalized Hypergeometric Ensembles
In the following we introduce our framework step by step. For this, let us first consider a data set consisting of repeated dyadic interactions (i, j), which have been observed between two nodes i and j. Such a data set can be represented as a multi-edge, or weighted, network \(G=(V,E)\), where V is a set of n nodes, and \(E \subseteq V \times V\) is a multi-set of (directed or undirected) edges. Let us further define an adjacency matrix \(\hat{\mathbf {A}}\), where entries \(\hat{A}_{ij}\in \mathbb N_{0}\) capture the weight of an edge \((i,j)\in V \times V\), i.e., the multiplicity of an edge (i, j) in the multi-set E. For each node \(i \in V\) we further define the (weighted) in-degree \(\hat{k}_{\mathrm {in}}(i) := \sum _{j \in V} \hat{A}_{ji}\) and the (weighted) out-degree \(\hat{k}_{\mathrm {out}}(i) := \sum _{j \in V} \hat{A}_{ij}\).
Rather than directly applying graph mining and learning techniques to such a weighted graph G, in the following we are interested in a crucial question: Which of the links between nodes are significant, i.e., which of the observed weights \(A_{ij}\) go beyond what is expected at random, given (i) the total number of observed interactions, and (ii) the number of times individual nodes engage in interactions? To answer this question, we take the common approach of defining a stochastic model that generates a so-called statistical ensemble, i.e., a probability space of graphs. Different from existing approaches, where link weights are assumed to be continuous (e.g. [1, 6]), we are interested in a statistical ensemble that (i) can handle directed and multi-edge graphs, (ii) is analytically tractable, and (iii) thus allows us to assess the significance of links in a theoretically principled way.
Our construction of a statistical ensemble follows the general idea of the Molloy-Reed configuration model, which is to randomly shuffle the topology of a given network G while preserving the observed node degrees. For this, the configuration model generates edges between randomly sampled pairs of nodes in such a way that the exact observed degrees of all nodes are preserved. Different from this approach, we assume a sampling of m multi-edges such that the sequence of expected degrees of nodes is preserved. For this, for each pair of nodes i and j, we first define the maximum number \(\varXi _{ij}\) of multi-edges that can possibly exist between nodes i and j as \(\varXi _{ij} := \hat{k}_{\mathrm {out}}(i) \hat{k}_{\mathrm {in}}(j)\) (cf. [15, 20]). The maximally possible numbers of links between all pairs of nodes can then be conveniently represented in matrix form as \(\mathbf {\Xi } := \left( \varXi _{ij}\right) _{i,j \in V}\).
Note that for the special case of a uniform dyadic propensity matrix \(\mathbf {\Omega } \equiv \text {const}\), we recover Eq. 1 for the unbiased case, i.e., where all dyadic propensities are identical. We thus obtain a general framework of statistical ensembles which (i) allows to encode arbitrary a priori tendencies of nodes to interact, and (ii) provides an analytical expression for the probability to observe a given number of interactions between any pair of nodes.
4 Inferring Significant Social Ties
Illustration of our approach in the (RM) data set capturing proximity of students and staff at MIT campus. For the observed weighted adjacency matrix (a) and a given significance threshold, our framework allows to establish a high-pass noise filter matrix (b), which can be used to obtain a filtered adjacency matrix containing only significant links (c). A visual comparison of the output of a community detection algorithm on the unfiltered (d) and filtered (f) graphs shows that detected partitions in the filtered one better correspond to ground truth lab affiliations and classes (e). (a) Unfiltered weighted adjacency matrix. (b) High-pass noise filter matrix. (c) Filtered adjacency matrix containing only significant links. (d) Unfiltered graph. (e) Comparison of ground truth lab affiliations (center column) vs. detected communities in the unfiltered (left column) and filtered (right column) graph. (f) Filtered graph.
Observed (a) and filtered (b) weighted graphs for the (ZKC) data set, capturing encounters between members of a Karate club. The filtered graph shows that most of the observed encounters can be explained by random effects resulting from the club members’ separation into two classes.
A major advantage of gHypEs is that, by specifying a non-uniform matrix \(\mathbf {\Omega }\), we can additionally encode known factors that influence the occurrence of interactions between nodes, while still obtaining an analytically tractable ensemble. In our second illustrative example, we use this to encode the known structure of two separate Karate classes in the (ZKC) data. These two classes naturally influence the frequency of encounters between actors beyond what would be expected “at random”. We incorporate this prior knowledge via a block matrix \(\mathbf {\Omega }\) that assigns higher dyadic propensities to pairs of actors in the same class (cf. [3]). This approach allows to establish a “random baseline” accounting both (i) for combinatorial effects due to heterogeneous node degrees, and (ii) the known group structure in the data. Using a significance threshold of \(\alpha =0.01\), for (ZKC) this yields the striking result that only 8 out of 78 observed links are significant (\(\sim 90\%\) of 231 observed multi-edges are filtered out, cf. Fig. 2). In other words, taking into account the partitioning of members in two classes for (ZKC) almost all encounters between club members can simply be explained by random effects. Figure 2 compares the original weighted network, illustrated in Fig. 2(a), and the filtered network, in Fig. 2(b).
5 Conclusion
In this short paper we introduce gHypEs, a broad class of statistical ensembles of graphs that can be used to infer significant links from noisy data. Our work makes three important contributions: First, we provide an analytically tractable statistical model of directed and undirected multi-edge graphs that can be used for inference and learning tasks. Second, the formulation of our ensemble highlights a – to the best of our knowledge – previously unknown relation between random graph theory and Wallenius‘non-central hypergeometric distribution. And finally, different from existing statistical ensembles such as, e.g., the configuration model, our framework can be used to encode prior knowledge on factors that influence the formation of relations. This flexible approach allows for a tuning of the “random baseline”, opening perspectives for a statistically principled network inference that accounts for effects that are not purely random. We thus argue that our work advances the theoretical foundation for the mining of relational data on social systems. It further highlights that principled model selection and hypothesis testing are crucial prerequisites that should precede the application of network-based data mining and modeling techniques.
Footnotes
- 1.
Note that we do not distinguish between the \(n\times n\) adjacency matrix \(\mathbf {A}\) and the \(n^{2}\times 1\) vector obtained by stacking.
Notes
Acknowledgments
The authors acknowledge support from the Swiss State Secretariat for Education, Research and Innovation (SERI), Grant No. C14.0036, the MTEC Foundation project “The Influence of Interaction Patterns on Success in Socio-Technical Systems”, and EU COST Action TD1210 KNOWeSCAPE. The authors thank Rebekka Burkholz, Giacomo Vaccario, and Simon Schweighofer for helpful discussions.
References
- 1.Aicher, C., Jacobs, A.Z., Clauset, A.: Learning latent block structure in weighted networks. J. Complex Netw. 3(2), 221–248 (2015). https://academic.oup.com/comnet/article-lookup/doi/10.1093/comnet/cnu026 MathSciNetCrossRefGoogle Scholar
- 2.Anand, K., Bianconi, G.: Entropy measures for networks: toward an information theory of complex topologies. Phys. Rev. E 80, 045102 (2009)CrossRefGoogle Scholar
- 3.Casiraghi, G.: Multiplex network regression: how do relations drive interactions? arXiv preprint arXiv:1702.02048, February 2017. http://arxiv.org/abs/1702.02048
- 4.Cimini, G., Squartini, T., Garlaschelli, D., Gabrielli, A.: Systemic risk analysis on reconstructed economic and financial networks. Sci. Rep. 5(1), 15758 (2015). http://arxiv.org/abs/1411.7613%0A, http://dx.doi.org/10.1038/srep15758, http://www.nature.com/articles/srep15758
- 5.Cranshaw, J., Toch, E., Hong, J., Kittur, A., Sadeh, N.: Bridging the gap between physical location and online social networks. In: Proceedings of the 12th ACM International Conference on Ubiquitous Computing, UbiComp 2010, pp. 119–128. ACM, New York (2010)Google Scholar
- 6.De Choudhury, M., Mason, W.A., Hofman, J.M., Watts, D.J.: Inferring relevant social networks from interpersonal communication. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 301–310. ACM, New York (2010)Google Scholar
- 7.De Domenico, M., Lancichinetti, A., Arenas, A., Rosvall, M.: Identifying modular flows on multilayer networks reveals highly overlapping organization in interconnected systems. Phys. Rev. X 5(1), 011027 (2015)Google Scholar
- 8.Eagle, N., Pentland, A.S., Lazer, D.: Inferring friendship network structure by using mobile phone data. Proc. Nat. Acad. Sci. 106(36), 15274–15278 (2009)CrossRefGoogle Scholar
- 9.Eagle, N., (Sandy) Pentland, A.: Reality mining: sensing complex social systems. Pers. Ubiquit. Comput. 10(4), 255–268 (2006)CrossRefGoogle Scholar
- 10.Erdös, P., Rényi, A.: On random graphs I. Publ. Math. Debrecen 6, 290–297 (1959)MathSciNetzbMATHGoogle Scholar
- 11.Fog, A.: Calculation methods for wallenius’ noncentral hypergeometric distribution. Commun. Stat. - Simul. Comput. 37(2), 258–273 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
- 12.Gemmetto, V., Cardillo, A., Garlaschelli, D.: Irreducible network backbones: unbiased graph filtering via maximum entropy, June 2017. http://arxiv.org/abs/1706.00230
- 13.Holme, P.: Modern temporal network theory: a colloquium. Europ. Phys. J. B 88(9), 1–30 (2015)CrossRefGoogle Scholar
- 14.Jacod, J., Protter, P.E.: Probability Essentials. Springer Science & Business Media, Heidelberg (2003)zbMATHGoogle Scholar
- 15.Karrer, B., Newman, M.E.J.: Stochastic blockmodels and community structure in networks. Phys. Rev. E 83, 016107 (2011)MathSciNetCrossRefGoogle Scholar
- 16.Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014)CrossRefGoogle Scholar
- 17.Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inform. Sci. Technol. 58(7), 1019–1031 (2007)CrossRefGoogle Scholar
- 18.Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence. Random Struct. Algorithms 6(2–3), 161–180 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
- 19.Newman, M.E.J., Peixoto, T.P.: Generalized communities in networks. Phys. Rev. Lett. 115, 088701 (2015)CrossRefGoogle Scholar
- 20.Newman, M.E.J.: Modularity and community structure in networks. Proc. Nat. Acad. Sci. 103(23), 8577–8582 (2006)CrossRefGoogle Scholar
- 21.Peixoto, T.P.: Efficient monte carlo and greedy heuristic for the inference of stochastic block models. Phys. Rev. E 89, 012804 (2014)CrossRefGoogle Scholar
- 22.Pham, H., Shahabi, C., Liu, Y.: EBM: an entropy-based model to infer social strength from spatiotemporal data. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 265–276. ACM (2013)Google Scholar
- 23.Robins, G., Pattison, P., Kalish, Y., Lusher, D.: An introduction to exponential random graph (p*) models for social networks. Soc. Netw. 29(2), 173–191 (2007)CrossRefGoogle Scholar
- 24.Rosvall, M., Esquivel, A.V., Lancichinetti, A., West, J.D., Lambiotte, R.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5, 4630 (2014)CrossRefGoogle Scholar
- 25.Schein, A., Paisley, J., Blei, D.M., Wallach, H.: Bayesian poisson tensor factorization for inferring multilateral relations from sparse dyadic event counts. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015. ACM (2015)Google Scholar
- 26.Scholtes, I.: When is a network a network? multi-order graphical model selection in pathways and temporal networks. In: KDD 2017 - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, February 2017, to appearGoogle Scholar
- 27.Scholtes, I., Wider, N., Garas, A.: Higher-order aggregate networks in the analysis of temporal networks: path structures and centralities. Europ. Phys. J. B 89(3), 1–15 (2016). http://link.springer.com/article/10.1140:2016-60663-0 CrossRefGoogle Scholar
- 28.Szell, M., Lambiotte, R., Thurner, S.: Multirelational organization of large-scale social networks in an online world. Proc. Natl. Acad. Sci. 107(31), 13636–13641 (2010)CrossRefGoogle Scholar
- 29.Tang, J., Lou, T., Kleinberg, J.: Inferring social ties across heterogenous networks. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM 2012, pp. 743–752. ACM, New York (2012)Google Scholar
- 30.Vidmer, A., Medo, M.: The essential role of time in network-based recommendation. EPL (Europhy. Lett.) 116(3), 30007 (2016)CrossRefGoogle Scholar
- 31.Wallenius, K.T.: Biased Sampling: The Noncentral Hypergeometric Probability Distribution. Ph.D. thesis, Stanford University (1963)Google Scholar
- 32.Wilson, J.D., Wang, S., Mucha, P.J., Bhamidi, S., Nobel, A.B.: A testing based extraction algorithm for identifying significant communities in networks. Ann. Appl. Stat. 8(3), 1853–1891 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
- 33.Xiang, R., Neville, J., Rogati, M.: Modeling relationship strength in online social networks. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 981–990. ACM, New York (2010)Google Scholar
- 34.Zachary, W.W.: An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33(4), 452–473 (1977)CrossRefGoogle Scholar
- 35.Zhang, Y., Garas, A., Schweitzer, F.: Value of peripheral nodes in controlling multilayer scale-free networks. Phys. Rev. E 93, 012309 (2016). https://journals.aps.org/pre/abstract/10.1103/PhysRevE.93.012309 CrossRefGoogle Scholar

