Increasing sampling efficiency for the fixed degree sequence model with phase transitions
- 307 Downloads
Real-world network data is often very noisy and contains erroneous or missing edges. These superfluous and missing edges can be identified statistically by assessing the number of common neighbors of the two incident nodes. To evaluate whether this number of common neighbors, the so-called co-occurrence, is statistically significant, a comparison with the expected co-occurrence in a suitable random graph model is required. For networks with a skewed degree distribution, including most real-world networks, it is known that the fixed degree sequence model (FDSM), which maintains the degrees of nodes, is favorable over using simplified graph models that are based on an independence assumption. However, the use of a FDSM requires sampling from the space of all graphs with the given degree sequence and measuring the co-occurrence of each pair of nodes in each of the samples, since there is no known closed formula known for this statistic. While there exist log-linear approaches such as Markov chain Monte Carlo sampling, the computational complexity still depends on the length of the Markov chain and the number of samples, which is significant in large-scale networks. In this article, we show based on ground truth data for different data sets that there are various phase transition-like tipping points that enable us to choose a comparatively low number of samples and to reduce the length of the Markov chains without reducing the quality of the significance test. As a result, the computational effort can be reduced by an order of magnitudes. Furthermore, we present and evaluate practically usable strategies for speeding up the randomization process of input graphs and heuristics for phase transition-based computation stopping.
KeywordsGraph processing Fixed degree sequence model Link assessment Online heuristics Phase transitions Data cleaning Randomization strategies Markov chain Monte Carlo
We thank Andreas Spitz for helpful feedback and ground truth data. We would like to thank Emőke-Ágnes Horvát for helpful discussions. The simulations were executed on the high performance cluster “Elwetritsch” at the TU Kaiserslautern which is part of the “Alliance of High Performance Computing Rheinland-Pfalz” (AHRP). We kindly acknowledge the support.
- Berger A, Müller-Hannemann M (2010) Uniform sampling of digraphs with a fixed degree sequence. In: Thilikos DM (ed) Graph theoretic concepts in computer science, vol 6410. Springer, Heidelberg, pp 220–231Google Scholar
- Brugger C, Chinazzo AL, John AF, De Schryver C, Wehn N, Spitz A, Zweig KA (2015) Exploiting phase transitions for the efficient sampling of the fixed degree sequence model. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), ASONAM’15. IEEE, ACM, New York, NY, pp 308–313, August 2015Google Scholar
- Gionis A et al (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3), Article No 14Google Scholar
- Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2006) Assessing data mining results via swap randomization. In: Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06)Google Scholar
- Horvát E-Á, Zweig KA (2013) A fixed degree sequence model for the one-mode projection of multiplex bipartite graphs. Soc Netw Anal Min 4:164Google Scholar
- Zweig KA (2010) How to forget the second side of the story: a new method for the one-mode projection of bipartite graphs. In: Proceedings of the 2010 international conference on advances in social networks analysis and mining ASONAM 2010, pp 200–207Google Scholar
- Zweig KA (2011) Good versus optimal: why network analytic methods need more systematic evaluation. Central Eur J Comput Sci 1:137–153Google Scholar