Deterministic Coresets for Stochastic Matrices with Applications to Scalable Sparse PageRank

  • Harry Lang
  • Cenk BaykalEmail author
  • Najib Abu Samra
  • Tony Tannous
  • Dan Feldman
  • Daniela Rus
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11436)


The PageRank algorithm is used by search engines to rank websites in their search results. The algorithm outputs a probability distribution that a person randomly clicking on links will arrive at any particular page. Intuitively, a node in the center of the network should be visited with high probability even if it has few edges, and an isolated node that has many (local) neighbours will be visited with low probability. The idea of PageRank is to rank nodes according to a stable state and not according to the previous local measurement of inner/outer edges from a node that may be manipulated more easily than the corresponding entry in the stable state.

In this paper we present a deterministic and completely parallelizable algorithm for computing an \(\varepsilon \)-approximation to the PageRank of a graph of n nodes. Typical inputs consist of millions of pages, but the average number of links per page is less than ten. Our algorithm takes advantage of this sparsity, assuming the out-degree of each node at most s, and terminates in \(O(n s / \varepsilon ^2)\) time. Beyond the input graph, which may be stored in read-only storage, our algorithm uses only O(n) memory. This is the first algorithm whose complexity takes advantage of sparsity. Real data exhibits an average out-degree of 7 while n is in the millions, so the advantage is immense. Moreover, our algorithm is simple and robust to floating point precision issues. Our sparse solution (core-set) is based on reducing the PageRank problem to an \(\ell _2\) approximation of the Carathéodory problem, which independently has many applications such as in machine learning and game theory. We hope that our approach will be useful for many other applications for learning sparse data and graphs.

Algorithm, analysis, and open code with experimental results are provided.


  1. 1.
    Ahmed, N.K., Neville, J., Kompella, R.: Network sampling: from static to streaming graphs. ACM Trans. Knowl. Discov. Data 8(2), 7:1–7:56 (2013). Scholar
  2. 2.
    Bahmani, B., Chakrabarti, K., Xin, D.: Fast personalized PageRank on mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 973–984. ACM (2011)Google Scholar
  3. 3.
    Bahmani, B., Chowdhury, A., Goel, A.: Fast incremental and personalized PageRank. Proc. VLDB Endow. 4(3), 173–184 (2010). Scholar
  4. 4.
    Bahmani, B., Kumar, R., Mahdian, M., Upfal, E.: PageRank on an evolving graph. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 24–32. ACM (2012)Google Scholar
  5. 5.
    Barman, S.: Approximating nash equilibria and dense bipartite subgraphs via an approximate version of caratheodory’s theorem. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pp. 361–369. ACM (2015)Google Scholar
  6. 6.
    Das Sarma, A., Nanongkai, D., Pandurangan, G.: Fast distributed random walks. In: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing, PODC 2009, pp. 161–170. ACM, New York (2009).
  7. 7.
    Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3(1–2), 95–110 (1956)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Haveliwala, T., Kamvar, A., Klein, D., Manning, C., Golub, G.: Computing PageRank using power extrapolation, August 2003Google Scholar
  9. 9.
    Jin, Z., Shi, D., Wu, Q., Yan, H., Fan, H.: LBSNRank: personalized PageRank on location-based social networks. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pp. 980–987. ACM (2012)Google Scholar
  10. 10.
    Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: Open source scientific tools for Python (2001). Accessed
  11. 11.
    Leskovec, J., Sosič, R.: Snap: a general-purpose network analysis and graph-mining library. ACM Trans. Intell. Syst. Technol. (TIST) 8(1), 1 (2016)CrossRefGoogle Scholar
  12. 12.
    Mitliagkas, I., Borokhovich, M., Dimakis, A.G., Caramanis, C.: FrogWild!: fast PageRank approximations on graph engines. Proc. VLDB Endow. 8(8), 874–885 (2015)CrossRefGoogle Scholar
  13. 13.
    Rossi, R.A., Gleich, D.F.: Dynamic PageRank using evolving teleportation. In: Bonato, A., Janssen, J. (eds.) WAW 2012. LNCS, vol. 7323, pp. 126–137. Springer, Heidelberg (2012). Scholar
  14. 14.
    Rozenshtein, P., Gionis, A.: Temporal PageRank. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9852, pp. 674–689. Springer, Cham (2016). Scholar
  15. 15.
    Sarma, A.D., Gollapudi, S., Panigrahy, R.: Estimating pageRank on graph streams. J. ACM 58(3), 13:1–13:19 (2011). Scholar
  16. 16.
    Sarma, A.D., Molla, A.R., Pandurangan, G.: Near-optimal random walk sampling in distributed networks. arXiv preprint arXiv:1201.1363 (2012)
  17. 17.
    Das Sarma, A., Molla, A.R., Pandurangan, G., Upfal, E.: Fast distributed PageRank computation. In: Frey, D., Raynal, M., Sarkar, S., Shyamasundar, R.K., Sinha, P. (eds.) ICDCN 2013. LNCS, vol. 7730, pp. 11–26. Springer, Heidelberg (2013). Scholar
  18. 18.
    Yu, W., Lin, X., Zhang, W.: Fast incremental simrank on link-evolving graphs. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 304–315. IEEE (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Harry Lang
    • 1
  • Cenk Baykal
    • 1
    Email author
  • Najib Abu Samra
    • 2
  • Tony Tannous
    • 2
  • Dan Feldman
    • 2
  • Daniela Rus
    • 1
  1. 1.MIT CSAILCambridgeUSA
  2. 2.Computer Science DepartmentUniversity of HaifaHaifaIsrael

Personalised recommendations