The VLDB Journal

, Volume 20, Issue 3, pp 445–470 | Cite as

Index design and query processing for graph conductance search

Regular Paper

Abstract

Graph conductance queries, also known as personalized PageRank and related to random walks with restarts, were originally proposed to assign a hyperlink-based prestige score to Web pages. More general forms of such queries are also very useful for ranking in entity-relation (ER) graphs used to represent relational, XML and hypertext data. Evaluation of PageRank usually involves a global eigen computation. If the graph is even moderately large, interactive response times may not be possible. Recently, the need for interactive PageRank evaluation has increased. The graph may be fully known only when the query is submitted. Browsing actions of the user may change some inputs to the PageRank computation dynamically. In this paper, we describe a system that analyzes query workloads and the ER graph, invests in limited offline indexing, and exploits those indices to achieve essentially constant-time query processing, even as the graph size scales. Our techniques—data and query statistics collection, index selection and materialization, and query-time index exploitation—have parallels in the extensive relational query optimization literature, but is applied to supporting novel graph data repositories. We report on experiments with five temporal snapshots of the CiteSeer ER graph having 74–702 thousand entity nodes, 0.17–1.16 million word nodes, 0.29–3.26 million edges between entities, and 3.29–32.8 million edges between words and entities. We also used two million actual queries from CiteSeer’s logs. Queries run 3–4 orders of magnitude faster than whole-graph PageRank, the gap growing with graph size. Index size is smaller than a text index. Ranking accuracy is 94–98% with reference to whole-graph PageRank.

Keywords

Personalized PageRank Graph conductance Proximity search in graph databases 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: WWW Conference, pp. 280–290 (2003)Google Scholar
  2. 2.
    Adler, M., Mitzenmacher, M.: Towards compressing Web graphs. In: Data Compression Conference, pp. 203–212 (2001)Google Scholar
  3. 3.
    Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: ICDE. IEEE, San Jose, CA (2002)Google Scholar
  4. 4.
    Amer-Yahia, S., Botev, C., Shanmugasundaram, J.: TeXQuery: A full-text search extension to XQuery. In: WWW Conference, pp. 583–594. New York (2004)Google Scholar
  5. 5.
    Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: PODS Conference, pp. 234–243. ACM (2003)Google Scholar
  6. 6.
    Balmin, A., Hristidis, V., Papakonstantinou, Y.: Authority-based keyword queries in databases using ObjectRank. In: VLDB Conference, Toronto (2004)Google Scholar
  7. 7.
    Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic Transit Gloria Telae: Towards an understanding of the Web’s decay. In: WWW Conference, pp. 328–337 (2004)Google Scholar
  8. 8.
    Berkhin, P.: Bookmark-coloring approach to personalized pagerank computing. Internet Math. 3(1), (2007)Google Scholar
  9. 9.
    Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE IEEE (2002)Google Scholar
  10. 10.
    Bharat, K., Bröder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S.: The connectivity server: fast access to linkage information on the Web. In: WWW Conference, Brisbane, Australia (1998)Google Scholar
  11. 11.
    Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Sixth Workshop on Very Large Corpora. Association for Computational Linguistics (1998)Google Scholar
  12. 12.
    Chakrabarti, S.: Dynamic personalized PageRank in entity-relation graphs. In: WWW Conference, Banff (2007)Google Scholar
  13. 13.
    Chakrabarti, S., Agarwal, A.: Learning parameters in entity relationship graphs from ranking preferences. In: PKDD Conference, LNCS, vol. 4213, pp. 91–102. Berlin (2006)Google Scholar
  14. 14.
    Chakrabarti, S., Mirchandani, J., Nandi, A.: SPIN: Searching personal information networks. In SIGIR Conference, pp. 674–674 (2005)Google Scholar
  15. 15.
    Chakrabarti, S., Puniyani, K., Das, S.: Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In: WWW Conference. Edinburgh (2006)Google Scholar
  16. 16.
    Chazelle B.: The soft heap: an approximate priority queue with optimal error rate. JACM 47(6), 1012–1027 (2000)MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    Cohen, E.: Estimating the size of the transitive closure in linear time. In: FOCS Conference, pp. 190–200 (1994)Google Scholar
  18. 18.
    Craswell, N., Szummer, M.: Random walks on the click graph. In: SIGIR Conference, pp. 239–246. ACM (2007)Google Scholar
  19. 19.
    Dalvi, B., Kshirsagar, M., Sudarshan, S.: Keyword search on external memory data graphs. In: VLDB Conference (2008)Google Scholar
  20. 20.
    Doyle, P., Snell, L.: Random walk and electric networks. In: Mathematical Association of America (1984)Google Scholar
  21. 21.
    Fagin R., Lotem A., Naor M.: Optimal aggregation algorithms for middleware. JCSS 66(4), 614–656 (2003)MathSciNetMATHGoogle Scholar
  22. 22.
    Faloutsos, C., McCurley, K.S., Tomkins, A.: Connection subgraphs in social networks. In: Workshop on Link Analysis, Counterterrorism, and Privacy. SDM Conference (2004)Google Scholar
  23. 23.
    Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: SIGCOMM, pp. 251–262 (1999)Google Scholar
  24. 24.
    Fogaras D., Rácz B., Csalogány K., Sarlós T.: Towards scaling fully personalized PageRank: algorithms, lower bounds, and experiments. Internet Math. 2(3), 333–358 (2005)MathSciNetMATHCrossRefGoogle Scholar
  25. 25.
    Graefe G.: Query evaluation techniques for large databases. ACM Computing Survey 25(2), 73–170 (1993)CrossRefGoogle Scholar
  26. 26.
    Grishman, R., Sundheim, B.: Message understanding conference-6: A brief history. In: Proceedings of the 16th conference on Computational linguistics, pp. 466–471. Association for Computational Linguistics (1996)Google Scholar
  27. 27.
    Gupta, M., Pathak, A., Chakrabarti, S.: Fast algorithms for top-k personalized PageRank queries. In: WWW Conference, pp. 1225–1226 (2008)Google Scholar
  28. 28.
    Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: VLDB Conference, pp. 576–587. (2004)Google Scholar
  29. 29.
    Han J., Pei J., Yin Y., Mao R.: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Min Knowl Discov 8(1), 53–87 (2004)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Hwang, H., Balmin, A., Reinwald, B., Nijkamp, E.: BinRank: scaling dynamic authority-based search using materialized subgraphs. In: ICDE, pp. 66–77. IEEE Computer Society (2009)Google Scholar
  31. 31.
    Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant documents. In: SIGIR Conference, pp. 41–48 (2000)Google Scholar
  32. 32.
    Jeh, G., Widom, J.: Scaling personalized web search. In: WWW Conference, pp. 271–279 (2003)Google Scholar
  33. 33.
    Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating PageRank computations. In: WWW Conference, pp. 261–270 (2003)Google Scholar
  34. 34.
    Kleinberg J.M.: Authoritative sources in a hyperlinked environment. JACM 46(5), 604–632 (1999)MathSciNetMATHCrossRefGoogle Scholar
  35. 35.
    Koren, Y., North, S.C., Volinsky, C.: Measuring and extracting proximity in networks. In: SIGKDD Conference, pp. 245–255. ACM (2006)Google Scholar
  36. 36.
    Koudas, N., Srivastava, D.: Data stream query processing. In: ICDE p. 1145 (2005)Google Scholar
  37. 37.
    Lempel R., Moran S.: Rank-stability and rank-similarity of link-based web ranking algorithms in authority-connected graphs. Information Retrieval 8(2), 245–264 (2005)CrossRefGoogle Scholar
  38. 38.
    Manning C.D., Schütze H.: Foundations of Statistical Natural Language Processing. MIT, Cambridge (1999)MATHGoogle Scholar
  39. 39.
    McSherry, F.: A uniform approach to accelerated pagerank computation. In: WWW Conference, pp. 575–582 (2005)Google Scholar
  40. 40.
    Miller, G., Beckwith, R., FellBaum, C., Gross, D., Miller, K., Tengi, R.: Five Papers on WordNet. Princeton University (1993)Google Scholar
  41. 41.
    Minkov, E., Ng, A., Cohen, W.W.: Contextual search and name disambiguation in email using graphs. In: SIGIR Conference (2006)Google Scholar
  42. 42.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the Web. Manuscript, Stanford University (1998)Google Scholar
  43. 43.
    Pan, J.-Y., Yang, H.-J., Faloutsos, C., Duygulu, P.: Automatic multimedia cross-modal correlation discovery. In: SIGKDD Conference, pp. 653–658 (2004)Google Scholar
  44. 44.
    Pandurangan, G., Raghavan, P., Upfal, E.: Using PageRank to characterize web structure. In: COCOON, LNCS 2387, pp. 330–339 (2002)Google Scholar
  45. 45.
    Pathak, A., Chakrabarti, S., Gupta, M.S.: Index design for dynamic personalized PageRank. In: ICDE, pp. 1489–1491 (2008)Google Scholar
  46. 46.
    Sarkar, P., Moore, A.W.: A tractable approach to finding closest truncated-commute-time neighbors in large graphs. In: UAI Conference (2007)Google Scholar
  47. 47.
    Sarkar, P., Moore, A.W., Prakash, A.: Fast incremental proximity search in large graphs. In: ICML, pp. 896–903 (2008)Google Scholar
  48. 48.
    Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a very large AltaVista query log. Technical Report 1998-014, COMPAQ System Research Center (1998)Google Scholar
  49. 49.
    Sleator, D.D., Temperley, D.: Parsing English with a link grammar. In: Third International Workshop on Parsing Technologies (1993)Google Scholar
  50. 50.
    Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. In: VLDB Conference, pp. 648–659 (2004)Google Scholar
  51. 51.
    Tong, H., Faloutsos, C.: Center-piece subgraphs: problem definition and fast solutions. In: SIGKDD Conference (2006)Google Scholar
  52. 52.
    Tong, H., Faloutsos, C., Koren, Y.: Fast direction-aware proximity for graph mining. In: SIGKDD Conference, pp. 747–756. ACM (2007)Google Scholar
  53. 53.
    Tong, H., Faloutsos, C., Pan, J.-Y.: Fast random walk with restart and its applications. In: ICDM (2006)Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  1. 1.IIT BombayPowai, MumbaiIndia

Personalised recommendations