Frontiers of Computer Science

, Volume 10, Issue 3, pp 387–398 | Cite as

Big graph search: challenges and techniques

  • Shuai Ma
  • Jia Li
  • Chunming Hu
  • Xuelian Lin
  • Jinpeng Huai
Review Article

Abstract

On one hand, compared with traditional relational and XML models, graphs have more expressive power and are widely used today. On the other hand, various applications of social computing trigger the pressing need of a new search paradigm. In this article, we argue that big graph search is the one filling this gap. We first introduce the application of graph search in various scenarios. We then formalize the graph search problem, and give an analysis of graph search from an evolutionary point of view, followed by the evidences from both the industry and academia. After that, we analyze the difficulties and challenges of big graph search. Finally, we present three classes of techniques towards big graph search: query techniques, data techniques and distributed computing techniques.

Keywords

graph search big data query techniques data techniques distributed computing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cukier K. Data, data everywhere: a special report on managing information. Economist Newspaper, 2010Google Scholar
  2. 2.
    Ma S, Li J, Liu X, Huai J. Graph search: a new searching approach to the social computing era. Communications of CCF, 2012, 8(11): 26–31Google Scholar
  3. 3.
    Ma S, Cao Y, Wo T, Huai J. Social networks and graph matching. Communications of CCF, 2012, 8(4): 20–24Google Scholar
  4. 4.
    Ma S, Li J, Liu X, Huai J. Graph search in the big data era. Information and Communications Technologies, 2013, 6: 44–51Google Scholar
  5. 5.
    Tian Y, Patel J M. Tale: A tool for approximate large graph matching. In: Proceedings of IEEE the 24th International Conference on Data Engineering. 2008, 963–972Google Scholar
  6. 6.
    Fan W, Li J, Ma S, Tang N, Wu Y, Wu Y. Graph pattern matching: from intractable to polynomial time. Proceedings of the VLDB Endowment, 2010, 3(1): 264–275CrossRefGoogle Scholar
  7. 7.
    Barcelo P, Hurtado C A, Libkin L, Wood P T. Expressive languages for path queries over graph-structured data. In: Proceedings of the 29th ACM Symposium on Principles of Database Systems. 2010, 3–14Google Scholar
  8. 8.
    Feng K, Cong G, Bhowmick S S, Ma S. In search of influential event organizers in online social networks. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014, 63–74Google Scholar
  9. 9.
    Maserrat H, Pei J. Neighbor query friendly compression of social networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 533–542CrossRefGoogle Scholar
  10. 10.
    Schenker A, Last M, Bunke H, Kandel A. Classification of web documents using graph matching. International Journal of Pattern Recognition and Artificial Intelligence, 2004, 18(3): 475–496CrossRefMATHGoogle Scholar
  11. 11.
    Fan W, Li J, Ma S, Wang H, Wu Y. Graph homomorphism revisited for graph matching. Proceedings of the VLDB Endowment, 2010, 3(1): 1161–1172CrossRefGoogle Scholar
  12. 12.
    Terveen L G, McDonald D W. Social matching: a framework and research agenda. ACM Transactions on Computer-Human Interaction, 2005, 12(3): 401–434CrossRefGoogle Scholar
  13. 13.
    Ma S, Cao Y, Fan W, Huai J, Wo T. Capturing topology in graph pattern matching. Proceedings of the VLDB Endowment, 2011, 5(4): 310–321CrossRefMATHGoogle Scholar
  14. 14.
    Ma S, Cao Y, Fan W, Huai J, Wo T. Strong simulation: capturing topology in graph pattern matching. ACM Transactions on Database Systems, 2014, 39(1)Google Scholar
  15. 15.
    Eckerson W. Data quality and the bottom line: achieving business success through a commitment to high quality data. TDWI Report. 2002Google Scholar
  16. 16.
    Otto B, Weber K. From health checks to the seven sisters: the data quality journey at bt. Report: BT TR-BE HSG/CC CDQ/8. 2009Google Scholar
  17. 17.
    Fan W, Li J, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 469–480CrossRefGoogle Scholar
  18. 18.
    Ullmann J R. An algorithm for subgraph isomorphism. Journal of the ACM, 1976, 23(1): 31–42MathSciNetCrossRefGoogle Scholar
  19. 19.
    Liu C, Chen C, Han J, Yu P S. Gplag: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 872–881CrossRefGoogle Scholar
  20. 20.
    Ferrante J, Ottenstein K J, Warren J D. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 1987, 9(3): 319–349CrossRefMATHGoogle Scholar
  21. 21.
    Rice M N, Tsotras V J. Graph indexing of road networks for shortest path queries with label restrictions. Proceedings of the VLDB Endowment, 2010, 4(2): 69–80CrossRefGoogle Scholar
  22. 22.
    Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. Cambridge: The MIT Press, 2001MATHGoogle Scholar
  23. 23.
    Chen Z, Shen H T, Zhou X, Yu J X. Monitoring path nearest neighbor in road networks. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. 2009, 591–602Google Scholar
  24. 24.
    Chowdhury N M M K, Rahman M R, Boutaba R. Virtual network embedding with coordinated node and link mapping. In: Proceedings of IEEE 28th Conference on Computer Communications. 2009, 783–791Google Scholar
  25. 25.
    Conte D, Foggia P, Sansone C, Vento M. Thirty years of graph matching in pattern recognition. International Journal of Pattern Recognition and Artificial, 2004, 18(3): 265–298CrossRefGoogle Scholar
  26. 26.
    Karypis G, Aggarwal R, Kumar V, Shekhar S. Multilevel hypergraph partitioning: applications in vlsi domain. IEEE Transactions on Very Large Scale Integration Systems, 1999, 7(1): 69–79CrossRefGoogle Scholar
  27. 27.
    Fan W, Li J, Ma S, Tang N, Wu Y. Adding regular expressions to graph reachability and pattern queries. In: Proceedings of IEEE the 27th Conference on Data Engineering. 2011, 39–50Google Scholar
  28. 28.
    Hansen P B, ed. Classic Operating Systems. New York: Springer, 2001MATHGoogle Scholar
  29. 29.
    Ramakrishnan R, Gehrke J. Database Management Systems. New York: McGraw-Hill Higher Education, 2000MATHGoogle Scholar
  30. 30.
    Abiteboul S, Hull R, Vianu V. Foundations of Databases. Addison-Wesley, 1995Google Scholar
  31. 31.
    Sakr S, Pardede E, eds. Graph Data Management: Techniques and Applications. IGI Global, 2011Google Scholar
  32. 32.
    Malewicz G, Austern M H, Bik A J C, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 135–146CrossRefGoogle Scholar
  33. 33.
    Yang S, Wu Y, Sun H, Yan X. Schemaless and structureless graph querying. Proceedings of the VLDB Endowment, 2014, 7(7): 565–576CrossRefGoogle Scholar
  34. 34.
    Beitzel S M, Jensen E C, Frieder O, Lewis D D, Chowdhury A, Kolcz A. Improving automatic query classification via semi-supervised learning. In: Proceedings of the 5th IEEE International Conference on Data Mining. 2005, 42–49Google Scholar
  35. 35.
    Shen D, Sun J T, Yang Q, Chen Z. Building bridges for web query classification. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2006, 131–138Google Scholar
  36. 36.
    Xing Q, Liu Y, Nie J Y, Zhang M, Ma S, Zhang K. Incorporating user preferences into click models. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013, 1301–1310Google Scholar
  37. 37.
    Hu B, Zhang Y, Chen W, Wang G, Yang Q. Characterizing search intent diversity into click models. In: Proceedings of the 20th International Conference on World Wide Web. 2011, 17–26Google Scholar
  38. 38.
    Maria G, Symeon P, Athena V. Massive graph management for the Web and Web 2.0. New Directions in Web Data Management 1. Springer, 2011, 19–58Google Scholar
  39. 39.
    Newman M, Barabási A L, Watts D J. The Structure and Dynamics of Networks. Princeton: Princeton University Press, 2006MATHGoogle Scholar
  40. 40.
    Rahm E, Do H H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000, 23(4): 3–13Google Scholar
  41. 41.
    Fan W, Li J, Ma S, Tang N, Yu W. Towards certain fixes with editing rules and master data. The International Journal on Very Large Data Bases, 2012, 21(2): 213–238CrossRefGoogle Scholar
  42. 42.
    Henzinger M R, Henzinger T A, Kopke P W. Computing simulations on finite and infinite graphs. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science. 1995, 453–462Google Scholar
  43. 43.
    Ramalingam G, Reps T W. A categorized bibliography on incremental computation. In: Proceedings of the 20th Symposium on Principles of Programming Languages. 1993, 502–510Google Scholar
  44. 44.
    Ramalingam G, Reps T W. On the computational complexity of dynamic graph problems. Theoretical Computer Science, 1996, 158(1): 233–277MathSciNetCrossRefMATHGoogle Scholar
  45. 45.
    Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Conference on Operating System Design and Implementation. 2004, 137–149Google Scholar
  46. 46.
    Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. 2010, 1–15Google Scholar
  47. 47.
    Papadimitriou C H. Computational Complexity. Addison-Wesley, 1994Google Scholar
  48. 48.
    Yu W, Aggarwal C C, Ma S, Wang H. On anomalous hotspot discovery in graph streams. In: Proceedings of the 13th IEEE International Conference on Data Mining. 2013, 1271–1276Google Scholar
  49. 49.
    Aggarwal C C, Wang H. Managing and Mining Graph Data. New York: Springer, 2010CrossRefMATHGoogle Scholar
  50. 50.
    Jordan M I. Divide-and-conquer and statistical inference for big data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 4–4CrossRefGoogle Scholar
  51. 51.
    Kleiner A, Talwalkar A, Sarkar P, Jordan M I. The big data bootstrap. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1759–1766Google Scholar
  52. 52.
    Kernighan B W, Lin S. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 1970, 49(2): 291–307CrossRefMATHGoogle Scholar
  53. 53.
    Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 1998, 20(1): 359–392MathSciNetCrossRefMATHGoogle Scholar
  54. 54.
    Yang S, Yan X, Zong B, Khan A. Towards effective partition management for large graphs. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 2012, 517–528CrossRefGoogle Scholar
  55. 55.
    Salomon D. Data compression: The Complete Reference. 4th ed. New York: Springer, 2007MATHGoogle Scholar
  56. 56.
    Buehrer G, Chellapilla K. A scalable pattern mining approach to Web graph compression with communities. In: Proceedings of the 2008 International Conference on Web Search and Data Mining. 2008, 95–106Google Scholar
  57. 57.
    Adler M, Mitzenmacher M. Towards compressing Web graphs. In: Proceedings of Data Compression Conference. 2001, 203–212Google Scholar
  58. 58.
    Boldi P, Vigna S. The Web Graph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web. 2004, 595–602Google Scholar
  59. 59.
    Feder T, Motwani R. Clique partitions, graph compression and speeding-up algorithms. Journal of Computer and System Sciences, 1995, 51(2): 261–272MathSciNetCrossRefMATHGoogle Scholar
  60. 60.
    Karande C, Chellapilla K, Andersen R. Speeding up algorithms on compressed Web graphs. In: Proceedings of the 2009 International Conference on Web Search and Data Mining. 2009, 272–281Google Scholar
  61. 61.
    Fan W, Li J, Wang X, Wu Y. Query preserving graph compression. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 2012, 157–168CrossRefGoogle Scholar
  62. 62.
    Baeza-Yates R A, Ribeiro-Neto B A. Modern Information Retrieval: the concepts and technology behind search. 2nd ed. Harlow: Pearson Education Ltd., 2011Google Scholar
  63. 63.
    Klein K, Kriege N, Mutzel P. CT-Index: Fingerprint-based graph indexing combining cycles and trees. In: Proceedings of IEEE the 27th International Conference on Data Engineering. 2011, 1115–1126Google Scholar
  64. 64.
    Lynch N A. Distributed Algorithms. San Francisco: Morgan Kaufmann, 1996MATHGoogle Scholar
  65. 65.
    Peleg D. Distributed Computing: A Locality-Sensitive Approach. SIAM, 2000Google Scholar
  66. 66.
    Ma S, Cao Y, Huai J, Wo T. Distributed graph pattern matching. In: Proceedings of the 21st International Conference on World Wide Web. 2012, 949–958Google Scholar
  67. 67.
    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. 2012, 15–28Google Scholar
  68. 68.
    Gao J, Zhou J, Zhou C, Yu J X. Glog: A high level graph analysis system using mapreduce. In: Proceedings of IEEE the 30th International Conference on Data Engineering. 2014, 544–555Google Scholar
  69. 69.
    Qin L, Yu J X, Chang L, Cheng H, Zhang C, Lin X. Scalable big graph processing in mapreduce. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014, 827–838Google Scholar
  70. 70.
    Xin R S, Gonzalez J E, Franklin M J, Stoica I. Graphx: a resilient distributed graph system on spark. In: Proceeding of the 1st International Workshop on Graph Data Management Experiences and Systems. 2013Google Scholar
  71. 71.
    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein J M. Distributed graphlab: a framework for machine learning in the cloud. Proceedings of the VLDB Endowment, 2012, 5(8): 716–727CrossRefGoogle Scholar
  72. 72.
    Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. 2012, 17–30Google Scholar
  73. 73.
    Fan W, Huai J. Querying big data: bridging theory and practice. Journal of Computer Science and Technology, 2014, 29(5): 849–869MathSciNetCrossRefGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Shuai Ma
    • 1
  • Jia Li
    • 1
  • Chunming Hu
    • 1
  • Xuelian Lin
    • 1
  • Jinpeng Huai
    • 1
  1. 1.State Key Laboratory of Software Development Environment, School of Computer Science and EngineeringBeihang UniversityBeijingChina

Personalised recommendations