DS 2016: Discovery Science pp 67-82 | Cite as
Min-Hashing for Probabilistic Frequent Subtree Feature Spaces
Abstract
We propose a fast algorithm for approximating graph similarities. For its advantageous semantic and algorithmic properties, we define the similarity between two graphs by the Jaccard-similarity of their images in a binary feature space spanned by the set of frequent subtrees generated for some training dataset. Since the feature space embedding is computationally intractable, we use a probabilistic subtree isomorphism operator based on a small sample of random spanning trees and approximate the Jaccard-similarity by min-hash sketches. The partial order on the feature set defined by subgraph isomorphism allows for a fast calculation of the min-hash sketch, without explicitly performing the feature space embedding. Experimental results on real-world graph datasets show that our technique results in a fast algorithm. Furthermore, the approximated similarities are well-suited for classification and retrieval tasks in large graph datasets.
Keywords
Feature Space Span Tree Tree Pattern Subgraph Isomorphism Naive AlgorithmReferences
- 1.Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29. IEEE (1997)Google Scholar
- 2.Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)MathSciNetCrossRefMATHGoogle Scholar
- 3.Deshpande, M., Kuramochi, M., Wale, N., Karypis, G.: Frequent substructure-based approaches for classifying chemical compounds. Trans. Knowl. Data Eng. 17(8), 1036–1050 (2005)CrossRefGoogle Scholar
- 4.Diestel, R.: Graph Theory. Graduate Texts in Mathematics, vol. 173, 4th edn. Springer, Heidelberg (2012). http://dblp.dagstuhl.de/rec/bib/books/daglib/0030488 MATHGoogle Scholar
- 5.Geppert, H., Horváth, T., Gärtner, T., Wrobel, S., Bajorath, J.: Support-vector-machine-based ranking significantly improves the effectiveness of similarity searching using 2D fingerprints and multiple reference compounds. J. Chem. Inf. Model. 48(4), 742–746 (2008)CrossRefGoogle Scholar
- 6.Horváth, T., Bringmann, B., Raedt, L.: Frequent hypergraph mining. In: Inoue, K., Ohwada, H., Yamamoto, A. (eds.) ILP 2006. LNCS (LNAI), vol. 4455, pp. 244–259. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-73847-3_26 CrossRefGoogle Scholar
- 7.Horváth, T., Ramon, J.: Efficient frequent connected subgraph mining in graphs of bounded tree-width. Theor. Comput. Sci. 411(31–33), 2784–2797 (2010)MathSciNetCrossRefMATHGoogle Scholar
- 8.Ralaivola, L., Swamidass, S.J., Saigo, H., Baldi, P.: Graph kernels for chemical informatics. Neural Netw. 18(8), 1093–1110 (2005)CrossRefGoogle Scholar
- 9.Shamir, R., Tsur, D.: Faster subtree isomorphism. J. Algorithms 33(2), 267–280 (1999). doi: 10.1006/jagm.1999.1044 MathSciNetCrossRefMATHGoogle Scholar
- 10.Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A.J., Vishwanathan, S.V.N.: Hash kernels for structured data. J. Mach. Learn. Res. 10, 2615–2637 (2009)MathSciNetMATHGoogle Scholar
- 11.Teixeira, C.H.C., Silva, A., Meira Jr., W.: Min-hash fingerprints for graph kernels: a trade-off among accuracy, efficiency, and compression. J. Inf. Data Manag. 3(3), 227–242 (2012)Google Scholar
- 12.Welke, P., Horváth, T., Wrobel, S.: Probabilistic frequent subtree kernels. In: Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2015. LNCS (LNAI), vol. 9607, pp. 179–193. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-39315-5_12 CrossRefGoogle Scholar