Advertisement

GTED: Graph Traversal Edit Distance

  • Ali Ebrahimpour Boroojeny
  • Akash Shrestha
  • Ali Sharifi-Zarchi
  • Suzanne Renick Gallagher
  • S. Cenk Sahinalp
  • Hamidreza Chitsaz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10812)

Abstract

Many problems in applied machine learning deal with graphs (also called networks), including social networks, security, web data mining, protein function prediction, and genome informatics. The kernel paradigm beautifully decouples the learning algorithm from the underlying geometric space, which renders graph kernels important for the aforementioned applications.

In this paper, we give a new graph kernel which we call graph traversal edit distance (GTED). We introduce the GTED problem and give the first polynomial time algorithm for it. Informally, the graph traversal edit distance is the minimum edit distance between two strings formed by the edge labels of respective Eulerian traversals of the two graphs. Also, GTED is motivated by and provides the first mathematical formalism for sequence co-assembly and de novo variation detection in bioinformatics.

We demonstrate that GTED admits a polynomial time algorithm using a linear program in the graph product space that is guaranteed to yield an integer solution. To the best of our knowledge, this is the first approach to this problem. We also give a linear programming relaxation algorithm for a lower bound on GTED. We use GTED as a graph kernel and evaluate it by computing the accuracy of an SVM classifier on a few datasets in the literature. Our results suggest that our kernel outperforms many of the common graph kernels in the tested datasets. As a second set of experiments, we successfully cluster viral genomes using GTED on their assembly graphs obtained from de novo assembly of next generation sequencing reads. Our GTED implementation can be downloaded from http://chitsazlab.org/software/gted/.

References

  1. 1.
    Li, Y., et al.: Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nat. Biotechnol. 29, 723–730 (2011)CrossRefGoogle Scholar
  2. 2.
    Movahedi, N.S., Forouzmand, E., Chitsaz, H.: De novo co-assembly of bacterial genomes from multiple single cells. In: IEEE Conference on Bioinformatics and Biomedicine, pp. 561–565 (2012)Google Scholar
  3. 3.
    Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012)CrossRefGoogle Scholar
  4. 4.
    Taghavi, Z., Movahedi, N.S., Draghici, S., Chitsaz, H.: Distilled single-cell genome sequencing and de novo assembly for sparse microbial communities. Bioinformatics 29(19), 2395–2401 (2013)CrossRefGoogle Scholar
  5. 5.
    Movahedi, N.S., Embree, M., Nagarajan, H., Zengler, K., Chitsaz, H.: Efficient synergistic single-cell genome assembly. Front. Bioeng. Biotechnol. 4, 42 (2016)CrossRefGoogle Scholar
  6. 6.
    Hormozdiari, F., Hajirasouliha, I., McPherson, A., Eichler, E., Sahinalp, S.C.: Simultaneous structural variation discovery among multiple paired-end sequenced genomes. Genome Res. 21, 2203–2212 (2011)CrossRefGoogle Scholar
  7. 7.
    Mak, C.: Multigenome analysis of variation (research highlights). Nat. Biotechnol. 29, 330 (2011)CrossRefGoogle Scholar
  8. 8.
    Jones, S.: True colors of genome variation (research highlights). Nat. Biotechnol. 30, 158 (2012)Google Scholar
  9. 9.
    Inokuchi, A., Washio, T., Motoda, H.: Complete mining of frequent patterns from graphs: mining graph data. Mach. Learn. 50(3), 321–354 (2003)CrossRefGoogle Scholar
  10. 10.
    Borgwardt, K.M., Ong, C.S., Schönauer, S., Vishwanathan, S.V.N., Smola, A.J., Kriegel, H.-P.: Protein function prediction via graph kernels. Bioinformatics 21(1), 47–56 (2005)CrossRefGoogle Scholar
  11. 11.
    Kubinyi, H.: Drug research: myths, hype and reality. Nat. Rev. Drug Discov. 2(8), 665–668 (2003)CrossRefGoogle Scholar
  12. 12.
    G"artner, T.: Exponential and geometric kernels for graphs. In: NIPS 2002 Workshop on Unreal Data, Principles of Modeling Nonvectorial Data (2002)Google Scholar
  13. 13.
    Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Borgwardt, K.M., Kriegel, H.P.: Shortest-path kernels on graphs. In Fifth IEEE International Conference on Data Mining (ICDM 2005), p. 8, November 2005Google Scholar
  15. 15.
    Feragen, A., Kasenburg, N., Petersen, J., de Bruijne, M., Borgwardt, K.: Scalable kernels for graphs with continuous attributes. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 216–224. Curran Associates Inc. (2013)Google Scholar
  16. 16.
    Kondor, R., Borgwardt, K.M.: The skew spectrum of graphs. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 496–503. ACM, New York (2008)Google Scholar
  17. 17.
    Kondor, R., Pan, H.: The multiscale laplacian graph kernel. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2990–2998. Curran Associates Inc. (2016)Google Scholar
  18. 18.
    Shervashidze, N., Vishwanathan, S.V.N., Petri, T., Mehlhorn, K., Borgwardt, K.: Efficient graphlet kernels for large graph comparison. In: van Dyk, D., Welling, M. (eds.) Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009, vol. 5, pp. 488–495 (2009). PMLRGoogle Scholar
  19. 19.
    Shervashidze, N., Schweitzer, P., van Leeuwen, E.J., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-lehman graph kernels. J. Mach. Learn. Res. 12, 2539–2561 (2011)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Neumann, M., Garnett, R., Bauckhage, C., Kersting, K.: Propagation kernels: efficient graph kernels from propagated information. Mach. Learn. 102(2), 209–245 (2016)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. U.S.A. 98, 9748–9753 (2001)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Pevzner, P.A., Tang, H., Tesler, G.: De novo repeat classification and fragment assembly. Genome Res. 14(9), 1786–1796 (2004)CrossRefGoogle Scholar
  23. 23.
    Ronen, R., Boucher, C., Chitsaz, H., Pevzner, P.: SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28(12), i188–i196 (2012). Also ISMB proceedingsCrossRefGoogle Scholar
  24. 24.
    Myers, E.W.: Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 275–290 (1995)CrossRefGoogle Scholar
  25. 25.
    Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26, 367–373 (2010)CrossRefGoogle Scholar
  26. 26.
    Jones, N.C., Pevzner, P.: An Introduction to Bioinformatics Algorithms. MIT press, Cambridge (2004)Google Scholar
  27. 27.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics-Doklady 10(8), 707–710 (1966). Original. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965)MathSciNetzbMATHGoogle Scholar
  28. 28.
    Tutte, W.T., Smith, C.A.B.: On unicursal paths in a network of degree 4. Am. Math. Mon. 48(4), 233–237 (1941)MathSciNetCrossRefGoogle Scholar
  29. 29.
    van Aardenne-Ehrenfest, T., de Bruijn, N.G.: Circuits and trees in oriented linear graphs. In: Gessel, I., Rota, G.-C. (eds.) Classic Papers in Combinatorics, Modern Birkhäuser Classics, pp. 149–163. Birkhäuser, Boston (1987)Google Scholar
  30. 30.
    Dey, T., Hirani, A., Krishnamoorthy, B.: Optimal homologous cycles, total unimodularity, and linear programming. SIAM J. Comput. 40(4), 1026–1044 (2011)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Vick, J.W.: Homology Theory: An Introduction to Algebraic Topology, vol. 145. Springer, New York (1994).  https://doi.org/10.1007/978-1-4612-0881-5CrossRefzbMATHGoogle Scholar
  32. 32.
    Massey, W.: A Basic Course in Algebraic Topology, vol. 127. Springer, New York (1991)zbMATHGoogle Scholar
  33. 33.
    Debnath, A.K., de Compadre, R.L.L., Debnath, G., Shusterman, A.J., Hansch, C.: Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. 34(2), 786–797 (1991)CrossRefGoogle Scholar
  34. 34.
    Wale, N., Watson, I.A., Karypis, G.: Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl. Inf. Syst. 14(3), 347–375 (2008)CrossRefGoogle Scholar
  35. 35.
    Toivonen, H., Srinivasan, A., King, R.D., Kramer, S., Helma, C.: Statistical evaluation of the predictive toxicology challenge 2000–2001. Bioinformatics 19(10), 1183–1193 (2003)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Ali Ebrahimpour Boroojeny
    • 1
  • Akash Shrestha
    • 1
  • Ali Sharifi-Zarchi
    • 1
    • 2
    • 3
  • Suzanne Renick Gallagher
    • 1
  • S. Cenk Sahinalp
    • 4
  • Hamidreza Chitsaz
    • 1
  1. 1.Colorado State UniversityFort CollinsUSA
  2. 2.Royan InstituteTehranIran
  3. 3.Sharif University of TechnologyTehranIran
  4. 4.Indiana UniversityBloomingtonUSA

Personalised recommendations