Skip to main content
Log in

Customizable HMM-based measures to accurately compare tree sets

  • Theoretical advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Trees have been topics of much interest since many decades due to various emerging applications using data represented as trees. Several techniques have been developed to compare two trees. But there is a serious lack of metrics to compare weighted trees. Existing approaches do not also allow to explicitly specify the targeted nodes properties on which the comparison should be performed. Furthermore, the problem of comparing two tree sets is not specifically addressed by existing techniques. This paper attempts to solve these problems by first proposing a distance and a similarity for the comparison of two finite sets of rooted ordered trees which can be labeled or not, as well as weighted or unweighted. To achieve this goal, a hidden Markov model is associated with each tree set for each targeted nodes property. The model associated with a tree set T for the targeted nodes property p learns how much the nodes of the trees in T verify property p. The resulting models are finally compared to derive a distance and similarity between the two sets of trees. The previous measures are then generalized for the comparison of unrooted and unordered trees. Flat classification experiments were carried out on two synthetic databases named FirstLast-L and FirstLast-LW available online. They both contain four classes of 100 rooted ordered trees whose specific and non-trivial nodes properties are clearly defined. When the distance proposed in this paper is selected as metric for the Nearest Neighbor classifier, a perfect accuracy of \(100\%\) is obtained for these two databases. This performance is \(41\%\) higher than the accuracy exhibited when the widespread tree Edit distance is selected for FirstLast-L.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. See page 15, Section F

  2. http://www.simotime.com/asc2ebc1.htm.

  3. http://perso-etis.ensea.fr/sylvain.iloga/FirstLast/index.html.

  4. http://tree-edit-distance.dbresearch.uni-salzburg.at/.

References

  1. Valiente G (2001) An efficient bottom-up distance between trees. In: spire, pages 212–219

  2. Bille P (2003) Tree edit distance, alignment distance and inclusion. Technical report, Citeseer

  3. Liu T-L, Geiger D (1999) Approximate tree matching and shape similarity. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 1, pages 456–462. IEEE

  4. Bhavsar VC, Boley H, Yang L (2004) A weighted-tree similarity algorithm for multi-agent systems in e-business environments. Comput Intell 20(4):584–602

    Article  MathSciNet  Google Scholar 

  5. Tai K-C (1979) The tree-to-tree correction problem. J ACM (JACM) 26(3):422–433

    Article  MathSciNet  Google Scholar 

  6. Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262

    Article  MathSciNet  Google Scholar 

  7. Zhang K, Statman R, Shasha D (1992) On the editing distance between unordered labeled trees. Inf Process Lett 42(3):133–139

    Article  MathSciNet  Google Scholar 

  8. Zhang K, Jiang T (1994) Some max snp-hard results concerning unordered labeled trees. Inf Process Lett 49(5):249–254

    Article  MathSciNet  Google Scholar 

  9. Klein PN (1998) Computing the edit-distance between unrooted ordered trees. In: European Symposium on Algorithms, pages 91–102. Springer

  10. Chen W (2001) New algorithm for ordered tree-to-tree correction problem. J Algorithms 40(2):135–158

    Article  MathSciNet  Google Scholar 

  11. Touzet H (2007) Comparing similar ordered trees in linear-time. J Discrete Algorithms 5(4):696–705

    Article  MathSciNet  Google Scholar 

  12. Demaine ED, Mozes S, Rossman B, Weimann O (2009) An optimal decomposition algorithm for tree edit distance. ACM Trans Algorithms (TALG) 6(1):2

    MathSciNet  MATH  Google Scholar 

  13. Pawlik M, Augsten N (2015) Efficient computation of the tree edit distance. ACM Trans Database Syst (TODS) 40(1):1–40

    Article  MathSciNet  Google Scholar 

  14. Pawlik M, Augsten N (2016) Tree edit distance: robust and memory-efficient. Inf Syst 56:157–173

    Article  Google Scholar 

  15. Schwarz S, Pawlik M, Augsten N (2017) A new perspective on the tree edit distance. In: International Conference on Similarity Search and Applications, pages 156–170. Springer

  16. Zhang K (1995) Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recogn 28(3):463–474

    Article  Google Scholar 

  17. Zhang K (1996) A constrained edit distance between unordered labeled trees. Algorithmica 15(3):205–222

    Article  MathSciNet  Google Scholar 

  18. Richter T (1997) A new measure of the distance between ordered trees and its applications. Inst für Informatik

  19. Lu CL, Su Z-Y, Tang CY (2001) A new measure of edit distance between labeled trees. In: International Computing and Combinatorics Conference, pages 338–348. Springer

  20. Ouangraoua A, Ferraro P, Tichit L, Dulucq S (2007) Local similarity between quotiented ordered trees. J Discrete Algorithms 5(1):23–35

    Article  MathSciNet  Google Scholar 

  21. Selkow SM (1977) The tree-to-tree editing problem. Inf Process Lett 6(6):184–186

    Article  MathSciNet  Google Scholar 

  22. Shin-Yee L (1979) A tree-to-tree distance and its application to cluster analysis. IEEE Trans Pattern Anal Mach Intell 2:219–224

    MATH  Google Scholar 

  23. Tanaka E, Tanaka K (1988) The tree-to-tree editing problem. Int J Pattern Recognit Artif Intell 2(02):221–240

    Article  Google Scholar 

  24. Shasha D, Zhang K (1990) Fast algorithms for the unit cost editing distance between trees. J Algorithms 11(4):581–621

    Article  MathSciNet  Google Scholar 

  25. Sridharamurthy R, Talha BM, Adhitya K, Vijay N (2018) Edit distance between merge trees. In: IEEE transactions on visualization and computer graphics, pages 1–14

  26. Jiang T, Wang L, Zhang K (1995) Alignment of trees–an alternative to tree edit. Theoret Comput Sci 143(1):137–148

    Article  MathSciNet  Google Scholar 

  27. Jansson J, Lingas A (2001) A fast algorithm for optimal alignment between similar ordered trees. In: Annual Symposium on Combinatorial Pattern Matching, pages 232–240. Springer

  28. Kilpeläinen P, et al (1992) Tree matching problems with applications to structured text databases

  29. Alonso L, Schott R (1993) On the tree inclusion problem. In: International Symposium on Mathematical Foundations of Computer Science, pages 211–221. Springer

  30. Kilpeläinen P, Mannila H (1995) Ordered and unordered tree inclusion. SIAM J Comput 24(2):340–356

    Article  MathSciNet  Google Scholar 

  31. Richter T (1997) A new algorithm for the ordered tree inclusion problem. In: Annual Symposium on Combinatorial Pattern Matching, pages 150–166. Springer

  32. Chen W (1998) More efficient algorithm for ordered tree inclusion. J Algorithms 26(2):370–385

    Article  MathSciNet  Google Scholar 

  33. Hoffmann CM, O’Donnell MJ (1982) Pattern matching in trees. J ACM 29(1):68–95

    Article  MathSciNet  Google Scholar 

  34. Kosaraju SR (1989) Efficient tree pattern matching. In: 30th Annual Symposium on Foundations of Computer Science, pages 178–183. IEEE

  35. Dubiner M, Galil Z, Magen E (1990) Faster tree pattern matching. In: Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science, pages 145–150. IEEE

  36. Ramesh RAMAKRISHNAN, Ramakrishnan IV (1992) Nonlinear pattern matching in trees. J ACM (JACM) 39(2):295–316

    Article  MathSciNet  Google Scholar 

  37. Zhang KZ, Shasha D, Wang JT-L (1994) Approximate tree matching in the presence of variable length don’t cares. J Algorithms 16(1):33–66

    Article  MathSciNet  Google Scholar 

  38. Farach M, Thorup M (1995) Fast comparison of evolutionary trees. Inf Comput 123(1):29–37

    Article  MathSciNet  Google Scholar 

  39. Amir A, Keselman D (1997) Maximum agreement subtree in a set of evolutionary trees: metrics and efficient algorithms. SIAM J Comput 26(6):1656–1669

    Article  MathSciNet  Google Scholar 

  40. Khanna S, Motwani R, Yao FF (1995) Approximation algorithms for the largest common subtree problem. Citeseer

  41. Akutsu T, Halldórsson MM (2000) On the approximation of largest common subtrees and largest common point sets. Theor Comput Sci 233(1–2):33–50

    Article  MathSciNet  Google Scholar 

  42. Gupta A, Nishimura N (1998) Finding largest subtrees and smallest supertrees. Algorithmica 21(2):183–210

    Article  MathSciNet  Google Scholar 

  43. Nishimura N, Ragde P, Thilikos DM (2000) Finding smallest supertrees under minor containment. Int J Found Comput Sci 11(03):445–465

    Article  MathSciNet  Google Scholar 

  44. Tan P-N, Steinbach M, Kumar V et al (2006) Cluster analysis: basic concepts and algorithms. Intro Data Min 8:487–568

    Google Scholar 

  45. Mucherino A, Papajorgji PJ, Pardalos PM (2009) Data Mining in Agriculture, volume 34, chapter k-Nearest Neighbor Classification. Springer, New York

  46. Bondy JA, Uppaluri SRM, et al (1976) Graph theory with applications, volume 290. Macmillan London

  47. Cheung T-Y (1983) Graph traversal techniques and the maximum flow problem in distributed computation. IEEE Trans Software Eng 4:504–512

    Article  Google Scholar 

  48. Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM (JACM) 21(1):168–173

    Article  MathSciNet  Google Scholar 

  49. Matoušek J, Thomas R (1992) On the complexity of finding iso-and other morphisms for partial k-trees. Discrete Math 108(1–3):343–364

    Article  MathSciNet  Google Scholar 

  50. Torsello A, Hancock ER (2006) Learning shape-classes using a mixture of tree-unions. IEEE Trans Pattern Anal Mach Intell 28(6):954–967

    Article  Google Scholar 

  51. Torsello A, Rossi L (2011) Supervised learning of graph structure. In: International Workshop on Similarity-Based Pattern Recognition, pages 117–132. Springer

  52. Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286

    Article  Google Scholar 

  53. Iloga S, Romain O, Tchuenté M (2020) An efficient generic approach for automatic taxonomy generation using HMMs. Pattern Anal Appl 1–22

  54. Falkhausen M, Reininger H, Wolf D (1995) Calculation of distance measures between hidden markov models. In: Fourth European Conference on Speech Communication and Technology

  55. Do MN (2003) Fast approximation of kullback-leibler distance for dependence trees and hidden markov models. IEEE Signal Process Lett 10(4):115–118

    Article  Google Scholar 

  56. Silva J, Narayanan S (2008) Upper bound kullback-leibler divergence for transient hidden markov models. IEEE Trans Signal Process 56(9):4176–4188

    Article  MathSciNet  Google Scholar 

  57. Lyngso RB, Pedersen CN, Nielsen H (1999) Metrics and similarity measures for hidden markov models. In: Proc Int Conf Intell Syst Mol Biol, pages 178–186

  58. Zeng J, Duan J, Chengrong W (2010) A new distance measure for hidden markov models. Expert Syst Appl 37(2):1550–1555

    Article  Google Scholar 

  59. Iloga S, Romain O, Tchuenté M (2018) An accurate hmm-based similarity measure between finite sets of histograms. Pattern Anal Appl 1–26

  60. Sahraeian SME, Yoon B-J (2011) A novel low-complexity hmm similarity measure. IEEE Signal Process Lett 18(2):87–90

    Article  Google Scholar 

  61. Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pages 49–56

  62. Nothman J, Qin H, Yurchak R (2018) Stop word lists in free open-source software packages. In: Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 7–12

  63. Rico-Juan JR, Micó L (2003) Some results about the use of tree/string edit distances in a\(^\sim\) nearest neighbour classification task. In: Iberian Conference on Pattern Recognition and Image Analysis, pages 821–828. Springer

  64. Noussi JBB, Tchendji MT, Iloga S (2019) Parallel hmm-based similarity between finite sets of histograms. http://cri-info.cm/?page_id=148

  65. Espinosa-Manzo ALA, Arias-Estrada MO (2001) Implementing hidden markov models in a hardware architecture. In: Proceedings of the International Meeting of Computer Science (ENC’01), Aguascalientes, Mexico, volume II, pages 1007–1016

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sylvain Iloga.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Iloga, S. Customizable HMM-based measures to accurately compare tree sets. Pattern Anal Applic 24, 1149–1171 (2021). https://doi.org/10.1007/s10044-021-00971-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-021-00971-3

Keywords

Navigation