Abstract
Trees have been topics of much interest since many decades due to various emerging applications using data represented as trees. Several techniques have been developed to compare two trees. But there is a serious lack of metrics to compare weighted trees. Existing approaches do not also allow to explicitly specify the targeted nodes properties on which the comparison should be performed. Furthermore, the problem of comparing two tree sets is not specifically addressed by existing techniques. This paper attempts to solve these problems by first proposing a distance and a similarity for the comparison of two finite sets of rooted ordered trees which can be labeled or not, as well as weighted or unweighted. To achieve this goal, a hidden Markov model is associated with each tree set for each targeted nodes property. The model associated with a tree set T for the targeted nodes property p learns how much the nodes of the trees in T verify property p. The resulting models are finally compared to derive a distance and similarity between the two sets of trees. The previous measures are then generalized for the comparison of unrooted and unordered trees. Flat classification experiments were carried out on two synthetic databases named FirstLast-L and FirstLast-LW available online. They both contain four classes of 100 rooted ordered trees whose specific and non-trivial nodes properties are clearly defined. When the distance proposed in this paper is selected as metric for the Nearest Neighbor classifier, a perfect accuracy of \(100\%\) is obtained for these two databases. This performance is \(41\%\) higher than the accuracy exhibited when the widespread tree Edit distance is selected for FirstLast-L.
Similar content being viewed by others
References
Valiente G (2001) An efficient bottom-up distance between trees. In: spire, pages 212–219
Bille P (2003) Tree edit distance, alignment distance and inclusion. Technical report, Citeseer
Liu T-L, Geiger D (1999) Approximate tree matching and shape similarity. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 1, pages 456–462. IEEE
Bhavsar VC, Boley H, Yang L (2004) A weighted-tree similarity algorithm for multi-agent systems in e-business environments. Comput Intell 20(4):584–602
Tai K-C (1979) The tree-to-tree correction problem. J ACM (JACM) 26(3):422–433
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Zhang K, Statman R, Shasha D (1992) On the editing distance between unordered labeled trees. Inf Process Lett 42(3):133–139
Zhang K, Jiang T (1994) Some max snp-hard results concerning unordered labeled trees. Inf Process Lett 49(5):249–254
Klein PN (1998) Computing the edit-distance between unrooted ordered trees. In: European Symposium on Algorithms, pages 91–102. Springer
Chen W (2001) New algorithm for ordered tree-to-tree correction problem. J Algorithms 40(2):135–158
Touzet H (2007) Comparing similar ordered trees in linear-time. J Discrete Algorithms 5(4):696–705
Demaine ED, Mozes S, Rossman B, Weimann O (2009) An optimal decomposition algorithm for tree edit distance. ACM Trans Algorithms (TALG) 6(1):2
Pawlik M, Augsten N (2015) Efficient computation of the tree edit distance. ACM Trans Database Syst (TODS) 40(1):1–40
Pawlik M, Augsten N (2016) Tree edit distance: robust and memory-efficient. Inf Syst 56:157–173
Schwarz S, Pawlik M, Augsten N (2017) A new perspective on the tree edit distance. In: International Conference on Similarity Search and Applications, pages 156–170. Springer
Zhang K (1995) Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recogn 28(3):463–474
Zhang K (1996) A constrained edit distance between unordered labeled trees. Algorithmica 15(3):205–222
Richter T (1997) A new measure of the distance between ordered trees and its applications. Inst für Informatik
Lu CL, Su Z-Y, Tang CY (2001) A new measure of edit distance between labeled trees. In: International Computing and Combinatorics Conference, pages 338–348. Springer
Ouangraoua A, Ferraro P, Tichit L, Dulucq S (2007) Local similarity between quotiented ordered trees. J Discrete Algorithms 5(1):23–35
Selkow SM (1977) The tree-to-tree editing problem. Inf Process Lett 6(6):184–186
Shin-Yee L (1979) A tree-to-tree distance and its application to cluster analysis. IEEE Trans Pattern Anal Mach Intell 2:219–224
Tanaka E, Tanaka K (1988) The tree-to-tree editing problem. Int J Pattern Recognit Artif Intell 2(02):221–240
Shasha D, Zhang K (1990) Fast algorithms for the unit cost editing distance between trees. J Algorithms 11(4):581–621
Sridharamurthy R, Talha BM, Adhitya K, Vijay N (2018) Edit distance between merge trees. In: IEEE transactions on visualization and computer graphics, pages 1–14
Jiang T, Wang L, Zhang K (1995) Alignment of trees–an alternative to tree edit. Theoret Comput Sci 143(1):137–148
Jansson J, Lingas A (2001) A fast algorithm for optimal alignment between similar ordered trees. In: Annual Symposium on Combinatorial Pattern Matching, pages 232–240. Springer
Kilpeläinen P, et al (1992) Tree matching problems with applications to structured text databases
Alonso L, Schott R (1993) On the tree inclusion problem. In: International Symposium on Mathematical Foundations of Computer Science, pages 211–221. Springer
Kilpeläinen P, Mannila H (1995) Ordered and unordered tree inclusion. SIAM J Comput 24(2):340–356
Richter T (1997) A new algorithm for the ordered tree inclusion problem. In: Annual Symposium on Combinatorial Pattern Matching, pages 150–166. Springer
Chen W (1998) More efficient algorithm for ordered tree inclusion. J Algorithms 26(2):370–385
Hoffmann CM, O’Donnell MJ (1982) Pattern matching in trees. J ACM 29(1):68–95
Kosaraju SR (1989) Efficient tree pattern matching. In: 30th Annual Symposium on Foundations of Computer Science, pages 178–183. IEEE
Dubiner M, Galil Z, Magen E (1990) Faster tree pattern matching. In: Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science, pages 145–150. IEEE
Ramesh RAMAKRISHNAN, Ramakrishnan IV (1992) Nonlinear pattern matching in trees. J ACM (JACM) 39(2):295–316
Zhang KZ, Shasha D, Wang JT-L (1994) Approximate tree matching in the presence of variable length don’t cares. J Algorithms 16(1):33–66
Farach M, Thorup M (1995) Fast comparison of evolutionary trees. Inf Comput 123(1):29–37
Amir A, Keselman D (1997) Maximum agreement subtree in a set of evolutionary trees: metrics and efficient algorithms. SIAM J Comput 26(6):1656–1669
Khanna S, Motwani R, Yao FF (1995) Approximation algorithms for the largest common subtree problem. Citeseer
Akutsu T, Halldórsson MM (2000) On the approximation of largest common subtrees and largest common point sets. Theor Comput Sci 233(1–2):33–50
Gupta A, Nishimura N (1998) Finding largest subtrees and smallest supertrees. Algorithmica 21(2):183–210
Nishimura N, Ragde P, Thilikos DM (2000) Finding smallest supertrees under minor containment. Int J Found Comput Sci 11(03):445–465
Tan P-N, Steinbach M, Kumar V et al (2006) Cluster analysis: basic concepts and algorithms. Intro Data Min 8:487–568
Mucherino A, Papajorgji PJ, Pardalos PM (2009) Data Mining in Agriculture, volume 34, chapter k-Nearest Neighbor Classification. Springer, New York
Bondy JA, Uppaluri SRM, et al (1976) Graph theory with applications, volume 290. Macmillan London
Cheung T-Y (1983) Graph traversal techniques and the maximum flow problem in distributed computation. IEEE Trans Software Eng 4:504–512
Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM (JACM) 21(1):168–173
Matoušek J, Thomas R (1992) On the complexity of finding iso-and other morphisms for partial k-trees. Discrete Math 108(1–3):343–364
Torsello A, Hancock ER (2006) Learning shape-classes using a mixture of tree-unions. IEEE Trans Pattern Anal Mach Intell 28(6):954–967
Torsello A, Rossi L (2011) Supervised learning of graph structure. In: International Workshop on Similarity-Based Pattern Recognition, pages 117–132. Springer
Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Iloga S, Romain O, Tchuenté M (2020) An efficient generic approach for automatic taxonomy generation using HMMs. Pattern Anal Appl 1–22
Falkhausen M, Reininger H, Wolf D (1995) Calculation of distance measures between hidden markov models. In: Fourth European Conference on Speech Communication and Technology
Do MN (2003) Fast approximation of kullback-leibler distance for dependence trees and hidden markov models. IEEE Signal Process Lett 10(4):115–118
Silva J, Narayanan S (2008) Upper bound kullback-leibler divergence for transient hidden markov models. IEEE Trans Signal Process 56(9):4176–4188
Lyngso RB, Pedersen CN, Nielsen H (1999) Metrics and similarity measures for hidden markov models. In: Proc Int Conf Intell Syst Mol Biol, pages 178–186
Zeng J, Duan J, Chengrong W (2010) A new distance measure for hidden markov models. Expert Syst Appl 37(2):1550–1555
Iloga S, Romain O, Tchuenté M (2018) An accurate hmm-based similarity measure between finite sets of histograms. Pattern Anal Appl 1–26
Sahraeian SME, Yoon B-J (2011) A novel low-complexity hmm similarity measure. IEEE Signal Process Lett 18(2):87–90
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pages 49–56
Nothman J, Qin H, Yurchak R (2018) Stop word lists in free open-source software packages. In: Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 7–12
Rico-Juan JR, Micó L (2003) Some results about the use of tree/string edit distances in a\(^\sim\) nearest neighbour classification task. In: Iberian Conference on Pattern Recognition and Image Analysis, pages 821–828. Springer
Noussi JBB, Tchendji MT, Iloga S (2019) Parallel hmm-based similarity between finite sets of histograms. http://cri-info.cm/?page_id=148
Espinosa-Manzo ALA, Arias-Estrada MO (2001) Implementing hidden markov models in a hardware architecture. In: Proceedings of the International Meeting of Computer Science (ENC’01), Aguascalientes, Mexico, volume II, pages 1007–1016
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Iloga, S. Customizable HMM-based measures to accurately compare tree sets. Pattern Anal Applic 24, 1149–1171 (2021). https://doi.org/10.1007/s10044-021-00971-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-021-00971-3