Counts-of-counts similarity for prediction and search in relational data

Jaeger, Manfred; Lippi, Marco; Pellegrini, Giovanni; Passerini, Andrea

doi:10.1007/s10618-019-00621-7

Counts-of-counts similarity for prediction and search in relational data

Published: 08 March 2019

Volume 33, pages 1254–1297, (2019)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Manfred Jaeger ORCID: orcid.org/0000-0002-5641-8153¹,
Marco Lippi²,
Giovanni Pellegrini³ &
…
Andrea Passerini^1,3

686 Accesses
Explore all metrics

Abstract

Defining appropriate distance functions is a crucial aspect of effective and efficient similarity-based prediction and retrieval. Relational data are especially challenging in this regard. By viewing relational data as multi-relational graphs, one can easily see that a distance between a pair of nodes can be defined in terms of a virtually unlimited class of features, including node attributes, attributes of node neighbors, structural aspects of the node neighborhood and arbitrary combinations of these properties. In this paper we propose a rich and flexible class of metrics on graph entities based on earth mover’s distance applied to a hierarchy of complex counts-of-counts statistics. We further propose an approximate version of the distance using sums of marginal earth mover’s distances. We show that the approximation is correct for many cases of practical interest and allows efficient nearest-neighbor retrieval when combined with a simple metric tree data structure. An experimental evaluation on two real-world scenarios highlights the flexibility of our framework for designing metrics representing different notions of similarity. Substantial improvements in similarity-based prediction are reported when compared to solutions based on state-of-the-art graph kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An expressive dissimilarity measure for relational clustering using neighbourhood trees

Article 05 June 2017

Sebastijan Dumančić & Hendrik Blockeel

Generalized Relative Neighborhood Graph (GRNG) for Similarity Search

Greedy Filtering: A Scalable Algorithm for K-Nearest Neighbor Graph Construction

Notes

Preliminary experiments showed that a plain summation indeed achieves poor performance on TETs where different branches have very different number of children.
https://aminer.org/aminernetwork.
http://www.imdb.com/.
https://github.com/dkoslicki/EMDeBruijn/tree/master/FastEMD/java.
https://github.com/mahito-sugiyama/graph-kernels.
Note that a classification accuracy of 99.9% corresponds to F1 = 99.9, far higher than the one we achieved in Jaeger et al. (2013) for the same task (on another data set) with discriminant function and nearest neighbor retrieval.
https://www.imdb.com/poll/p0HuVFrAcR4/.

References

Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY (1998) An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM (JACM) 45(6):891–923
Article MathSciNet MATH Google Scholar
Assche AV, Vens C, Blockeel H, Dzeroski S (2006) First order random forests: learning relational classifiers with complex aggregates. Mach Learn 64:149–182
Article MATH Google Scholar
Barla A, Odone F, Verri A (2003) Histogram intersection kernel for image classification. In: Proceedings 2003 international conference on image processing (Cat. No.03CH37429), vol. 3, pp III–513
Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data. CoRR arXiv:1306.6709
Berg C, Christensen JP, Ressel P (1984) Harmonic analysis on semigroups: theory of positive definite and related functions, Graduate Texts in Mathematics, vol 100, 1st edn. Springer, Berlin
Book MATH Google Scholar
Chan T, Esedoglu S, Ni K (2007) Histogram based segmentation using Wasserstein distances. In: International conference on scale space and variational methods in computer vision, Springer, pp 697–708
Clarkson KL (2006) Nearest-neighbor searching and metric space dimensions. In: In nearest-neighbor methods for learning and vision: theory and practice, MIT Press, Cambridge
Cuturi M, Avis D (2014) Ground metric learning. J Mach Learn Res 15(1):533–564
MathSciNet MATH Google Scholar
Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):5:1–5:60
Article Google Scholar
Egghe L (2006) Theory and practise of the g-index. Scientometrics 69(1):131–152
Article MathSciNet Google Scholar
Gardner A, Duncan CA, Kanno J, Selmic RR (2018) On the definiteness of earth mover’s distance and its relation to set intersection. IEEE Trans Cybern 48(11):3184–3196
Article Google Scholar
Grover A, Leskovec J (2016) Node2vec: scalable feature learning for networks. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, KDD ’16, pp 855–864
Hirsch JE (2005) An index to quantify an individual’s scientific research output. Proc Natl Acad Sci USA 102(46):16,569
Article MATH Google Scholar
Hoff PD (2009) Multiplicative latent factor models for description and prediction of social networks. Comput Math Organ Theory 15(4):261–272
Article Google Scholar
Jaeger M, Lippi M, Passerini A, Frasconi P (2013) Type extension trees for feature construction and learning in relational domains. Artif Intell 204:30–55
Article MathSciNet MATH Google Scholar
Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 41–48
Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, KDD ’02, pp 538–543
Khan A, Li N, Yan X, Guan Z, Chakraborty S, Tao S (2011) Neighborhood based fast graph search in large networks. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, ACM, pp 901–912
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations (ICLR)
Knobbe AJ, Siebes A, van der Wallen D (1999) Multi-relational decision tree induction. In: Proceedings of PKDD-99, pp 378–383
Leicht EA, Holme P, Newman ME (2006) Vertex similarity in networks. Phys Rev E 73(2):026,120
Article Google Scholar
Liu T, Moore AW, Yang K, Gray AG (2005) An investigation of practical approximate nearest neighbor algorithms. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems, vol 17. MIT Press, Cambridge, pp 825–832
Google Scholar
Liu Z, Zheng VW, Zhao Z, Zhu F, Chang KC, Wu M, Ying J (2017) Semantic proximity search on heterogeneous graph by proximity embedding. In: Singh SP, Markovitch S (eds) Proceedings of the thirty-first AAAI conference on artificial intelligence, February 4–9, 2017, San Francisco, CA, AAAI Press, pp 154–160
Ljosa V, Bhattacharya A, Singh AK (2006) Indexing spatially sensitive distance measures using multi-resolution lower bounds. In: International conference on extending database technology, Springer, Berlin, pp 865–883
Loosli G, Canu S, Ong CS (2016) Learning SVM in Kreĭn spaces. IEEE Trans Pattern Anal Mach Intell 38(6):1204–1216
Article Google Scholar
Mottin D, Lissandrini M, Velegrakis Y, Palpanas T (2014) Exemplar queries: give me an example of what you need. Proc VLDB Endow 7(5):365–376
Article Google Scholar
Muja M, Lowe DG (2014) Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans Pattern Anal Mach Intell 36(11):2227–2240
Article Google Scholar
Naor A, Schechtman G (2007) Planar earthmover is not in l_1. SIAM J Comput 37(3):804–826
Article MathSciNet MATH Google Scholar
Neumann M, Garnett R, Bauckhage C, Kersting K (2016) Propagation kernels: efficient graph kernels from propagated information. Mach Learn 102(2):209–245
Article MathSciNet MATH Google Scholar
Neville J, Jensen D, Friedland L, Hay M (2003) Learning relational probability trees. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-03)
Newman ME (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E 74(3):036,104
Article MathSciNet Google Scholar
Oglic D, Gaertner T (2018) Learning in reproducing kernel Krein spaces. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, proceedings of machine learning research, vol 80, pp 3856–3864
Pele O, Werman M (2008) A linear time histogram metric for improved SIFT matching. In: Forsyth DA, Torr PHS, Zisserman A (eds) Computer vision–ECCV 2008, 10th European conference on computer vision, Marseille, France, October 12–18, 2008, proceedings, Part III, Springer, Lecture Notes in Computer Science, vol 5304, pp 495–508
Richards DS (1985) Positive definite symmetric functions on finite-dimensional spaces ii. Stat Probab Lett 3(6):325–329
Article MathSciNet MATH Google Scholar
Richardson M, Domingos P (2006) Markov logic networks. Mach Learn 62(1):107–136
Article Google Scholar
Rubner Y, Tomasi C, Guibas LJ (1998) A metric for distributions with applications to image databases. In: Sixth international conference on computer vision, 1998, IEEE, pp 59–66
Schölkopf B, Smola A (2002) Learning with Kernels. The MIT Press, Cambridge, MA
MATH Google Scholar
Shervashidze N, Schweitzer P, van Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) Weisfeiler–Lehman graph kernels. J Mach Learn Res 12:2539–2561
MathSciNet MATH Google Scholar
Sun Y, Han J, Yan X, Yu PS, Wu T (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc VLDB Endow 4(11):992–1003
Google Scholar
Tong H, Faloutsos C, Gallagher B, Eliassi-Rad T (2007) Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, KDD ’07, pp 737–746
Uhlmann JK (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40(4):175–179
Article MATH Google Scholar
Vens C, Gassen SV, Dhaene T, Saeys Y (2014) Complex aggregates over clusters of elements. In: Davis J, Ramon J (eds) Inductive logic programming–24th international conference, ILP 2014, Nancy, France, September 14–16, 2014, Revised Selected Papers, Springer, Lecture Notes in Computer Science, vol 9046, pp 181–193
Wang F, Guibas LJ (2012) Supervised earth mover’s distance learning and its computer vision applications. In: Fitzgibbon AW, Lazebnik S, Perona P, Sato Y, Schmid C (eds) Computer vision–ECCV 2012–12th European conference on computer vision, Florence, Italy, October 7–13, 2012, Proceedings, Part I, Springer, Lecture Notes in Computer Science, vol 7572, pp 442–455
Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. CoRR arXiv:1408.2927
Wang J, Zhang T, Song J, Sebe N, Shen HT (2017) A survey on learning to hash. IEEE Trans Pattern Anal Mach Intell PP(99):1–1
Google Scholar
Wichterich M, Assent I, Kranen P, Seidl T (2008) Efficient emd-based similarity search in multimedia databases via flexible dimensionality reduction. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, ACM, pp 199–212
Yanardag P, Vishwanathan S (2015) Deep graph kernels. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 1365–1374
Zhang CT (2009) The e-index, complementing the h-index for excess citations. PLoS One 4(5):e5429
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institut for Datalogi, Aalborg University, Aalborg, Denmark
Manfred Jaeger & Andrea Passerini
Dipartimento di Scienze e Metodi per l’Ingegneria, Università degli Studi di Modena e Reggio Emilia, Reggio Emilia, Italy
Marco Lippi
Dipartimento di Ingegneria e Scienza dell’Informazione, Università degli Studi di Trento, Trento, Italy
Giovanni Pellegrini & Andrea Passerini

Authors

Manfred Jaeger
View author publications
You can also search for this author in PubMed Google Scholar
Marco Lippi
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Pellegrini
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Passerini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manfred Jaeger.

Additional information

Responsible editor: Dr. Bringmaan, Dr. Davis, Dr. Fromont and Dr. Greene

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proofs

Proposition 1

$d_{\text {r-count}}$ is a scale-invariant pseudo-metric with values in [0, 1].

Proof

The minimum of two counts is a positive semi-definite kernel, called histogram intersection kernel (Barla et al. 2003). The normalization is called cosine normalization, and the result is also a kernel (Schölkopf and Smola 2002). Let us refer to this kernel as

$$\begin{aligned} k(h_1,h_2)=\frac{min (c(h_1),c(h_2))}{\sqrt{c(h_1)\cdot c(h_2)}}. \end{aligned}$$

A kernel induces a pseudo-metric

$$\begin{aligned} d(h_1,h_2)=\sqrt{k(h_1,h_1)+k(h_2,h_2)-2k(h_1,h_2)}. \end{aligned}$$

For the normalized histogram intesection kernel we have that $0 \le k(h_1,h_2) \le 1$ and $k(h_1,h_1)=k(h_2,h_2)=1$, thus $d(h_1,h_2)=\sqrt{2-2k(h_1,h_2)}$. The count distance is obtained as $d_{\textit{r-count}}(h_1,h_2)=\frac{1}{2}d(h_1,h_2)^2$, a simplified version of the distance which preserves its properties. Non-negativity and symmetry are trivially preserved. For triangle inequality $d(h_1,h_3) \le d(h_1,h_2)+d(h_2,h_3)$ implies that $\alpha d(h_1,h_3)^2 \le \alpha (d(h_1,h_2)+d(h_2,h_3))^2 \le \alpha d(h_1,h_2)^2+ \alpha d(h_2,h_3)^2$ for any $\alpha > 0$. Finally, $d_{\textit{r-count}}$ is a pseudo-metric because any two distinct histograms having same counts have zero distance.$\square $

Proposition 4

$d_{{\textit{memd}}}$ is a pseudo-metric with $d_{{\textit{memd}}}\le d_{{\textit{emd}}}$.

Proof

We recall and introduce the following notation: $\bar{h}_1,\bar{h}_2$ are normalized D-dimensional histograms with N bins in each dimension. Histogram cells are indexed by index vectors $\varvec{i},\varvec{j},\ldots \in N^D$. The kth component of the index vector $\varvec{i}$ is denoted $\varvec{i}(k)$.

For $k=1,\ldots ,D$ we have that $d_{{\textit{memd}}}^{\downarrow k}(\bar{h}_1,\bar{h}_2):=d_{{\textit{emd}}}(\bar{h}_1^{\downarrow k},\bar{h}_2^{\downarrow k})$$(k=1,\ldots ,D)$ is a pseudo-metric on the D-dimensional histograms $\bar{h}_1,\bar{h}_2$, because it is induced by the metric $d_{{\textit{emd}}}$ under the non-injective mapping $\bar{h}\mapsto \bar{h}^{\downarrow k}$. $d_{{\textit{memd}}}$ therefore is a sum of pseudo-metrics, and therefore also a pseudo-metric.

We denote by $EMD(\bar{h}_1,\bar{h}_2)$ the constrained optimization problem defining the earth mover’s distance, i.e., $d_{{\textit{emd}}}(\bar{h}_1,\bar{h}_2)$ is the cost of the optimal solution of $EMD(\bar{h}_1,\bar{h}_2)$. A feasible solution for $EMD(\bar{h}_1,\bar{h}_2)$ is a given by $\varvec{f}=(f_{\varvec{i},\varvec{j}})_{\varvec{i},\varvec{j}}$, where

$$\begin{aligned} \sum _{\varvec{i}} f_{\varvec{i},\varvec{j}}=\bar{h}_1(\varvec{j}),\ \sum _{\varvec{j}} f_{\varvec{i},\varvec{j}}=\bar{h}_2(\varvec{i}) \end{aligned}$$

The cost of a feasible solution is

$$\begin{aligned} cost (\varvec{f})=\sum _{\varvec{i},\varvec{j}}f_{\varvec{i},\varvec{j}} d(\varvec{i},\varvec{j}) \end{aligned}$$

where d is the underlying metric on histogram cells. In our case, d is the Manhattan distance. However, all we require for this proof is that d is additive in the sense that there exist metrics $d^{(k)}$ on $\{1,\ldots ,N\}$$(k=1,\ldots ,D)$ such that

$$\begin{aligned} d(\varvec{i},\varvec{j})=\sum _{k=1}^D d^{(k)}(\varvec{i}(k),\varvec{j}(k)). \end{aligned}$$

In the case of Manhattan distance, $d^{(k)}(\varvec{i}(k),\varvec{j}(k))=|\varvec{i}(k)-\varvec{j}(k)|$.

Let $\varvec{f}$ be a feasible solution for $EMD(\bar{h}_1,\bar{h}_2)$. For $k=1,\ldots ,D$ we define the marginal solutions

$$\begin{aligned} f^{\downarrow k}_{i,j}:= \sum _{ \begin{array}{c} \varvec{i}: \varvec{i}(k)=i\\ {\varvec{j}: \varvec{j}(k)=j} \end{array}} f_{\varvec{i},\varvec{j}} \end{aligned}$$

Then $\varvec{f}^{\downarrow k}=( f^{\downarrow k}_{i,j})$ is a feasible solution solution of $EMD(\bar{h}_1^{\downarrow k},\bar{h}_2^{\downarrow k})$, and we have

$$\begin{aligned} \textit{cost}(\varvec{f})=\sum _{\varvec{i},\varvec{j}}\sum _{k=1}^D f_{\varvec{i},\varvec{j}} d^{(k)}(\varvec{i}(k),\varvec{j}(k)) =\sum _{k=1}^D \sum _{i,j=1}^N f^{\downarrow k}_{i,j}d^{(k)}(i,j) = \sum _{k=1}^D \textit{cost}(\varvec{f}^{\downarrow k}) \end{aligned}$$

In particular, when $\varvec{f}$ is a minimal cost solution of $EMD(\bar{h}_1,\bar{h}_2)$, then we have $d_{{\textit{emd}}}(\bar{h}_1,\bar{h}_2)=cost (\varvec{f})$, and

$$\begin{aligned} \sum _{k=1}^D \textit{cost}(\varvec{f}^{\downarrow k})\ge \sum _{k=1}^D d_{{\textit{emd}}}(\bar{h}_1^{\downarrow k},\bar{h}_2^{\downarrow k}) =d_{{\textit{memd}}}(\bar{h}_1,\bar{h}_2) \end{aligned}$$

$\square $

Proposition 5

If $\bar{h}_1,\bar{h}_2$ are product histograms, then $d_{{\textit{memd}}}(\bar{h}_1,\bar{h}_2 ) = d_{{\textit{emd}}}(\bar{h}_1,\bar{h}_2 )$.

Proof

Let $\varvec{f}^{(k)}$ be feasible solutions for $EMD(\bar{h}_1^{\downarrow k},\bar{h}_2^{\downarrow k})$ ($k=1,\ldots ,D$). Define

$$\begin{aligned} f_{\varvec{i},\varvec{j}}:= \prod _{k=1}^D f^{(k)}_{\varvec{i}(k),\varvec{j}(k)}. \end{aligned}$$

Then $\varvec{f}=(f_{\varvec{i},\varvec{j}})$ is a feasible solution for $EMD(\bar{h}_1,\bar{h}_2)$:

$$\begin{aligned} \sum _{\varvec{i}}f_{\varvec{i},\varvec{j}} = \sum _{\varvec{i}} \prod _k f^{(k)}_{\varvec{i}(k),\varvec{j}(k)} = \prod _k \sum _{i=1}^N f^{(k)}_{i,\varvec{j}(k)} = \prod _k \bar{h}_2^{\downarrow k}(\varvec{j}(k)) = \bar{h}_2(\varvec{j}), \end{aligned}$$

and similarly $\sum _{\varvec{j}}f_{\varvec{i},\varvec{j}}=\bar{h}_1(\varvec{i})$. For the cost of the solutions we obtain:

$$\begin{aligned} cost (\varvec{f})= & {} \sum _{\varvec{i},\varvec{j}} \left( \sum _{k} d^{(k)}(\varvec{i}(k),\varvec{j}(k))\right) \prod _k f^{(k)}_{\varvec{i}(k),\varvec{j}(k)}\\= & {} \sum _{k} \sum _{\varvec{i},\varvec{j}} d^{(k)}(\varvec{i}(k),\varvec{j}(k)) \prod _k f^{(k)}_{\varvec{i}(k),\varvec{j}(k)} \\= & {} \sum _{k} \sum _{i,j=1}^N d^{(k)}(i,j) \sum _{\begin{array}{c} \varvec{i}: \varvec{i}(k)=i\\ {\varvec{j}: \varvec{j}(k)=j} \end{array}} \prod _k f^{(k)}_{\varvec{i}(k),\varvec{j}(k)}\\= & {} \sum _{k} \sum _{i,j=1}^N d^{(k)}(i,j) \sum _{i,j}f^{(k)}_{i,j} = \sum _{k} \textit{cost}(\varvec{f}^{(k)}). \end{aligned}$$

This implies $d_{{\textit{emd}}}(\bar{h}_1,\bar{h}_2)\le \sum _k d_{{\textit{emd}}}(\bar{h}_1^{\downarrow k},\bar{h}_2^{\downarrow k})$, which together with Proposition 4 proves the proposition.$\square $

Proposition 6

$d_{\textit{c-memd}}$ on node histogram trees is conditionally negative definite.

Proof

Let us recall the definition of $d_{\textit{c-memd}}$ on node histogram trees and the definition of all its components:

$$\begin{aligned} d_{\textit{c-memd}}(H_1,H_2):= & {} \sum _{i=1}^k \frac{\gamma ^{d_i}}{s_i} d_{\textit{c-memd}}(h_{1,i},h_{2,i}) \end{aligned}$$

(11)

$$\begin{aligned} d_{\textit{c-memd}}(h_1,h_2):= & {} \frac{1}{2}(d_{\textit{r-count}}(h_1,h_2) + d_{{\textit{memd}}}(\bar{h}_1,\bar{h}_2)) \end{aligned}$$

(12)

$$\begin{aligned} d_{{\textit{memd}}}(\bar{h}_1,\bar{h}_2):= & {} \sum _{k=1}^D d_{{\textit{emd}}}(\bar{h}_1^{\downarrow k}, \bar{h}_2^{\downarrow k}) \end{aligned}$$

(13)

$$\begin{aligned} d_{{\textit{emd}}}(\bar{h}_1,\bar{h}_2):= & {} \sum _{k=1}^N |f_{1}(k)-f_{2}(k)| \end{aligned}$$

(14)

$$\begin{aligned} d_{\textit{r-count}}(h_1,h_2):= & {} 1-\frac{\textit{min}(c(h_1),c(h_2))}{\sqrt{c(h_1)\cdot c(h_2)}} \end{aligned}$$

(15)

Let us prove the statement in a bottom-up fashion:

$d_{\textit{r-count}}(h_1,h_2)$ (Eq. 15) is conditionally negative definite, as $\frac{min (c(h_1),c(h_2))}{\sqrt{c(h_1)\cdot c(h_2)}}$ is positive semi-definite (see proof of proposition 1), the negation of a p.s.d. function is conditionally negative definite (Berg et al. 1984), and summing a constant value does not change conditional negative definiteness.
$d_{{\textit{emd}}}(\bar{h}_1,\bar{h}_2)$ (Eq. 14) is a Manhattan distance and thus it is conditionally negative definite (the same holds for other distances like the Euclidean one, see (Richards 1985) for a classical proof).

It follows that $d_{\textit{c-memd}}(H_1,H_2)$ is conditionally negative definite, as it is a positively weighted sum of conditionally negative definite functions, and the property is closed under summation and multiplication by positive scalar. $\square $

Procedures for metric tree building and retrieval

In the following we briefly review the procedures for building and searching MTs, mostly following (Uhlmann 1991).

A MT is built from a dataset of node histogram trees, by recursively splitting data until a stopping condition is met. Algorithm 1 describes the procedure for building the MT. The algorithm has two parameters, the maximal tree depth ($d_{max}$) and the maximal bucket size ($n_{max}$) and two additional arguments, the current depth (initialized at $d=1$) and the data to be stored (data), represented as a set of node histogram trees, one for each entity. A MT is made of two types of nodes, internal ones and leaves. An internal node contains two entities and two branches. A leaf node contains a set of entities (the bucket). The MT construction proceeds by splitting data and recursively calling the procedure over each of the subsets, until a stopping condition is met. If the maximal tree depth is reached, or the current set to be splitted is not larger than the maximal bucket size, a leaf node is returned. If the stopping condition is not met, two entities z1 and z2 are chosen at random from the set of data (making sure they have a non-zero distance), and data are splitted according to their distances to these entities. Data that are closer to z1 go to the left branch, the others go to the right one, and the procedure recurses over each of the branches in turn.

Once the MT has been built, the fastest solution for approximate k-nearest-neighbor retrieval for a query instance H amounts to traversing the tree, following at each node the branch whose corresponding entity is closer to the query one, until a leaf node is found. The entities in the bucket contained in the leaf node are then sorted according to their distance to the query entity, and the k nearest neighbors are returned. See Algorithm 2 for the pseudocode of the procedure. Notice that this is a greedy approximate solution, as exact search would require to backtrack over alternative branches, pruning a branch when it cannot contain entities closer to the query than the current $k\mathrm{th}$ neighbor (see Liu et al. 2005) for the details). Here we trade effectiveness for efficiency as our goal is to quickly find high quality solutions rather than discovering the actual nearest neighbors. Alternative solutions can be implemented in the latter case (Liu et al. 2005; Muja and Lowe 2014).

Both algorithms have as additional implicit parameter the distance function over NHTs, which can be the exact EMD-based NHT metric or its approximate version based on marginal EMD (exact for product histograms, see Proposition 5). Notice that for large databases, explicitly storing the NHT representation of each entity in the leaf buckets can be infeasible. In this case buckets only containt entity identifiers, and the corresponding NHTs are computed on-the-fly when scanning the bucket for the nearest neighbors. Standard caching solutions can be implemented to speed up this step.

Details on actor retrieval results

See Fig. 11.

Test actor	NN genre	NN business
Muhammad I Ali	John III Kerry	Justin Ferrari
Kevin I Bacon	Lance E. Nichols	Charlie Sheen
Christian Bale	Channing Tatum	Hugh I Grant
Warren I Beatty	Art I Howard	Christopher Reeve
Humphrey Bogart	Eddie I Graham	Tony I Curtis
David I Bowie	Ethan I Phillips	Adam I Baldwin
Adrien Brody	Mark I Camacho	Kevin I Kline
Steve Buscemi	Vincent I Price	Keith I David
Michael I Caine	Robert De Niro	Robert De Niro
David Carradine	Clint Howard	Rutger Hauer
Jim Carrey	Jason I Alexander	Jake Gyllenhaal
Vincent Cassel	Keith Szarabajka	Dougray Scott
James I Coburn	Ned Beatty	Louis Gossett Jr.
Robbie Coltrane	Rene Auberjonois	H.B. Warner
Sean Connery	Gene Hackman	Paul I Newman
Kirk I Douglas	Eli Wallach	Burt Lancaster
Rupert Everett	Brian Blessed	Omar Sharif
Henry Fonda	Dick I Curtis	James I Mason
John I Goodman	Christopher I Plummer	Ron I Perlman
Al I Gore	Jeroen Willems	Dwight D. Eisenhower
Dustin Hoffman	Rip Torn	Pierce Brosnan
Stan Laurel	Billy Franey	Oliver Hardy
Jude Law	Michael I Sheen	Omar Sharif
Jack Lemmon	Charles Dorety	William I Holden
John Malkovich	William H. Macy	Mickey Rourke
Marcello Mastroianni	James I Payne	Ajay Devgn
Malcolm I McDowell	Clint Howard	Martin Sheen
Alfred Molina	William H. Macy	George I Kennedy
David I Niven	Ivan F. Simpson	William I Powell
Philippe Noiret	Dominique Zardi	Pat I O’Brien
Al I Pacino	Jeremy Piven	Tom Cruise
Chazz Palminteri	Bobby Cannavale	Norman Reedus
Gregory Peck	James Seay	Christopher I Lambert
Sean I Penn	Andy I Garcia	Michael I Douglas
Anthony I Perkins	Nicholas I Campbell	George C. Scott
Joe Pesci	Stephen Marcus	Anton Yelchin
Elvis Presley	Berton Churchill	Lee I Marvin
Robert I Redford	Roscoe Ates	Michael Keaton
Keanu Reeves	Kevin I Pollak	Antonio Banderas
Geoffrey Rush	Jim I Carter	Ian I McShane
Steven Seagal	Frank Pesce	Marlon Brando
Joseph Stalin	Jimmy I Carter	Tom I Herbert
Sylvester Stallone	Nicolas Cage	Johnny Depp
Ben Stiller	Bill I Murray	Antonio Banderas
David Suchet	Danny Nucci	James I Nesbitt
John Turturro	Danny DeVito	Bruce I Dern
Lee Van Cleef	Robert I Peters	Jack Warden
Christoph Waltz	Frank I Gorshin	Demin Bichir
Denzel Washington	Michael V Shannon	Tom Cruise
Orson Welles	Donald Pleasence	Rod Steiger

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jaeger, M., Lippi, M., Pellegrini, G. et al. Counts-of-counts similarity for prediction and search in relational data. Data Min Knowl Disc 33, 1254–1297 (2019). https://doi.org/10.1007/s10618-019-00621-7

Download citation

Received: 28 January 2018
Accepted: 01 March 2019
Published: 08 March 2019
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s10618-019-00621-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Counts-of-counts similarity for prediction and search in relational data

Abstract

Access this article

Similar content being viewed by others

An expressive dissimilarity measure for relational clustering using neighbourhood trees

Generalized Relative Neighborhood Graph (GRNG) for Similarity Search

Greedy Filtering: A Scalable Algorithm for K-Nearest Neighbor Graph Construction

Notes

References