Metrics based on average distance between sets

This paper presents a distance function between sets based on an average of distances between their elements. The distance function is a metric if the sets are non-empty finite subsets of a metric space. It includes the Jaccard distance as a special case, and can be generalized by using the power mean so as to also include the Hausdorff metric on finite sets. It can be extended to deal with non-null measurable sets, and applied for measuring distances between fuzzy sets and between probability distributions. These distance functions are useful for measuring similarity between data in computer science and information science. In instructional systems design and information retrieval, for example, they are likely to be useful for analyzing and processing text documents that are modeled as hierarchical collections of sets of terms. A distance measure of learners’ knowledge is also discussed in connection with quantities of information.


Introduction
A metric defined in general topology [1,2], based on a natural notion of distance between points, is generally extensible to distance between sets or more complex elements.The Hausdorff metric is such a typical one and practically used for image data analysis [3], but it has some problems.In the Euclidean metric on R, for example, the Hausdorff distance between bounded subsets of R depends only on their suprema or infima, no matter how the other elements of the sets are distributed within a certain range, which means it places importance on extremes and disregards the middle parts of the sets.This is a drawback because it is sensitive to noises, errors and outliers in analyzing real world data.There is a need to develop another metric that reflects the overall characteristics of elements of the sets.
In computer science, especially in the fields of pattern recognition, classification, information retrieval and artificial intelligence, it is important for data analysis to measure similarity or difference between data objects such as documents, images and signals.If the data objects can be represented by vectors, a conventional distance between vectors is a proper measure in their vector space.In practice, however, there are various data objects that should be dealt with in the form of collections of sets, probability distributions, graph structured data, or collections consisting of more complex data elements.To analyze these data objects, numerous distance-like functions have been developed [4], like the Mahalanobis distance and the Kullback-Leibler divergence, even though they do not necessarily satisfy symmetry and/or the triangle inequality.
As a true metric, besides the Hausdorff metric, there is another type of distance functions of sets, such as the Jaccard distance, based on the cardinality of the symmetric difference between sets or its variations.However, it measures only the size of the set difference, and takes no account of qualitative differences between individual elements.Thus, both metrics are insufficient to analyze informative data sets in which each element has its own specific meaning.
The set X is called a metric space and denoted by (X, d).The function d is called distance function or simply distance.
The metric is generalized by relaxing the conditions as follows: • If d satisfies M1, M2, M4 and M5, then it is called a pseudo-metric.
• If d satisfies M1, M2, M3 and M5, then it is called a quasi-metric.
• If d satisfies M1, M2, M3 and M4, then it is called a semi-metric.This terminology follows [1,2], though the term "semi-metric" is sometimes referred to as a synonym of pseudo-metric [4].
A set-to-set distance is usually defined as follows (see, e.g., [6]): Let A and B be two non-empty subsets of X.For each x ∈ X, the distance from x to A, denoted by dist(x, A), is defined by the equation (2.1) dist(x, A) = inf{d(x, a) | a ∈ A}.This is fundamental not only to the definitions of a boundary point and an open set in metric spaces but also to the generalization of a metric space to approach space [7].Similarly, the distance from A to B can be straightforwardly defined by The function dist() is neither a pseudo-metric nor a semi-metric.However, let S(X) be the collection of all non-empty closed bounded subsets of X.Then, for A, B ∈ S(X), the function h(A, B) defined by is a metric on S(X), and h is called the Hausdorff metric.The collection S(X) topologized by the metric h is called a hyperspace in general topology.
In computer science, data sets are generally discrete and finite.A popular metric is the Jaccard distance (or Tanimoto distance, Marczewski-Steinhaus distance [4]) that is defined by (2.4) where |A| is the cardinality of A, and △ denotes the symmetric difference: A△B = (A \ B) ∪ (B \ A).In addition, |A△B| is also used as a metric.In cluster analysis [8], the distance (2.2) is used as the minimum distance between data clusters for single-linkage clustering, and likewise the maximum distance is defined by replacing infimum with maximum for complete-linkage clustering.Moreover, the group-average distance (or average distance, mean distance) defined as g(A, B) in the following is also typically used for hierarchical clustering.Although these three distance functions are not metrics, the group-average distance plays an important role in this paper.Lemma 2. Suppose (X, d) is a non-empty metric space.Let S(X) denote the collection of all non-empty finite subsets of X.For each A and B in S(X), define g(A, B) on S(X) × S(X) to be the function Then g satisfies the triangle inequality.
Proof.The triangle inequality for Then, for all A, B, C ∈ S(X), we have For ease of notation, let s (A, B) be the sum of all pairwise distances between A and B such that (2.8) where A i ∩ A j = ∅ = B i ∩ B j for i = j.Furthermore, we define t(A, B, C) by the following equation It follows from Lemma 2 that t(A, B, C) ≥ 0 for A, B, C ∈ S(X), which is a shorthand notation of the triangle inequality (2.6).

Metric based on average distance
Theorem 3. Suppose (X, d) is a non-empty metric space.Let S(X) denote the collection of all non-empty finite subsets of X.For each A and B in S(X), define f (A, B) on S(X) × S(X) to be the function Then f is a metric on S(X).
Proof.The function f can be rewritten, using s in (2.7), as It is non-negative and symmetric.
. This holds if, and only if, The triangle inequality is straightforwardly proved to be f (A, B) + f (B, C) − f (A, C) ≥ 0 by showing that the left-hand terms are transformed into the sum of non-negative terms of s and t in (2.9).Let A ∪ B ∪ C be decomposed into five disjoint partitions: The details are given in Appendix A.
The function f in (3.1) can be rewritten, using g in (2.5), as Then e is a semi-metric on S(X).
Proof.Let e(A, B) be rewritten as In a similar manner to the proof of Theorem 3, it can be proved that the conditions from M1 to M4, except for M5 (triangle inequality), are satisfied.
It is noted that the triangle inequality e(A, B) so that the condition M5 is not generally satisfied.

Extensions
This section discusses future directions for generalization of the average distance based on the power mean and extensions to metrics on collections of infinite sets., and its variation using the exponential transform of ψ, where i ∈ {0, 1} indicates one of the two types (4.1) and (4.2), p is an extended real number, ψ is a non-negative function of x ∈ A, and w is a weight such that w(x) ∈ (0, 1] for each x and x∈A w(x) > 0. In addition, let , where 1 A (x) is the indicator function defined by 1 A (x) = 1 for x ∈ A and 1 A (x) = 0 for x / ∈ A. If there exists x ∈ A such that ψ(x) = 0 for p < 0 in (4.1), then we define M (1) p = 0, which is consistent with taking the limit ψ(x) → 0 + , though such a case is undefined in the conventional power mean to avoid division by zero.
The power mean includes various types of means [5], which are parameterized by p.By taking limits also for p = 0, ±∞, we have the following: gives the geometric mean, and neither There are some forms of function composition of M (i) p that include the distance function (3.1) as a special case, for example, as follows: where i, j, k ∈ {0, 1} and S ∁ is the complement of S.
In addition, let w be extended to w ∈ [0, 1] to include zero, though a weight for at least one summand must still be positive.Furthermore, it is assumed that 0 • ∞ = 0, in order to ensure 0 • 0 p = 0 for p < 0 in M (1) p , so that the zero-weight can be used for excluding terms from averaging even if ψ(x) = 0 (i.e., distance zero) in the terms.Then, the function (4.3) can be simply expressed as q y ∈ A ∪ B, d(x, y), w(x, y) , by using the weight function defined by where [•] denotes the Iverson bracket, that is a quantity defined to be 1 whenever the statement within the brackets is true, and 0 otherwise.The distance function f in (3.1) and the Hausdorff metric (2.3) are expressed by respectively, where i, j, k ∈ {0, 1}.
Although it is unclear, at present, what conditions on the parameters i, j, k, p, q, and r are necessary for (4.3) and (4.4) to be metrics, these generalized forms are capable of generating various distance functions in fact as follows: Example 6.The exponential types u (0,0) p,q (A, B) and v (0,0,0) r,p,q (A, B) are written as |A| y∈A e qd(x,y) If d is the discrete metric multiplied by a positive constant λ, then we have The functions (4.6) and (4.7) are metrics for p > 0 and p < 0, respectively.The proofs of each triangle inequality are outlined in Appendix B. If p = 0, then it is the same situation as Example 4. By taking the limit for p = 0, we can see that both functions are equal to the Jaccard distance (2.4) except for the coefficient λ.

4.2.
Hierarchical metric spaces.Suppose that (X, d) is a metric space and S(X) is the collection of all non-empty finite subsets of X.Let k be a non-negative integer, let S k+1 (X) denote the collection of all non-empty finite subsets of S k (X), and let f k be a metric on S k (X), where S 1 (X), S 0 (X), f 1 and f 0 correspond to, respectively, S(X), X, f , and d in Theorem 3.For k > 1, in much the same way, for each A and B in S k (X), the function f k (A, B) can be defined by which generates a metric space S k (X), f k based on S k−1 (X), f k−1 .This metric will be useful for constructing hierarchical hyperspaces.

Duality.
There is a kind of duality between sets and elements with respect to their distance functions.For example, we can define the functions D and d symmetrically as follows: where C(a) = {A | a ∈ A}.The set-to-set distance D in (4.9) is a metric due to the axiom of extensionality, and the element-to-element distance d in (4.10) is a pseudo-metric.
According to Theorem 3, D can be defined by f in (3.1), instead of (4.9), so that we have In this case, D depends on d so that D is also a pseudo-metric.Furthermore, in the situation of Section 4.2, let (X, d) be a metric space, let S 1 (X) be the collection of all non-empty finite subsets of X, and let C(a) = {A ∈ S 1 (X) | a ∈ A} be an element of S 2 (X).If d is the discrete metric, then we have where κ is a certain positive real number such that κ < 1.If there exists , f 2 and d can be regarded as the function of D. In general, there possibly exist D and d that are formally expressed by where F and G may be such a generalized function given in (4.5).It is interesting to consider whether F and G really exist and what features they have.In numerical analysis, D and d that are consistent with each other will be obtained by the iterative computation of (4.11) for all A, B ∈ S 1 (X) and (4.12) for all a, b ∈ X, starting with an initial metric space (X, d), if each converges to a non-trivial function.
4.4.Generalized metrics.The group average distance g(A, B) in (2.5) can be regarded as a generalized metric that satisfies conditions M1 (non-negativity), M4 (symmetry) and M5 (triangle inequality) in Definition 1.In conventional topology, there has been no such generalization by dropping both conditions M2 and M3, which are usually combined together into the single axiom d(a, b) = 0 ⇔ a = b (identity, reflexivity, or coincidence).Although M3 can be dropped for the pseudometric so as to allow d(a, b) = 0 for a = b, the self-distance d(a, a) = 0 (M2) seems to be indispensable in point-set topology where the element is a point having no size.An exception is a partial metric [9] that is defined to satisfy M1, M3, M4 and, instead of M5, the following partial metric triangularity, In computer science or information science, the element of data sets is not merely a simple point.It may have rich contents inside.Some elements may have internal structures which cause non-zero self-distance, and some elements may have different properties each other, even though they are indiscernible from a metric point of view.The concept of distance can be used for measuring not only the difference between objects but also the cost of moving or the energy of transition between states.This is the reason why the generalization toward non-zero self-distance is worth considering.If the triangle inequality holds, it provides an upper and lower bound for them.The group average distance g(A, B) can be such a typical one, and that it is simpler and more natural than the metric f (A, B) in (3.1).
Incidentally, the function g(A, B) is not a partial metric because it does not satisfy (4.13).On the other hand, f (A, B) gives an approach to an instance of partial metrics from a special case of (4.7) in Example 6.By taking the limit as λ → ∞ for p < 0 and multiplying a positive constant, for non-empty finite sets A and B, we have the following metric, This suggests, for ν ∈ [0, 1/2), D ν is a partial metric on a collection of non-empty finite sets.4.5.Extension to infinite sets.If S(X) is the collection of all non-null measurable subsets of (X, d), and d is Lebesgue integrable on each element of S(X), then the group average distance g(A, B), for A, B ∈ S(X), can be defined by where x ∈ A, y ∈ B, and µ is a measure on X, and then the distance function (3.2) can be extended to If d is the discrete metric, then (4.15) is equal to the Steinhaus distance [4].
Example 7. Let (R, d) be a metric space and let d(x, y) = |x − y|.For two intervals A and B, the distance function (4.15) can be expressed as ) is equal to the distance between the centers of A and B. This is consistent with an intuitive notion of the distance between balls in this (R, d).
If S(X) is the collection of all non-empty, countably infinite subsets (measurezero sets) of X, then g(A, B) and f (A, B) should be defined by taking limits in (2.5) and (3.2), provided that both have definite values.In order to determine the average distance, we have to define a proper condition, which should be said to be "averageable".The average distance will strongly depend on accumulation points in A and B, and it will require additional assumptions on the difference of the strength between the accumulation points.This requirement is closely related to a "relative measure" that is needed to obtain the ratio of the cardinality of an infinite set to the cardinality of its superset in (3.2).In conventional measure theory, however, any set of cardinality ℵ 0 is a null set having measure zero so that both counting measure and Lebesgue measure are useless for computing the ratio.It is necessary to use another measure.A feasible solution is discussed in the following section.4.6.Estimation by sampling.In application to computational data analysis, statistical estimation by sampling is very useful for obtaining the approximate value of f (A, B) when the size of the sets is very large.According to the law of large numbers, if enough sample elements are selected randomly, an average generated by those samples should approximate the average of the total population.The procedure is as follows: (1) Choose a superset P of A ∪ B as a population such that P ⊇ A ∪ B.
(2) Select a finite subset S of P as a sample obtained by random sampling.
(3) Let S A = S ∩ A and S B = S ∩ B. Then, compute f (S A , S B ) for approximation of f (A, B).
The sampling process and its randomness are crucial for efficiently estimating a good approximation.Some useful hints could be found in various sampling techniques developed for Monte Carlo methods [10].In most cases, sampling error is expected to decrease as the sample size increases, except for situations where the distribution of d has no mean (e.g., Cauchy distribution).The notion of sampling suggests an intuitive measure to define a relative measure on a σ-algebra over a set X, which could be called "sample counting measure".Suppose A and B are subsets of X.Let ρ(A : B) be the ratio of the cardinality of A to the cardinality of B, let P be a superset of A ∪ B, and let S n be a non-empty finite subset of P such that S n = n i=1 Y i where Y i is the i-th non-empty sample randomly selected from P .Then, the ratio ρ(A : B) can be determined by taking a limit of n as it approaches to ∞ as follows: if there exist such a limit and a random choice function that performs random sampling.Otherwise, instead of random sampling, systematic sampling could be available if the elements of X are supposed to be distributed with uniform density in its measurable metric space.For example, suppose there exists a finite partition of P where every part has an almost equal diameter.It seems better for S n to have exactly one element with each of the parts.

4.7.
Metrics for fuzzy sets and probability distributions.A fuzzy set can be represented by a collection of crisp sets so that the distance between fuzzy sets can be defined by the distance between the collections of such crisp sets.Let A be a fuzzy set: A = { x, m A (x) | x ∈ X}, where m A (x) is a membership function, and let A α be a crisp set called an α − lebel set [11] such that A α = {x ∈ X | m A (x) ≥ α}.Then, A can be represented by the following set of ordered pairs: The distance between two fuzzy sets A and B can be defined by f 2 C(A), C(B) , where there may be various ways to treat α.This notion is also applicable to the distance between probability distributions, where probability density functions are used instead of the membership function.

Concluding Remarks
We have found that, for a metric space (X, d), there exists a distance function between non-empty finite subsets of X that is a metric based on the average distance of d.The distance function (3.1) in Theorem 3 is the most typical one, which includes the Jaccard distance as a special case where d is the discrete metric.Its extensions based on the power mean will be useful to develop generalized forms that also include the Hausdorff metric and the other various distance functions.Furthermore, the extensions to infinite subsets of X will provide metrics for measuring dissimilarity of fuzzy sets and probability distributions.

Appendix A. Triangle Inequality in Theorem 3
The triangle inequality for can be proved by showing the following inequality: Let A ∪ B ∪ C be decomposed into the following seven disjoint sets: and let θ = B \ β = B ∩ (A ∪ C), so that we have Taking account of (2.8) and (2.9), we have where the equality holds if all terms of s and t are zero.

Appendix B. Triangle Inequalities of Example 6
The triangle inequality for (4.6) can be proved as follows: Let x = e pλ and let Then the triangle inequality for p > 0 is equivalent to τ (x) ≥ 0 for x > 1.The first derivative of τ (x) with respect to x is where j is the Jaccard distance (2.4).Since τ (1) = 0 and τ ′ (x) ≥ 0 for x ≥ 1, we have τ (e pλ ) ≥ 0 for p > 0.

Introduction
A metric defined in general topology [1,2], based on a natural notion of distance between points, is generally extensible to distance between sets or more complex elements.The Hausdorff metric is such a typical one and practically used for image data analysis [3], but it has some problems.In the Euclidean metric on R, for example, the Hausdorff distance between bounded subsets of R often depends only on their suprema or infima, no matter how the other elements of the sets are distributed within a certain range, which means it places importance on extremes and disregards the middle parts of the sets.This is a drawback because it is sensitive to noises, errors and outliers in analyzing real world data.There is a need to develop another metric that reflects the overall characteristics of elements of the sets.
In computer science, especially in the fields of pattern recognition, classification, information retrieval and artificial intelligence, it is important for data analysis to measure similarity or difference between data objects such as documents, images and signals.If the data objects can be represented by vectors, a conventional distance between vectors is a proper measure in their vector space.In practice, however, there are various data objects that should be dealt with in the form of collections of sets, probability distributions, graph structured data, or collections consisting of more complex data elements.To analyze these data objects, numerous distance-like functions have been developed [4], like the Mahalanobis distance and the Kullback-Leibler divergence, even though they do not necessarily satisfy symmetry and/or the triangle inequality.
As a true metric, besides the Hausdorff metric, there is another type of distance functions of sets, such as the Jaccard distance, based on the cardinality of the symmetric difference between sets or its variations.However, it measures only the size of the set difference, and takes no account of qualitative differences between individual elements.Thus, both metrics are insufficient to analyze informative data sets in which each element has its own specific meaning.
Key words and phrases.Metric, distance between sets, average distance, power mean, Hausdorff metric.
This paper presents a new distance function between sets based on an average distance.It takes all elements into account.It is a metric if the sets are non-empty finite subsets of a metric space, and includes the Jaccard distance as a special case.By using the power means [5], we obtain generalized forms that also include the Hausdorff metric.Extensions of the metric to hierarchical collections of infinite subsets will be useful for treating fuzzy sets and probability distributions.

Preliminaries
The metric is extended to various types of generalized metrics.To avoid confusion in terminology, the following definition is used.The set X is called a metric space and denoted by (X, d).The function d is called distance function or simply distance.
The metric is generalized by relaxing the conditions as follows: • If d satisfies M1, M2, M4 and M5, then it is called a pseudo-metric.
• If d satisfies M1, M2, M3 and M5, then it is called a quasi-metric.
• If d satisfies M1, M2, M3 and M4, then it is called a semi-metric.This terminology follows [1,2], though the term "semi-metric" is sometimes referred to as a synonym of pseudo-metric [4].
A set-to-set distance is usually defined as follows (see, e.g., [6]): Let A and B be two non-empty subsets of X.For each x ∈ X, the distance from x to A, denoted by dist(x, A), is defined by the equation This is fundamental not only to the definitions of a boundary point and an open set in metric spaces but also to the generalization of a metric space to approach space [7].Similarly, the distance from A to B can be straightforwardly defined by The function dist() is neither a pseudo-metric nor a semi-metric.However, let S(X) be the collection of all non-empty closed bounded subsets of X.Then, for A, B ∈ S(X), the function h(A, B) defined by is a metric on S(X), and h is called the Hausdorff metric.The collection S(X) topologized by the metric h is called a hyperspace in general topology.
In computer science, data sets are generally discrete and finite.A popular metric is the Jaccard distance (or Tanimoto distance, Marczewski-Steinhaus distance [4]) that is defined by where |A| is the cardinality of A, and △ denotes the symmetric difference: A△B = (A \ B) ∪ (B \ A).In addition, |A△B| is also used as a metric.In cluster analysis [8], the distance (2.2) is used as the minimum distance between data clusters for single-linkage clustering, and likewise the maximum distance is defined by replacing infimum with maximum for complete-linkage clustering.Moreover, the group-average distance (or average distance, mean distance) defined as g(A, B) in the following is also typically used for hierarchical clustering.Although these three distance functions are not metrics, the group-average distance plays an important role in this paper.Lemma 2. Suppose (X, d) is a non-empty metric space.Let S(X) denote the collection of all non-empty finite subsets of X.For each A and B in S(X), define g(A, B) on S(X) × S(X) to be the function Then g satisfies the triangle inequality.For ease of notation, let s (A, B) be the sum of all pairwise distances between A and B such that (2.8) where A i ∩ A j = ∅ = B i ∩ B j for i = j.Furthermore, we define t(A, B, C) by the following equation It follows from Lemma 2 that t(A, B, C) ≥ 0 for A, B, C ∈ S(X), which is a shorthand notation of the triangle inequality (2.6).

Metric based on average distance
Theorem 3. Suppose (X, d) is a non-empty metric space.Let S(X) denote the collection of all non-empty finite subsets of X.For each A and B in S(X), define f (A, B) on S(X) × S(X) to be the function Then f is a metric on S(X).
Proof.The function f can be rewritten, using s in (2.7), as It is non-negative and symmetric.
. This holds if, and only if, The triangle inequality is straightforwardly proved to be f (A, B) + f (B, C) − f (A, C) ≥ 0 by showing that the left-hand terms are transformed into the sum of non-negative terms of s and t in (2.9).Let A ∪ B ∪ C be decomposed into five disjoint partitions: The details are given in Appendix A.
The function f in (3.1) can be rewritten, using g in (2.5), as In S(X), f , for all a, b ∈ X, we have Corollary 5. Suppose (X, d) is a non-empty metric space.Let S(X) denote the collection of all non-empty finite subsets of X.For each A and B in S(X), define e(A, B) on S(X) × S(X) to be the function Then e is a semi-metric on S(X).
Proof.Let e(A, B) be rewritten as In a similar manner to the proof of Theorem 3, it can be proved that the conditions from M1 to M4, except for M5 (triangle inequality), are satisfied.
It is noted that the triangle inequality e(A, B) so that the condition M5 is not generally satisfied.

Extensions
This section discusses future directions for generalization of the average distance based on the power mean and extensions to metrics on collections of infinite sets.4.1.Generalization based on the power mean.The distance function (3.1) can be unified with the Hausdorff metric for finite sets by using the power mean.To simplify expressions, we use the following notation.Let M (i) p (x ∈ A, ψ, w) be an extended weighted-power-mean of ψ(x) such that (4.1) , and its variation using the exponential transform of ψ, where i ∈ {0, 1} indicates one of the two types (4.1) and (4.2), p is an extended real number, ψ is a non-negative function of x ∈ A, and w is a weight such that w(x) ∈ (0, 1] for each x and x∈A w(x) > 0. In addition, let , where 1 A (x) is the indicator function defined by 1 A (x) = 1 for x ∈ A and 1 A (x) = 0 for x / ∈ A. If there exists x ∈ A such that ψ(x) = 0 for p < 0 in (4.1), then we define M (1) p = 0, which is consistent with taking the limit ψ(x) → 0 + , though such a case is undefined in the conventional power mean to avoid division by zero.
The power mean includes various types of means [5], which are parameterized by p.By taking limits also for p = 0, ±∞, we have the following: gives the geometric mean, and neither There are some forms of function composition of M (i) p that include the distance function (3.1) as a special case, for example, as follows: where i, j, k ∈ {0, 1} and S ∁ is the complement of S.
In addition, let w be extended to w ∈ [0, 1] to include zero, though a weight for at least one summand must still be positive.Furthermore, it is assumed that 0 • ∞ = 0, in order to ensure 0 • 0 p = 0 for p < 0 in M (1) p , so that the zero-weight can be used for excluding terms from averaging even if ψ(x) = 0 (i.e., distance zero) in the terms.Then, the function (4.3) can be simply expressed as (4.5) q y ∈ A ∪ B, d(x, y), w(x, y) , by using the weight function defined by where [•] denotes the Iverson bracket, that is a quantity defined to be 1 whenever the statement within the brackets is true, and 0 otherwise.The distance function f in (3.1) and the Hausdorff metric (2.3) are expressed by respectively, where i, j, k ∈ {0, 1}.
Although it is unclear, at present, what conditions on the parameters i, j, k, p, q, and r are necessary for (4.3) and (4.4) to be metrics, these generalized forms are capable of generating various distance functions in fact as follows: Example 6.The exponential types u (0,0) p,q (A, B) and v (0,0,0) r,p,q (A, B) are written as |A| y∈A e qd(x,y) |B| y∈B e qd(x,y) The functions (4.6) and (4.7) are metrics for p > 0 and p < 0, respectively.The proofs of each triangle inequality are outlined in Appendix B. If p = 0, then it is the same situation as Example 4. By taking the limit for p = 0, we can see that both functions are equal to the Jaccard distance (2.4) except for the coefficient λ.
4.2.Hierarchical metric spaces.Suppose that (X, d) is a metric space and S(X) is the collection of all non-empty finite subsets of X.Let k be a non-negative integer, let S k+1 (X) denote the collection of all non-empty finite subsets of S k (X), and let f k be a metric on S k (X), where S 1 (X), S 0 (X), f 1 and f 0 correspond to, respectively, S(X), X, f , and d in Theorem 3.For k > 1, in much the same way, for each A and B in S k (X), the function f k (A, B) can be defined by (4.8) which generates a metric space S k (X), f k based on S k−1 (X), f k−1 .This metric will be useful for constructing hierarchical hyperspaces.

4.3.
Duality.There is a kind of duality between sets and elements with respect to their distance functions.For example, we can define the functions D and d symmetrically as follows: where C(a) = {A | a ∈ A}.The set-to-set distance D in (4.9) is a metric due to the axiom of extensionality, and the element-to-element distance d in (4.10) is a pseudo-metric.
According to Theorem 3, D can be defined by f in (3.1), instead of (4.9), so that we have In this case, D is a pseudo-metric, depending on d, and there exist a condition that satisfy d(a, b) = f ( C(a), C(b)).Furthermore, in the situation of Section 4.2, let (X, d) be a metric space, let S 1 (X) be the collection of all non-empty finite subsets of X, and let C(a) = {A ∈ S 1 (X) | a ∈ A} be an element of S 2 (X).If d is the discrete metric, then we have where F and G may be such a generalized function given in (4.5).It is interesting to consider whether F and G really exist and what features they have.In numerical analysis, D and d that are consistent with each other will be obtained by the iterative computation of (4.11) for all A, B ∈ S 1 (X) and (4.12) for all a, b ∈ X, starting with an initial metric space (X, d), if each converges to a non-trivial function.
4.4.Generalized metrics.The group average distance g(A, B) in (2.5) can be regarded as a generalized metric that satisfies conditions M1 (non-negativity), M4 (symmetry) and M5 (triangle inequality) in Definition 1.In conventional topology, there has been no such generalization by dropping both conditions M2 and M3, which are usually combined together into the single axiom d(a, b) = 0 ⇔ a = b (identity, reflexivity, or coincidence).Although M3 can be dropped for the pseudometric so as to allow d(a, b) = 0 for a = b, the self-distance d(a, a) = 0 (M2) seems to be indispensable in point-set topology where the element is a point having no size.An exception is a partial metric [9] that is defined to satisfy M1, M3, M4 and, instead of M5, the following partial metric triangularity, In computer science or information science, the element of data sets is not merely a simple point.It may have rich contents inside.Some elements may have internal structures which cause non-zero self-distance, and some elements may have different properties each other, even though they are indiscernible from a metric point of view.The concept of distance can be used for measuring not only the difference between objects but also the cost of moving or the energy of transition between states.This is the reason why the generalization toward non-zero self-distance is worth considering.If the triangle inequality holds, it provides an upper and lower bound for them.The group average distance g(A, B) can be such a typical one, and that it is simpler and more natural than the metric f (A, B) in (3.1).
Incidentally, the function g(A, B) is not a partial metric because it does not satisfy (4.13).On the other hand, f (A, B) gives an approach to an instance of partial metrics from a special case of (4.7) in Example 6.By taking the limit as λ → ∞ for p < 0 and multiplying a positive constant, for non-empty finite sets A and B, we have the following metric, This suggests, for ν ∈ [0, 1/2), D ν is a partial metric on a collection of non-empty finite sets.4.5.Extension to infinite sets.If S(X) is the collection of all non-null measurable subsets of (X, d), and d is Lebesgue integrable on each element of S(X), then the group average distance g(A, B), for A, B ∈ S(X), can be defined by where x ∈ A, y ∈ B, and µ is a measure on X, and then the distance function (3.2) can be extended to If d is the discrete metric, then (4.15) is equal to the Steinhaus distance [4].
Example 7. Let (R, d) be a metric space and let d(x, y) = |x − y|.For two intervals A and B, the distance function (4.15) can be expressed as ) is equal to the distance between the centers of A and B. This is consistent with an intuitive notion of the distance between balls in this (R, d).
If S(X) is the collection of all non-empty, countably infinite subsets (measurezero sets) of X, then g(A, B) and f (A, B) should be defined by taking limits in (2.5) and (3.2), provided that both have definite values.In order to determine the average distance, we have to define a proper condition, which should be said to be "averageable".The average distance will strongly depend on accumulation points in A and B, and it will require additional assumptions on the difference of the strength between the accumulation points.This requirement is closely related to a "relative measure" that is needed to obtain the ratio of the cardinality of an infinite set to the cardinality of its superset in (3.2).In conventional measure theory, however, any set of cardinality ℵ 0 is a null set having measure zero so that both counting measure and Lebesgue measure are useless for computing the ratio.It is necessary to use another measure.A feasible solution is discussed in the following section.4.6.Estimation by sampling.In application to computational data analysis, statistical estimation by sampling is very useful for obtaining the approximate value of f (A, B) when the size of the sets is very large.According to the law of large numbers, if enough sample elements are selected randomly, an average generated by those samples should approximate the average of the total population.The procedure is as follows: (1) Choose a superset P of A ∪ B as a population such that P ⊇ A ∪ B.
(2) Select a finite subset S of P as a sample obtained by random sampling.
(3) Let S A = S ∩ A and S B = S ∩ B. Then, compute f (S A , S B ) for approximation of f (A, B).
The sampling process and its randomness are crucial for efficiently estimating a good approximation.Some useful hints could be found in various sampling techniques developed for Monte Carlo methods [10].In most cases, sampling error is expected to decrease as the sample size increases, except for situations where the distribution of d has no mean (e.g., Cauchy distribution).The notion of sampling suggests an intuitive measure to define a relative measure on a σ-algebra over a set X, which could be called "sample counting measure".Suppose A and B are subsets of X.Let ρ(A : B) be the ratio of the cardinality of A to the cardinality of B, let P be a superset of A ∪ B, and let S n be a non-empty finite subset of P such that S n = n i=1 Y i where Y i is the i-th non-empty sample randomly selected from P .Then, the ratio ρ(A : B) can be determined by taking a limit of n as it approaches to ∞ as follows: if there exist such a limit and a random choice function that performs random sampling.Otherwise, instead of random sampling, systematic sampling could be available if the elements of X are supposed to be distributed with uniform density in its measurable metric space.For example, suppose there exists a finite partition of P where every part has an almost equal diameter.It seems better for S n to have exactly one element with each of the parts.

4.7.
Metrics for fuzzy sets and probability distributions.A fuzzy set can be represented by a collection of crisp sets so that the distance between fuzzy sets can be defined by the distance between the collections of such crisp sets.Let A be a fuzzy set: A = { x, m A (x) | x ∈ X}, where m A (x) is a membership function, and let A α be a crisp set called an α − lebel set [11] such that A α = {x ∈ X | m A (x) ≥ α}.Then, A can be represented by the following set of ordered pairs: C(A) = {(A α , α) | α ∈ (0, 1]}.The distance between two fuzzy sets A and B can be defined by f 2 C(A), C(B) , where there may be various ways to treat α.This notion is also applicable to the distance between probability distributions, where probability density functions are used instead of the membership function.

Concluding Remarks
We have found that, for a metric space (X, d), there exists a distance function between non-empty finite subsets of X that is a metric based on the average distance of d.The distance function (3.1) in Theorem 3 is the most typical one, which includes the Jaccard distance as a special case where d is the discrete metric.Its extensions based on the power mean will be useful to develop generalized forms that also include the Hausdorff metric and the other various distance functions.Furthermore, the extensions to infinite subsets of X will provide metrics for measuring dissimilarity of fuzzy sets and probability distributions.

(2. 7 )
s(A, B) = a∈A b∈B d(a, b), so that g(A, B) = (|A| |B|) −1 s(A, B).Since d is a metric, we have s(A, B) ≥ 0, s(A, B) = s(B, A), and s({x}, {x}) = 0 for all x ∈ X.If A = ∅ or B = ∅, then s(A, B) = 0 due to the empty sum.If A and B are countable unions of disjoint sets, it can be decomposed as follows:

Example 4 .Corollary 5 .
If d is the discrete metric, where d(x, y) = 0 if x = y and d(x, y) = 1 otherwise, then f (A, B) is equal to the Jaccard distance (2.4).Suppose (X, d) is a non-empty metric space.Let S(X) denote the collection of all non-empty finite subsets of X.For each A and B in S(X), define e(A, B) on S(X) × S(X) to be the function e(A, B) = 1 |A| |B| a∈A b∈B d(a, b) − a∈A∩B b∈A∩B d(a, b) .

4. 1 .
Generalization based on the power mean.The distance function (3.1) can be unified with the Hausdorff metric for finite sets by using the power mean.To simplify expressions, we use the following notation.Let M (i) p (x ∈ A, ψ, w) be an extended weighted-power-mean of ψ(x)

Definition 1 .
(Metric) Suppose X is a set and d is a function on X × X into R. Then d is called a metric on X if it satisfies the following conditions, for all a, b, c ∈ X, M1: d(a, b) ≥ 0 (non-negativity), M2: d(a, a) = 0, M3: d(a, b) = 0 ⇒ a = b, M4: d(a, b) = d(b, a) (symmetry), M5: d(a, b) + d(b, c) ≥ d(a, c) (triangle inequality).

(2. 7 )
s(A, B) = a∈A b∈B d(a, b), so that g(A, B) = (|A| |B|) −1 s(A, B).Since d is a metric, we have s(A, B) ≥ 0, s(A, B) = s(B, A), and s({x}, {x}) = 0 for all x ∈ X.If A = ∅ or B = ∅, then s(A, B) = 0 due to the empty sum.If A and B are countable unions of disjoint sets, it can be decomposed as follows:

Example 4 .
If d is the discrete metric, where d(x, y) = 0 if x = y and d(x, y) = 1 otherwise, then f (A, B) is equal to the Jaccard distance (2.4).

f 2 C
(a), C(b) = κd(a, b), where κ is a certain positive real number such that κ < 1.If there exists d such that d(a, b) = f 2 C(a), C(b) , then the isometric copy of (X, d) is contained in S 2 (X), f 2 and d can be regarded as the function of D. In general, there possibly exist D and d that are formally expressed by D(A, B) = F {d(a, b) | a ∈ A, b ∈ B} , (4.11) d(a, b) = G {D(A, B) | a ∈ A, b ∈ B} , (4.12)