Asymptotic properties of Lee distance

A Correction to this article was published on 20 February 2021

This article has been updated

Abstract

Distances on permutations are often convenient tools for analyzing and modeling rank data. They measure the closeness between two rankings and can be very useful and informative for revealing the main structure and features of the data. In this paper, some statistical properties of the Lee distance are studied. Asymptotic results for the random variable induced by Lee distance are derived and used to compare the Distance-based probability model and the Marginals model for complete rankings. Three rank datasets are analyzed as an illustration of the presented models.

This is a preview of subscription content, access via your institution.

Fig. 1

Change history

References

  1. Alvo M, Yu PL (2014) Statistical methods for ranking data. Frontiers in probability and the statistical sciences. Springer, Berlin

    Google Scholar 

  2. Chan CH, Yan F, Kittler J, Mikolajczyk K (2015) Full ranking as local descriptor for visual recognition: a comparison of distance metrics on \(\mathbf{S}_{n}\). Pattern Recognit 48:134–160

    Article  Google Scholar 

  3. Critchlow DE (1985) Metric methods for analyzing partially ranked data. Lecture Notes in Statistics, vol 34. Springer, New York

  4. Critchlow DE (1992) On rank statistics: an approach via metrics on the permutation group. J Stat Plan Inference 32:325–346

    MathSciNet  Article  Google Scholar 

  5. Deza M, Huang T (1998) Metrics on permutations, a survey. J Comb Inf Syst Sci 23:173–185

    MathSciNet  MATH  Google Scholar 

  6. Diaconis P (1988) Group representations in probability and statistics. IMS Lecture Notes—Monograph Series, vol 11. Institute of Mathematical Statistics, Hayward

  7. Diaconis P (1989) A generalization of spectral analysis with application to ranked data. Ann Stat 17:949–979

    MathSciNet  Article  Google Scholar 

  8. Fligner M, Verducci T (1986) Distance based ranking models. J R Stat Soc 48:359–369

    MathSciNet  MATH  Google Scholar 

  9. Hoeffding W (1951) A combinatorial limit theorem. Ann Math Stat 22:558–566

    MathSciNet  Article  Google Scholar 

  10. Irurozki E, Calvo B, Lozano A (2014) Sampling and learning the Mallows and Weighted Mallows models under the Hamming distance. Technical report. https://addi.ehu.es/bitstream/handle/10810/11240/tr14-3.pdf. Accessed 28 Sept 2018

  11. Lee CY (1961) An algorithm for path connections and its applications. IRE Trans Electron Comput 10:346–365

    MathSciNet  Article  Google Scholar 

  12. Mallows CM (1957) Non-null ranking models. I. Biometrika 44:114–130

    MathSciNet  Article  Google Scholar 

  13. Mao A, Procaccia AD, Chen Y (2013) Better human computation through principled voting. In: Proceedings of 27th AAAI conference on artificial intelligence, pp 1142–1148

  14. Marden JI (1995) Analyzing and modeling rank data. Monographs on statistics and applied probability, vol 64. Chapman & Hall, London

    Google Scholar 

  15. Mattei N, Walsh T (2013) Preflib: a library of preference data. In: Proceedings of 3rd international conference on algorithmic decision theory. Springer. http://www.preflib.org. Accessed 28 Sept 2018

  16. Mukherjee S (2016) Estimation in exponential families on permutations. Ann Stat 44:853–875

    MathSciNet  Article  Google Scholar 

  17. Nikolov NI (2016) Lee distance in two-sample rank tests. In: Proceedings of 11th international conference on computer data analysis and modeling, pp 100–103

  18. Nikolov NI, Stoimenova E (2017) Mallows’ model based on Lee distance. In: Proceedings of 20th European young statisticians meeting, pp 59–66

  19. Nikolov NI, Stoimenova E (2018) EM estimation of the parameters in latent Mallows’ models. Studies in computational intelligence. Springer, Berlin

    Google Scholar 

  20. Skowron P, Faliszewski P, Slinko A (2013) Achieving fully proportional representation is easy in practice. In: Proceedings of 2013 international conference on autonomous agents and multi-agent systems, pp 399–406

  21. Verducci JS (1982) Discrimination between two populations on the basis of ranked preferences. PhD dissertation, Department of Statistics, Stanford University

  22. Verducci JS (1989) Minimum majorization decomposition. In: Gleser LJ, Perlman MD, Press SJ, Sampson AR (eds) Contributions to probability and statistics. Springer, Berlin, pp 160–173

    Chapter  Google Scholar 

  23. Yu PLH, Xu H (2018) Rank aggregation using latent-scale distance-based models. Stat Comput. https://doi.org/10.1007/s11222-018-9811-9

    Article  Google Scholar 

Download references

Acknowledgements

The work of the first author was supported by the Support Program of Bulgarian Academy of Sciences for Young Researchers under Grant 17-95/2017. The work of the second author was supported by the National Science Fund of Bulgaria under Grant DH02-13.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nikolay I. Nikolov.

Appendix

Appendix

In order to prove Theorem 3, let’s consider the random variables \(D_{N,k}=d_{L}\left( \pi ,e_{N}\right) \), where \(k=1,2,\ldots ,N\) and \(\pi \) is randomly selected from \({\mathbf {S}}_{N,k}=\left\{ \sigma \in {\mathbf {S}}_{N}: \sigma (N)=k\right\} \), i.e. \(\pi \sim Uniform({\mathbf {S}}_{N,k})\). Then, for fixed k,

$$\begin{aligned} D_{N,k}(\pi )=\sum \limits _{i=1}^{N}c_{N}(\pi (i),i)=\sum \limits _{i=1}^{N-1}c_{N}(\pi (i),i) + c_{N}(k,N)=\sum \limits _{i=1}^{N-1}\tilde{c}_{N}(\sigma (i),i) + c_{N}(k,N), \end{aligned}$$

where \(\sigma \in {\mathbf {S}}_{N-1}\) and for \(i,j=1,2,\ldots ,N-1\),

$$\begin{aligned} \sigma (i)= {\left\{ \begin{array}{ll} \pi (i), &{} \text{ if } \pi (i)<k\\ \pi (i)-1, &{} \text{ if } \pi (i)>k, \end{array}\right. } \qquad \tilde{c}_{N}(j,i)= {\left\{ \begin{array}{ll} c_{N}(j,i), &{} \text{ if } j<k \\ c_{N}(j+1,i), &{} \text{ if } j\ge k. \end{array}\right. } \end{aligned}$$
(21)

Lemma 1

Let \( \tilde{D}_{N-1}\left( \sigma \right) =\sum \nolimits _{i=1}^{N-1}\tilde{c}_{N}(\sigma (i),i)\), where \(\sigma \sim Uniform({\mathbf {S}}_{N-1})\) and \(\tilde{c}_{N}(\cdot ,\cdot )\) is as in (21). Then the distribution of \(\tilde{D}_{N-1}\) is asymptotically normal and the mean and variance of \(\tilde{D}_{N-1}\) are

$$\begin{aligned} {\mathbf {E}}\left( \tilde{D}_{N-1}\right)&= \displaystyle \frac{c_{N}(k,N)}{N-1}+\frac{N-2}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] , \\ {\mathbf {Var}} \left( \tilde{D}_{N-1}\right)&= \displaystyle \frac{ \displaystyle N^{2} \left( c_{N}\left( k,N\right) \right) ^{2}- 2N\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] c_{N}\left( k,N\right) }{\left( N-2\right) \left( N-1\right) ^{2}} + \beta _{N-1}, \end{aligned}$$

where

$$\begin{aligned} \beta _{N-1} = {\left\{ \begin{array}{ll} \displaystyle \frac{N^{2}\left( N^{3}-2N^{2}+10N-12\right) }{48(N-1)^{2}}, &{}\quad \text{ for } N \text{ even } \\ \displaystyle \frac{\left( N+1\right) \left( N^{3}-3N^{2}+6N-6\right) }{48(N-2)}, &{}\quad \text{ for } N \text{ odd. } \end{array}\right. } \end{aligned}$$
(22)

Proof

From (6) of Theorem 1 and formulas (21) and (10), it follows that

$$\begin{aligned}&{\mathbf {E}}\left( \tilde{D}_{N-1}\right) {\mathop {=}\limits ^{(6)}}\frac{1}{N-1} \sum _{i=1}^{N-1}\sum _{j=1}^{N-1}\tilde{c}_{N}(i,j) {\mathop {=}\limits ^{(21)}}\frac{1}{N-1}\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}\sum _{j=1}^{N-1}c_{N}(i,j)\\&\quad =\frac{1}{N-1}\sum _{i=1}^{N}\sum _{j=1}^{N}c_{N}(i,j)-\frac{1}{N-1} \sum _{i=1}^{N}c_{N}(i,N)-\frac{1}{N-1}\sum _{j=1}^{N}c_{N}(k,j)+\frac{c_{N}(k,N)}{N-1}\\&\quad {\mathop {=}\limits ^{(10)}}\frac{N}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] -\frac{1}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] -\frac{1}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] +\frac{c_{N}(k,N)}{N-1}\\&\quad =\frac{c_{N}(k,N)}{N-1}+\frac{N-2}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] . \end{aligned}$$

Using (7) of Theorem 1,

$$\begin{aligned} {\mathbf {Var}} \left( \tilde{D}_{N-1}\right)= & {} \frac{1}{N-2}\sum _{i=1}^{N-1}\sum _{j=1}^{N-1}\tilde{b}_{N}^{2}(i,j)=\frac{1}{N-2}\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}\sum _{j=1}^{N-1}b_{N}^{2}(i,j), \quad \text{ where }\nonumber \\ b_{N}(i,j)= & {} c_{N}(i,j)- \sum _{\begin{array}{c} g=1 \\ g\ne k \end{array}}^{N}\frac{c_{N}(g,j)}{N-1}-\sum _{h=1}^{N-1}\frac{c_{N}(i,h)}{N-1}+\frac{1}{\left( N-1\right) ^{2}} \sum _{\begin{array}{c} g=1 \\ g\ne k \end{array}}^{N}\sum _{h=1}^{N-1}c_{N}(g,h), \nonumber \\ \end{aligned}$$
(23)

for \(i,j=1,2,\ldots ,N\). Simplifying expression (23) gives

$$\begin{aligned} b_{N}(i,j)=c_{N}(i,j)+\frac{c_{N}(i,N)+c_{N}(k,j)}{N-1}+\frac{c_{N}(k,N)}{\left( N-1\right) ^{2}}-\frac{N}{\left( N-1\right) ^{2}}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] . \end{aligned}$$
(24)

When N is even, the variance of \(\tilde{D}_{N-1}\) can be calculated by

$$\begin{aligned} {\mathbf {Var}} \left( \tilde{D}_{N-1}\right)&=\frac{1}{N-2}\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}\left\{ \sum _{j=1}^{k-\frac{N}{2}}b_{N}^{2}(i,j)+\sum _{j=k-\frac{N}{2}+1}^{\frac{N}{2}}b_{N}^{2}(i,j)+\sum _{j=\frac{N}{2}+1}^{k}b_{N}^{2}(i,j)\right. \\&\quad \left. +\sum _{j=k+1}^{N-1}b_{N}^{2}(i,j)\right\} =\frac{1}{N-2}\left( Q_{1}+Q_{2}+Q_{3}+Q_{4}\right) , \end{aligned}$$

where the summation \(\sum _{j=l_{1}}^{l_{2}}=0\), if \(l_{1}>l_{2}\). Since the computations for \(Q_{1}\), \(Q_{2}\), \(Q_{3}\) and \(Q_{4}\) are similar, only the steps for \(Q_{1}\) are presented herein.

$$\begin{aligned} Q_{1}&=\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}\sum _{j=1}^{k-\frac{N}{2}}b_{N}^{2}(i,j)=\sum _{j=1}^{k-\frac{N}{2}}\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\left\{ \sum _{i=1}^{j-1}b_{N}^{2}(i,j)+ \sum _{i=j}^{\frac{N}{2}}b_{N}^{2}(i,j)\right. \\&\quad \left. +\sum _{i=\frac{N}{2}+1}^{\frac{N}{2}+j-1}b_{N}^{2}(i,j)+\sum _{i=\frac{N}{2}+j}^{N}b_{N}^{2}(i,j)-b_{N}^{2}(k,j)\right\} =Q_{1}^{(1)}+Q_{1}^{(2)}+Q_{1}^{(3)}+Q_{1}^{(4)}-Q_{1}^{(5)}, \end{aligned}$$

where

$$\begin{aligned} Q_{1}^{(1)}&= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=1}^{j-1}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=1}^{j-1} \left( j-i+\frac{i+(N-k+j)}{N-1}+B_{N}(k)\right) ^{2}, \\ Q_{1}^{(2)}&= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=j}^{\frac{N}{2}}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=j}^{\frac{N}{2}} \left( i-j+\frac{i+(N-k+j)}{N-1}+B_{N}(k)\right) ^{2}, \\ Q_{1}^{(3)}&= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=\frac{N}{2}+1}^{\frac{N}{2}+j-1}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=\frac{N}{2}+1}^{\frac{N}{2}+j-1} \left( i-j+\frac{N-i+(N-k+j)}{N-1}+B_{N}(k) \right) ^{2}, \\ Q_{1}^{(4)}&= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=\frac{N}{2}+j}^{N}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=\frac{N}{2}+j}^{N} \left( N-i+j+\frac{N-i+(N-k+j)}{N-1}+B_{N}(k)\right) ^{2}, \\ Q_{1}^{(5)}&= \sum _{j=1}^{k-\frac{N}{2}}b_{N}^{2}(k,j)= \sum _{j=1}^{k-\frac{N}{2}} \left( N-k+j+\frac{N-k+(N-k+j)}{N-1}+B_{N}(k)\right) ^{2}, \end{aligned}$$

for \( B_{N}(k)=\frac{c_{N}(k,N)}{\left( N-1\right) ^{2}}-\frac{N}{\left( N-1\right) ^{2}}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] =\frac{4(N-k)-N^{3}}{4\left( N-1\right) ^{2}}\) and \(\sum _{i=l_{1}}^{l_{2}}=0\), if \(l_{1}>l_{2}\). The calculation of \(Q_{1}\) is completed by repeatedly using the formula

$$\begin{aligned} \sum _{i=1}^{n}\left( i-a\right) ^{2}=na^{2}+\frac{n(n+1)(2n+1-6a)}{6} \end{aligned}$$
(25)

for appropriate values of a and n.

The quantities \(Q_{2}\), \(Q_{3}\) and \(Q_{4}\) can be decomposed and calculated in a similar fashion as shown for \(Q_{1}\). The final result for the variance of \(\tilde{D}_{N-1}\), when N is even, is

$$\begin{aligned} {\mathbf {Var}} \left( \tilde{D}_{N-1}\right) = \displaystyle \frac{ \displaystyle 2N^{2} \left( c_{N}\left( k,N\right) \right) ^{2}- N^{3}c_{N}\left( k,N\right) }{2\left( N-2\right) \left( N-1\right) ^{2}} + \frac{N^{2}\left( N^{3}-2N^{2}+10N-12\right) }{48(N-1)^{2}}. \end{aligned}$$

The variance \({\mathbf {Var}} \left( \tilde{D}_{N-1}\right) \), when N is odd, can be obtained by decomposing it to four decomposable double sums and applying formula (25), as in the case when N is even.

From (24) and (2), it follows that

$$\begin{aligned} \displaystyle \max _{1 \le i,j \le N}b_{N}^{2}(i,j) \le \left( \left[ \frac{N}{2}\right] +\frac{\displaystyle \left[ \frac{N}{2}\right] +\left[ \frac{N}{2}\right] }{N-1}+\frac{\displaystyle \left[ \frac{N}{2}\right] }{\left( N-1\right) ^{2}}-\frac{N}{\left( N-1\right) ^{2}}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] \right) ^{2}. \end{aligned}$$

By using (22),

$$\begin{aligned} \displaystyle \frac{1}{N-1}\sum _{i=1}^{N-1}\sum _{j=1}^{N-1}\tilde{b}_{N}^{2}(i,j)=\frac{N-2}{N-1} {\mathbf {Var}} \left( \tilde{D}_{N-1}\right) \ge \frac{N-2}{N-1}\beta _{N-1}= N^{3}\left( \frac{1}{48}+O\left( \frac{1}{N}\right) \right) , \end{aligned}$$

where \(\lim _{N \rightarrow \infty }O\left( \frac{1}{N}\right) =0\). Therefore,

$$\begin{aligned} \lim _{N \rightarrow \infty } \frac{ \max _{1 \le i,j \le N-1}\tilde{b}_{N}^{2}(i,j)}{ \frac{1}{N}\sum _{i=1}^{N-1}\sum _{j=1}^{N-1}\tilde{b}_{N}^{2}(i,j)}\le \lim _{N \rightarrow \infty } \frac{N^{2}\left( \frac{1}{16}+O\left( \frac{1}{N}\right) \right) }{N^{3}\left( \frac{1}{48}+O\left( \frac{1}{N}\right) \right) }=0, \end{aligned}$$

i.e. the condition (8) of Theorem 1 is fulfilled and the distribution of \(\tilde{D}_{N-1}\) is asymptotically normal. \(\square \)

Proof (Proof of Theorem 3)

From (14), (19) and (15), it follows that

$$\begin{aligned} m_{ij}(\theta ,N)=\sum _{\pi (i)=j} \exp \left( \theta d(\pi ,e_{N})-\psi _{N}(\theta )\right) =\frac{(N-1)!\tilde{g}_{N-1}(\theta )}{N!g_{N}(\theta )}=\frac{1}{N}\frac{\tilde{g}_{N-1}(\theta )}{g_{N}(\theta )}, \end{aligned}$$

where \(g_{N}(\cdot )\) and \(\tilde{g}_{N-1}(\cdot )\) are the moment generating functions of \(D_{L}(\pi )\) and \(D_{i,j}(\sigma )\), for \(\pi \sim Uniform({\mathbf {S}}_{N})\) and \(\sigma \sim Uniform({\mathbf {S}}_{i,j})\). Since \(D_{i,j}\) depends on i and j only through \(c_{N}(i,j)\), the random variables \(D_{i,j}\) and \(D_{N,k}\) are identically distributed for \({k=N-c_{N}(i,j)}\). From Theorem 2 and Lemma 1, \(g_{N}(\cdot )\) and \(\tilde{g}_{N-1}(\cdot )\) can be approximated, so

$$\begin{aligned} m_{ij}(\theta ,N) \frac{N}{ \exp \left( \theta \mu + \displaystyle \frac{\theta ^{2}\nu ^{2}}{2}\right) } \xrightarrow [N \rightarrow \infty ] \displaystyle 1, \end{aligned}$$

where \(\mu ={\mathbf {E}}\left( D_{i,j}\right) -{\mathbf {E}}(D_{L})\) and \(\nu ^{2}={\mathbf {Var}}\left( D_{i,j}\right) -{\mathbf {Var}}(D_{L})\).

According to Lemma 1,

$$\begin{aligned} {\mathbf {E}}\left( D_{i,j}\right)&= \displaystyle \frac{c_{N}(i,j)}{N-1}+\frac{N-2}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] +c_{N}(i,j),\\ {\mathbf {Var}} \left( D_{i,j}\right)&= \displaystyle \frac{ \displaystyle N^{2} \left( c_{N}\left( i,j\right) \right) ^{2}- 2N\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] c_{N}\left( i,j\right) }{\left( N-2\right) \left( N-1\right) ^{2}} + \beta _{N-1}. \end{aligned}$$

The values of \(\mu \) and \(\nu ^{2}\) are obtained by combining the results above with formulas (10) and (12). \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nikolov, N.I., Stoimenova, E. Asymptotic properties of Lee distance. Metrika 82, 385–408 (2019). https://doi.org/10.1007/s00184-018-0687-7

Download citation

Keywords

  • Lee distance
  • Rank data
  • Distance-based models
  • Marginals model
  • Asymptotic distribution