Skip to main content

Approximate sorting of data streams with limited storage

Abstract

We consider the problem of approximate sorting of a data stream (in one pass) with limited internal storage where the goal is not to rearrange data but to output a permutation that reflects the ordering of the elements of the data stream as closely as possible. Our main objective is to study the relationship between the quality of the sorting and the amount of available storage. To measure quality, we use permutation distortion metrics, namely the Kendall tau, Chebyshev, and weighted Kendall metrics, as well as mutual information, between the output permutation and the true ordering of data elements. We provide bounds on the performance of algorithms with limited storage and present a simple algorithm that asymptotically requires a constant factor as much storage as an optimal algorithm in terms of mutual information and average Kendall tau distortion. We also study the case in which only information about the most recent elements of the stream is available. This setting has applications to learning user preference rankings in services such as Netflix, where items are presented to the user one at a time.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  • Apostol TM (1976) Introduction to analytic number theory. Springer, New York

    MATH  Google Scholar 

  • Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of 21st ACM symposium on principles of database systems (PODS), New York

  • Carterette B (2009) On rank correlation and the distance between rankings. In: Proceedings of 32nd international SIGIR conference on research and development in information retrieval, ACM Press, New York, pp 436–443

  • Chakrabarti A, Jayram TS, Pǎtraşcu M (2008) Tight lower bounds for selection in randomly ordered streams. In: ACM-SIAM symposium on discrete algorithms (SODA), Society for Industrial and Applied Mathematics, Philadelphia, pp 720–729

  • Chen CP, Qi F (2008) The best lower and upper bounds of harmonic sequence. Glob J Appl Math Math Sci 1(1):41–49

    Google Scholar 

  • Corless RM, Gonnet GH, Hare DEG, Jeffrey DJ, Knuth DE (1996) On the Lambert W function. Adv Comput Math 5(1):329–359. doi:10.1007/BF02124750

    MathSciNet  Article  MATH  Google Scholar 

  • Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York

    MATH  Google Scholar 

  • Diaconis P (1988) Group representations in probability and statistics, vol 11. Institute of Mathematical Statistics, Hayward

    MATH  Google Scholar 

  • Farnoud F, Milenkovic O (2013) Aggregating rankings with positional constraints. In: Proceedings of IEEE information theory workshop (ITW), Seville

  • Farnoud F, Schwartz M, Bruck J (2014a) Rate-distortion for ranking with incomplete information. arXiv preprint: http://arxiv.org/abs/1401.3093

  • Farnoud F, Schwartz M, Bruck J (2014b) Bounds for permutation rate-distortion. In: Proceedings of IEEE international symposium on information theory (ISIT), Honolulu

  • Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of ACM SIGMOD international conference on management of data, ACM, New York, pp 58–66. doi:10.1145/375663.375670

  • Hassanzadeh F (2013) Distances on rankings: from social choice to flash memories. Ph.D. thesis, University of Illinois at Urbana–Champaign. http://hdl.handle.net/2142/44268

  • Holst L (1980) On the lengths of the pieces of a stick broken at random. J Appl Probab 17(3):623–634

    MathSciNet  Article  MATH  Google Scholar 

  • Kemeny JG (1959) Mathematics without numbers. Daedalus 88(4):577–591

    Google Scholar 

  • Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. In: Proceedings of 19th international world wide web conference, Raleigh, pp. 571–580

  • Manku GS, Rajagopalan S, Lindsay BG (1998) Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of ACM SIGMOD international conference on management of data, ACM, New York, pp 426–435. doi:10.1145/276304.276342

  • McGregor A, Valiant P (2012) The shifting sands algorithm. In: ACM-SIAM symposium on discrete algorithms (SODA), SIAM, pp 453–458. http://www.dl.acm.org/citation.cfm?id=2095116.2095155

  • Munro J, Paterson M (1980) Selection and sorting with limited storage. Theor Comput Sci 12(3):315–323. http://www.sciencedirect.com/science/article/pii/0304397580900614

  • Sedgewick R, Wayne K (2011) Algorithms, 4th edn. Addison-Wesley Professional, Reading

    Google Scholar 

  • Shieh GS (1998) A weighted Kendall’s tau statistic. Stat Probab Lett 39(1):17–24

    MathSciNet  Article  MATH  Google Scholar 

  • Yilmaz E, Aslam JA, Robertson S (2008) A new rank correlation coefficient for information retrieval. In: Proceedings of 31st annual international SIGIR conference research and development in information retrieval, ACM, New York, pp 587–594

Download references

Acknowledgments

The authors would like to thank Ryan Gabrys and Yue Li for useful discussions and comments. Furthermore, the authors thank anonymous reviewers whose comments greatly improved this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Farzad Farnoud.

Appendix

Appendix

Proof of (10)

Lemma 6

We have

$$\begin{aligned} \sum _{k=1}^{n-m+1}\left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) k\ln k\le \left( {\begin{array}{c}n\\ m\end{array}}\right) \left( H_{n}-H_{m}+1-\frac{m}{n}\right) . \end{aligned}$$

Proof

To prove the upper bound on \(\sum _{k=2}^{n-m+1}\left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) k\lg k\), we use Abel’s identity Apostol (1976, Theorem 4.2), which states that for an arithmetic function a, a real number x, and a function f with a continuous derivative on \(\left[ 1,x\right] \), we have

$$\begin{aligned} \sum _{1\le k\le x}a\left( k\right) f\left( k\right) =A\left( x\right) f\left( x\right) -\int _{1}^{x}A\left( y\right) f'\left( y\right) dy, \end{aligned}$$

where \(A\left( y\right) =\sum _{1\le k\le y}a\left( k\right) \) for \(y\in {\mathbb {R}}\). To use Abel’s identity, we let \(a\left( k\right) =k\left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) \) and \(f\left( k\right) =\ln k\). Hence,

$$\begin{aligned} A\left( y\right)&=\sum _{1\le k\le y}k\left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) \\&=\left( {\begin{array}{c}n\\ m\end{array}}\right) -\frac{(\left\lfloor y\right\rfloor (m-1)+n)}{m}\left( {\begin{array}{c}n-\left\lfloor y\right\rfloor -1\\ m-1\end{array}}\right) , \end{aligned}$$

for \(y\ge 1\) and \(A\left( y\right) =0\) for \(y<1\). By Abel’s identity, we have

$$\begin{aligned} \sum _{k=1}^{n-m+1}\left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) k\ln k=A(n-m+1)f(n-m+1)-\int _{y=1}^{n-m+1}A(y)f'(y)dy. \end{aligned}$$

The first term on the right side equals \(\left( {\begin{array}{c}n\\ m\end{array}}\right) \ln \left( n-m+1\right) \) and for the second term, we find

$$\begin{aligned}&\int _{y=1}^{n-m+1}A(y)f'(y)dy\\&\quad =\int _{y=1}^{n-m+1}\left( \left( {\begin{array}{c}n\\ m\end{array}}\right) -\frac{(\left\lfloor y\right\rfloor (m-1)+n)}{m}\left( {\begin{array}{c}n-\left\lfloor y\right\rfloor -1\\ m-1\end{array}}\right) \right) \frac{1}{y}dy\\&\quad =\left( {\begin{array}{c}n\\ m\end{array}}\right) \ln \left( n-m+1\right) -\int _{y=1}^{n-m+1} \frac{(\left\lfloor y\right\rfloor (m-1)+n)}{m}\left( {\begin{array}{c}n-\left\lfloor y\right\rfloor -1\\ m-1\end{array}}\right) \frac{1}{y}dy. \end{aligned}$$

Hence,

$$\begin{aligned} \sum _{k=1}^{n-m+1}\left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) k\ln k=\int _{y=1}^{n-m+1}\frac{(\left\lfloor y\right\rfloor (m-1)+n)}{m}\left( {\begin{array}{c}n-\left\lfloor y\right\rfloor -1\\ m-1\end{array}}\right) \frac{1}{y}dy. \end{aligned}$$

We proceed as follows:

$$\begin{aligned}&\sum _{k=1}^{n-m+1} \left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) k\ln k\\&\quad \le \int _{y=1}^{n-m+1}\frac{(\left\lfloor y\right\rfloor (m-1)+n)}{m}\left( {\begin{array}{c}n-\left\lfloor y\right\rfloor -1\\ m-1\end{array}}\right) \frac{1}{\left\lfloor y\right\rfloor }dy\\&\quad =\sum _{k=1}^{n-m}\frac{k(m-1)+n}{m}\left( {\begin{array}{c}n-k-1\\ m-1\end{array}}\right) \frac{1}{k}\\&\quad =\frac{m-1}{m}\sum _{k=1}^{n-m}\left( {\begin{array}{c}n-k-1\\ m-1\end{array}}\right) +\frac{n}{m}\sum _{k=1}^{n-m}\left( {\begin{array}{c}n-k-1\\ m-1\end{array}}\right) \frac{1}{k}\\&\quad =\frac{m-1}{m}\left( {\begin{array}{c}n-1\\ m\end{array}}\right) +\frac{n}{m}\left( {\begin{array}{c}n-1\\ m-1\end{array}}\right) \left( H_{n-1}-H_{m-1}\right) \\&\quad =\frac{m-1}{m}\left( {\begin{array}{c}n-1\\ m\end{array}}\right) +\left( {\begin{array}{c}n\\ m\end{array}}\right) \left( H_{n}-H_{m}\right) +\frac{1}{m}\left( {\begin{array}{c}n-1\\ m-1\end{array}}\right) \left( \frac{n-m}{m}\right) \\&\quad =\left( {\begin{array}{c}n-1\\ m\end{array}}\right) +\left( {\begin{array}{c}n\\ m\end{array}}\right) \left( H_{n}-H_{m}\right) = \left( {\begin{array}{c}n\\ m\end{array}}\right) \left( H_{n}-H_{m}+1-\frac{m}{n}\right) , \end{aligned}$$

where we have used the fact that for nonnegative integers \(\ell ,j\), we have

$$\begin{aligned} \sum _{i=1}^{\ell -j}\left( {\begin{array}{c}\ell -i\\ j\end{array}}\right) \frac{1}{i}= \left( {\begin{array}{c}\ell \\ j\end{array}}\right) \left( H_{\ell }-H_{j}\right) , \end{aligned}$$

proved below, to obtain the third equality.

To prove

$$\begin{aligned} \sum _{i=1}^{\ell -j}\left( {\begin{array}{c}\ell -i\\ j\end{array}}\right) \frac{1}{i}= \left( {\begin{array}{c}\ell \\ j\end{array}}\right) \left( H_{\ell }-H_{j}\right) \end{aligned}$$

let us write it as

$$\begin{aligned} \sum _{i=j+1}^{\ell }\frac{\left( \ell -i+j\right) !}{\left( \ell -i\right) !} \frac{1}{i-j}=\frac{\ell !}{\left( \ell -j\right) !}\sum _{i=j+1}^{\ell }\frac{1}{i}\cdot \end{aligned}$$
(19)

The proof is by induction. The equality (19) holds for \(j=0\) as both sides reduce to \(\sum _{i=1}^{\ell }\frac{1}{i}\). As induction hypothesis, suppose (19) holds for a certain value of j. We show that it also holds for \(j+1\). We have

$$\begin{aligned}&\sum _{i=j+2}^{\ell } \frac{\left( \ell -i+j+1\right) !}{\left( \ell -i\right) !}\frac{1}{i-j-1}\\&\quad =\sum _{i=j+1}^{\ell -1}\frac{\left( \ell -i+j\right) !}{\left( \ell -i-1 \right) !}\frac{1}{i-j}\\&\quad =\sum _{i=j+1}^{\ell }\frac{\left( \ell -i\right) \left( \ell -i+j \right) !}{\left( \ell -i\right) !}\frac{1}{i-j}\\&\quad =\sum _{i=j+1}^{\ell }\frac{\left( \ell -j\right) \left( \ell -i+j \right) !}{\left( \ell -i\right) !}\frac{1}{i-j}-\sum _{i=j+1}^{\ell } \frac{\left( i-j\right) \left( \ell -i+j\right) !}{\left( \ell -i\right) !}\frac{1}{i-j}\\&\quad =\left( \ell -j\right) \sum _{i=j+1}^{\ell }\frac{\left( \ell -i+j \right) !}{\left( \ell -i\right) !}\frac{1}{i-j}-\sum _{i=j+1}^{\ell } \frac{\left( \ell -i+j\right) !}{\left( \ell -i\right) !}\\&\quad \mathop {=}\limits ^{\mathsf {(a)}}\left( \ell -j\right) \frac{\ell !}{\left( \ell -j\right) !} \sum _{i=j+1}^{\ell }\frac{1}{i}-j!\sum _{i=j+1}^{\ell } \left( {\begin{array}{c}\ell -i+j\\ j\end{array}}\right) \\&\quad =\frac{\ell !}{\left( \ell -j-1\right) !}\sum _{i=j+1}^{\ell } \frac{1}{i}-j!\sum _{i=j}^{\ell -1}\left( {\begin{array}{c}i\\ j\end{array}}\right) \\&\quad =\frac{\ell !}{\left( \ell -j-1\right) !}\sum _{i=j+1}^{\ell } \frac{1}{i}-\frac{\ell !}{(j+1)\left( \ell -j-1\right) !}\\&\quad =\frac{\ell !}{\left( \ell -j-1\right) !}\sum _{i=j+2}^{\ell }\frac{1}{i}, \end{aligned}$$

where for \(\mathsf {(a)}\) we have used the induction hypothesis. \(\square \)

Proof of (14)

In this subsection, we prove (14) as follows

$$\begin{aligned}&P \left( D_{u,v}\left( X,Y_{1}''\right) \right) \\&\quad \mathop {=}\limits ^{\mathsf {(a)}}\frac{1}{t}\sum _{a=0}^{ \left\lfloor (3t-2)/4\right\rfloor }\frac{\left( {\begin{array}{c}a+2\\ 2\end{array}}\right) +\left( {\begin{array}{c}t-a+1\\ 2\end{array}}\right) }{2\left( {\begin{array}{c}t+2\\ 2\end{array}}\right) }+ \frac{1}{t}\sum _{a=\left\lfloor (3t-2)/4\right\rfloor +1}^{t-1}\frac{\left( {\begin{array}{c}a+2\\ 3\end{array}}\right) +\left( {\begin{array}{c}t+2\\ 3\end{array}}\right) -\left( {\begin{array}{c}t-a+2\\ 3\end{array}}\right) }{2a\left( {\begin{array}{c}t+2\\ 2\end{array}}\right) }\\&\quad =\frac{1}{t^{3}+O\left( t^{2}\right) }\biggl [\sum _{a=0}^{\left\lfloor (3t-2)/4\right\rfloor }\left( \left( {\begin{array}{c}a+2\\ 2\end{array}}\right) +\left( {\begin{array}{c}t-a+1\\ 2\end{array}}\right) \right) \\&\qquad +\, \sum _{a=\left\lfloor (3t-2)/4\right\rfloor +1}^{t-1}\frac{1}{a}\left( \left( {\begin{array}{c}a+2\\ 3\end{array}}\right) +\left( {\begin{array}{c}t+2\\ 3\end{array}}\right) -\left( {\begin{array}{c}t-a+2\\ 3\end{array}}\right) \right) \biggl ]\\&\quad =\frac{1}{t^{3}+O\left( t^{2}\right) }\biggl [\frac{\left( 3t/4 \right) ^{3}}{6}+\frac{t^{3}-\left( t/4\right) ^{3}}{6}+\frac{t^{3}-\left( 3t/4\right) ^{3}}{18}+O\left( t^{2}\right) \\&\qquad +\, \sum _{a=\left\lfloor (3t-2)/4\right\rfloor +1}^{t-1} \frac{t^{3}-\left( t-a\right) ^{3}+O\left( t^{2}\right) }{6a}\biggl ]\\&\quad =\frac{1}{1+O\left( t^{-1}\right) }\left( \frac{307}{1152}+\frac{1}{6} \sum _{\left\lfloor (3t-2)/4\right\rfloor +1}^{t-1}\left( \frac{3}{t} -\frac{3a}{t^{2}}+\frac{a^{2}}{t^{3}}+O\left( \frac{1}{at}\right) \right) \right) \\&\quad =\frac{1}{1+O\left( t^{-1}\right) }\left( \frac{307}{1152} +\frac{1}{6}\left( \frac{3}{4}-\frac{21}{32}+\frac{37}{192} +O\left( \frac{1}{t}\right) \right) \right) \\&\quad =\frac{181}{576}+o(1), \end{aligned}$$

where \(\mathsf {(a)}\) follows from (13) and where we have used \(\sum _{a=0}^{k+O\left( 1\right) }\left( {\begin{array}{c}a+O\left( 1\right) \\ 2\end{array}}\right) =\frac{k^{3}}{6}+O\left( k^{2}\right) \) and \(\frac{1}{a}\left( {\begin{array}{c}a+2\\ 3\end{array}}\right) =\frac{1}{3}\left( {\begin{array}{c}a+2\\ 2\end{array}}\right) \).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Farnoud, F., Yaakobi, E. & Bruck, J. Approximate sorting of data streams with limited storage. J Comb Optim 32, 1133–1164 (2016). https://doi.org/10.1007/s10878-015-9930-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10878-015-9930-6

Keywords

  • Approximate sorting
  • Data stream
  • Limited storage
  • Permutation distortion metrics
  • Weighted Kendall distortion
  • User preference ranking