Advertisement

Journal of Combinatorial Optimization

, Volume 32, Issue 4, pp 1133–1164 | Cite as

Approximate sorting of data streams with limited storage

  • Farzad Farnoud
  • Eitan Yaakobi
  • Jehoshua Bruck
Article

Abstract

We consider the problem of approximate sorting of a data stream (in one pass) with limited internal storage where the goal is not to rearrange data but to output a permutation that reflects the ordering of the elements of the data stream as closely as possible. Our main objective is to study the relationship between the quality of the sorting and the amount of available storage. To measure quality, we use permutation distortion metrics, namely the Kendall tau, Chebyshev, and weighted Kendall metrics, as well as mutual information, between the output permutation and the true ordering of data elements. We provide bounds on the performance of algorithms with limited storage and present a simple algorithm that asymptotically requires a constant factor as much storage as an optimal algorithm in terms of mutual information and average Kendall tau distortion. We also study the case in which only information about the most recent elements of the stream is available. This setting has applications to learning user preference rankings in services such as Netflix, where items are presented to the user one at a time.

Keywords

Approximate sorting Data stream Limited storage  Permutation distortion metrics Weighted Kendall distortion User preference ranking 

Notes

Acknowledgments

The authors would like to thank Ryan Gabrys and Yue Li for useful discussions and comments. Furthermore, the authors thank anonymous reviewers whose comments greatly improved this paper.

References

  1. Apostol TM (1976) Introduction to analytic number theory. Springer, New YorkzbMATHGoogle Scholar
  2. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of 21st ACM symposium on principles of database systems (PODS), New YorkGoogle Scholar
  3. Carterette B (2009) On rank correlation and the distance between rankings. In: Proceedings of 32nd international SIGIR conference on research and development in information retrieval, ACM Press, New York, pp 436–443Google Scholar
  4. Chakrabarti A, Jayram TS, Pǎtraşcu M (2008) Tight lower bounds for selection in randomly ordered streams. In: ACM-SIAM symposium on discrete algorithms (SODA), Society for Industrial and Applied Mathematics, Philadelphia, pp 720–729Google Scholar
  5. Chen CP, Qi F (2008) The best lower and upper bounds of harmonic sequence. Glob J Appl Math Math Sci 1(1):41–49Google Scholar
  6. Corless RM, Gonnet GH, Hare DEG, Jeffrey DJ, Knuth DE (1996) On the Lambert W function. Adv Comput Math 5(1):329–359. doi: 10.1007/BF02124750 MathSciNetCrossRefzbMATHGoogle Scholar
  7. Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New YorkzbMATHGoogle Scholar
  8. Diaconis P (1988) Group representations in probability and statistics, vol 11. Institute of Mathematical Statistics, HaywardzbMATHGoogle Scholar
  9. Farnoud F, Milenkovic O (2013) Aggregating rankings with positional constraints. In: Proceedings of IEEE information theory workshop (ITW), SevilleGoogle Scholar
  10. Farnoud F, Schwartz M, Bruck J (2014a) Rate-distortion for ranking with incomplete information. arXiv preprint: http://arxiv.org/abs/1401.3093
  11. Farnoud F, Schwartz M, Bruck J (2014b) Bounds for permutation rate-distortion. In: Proceedings of IEEE international symposium on information theory (ISIT), HonoluluGoogle Scholar
  12. Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of ACM SIGMOD international conference on management of data, ACM, New York, pp 58–66. doi: 10.1145/375663.375670
  13. Hassanzadeh F (2013) Distances on rankings: from social choice to flash memories. Ph.D. thesis, University of Illinois at Urbana–Champaign. http://hdl.handle.net/2142/44268
  14. Holst L (1980) On the lengths of the pieces of a stick broken at random. J Appl Probab 17(3):623–634MathSciNetCrossRefzbMATHGoogle Scholar
  15. Kemeny JG (1959) Mathematics without numbers. Daedalus 88(4):577–591Google Scholar
  16. Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. In: Proceedings of 19th international world wide web conference, Raleigh, pp. 571–580Google Scholar
  17. Manku GS, Rajagopalan S, Lindsay BG (1998) Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of ACM SIGMOD international conference on management of data, ACM, New York, pp 426–435. doi: 10.1145/276304.276342
  18. McGregor A, Valiant P (2012) The shifting sands algorithm. In: ACM-SIAM symposium on discrete algorithms (SODA), SIAM, pp 453–458. http://www.dl.acm.org/citation.cfm?id=2095116.2095155
  19. Munro J, Paterson M (1980) Selection and sorting with limited storage. Theor Comput Sci 12(3):315–323. http://www.sciencedirect.com/science/article/pii/0304397580900614
  20. Sedgewick R, Wayne K (2011) Algorithms, 4th edn. Addison-Wesley Professional, ReadingGoogle Scholar
  21. Shieh GS (1998) A weighted Kendall’s tau statistic. Stat Probab Lett 39(1):17–24MathSciNetCrossRefzbMATHGoogle Scholar
  22. Yilmaz E, Aslam JA, Robertson S (2008) A new rank correlation coefficient for information retrieval. In: Proceedings of 31st annual international SIGIR conference research and development in information retrieval, ACM, New York, pp 587–594Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.California Institute of TechnologyPasadenaUSA

Personalised recommendations