Advertisement

Journal of Combinatorial Optimization

, Volume 11, Issue 2, pp 155–175 | Cite as

Finding longest increasing and common subsequences in streaming data

  • David Liben-Nowell
  • Erik Vee
  • An Zhu
Article

Abstract

We present algorithms and lower bounds for the Longest Increasing Subsequence (LIS) and Longest Common Subsequence (LCS) problems in the data-streaming model. To decide if the LIS of a given stream of elements drawn from an alphabet αbet has length at least k, we discuss a one-pass algorithm using O(k log αbetsize) space, with update time either O(log k) or O(log log αbetsize); for αbetsize = O(1), we can achieve O(log k) space and constant-time updates. We also prove a lower bound of Ω(k) on the space requirement for this problem for general alphabets αbet, even when the input stream is a permutation of αbet. For finding the actual LIS, we give a ⌈log (1 + 1/ɛ)-pass algorithm using O(k1+ɛlog αbetsize) space, for any ɛ > 0. For LCS, there is a trivial Θ(1)-approximate O(log n)-space streaming algorithm when αbetsize = O(1). For general alphabets αbet, the problem is much harder. We prove several lower bounds on the LCS problem, of which the strongest is the following: it is necessary to use Ω(n2) space to approximate the LCS of two n-element streams to within a factor of ρ, even if the streams are permutations of each other.

Keywords

LIS LCS Data-streaming model Algorithms Lower bounds 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ajtai M, Jayram TS, Kumar R, Sivakumar D (2002) Approximate counting of inversions in a data stream. In: Proceedings of the ACM Symposium on Theory of Computing (STOC), pp. 370–379Google Scholar
  2. Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequency moments, Journal of Computer and System Sciences 58(1):137–147CrossRefMathSciNetGoogle Scholar
  3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool, Journal of Molecular Biology 215:403–410CrossRefGoogle Scholar
  4. Apostolico A, Guerra C (1987) The longest common subsequence problem revisited, Algorithmica 2:315–336CrossRefMathSciNetGoogle Scholar
  5. Banerjee A, Ghosh J (2001) Clickstream clustering using weighted longest common subsequence. In: SIAM International Conference on Data Mining Workshop on Web MiningGoogle Scholar
  6. Bar-Yossef Z, Jayram TS, Kumar R, Sivakumar D (2004) An information statistics approach to data stream and communication complexity, Journal of Computer and System Sciences 68(4):702–732CrossRefMathSciNetGoogle Scholar
  7. Bender MA, Cole R, Demaine ED, Farach-Colton M (2002) Scanning and traversing: Maintaining data for traversals in a memory hierarchy. In: Proceedings of the European Symposium on Algorithms (ESA) pp. 139–151Google Scholar
  8. Bespamyatnikh S, Segal M (2000) Enumerating longest increasing subsequences and patience sorting, Information Processing Letters 76(1–2):7–11Google Scholar
  9. Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams, Theoretical Computer Science 312(1):3–15CrossRefMathSciNetGoogle Scholar
  10. Cormen T, Leiserson C, Rivest R, Stein C (2002) Introduction to Algorithms, 2nd edition. McGraw-HillGoogle Scholar
  11. Cormode G, Muthukrishnan S (2006) What's new: Finding significant differences in network data streams. Transactions on NetworkingGoogle Scholar
  12. Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes, Nucleic Acids Research 27(11):2369–2376CrossRefGoogle Scholar
  13. Demaine ED, López-Ortiz A, Ian Munro J (2002) Frequency estimation of internet packet streams with limited space. In: Proceedings of the European Symposium on Algorithms (ESA), pp. 348–360Google Scholar
  14. Erdős P, Szekeres, G (1935) A combinatorial problem in geometry, Compositio Mathematica 463–470Google Scholar
  15. Farach-Colton M, Ferragina P, Muthukrishnan S (1998) Overcoming the memory bottleneck in suffix tree construction. In: Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), pp. 174–185Google Scholar
  16. Feigenbaum J, Kannan S, Strauss M, Viswanathan M (2002) An approximate $L_1$-difference algorithm for massive data streams, SIAM Journal on Computing 32(1):131–151CrossRefMathSciNetGoogle Scholar
  17. Fong JH, Strauss M (2001) An approximate $L_p$-difference algorithm for massive data streams, Discrete Mathematics & Theoretical Computer Science 4(2):301–322MathSciNetGoogle Scholar
  18. Fredman ML (1975) On computing the length of longest increasing subsequences, Discrete Mathematics 11:29–35CrossRefzbMATHMathSciNetGoogle Scholar
  19. Gilbert A, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss M (2002) Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the ACM Symposium on Theory of Computing (STOC), pp. 389–398Google Scholar
  20. Guha S, Koudas N, Shim K (2001) Data-streams and histograms. In: Proceedings of the ACM Symposium on Theory of Computing (STOC), pp. 471–475Google Scholar
  21. Guha S, Mishra N, Motwani R, O'Callaghan L (2000) Clustering data streams. In: Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), pp. 359–366Google Scholar
  22. Henzinger MR, Raghavan P, Rajagopalan S (1998) Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation, Systems Research CenterGoogle Scholar
  23. Hirschberg DS (1977) Algorithms for the longest common subsequence problem, Journal of the ACM 24:644–675CrossRefMathSciNetGoogle Scholar
  24. Hunt J, Szymanski T (1977) A fast algorithm for computing longest common subsequences, Communications of the ACM 20:350–353CrossRefMathSciNetGoogle Scholar
  25. Indyk P (2000) Stable distributions, pseudorandom generators, embeddings, and data stream computations. In: Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), pp. 189–197Google Scholar
  26. Kalyanasundaram B, Schnitger G (1992) The probabilistic communication complexity of set intersection. SIAM Journal on Discrete Mathematics 5(5):545–557CrossRefMathSciNetGoogle Scholar
  27. Manku G, Rajagopalan S, Lindsay B (1998) Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 426–435Google Scholar
  28. Razborov A (1984) On the distributional complexity of disjointness. Journal of Computer and System Sciences 28(2)Google Scholar
  29. Saks ME, Sun X (2002) Space lower bounds for distance approximation in the data stream model. In: Proceedings of the ACM Symposium on Theory of Computing (STOC), pp. 360–369Google Scholar
  30. Sankoff D, Kruskal J (1983) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-WesleyGoogle Scholar
  31. Schensted C (1961) Longest increasing and decreasing subsequences, Canadian Journal of Mathematics 13:179–191zbMATHMathSciNetGoogle Scholar
  32. van Emde Boas P (1977) Preserving order in a forest in less than logarithmic time and linear space, Information Processing Letters 6(3):80–82CrossRefzbMATHGoogle Scholar
  33. Willard DE (August 1983) Log-logarithmic worst-case range queries are possible in space Θ N, Information Processing Letters 17(2):81–84CrossRefzbMATHMathSciNetGoogle Scholar
  34. Zhang H (2003) Alignment of BLAST high-scoring segment pairs based on the longest increasing subsequence algorithm, Bioinformatics 19(11):1391–1396CrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Department of Mathematics and Computer ScienceCarleton CollegeUSA
  2. 2.IBM, Almaden Research CenterNew YorkUSA
  3. 3.Google, Inc.USA

Personalised recommendations