Skip to main content

Linear-Time Off-Line Text Compression by Longest-First Substitution

  • Conference paper
String Processing and Information Retrieval (SPIRE 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2857))

Included in the following conference series:

Abstract

Given a text, grammar-based compression is to construct a grammar that generates the text. There are many kinds of text compression techniques of this type. Each compression scheme is categorized as being either off-line or on-line, according to how a text is processed. One representative tactics for off-line compression is to substitute the longest repeated factors of a text with a production rule. In this paper, we present an algorithm that compresses a text basing on this longest-first principle, in linear time. The algorithm employs a suitable index structure for a text, and involves technically efficient operations on the structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apostolico, A.: The myriad virtues of subword trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithm on Words. NATO Advanced Science Institutes, Series F, vol. 12, pp. 85–96. Springer, Heidelberg (1985)

    Google Scholar 

  2. Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proc. IEEE 88(11), 1733–1744 (2000)

    Article  Google Scholar 

  3. Apostolico, A., Preparata, F.P.: Data structures and algorithms for the string statistics problem. Algorithmica 15, 481–494 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  4. Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, New Jersey (1990)

    Google Scholar 

  5. Bentley, J., McIlroy, D.: Data compression using long common strings. In: Proc. Data Compression Conference 1999 (DCC 1999), pp. 287–295. IEEE Computer Society, Los Alamitos (1999)

    Google Scholar 

  6. Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)

    Article  MathSciNet  Google Scholar 

  7. Brødal, G.S., Lyngsø, R.B., Östlin, A., Pedersen, C.N.S.: Solving the string statistics problem in time \(\mathcal{O}(n\log n)\). In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 728–739. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  8. Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2002)

    Book  Google Scholar 

  9. Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York (1997)

    Book  MATH  Google Scholar 

  10. Inenaga, S., Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 169–180. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  11. Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)

    Google Scholar 

  12. Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)

    Article  Google Scholar 

  13. McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  14. Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artificial Intelligence Research 7, 67–82 (1997)

    MATH  Google Scholar 

  15. Nevill-Manning, C.G., Witten, I.H.: Phrase hierarchy inference and compression in bounded space. In: Proc. Data Compression Conference 1998 (DCC 1998), pp. 179–188. IEEE Computer Society, Los Alamitos (1998)

    Google Scholar 

  16. Nevill-Manning, C.G., Witten, I.H.: Online and offline heuristics for inferring hierarchies of repetitions in sequences 88(11), 1745–1755 (2000)

    Google Scholar 

  17. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  18. Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

  19. Wolff, J.G.: An algorithm for the segmentation for an artificial language analogue. Britich Journal of Psychology 66, 79–90 (1975)

    Google Scholar 

  20. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans Information Theory 24(5), 530–536 (1978)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Inenaga, S., Funamoto, T., Takeda, M., Shinohara, A. (2003). Linear-Time Off-Line Text Compression by Longest-First Substitution. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2003. Lecture Notes in Computer Science, vol 2857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39984-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39984-1_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20177-9

  • Online ISBN: 978-3-540-39984-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics