Advertisement

FEMTO: Fast Search of Large Sequence Collections

  • Michael P. Ferguson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7354)

Abstract

We present FEMTO, a new system for indexing and searching large collections of sequence data. We used FEMTO to index and search three large collections, including one 182 GB collection. We compare the performance of FEMTO indexing and search with Bowtie and with Lucene, and we compare performance with indexes stored on hard disks and in flash memory. To our knowledge, we report on the first compressed suffix array storing more than 100 GB. Even for the largest collection, most searches completed in under 10 seconds.

Keywords

FM-index document retrieval external memory regular expression compressed suffix array 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R.A., Gonnet, G.H.: Fast text searching for regular expressions or automaton searching on tries. J. ACM 43(6), 915–936 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight BWT Construction for Very Large String Collections. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 219–231. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  3. 3.
    Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)zbMATHCrossRefGoogle Scholar
  4. 4.
    Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  5. 5.
    Culpepper, J.S., Navarro, G., Puglisi, S.J., Turpin, A.: Top-k Ranked Document Search in General Text Databases. In: de Berg, M., Meyer, U. (eds.) ESA 2010. LNCS, vol. 6347, pp. 194–205. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. In: ALENEX/ANALCO, pp. 86–97. SIAM (2005)Google Scholar
  7. 7.
    Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)zbMATHCrossRefGoogle Scholar
  8. 8.
    Ferragina, P., Manzini, G.: An experimental study of a compressed index. Information Sciences 135(1-2), 13–28 (2001)zbMATHCrossRefGoogle Scholar
  9. 9.
    Ferragina, P., Manzini, G.: Fm-index version 2 web page (2005), http://roquefort.di.unipi.it/~ferrax/fmindexV2/index.html
  10. 10.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Ferragina, P., Navarro, G.: Pizza & chili website (2006), http://pizzachili.dcc.uchile.cl or http://pizzachili.di.unipi.it
  12. 12.
    Gagie, T., Puglisi, S.J., Turpin, A.: Range Quantile Queries: Another Virtue of Wavelet Trees. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 1–6. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  13. 13.
    Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. In: SODA 2004: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2004)Google Scholar
  14. 14.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC 2000: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing (2000)Google Scholar
  15. 15.
    Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Kulla, F., Sanders, P.: Scalable parallel suffix array construction. Parallel Comput. 33(9), 605–612 (2007)CrossRefGoogle Scholar
  18. 18.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3), R25 (2009)CrossRefGoogle Scholar
  19. 19.
    Sirén, J.: Compressed Suffix Arrays for Massive Data. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 63–74. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  20. 20.
    Thompson, K.: Programming techniques: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)zbMATHCrossRefGoogle Scholar
  21. 21.
    Witten, I., Moffat, A., Bell, T.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann (1999)Google Scholar
  22. 22.
    Wu, S., Manber, U.: Fast text searching: allowing errors. Commun. ACM 35, 83–91 (1992)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Michael P. Ferguson
    • 1
  1. 1.Laboratory for Telecommunications SciencesCollege ParkMaryland

Personalised recommendations