Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores

  • Astrid Rheinländer
  • Ulf Leser
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7156)

Abstract

Similarity-based queries play an important role in many large scale applications. In bioinformatics, DNA sequencing produces huge collections of strings, that need to be compared and merged. We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. Parallelization of searches and joins is performed using MapReduce as the underlying execution paradigm. We show that our data structure is capable of performing many real-world applications in sequence comparisons in main memory. Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. However, the evaluation also shows that scalability should be further improved, e.g., by reducing sequential parts of the algorithms.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aghili, S.A., Agrawal, D., El Abbadi, A.: BFT: Bit Filtration Technique for Approximate String Join in Biological Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 326–340. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  2. 2.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molecular Biology 215(3), 403–410 (1990)Google Scholar
  3. 3.
    Bialecki, A., Cafarella, M., Cutting, D., O’Malley, O.: Hadoop, http://hadoop.apache.org/
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107 (2008)CrossRefGoogle Scholar
  5. 5.
    Fickett, J.W.: Fast optimal alignment. Nucl. Acids Res. 12(1Part1), 175–179 (1984)CrossRefGoogle Scholar
  6. 6.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proc. VLDB, pp. 491–500 (2001)Google Scholar
  7. 7.
    Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. American Society for Information Science and Technology 54, 203–215 (2003)CrossRefGoogle Scholar
  8. 8.
    Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26(5), 589–595 (2010)CrossRefGoogle Scholar
  9. 9.
    Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11(5), 473–483 (2010)CrossRefGoogle Scholar
  10. 10.
    Morrison, D.R.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15(4), 514–534 (1968)CrossRefGoogle Scholar
  11. 11.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G.R., Kozyrakis, C.: Evaluating mapreduce for multicore and multiprocessor systems. In: Proc. HPCA, pp. 13–24 (2007)Google Scholar
  12. 12.
    Rheinländer, A., Knobloch, M., Hochmuth, N., Leser, U.: Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data. In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 519–536. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  13. 13.
    Schatz, M.C.: Cloudburst. Bioinformatics 25, 1363–1369 (2009)CrossRefGoogle Scholar
  14. 14.
    Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE TKDE 8(4), 540–547 (1996)Google Scholar
  15. 15.
    Sutinen, E., Tarhio, J.: Filtration with q-Samples in Approximate String Matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 50–63. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  16. 16.
    Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Data Clustering Practices. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  17. 17.
    Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proc. SIGMOD, pp. 495–506 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Astrid Rheinländer
    • 1
  • Ulf Leser
    • 1
  1. 1.Department of Computer ScienceHumboldt-Universität zu BerlinBerlinGermany

Personalised recommendations