Abstract
Similarity-based queries play an important role in many large scale applications. In bioinformatics, DNA sequencing produces huge collections of strings, that need to be compared and merged. We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. Parallelization of searches and joins is performed using MapReduce as the underlying execution paradigm. We show that our data structure is capable of performing many real-world applications in sequence comparisons in main memory. Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. However, the evaluation also shows that scalability should be further improved, e.g., by reducing sequential parts of the algorithms.
Keywords
- Main Memory
- Edit Distance
- Approximate String Match
- PeARL Index
- Runtime Improvement
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download conference paper PDF
References
Aghili, S.A., Agrawal, D., El Abbadi, A.: BFT: Bit Filtration Technique for Approximate String Join in Biological Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 326–340. Springer, Heidelberg (2003)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molecular Biology 215(3), 403–410 (1990)
Bialecki, A., Cafarella, M., Cutting, D., O’Malley, O.: Hadoop, http://hadoop.apache.org/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107 (2008)
Fickett, J.W.: Fast optimal alignment. Nucl. Acids Res. 12(1Part1), 175–179 (1984)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proc. VLDB, pp. 491–500 (2001)
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. American Society for Information Science and Technology 54, 203–215 (2003)
Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26(5), 589–595 (2010)
Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11(5), 473–483 (2010)
Morrison, D.R.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15(4), 514–534 (1968)
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G.R., Kozyrakis, C.: Evaluating mapreduce for multicore and multiprocessor systems. In: Proc. HPCA, pp. 13–24 (2007)
Rheinländer, A., Knobloch, M., Hochmuth, N., Leser, U.: Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data. In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 519–536. Springer, Heidelberg (2010)
Schatz, M.C.: Cloudburst. Bioinformatics 25, 1363–1369 (2009)
Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE TKDE 8(4), 540–547 (1996)
Sutinen, E., Tarhio, J.: Filtration with q-Samples in Approximate String Matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 50–63. Springer, Heidelberg (1996)
Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Data Clustering Practices. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proc. SIGMOD, pp. 495–506 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rheinländer, A., Leser, U. (2012). Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-29740-3_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29739-7
Online ISBN: 978-3-642-29740-3
eBook Packages: Computer ScienceComputer Science (R0)
