Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores
Similarity-based queries play an important role in many large scale applications. In bioinformatics, DNA sequencing produces huge collections of strings, that need to be compared and merged. We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. Parallelization of searches and joins is performed using MapReduce as the underlying execution paradigm. We show that our data structure is capable of performing many real-world applications in sequence comparisons in main memory. Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. However, the evaluation also shows that scalability should be further improved, e.g., by reducing sequential parts of the algorithms.
KeywordsMain Memory Edit Distance Approximate String Match PeARL Index Runtime Improvement
Unable to display preview. Download preview PDF.
- 2.Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molecular Biology 215(3), 403–410 (1990)Google Scholar
- 3.Bialecki, A., Cafarella, M., Cutting, D., O’Malley, O.: Hadoop, http://hadoop.apache.org/
- 6.Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proc. VLDB, pp. 491–500 (2001)Google Scholar
- 11.Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G.R., Kozyrakis, C.: Evaluating mapreduce for multicore and multiprocessor systems. In: Proc. HPCA, pp. 13–24 (2007)Google Scholar
- 14.Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE TKDE 8(4), 540–547 (1996)Google Scholar
- 17.Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proc. SIGMOD, pp. 495–506 (2010)Google Scholar