Exploiting thread-level and instruction-level parallelism to cluster mass spectrometry data using multicore architectures

Saeed, Fahad; Hoffert, Jason D.; Pisitkun, Trairak; Knepper, Mark A.

doi:10.1007/s13721-014-0054-1

Exploiting thread-level and instruction-level parallelism to cluster mass spectrometry data using multicore architectures

Original Article
Published: 15 April 2014

Volume 3, article number 54, (2014)
Cite this article

Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Fahad Saeed^1,2,4,
Jason D. Hoffert²,
Trairak Pisitkun^2,3 &
…
Mark A. Knepper²

331 Accesses
Explore all metrics

Abstract

Modern mass spectrometers can produce large numbers of peptide spectra from complex biological samples in a short time. A substantial amount of redundancy is observed in these data sets from peptides that may get selected multiple times in liquid chromatography tandem mass spectrometry experiments. A large number of spectra do not get mapped to specific peptide sequences due to low signal-to-noise ratio of the spectra from these machines. Clustering is one way to mitigate the problems of these complex mass spectrometry data sets. Recently, we presented a graph theoretic framework, known as CAMS, for clustering of large-scale mass spectrometry data. CAMS utilized a novel metric to exploit the spatial patterns in the mass spectrometry peaks which allowed highly accurate clustering results. However, comparison of each spectrum with every other spectrum makes the clustering problem computationally inefficient. In this paper, we present a parallel algorithm, called P-CAMS, that uses thread-level and instruction-level parallelism on multicore architectures to substantially decrease running times. P-CAMS relies on intelligent matrix completion to reduce the number of comparisons, threads to run on each core and single instruction multiple data (SIMD) paradigm inside each thread to exploit massive parallelism on multicore architectures. A carefully crafted load-balanced scheme that uses spatial locations of the mass spectrometry peaks mapped to nearest level cache and core allows super-linear speedups. We study the scalability of the algorithm with a wide variety of mass spectrometry data and variation in architecture specific parameters. The results show that SIMD style data parallelism combined with thread-level parallelism for multicore architectures is a powerful combination that allows substantial reduction in run-times even for all-to-all comparison algorithms. The quality assessment is performed using real-world data set and is shown to be consistent with the serial version of the same algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quantitative Mass Spectrometry-Based Proteomics: An Overview

Big data analytics on Apache Spark

Article 13 October 2016

Perseus: A Bioinformatics Platform for Integrative Analysis of Proteomics Data in Cancer Research

Notes

Notations of Fset and F-set are used interchangeably in this paper.

References

Beausoleil A, Jedrychowski M, Schwartz D, Elias E, Villen J, Li J, Cohn A, Cantley C, Gygi P (2004) Large-scale characterization of hela cell nuclear phosphoproteins. Proc Natl Acad Sci USA 101:12130
Article Google Scholar
Beer I, Barnea E, Ziv T, Admon A (2004) Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4(4):950–960
Article Google Scholar
Catalyurek UV, Feo J, Gebremedhin AH, Halappanavar M, Pothen A (2012) Graph coloring algorithms for multi-core and massively multithreaded architectures. Parallel Comput 38(1011):576–594
Article MathSciNet Google Scholar
Cantin T, Venable D, Cociorva D, Yates R (2006) Iii quantitative phosphoproteomic analysis of the tumor necrosis factor pathway. J. Proteome Res. 5:127
Article Google Scholar
Dutta D, Chen T (2007) Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search. Bioinformatics 23(5):612–618
Article Google Scholar
Du X, Yang F, Manes NP, Stenoien DL, Monroe ME, Adkins JN, States DJ, Purvine SO, Camp DG II, Smith RD (2008) Linear discriminant analysis-based estimation of the false discovery rate for phosphopeptide identifications. J Proteome Res 7(6):2195–2203
Article Google Scholar
Frank AM, Bandeira N, Shen Z, Tanner S, Briggs SP, Smith RD, Pevzner PA (2008) Clustering Millions of tandem mass spectra. J Proteome Res 7:113–122
Article Google Scholar
Gruhler A, Olsen JV, Mohammed S, Mortensen P, FÃrgeman NJ, Mann M, Jensen ON (2005) Quantitative Phosphoproteomics Applied to the Yeast Pheromone Signaling Pathway. Mol Cell Proteomics 4:310–327
Article Google Scholar
Hoffert J, Pisitkun T, Wang G, Shen F, Knepper M (2006) Quantitative phosphoproteomics of vasopressin-sensitive renal cells: regulation of aquaporin-2 phosphorylation at two sites. Proc Natl Acad Sci USA 103(18):7159–7164
Article Google Scholar
Jiang X, Ye M, Han G, Dong X, Zou H (2010) Classification filtering strategy to improve the coverage and sensitivity of phosphoproteome analysis. Anal Chem 82(14):6168–6175
Article Google Scholar
Li X, Gerber SA, Rudner AD, Beausoleil SA, Haas W, Elias JE, Gygi SP (2007) Large-scale phosphorylation analysis of alpha-factor-arrested saccharomyces cerevisiae. J Proteome Res 6(3):1190–1197
Article Google Scholar
Liu Y, Schmidt B, Maskell D (2011) Parallelized short read assembly of large genomes using de bruijn graphs. BMC Bioinform 12(1):354
Article Google Scholar
Majumder T, Borgens M, Pande P, Kalyanaraman A (2012) On-chip network-enabled multicore platforms targeting maximum likelihood phylogeny reconstruction, Computer-Aided Design of Integrated Circuits and Systems. IEEE Transactions on 31:1061–1073
Google Scholar
Ozyer T, Alhajj R (2009) Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer. Appl Intell 31(3):318–331
Article Google Scholar
Ramakrishnan SR, Mao R, Nakorchevskiy AA, Prince JT, Willard WS, Xu W, Marcotte EM, Miranker DP (2006) A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 22(12):1524–1531
Article Google Scholar
Riedy J, Meyerhenke H, Bader D, Ediger D, Mattson T (2012) Analysis of streaming social networks and graphs on multicore architectures. In: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, 5337–5340 IEEE
Ruttenberg BE, Pisitkun T, Knepper MA, Hoffert JD (2008) PhosphoScore: an open-source phosphorylation site assignment tool for MSn data. J Proteome Res 7:3054–3059
Article Google Scholar
Saeed F, Khokhar A (2009) A domain decomposition strategy for alignment of multiple biological sequences on multiprocessor platforms. J Parallel Distrib Comput 69(7):666–677
Article Google Scholar
Saeed F, Pisitkun T, Knepper MA, Hoffert JD (2012) An efficient algorithm for clustering of large-scale mass spectrometry data. In: Bioinformatics and biomedicine (BIBM), 2012 IEEE International Conference on 1–4 IEEE
Saeed F, Pisitkun T, Hoffert JD, Wang G, Gucek M, Knepper MA (2012) An efficient dynamic programming algorithm for phosphorylation site assignment of large-scale mass spectrometry data. In: Bioinformatics and biomedicine Workshops (BIBMW), 2012 IEEE International Conference on, pp 618–625, IEEE
Saeed F, Hoffert JD, Knepper MA (2013) A high performance algorithm for clustering of large-scale protein mass spectrometry data using multi-core architectures. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM'13). ACM, New York, pp 923–930
Saeed F, Hoffert JD, Knepper MA (2014) Cams-rs: clustering algorithm for large-scale mass spectrometry data using restricted search space and intelligent random sampling. IEEE/ACM Trans Comput Biol Bioinform (in press)
Sarje A, Zola J, Aluru S (2011) Accelerating pairwise computations on cell processors. Parallel and Distributed Systems, IEEE Transactions on 22:69–77
Google Scholar
Tabb DL, MacCoss MJ, Wu CC, Anderson SD, Yates JR (2003) Similarity among tandem mass spectra from proteomic experiments, detection, significance, and utility. Anal Chem 75(10):2470–2477
Article Google Scholar
Tabb DL, Thompson MR, Khalsa-Moyers G, VerBerkmoes NC, McDonald WH (2005) Ms2grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J Am Soc Mass Spectrom16(8):1250–1261
Article Google Scholar
Whitelegge JP (2003) Hplc and mass spectrometry of intrinsic membrane proteins, 251

Download references

Acknowledgements

This work was funded by the operating budget of Division of Intramural Research, National Heart, Lung and Blood Institute, National Institutes of Health (NIH), Project ZO1-HL001285 and National Science Foundation (NSF) under grant CNS-1250264. All the Mass Spectrometry data was produced at Proteomics Core at System Biology Center (SBC), NHLBI, NIH.

Author information

Authors and Affiliations

Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
Fahad Saeed
Epithelial Systems Biology Laboratory, National Heart Lung and Blood Institute (NHLBI), National Institutes of Health (NIH), Bethesda, MD, USA
Fahad Saeed, Jason D. Hoffert, Trairak Pisitkun & Mark A. Knepper
Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
Trairak Pisitkun
Department of Electrical and Computer Engineering, Western Michigan University, Kalamazoo, MI, USA
Fahad Saeed

Authors

Fahad Saeed
View author publications
You can also search for this author in PubMed Google Scholar
Jason D. Hoffert
View author publications
You can also search for this author in PubMed Google Scholar
Trairak Pisitkun
View author publications
You can also search for this author in PubMed Google Scholar
Mark A. Knepper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fahad Saeed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saeed, F., Hoffert, J.D., Pisitkun, T. et al. Exploiting thread-level and instruction-level parallelism to cluster mass spectrometry data using multicore architectures. Netw Model Anal Health Inform Bioinforma 3, 54 (2014). https://doi.org/10.1007/s13721-014-0054-1

Download citation

Received: 06 November 2013
Revised: 29 January 2014
Accepted: 03 February 2014
Published: 15 April 2014
DOI: https://doi.org/10.1007/s13721-014-0054-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting thread-level and instruction-level parallelism to cluster mass spectrometry data using multicore architectures

Abstract

Access this article

Similar content being viewed by others

Quantitative Mass Spectrometry-Based Proteomics: An Overview

Big data analytics on Apache Spark

Perseus: A Bioinformatics Platform for Integrative Analysis of Proteomics Data in Cancer Research

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting thread-level and instruction-level parallelism to cluster mass spectrometry data using multicore architectures

Abstract

Access this article

Similar content being viewed by others

Quantitative Mass Spectrometry-Based Proteomics: An Overview

Big data analytics on Apache Spark

Perseus: A Bioinformatics Platform for Integrative Analysis of Proteomics Data in Cancer Research

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation