Skip to main content

Advertisement

Log in

Content-based search of gene expression databases using binary fingerprints of differential expression profiles

  • Original Article
  • Published:
Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

Availability and rapid growth of microarray databases have made an integrated analysis of these databases computationally challenging. We present a novel approach to content-based searching in microarray databases, using binary vector representations, that is inspired from the Chemoinformatics field. A benchmark compendium of microarray datasets is established for evaluation of content-based searching. Differential expression profiles from microarray experiments are represented either as floating point vectors or as concise binary vectors. The benchmark compendium is searched using several distance measures for determining similarity. We demonstrate that the use of binary vector representations achieves accuracies equivalent to or better than the use of floating point measures, while at the same time significantly reducing the time required to search a microarray database, owing to the fast bitwise operations and the reduction in memory requirements. Experiments on a large database of binary vector representations demonstrate that a modified Tanimoto distance measure is best suited for content-based search of differential microarray profiles. The search method is available as a web service at: http://sacan.biomed.drexel.edu/mageoindex/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Abbreviations

GEO:

Gene Expression Omnibus

References

  • Avcibas I, Memon N, Sankur B (2002) Image steganalysis with binary similarity measures. In: Image processing, 2002. Proceedings, 2002 international conference on, 24–28 June 2002, vol 643, pp 645–648. doi:10.1109/icip.2002.1039053

  • Ball CA et al (2004) Submission of microarray data to public repositories. PLoS Biol 2:e317

    Article  Google Scholar 

  • Barrett T (2010) NCBI GEO: archive for functional genomics data sets—10 years on nucleic acids research doi:10.1093/nar/gkq1184

  • Bolstad BM, Irizarry RA, Åstrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185–193. doi:10.1093/bioinformatics/19.2.185

    Article  Google Scholar 

  • Brown N (2009) Chemoinformatics; an introduction for computer scientists. ACM Comput Surv 41:1–38. doi:10.1145/1459352.1459353

    Article  Google Scholar 

  • Chang JC et al (2003) Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Lancet 362:362–369. doi:10.1016/s0140-6736(03)14023-8

    Article  Google Scholar 

  • Chen Y-W, Zhao P, Borup R, Hoffman EP (2000) Expression profiling in the muscular dystrophies. J Cell Biol 151:1321–1336. doi:10.1083/jcb.151.6.1321

    Article  Google Scholar 

  • Chen R, Mallelwar R, Thosar A, Venkatasubrahmanyam S, Butte A (2008) GeneChaser: identifying all biological and clinical conditions in which genes of interest are differentially expressed. BMC Bioinform 9:548

    Article  Google Scholar 

  • Conner LM, Leopold BD (2001) A Euclidean distance metric to index dispersion from radiotelemetry data. Wildl Soc Bull 29:783–786

    Google Scholar 

  • D’Andrea A, Aste-Amezaga M, Valiante NM, Ma X, Kubin M, Trinchieri G (1993) Interleukin 10 (IL-10) inhibits human lymphocyte interferon gamma-production by suppressing natural killer cell stimulatory factor/IL-12 synthesis in accessory cells. J Exp Med 178:1041–1048. doi:10.1084/jem.178.3.1041

    Article  Google Scholar 

  • D’Andrea A, Ma X, Aste-Amezaga M, Paganin C, Trinchieri G (1995) Stimulatory and inhibitory effects of interleukin (IL)-4 and IL-13 on the production of cytokines by human peripheral blood mononuclear cells: priming for IL-12 and tumor necrosis factor alpha production. J Exp Med 181:537–546. doi:10.1084/jem.181.2.537

    Article  Google Scholar 

  • de la Fuente C et al (2002) Gene expression profile of HIV-1 Tat expressing cells: a close interplay between proliferative and differentiation signals. BMC Biochem 3:1–22. doi:10.1186/1471-2091-3-14

    Article  Google Scholar 

  • Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30:207–210. doi:10.1093/nar/30.1.207

    Article  Google Scholar 

  • Engreitz J, Morgan A, Dudley J, Chen R, Thathoo R, Altman R, Butte A (2010) Content-based microarray search using differential expression profiles. BMC Bioinform 11:603

    Article  Google Scholar 

  • Engreitz JM, Chen R, Morgan AA, Dudley JT, Mallelwar R, Butte AJ (2011) ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression. Bioinformatics 27:3317–3318. doi:10.1093/bioinformatics/btr548

    Article  Google Scholar 

  • Fligner MA, Verducci JS, Blower PE (2002) A modification of the Jaccard–Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics 44:110–119. doi:10.1198/004017002317375064

    Article  MathSciNet  Google Scholar 

  • Flower DR (1998) On the properties of bit string-based measures of chemical similarity. J Chem Inf Comput Sci 38:379–386. doi:10.1021/ci970437z

    Article  Google Scholar 

  • Fujibuchi W, Kiseleva L, Taniguchi T, Harada H, Horton P (2007) Cell montage: similar expression profile search server. Bioinformatics 23:3103–3104. doi:10.1093/bioinformatics/btm462

    Article  Google Scholar 

  • Gazzinelli RT, Makino M, Chattopadhyay SK, Snapper CM, Sher A, Hügin AW, Morse HC (1992) CD4+ subset regulation in viral infection. Preferential activation of Th2 cells during progression of retrovirus-induced immunodeficiency in mice. J Immunol 148:182–188

    Google Scholar 

  • Guo L (2006) Rat toxicogenomic study reveals analytical consistency across microarray platforms. Nat Biotech 24:1162–1169. http://www.nature.com/nbt/journal/v24/n9/suppinfo/nbt1238_S1.html

  • Hohn ME (1976) Binary coefficients: a theoretical and empirical study. Math Geol 8:137–150. doi:10.1007/bf01079031

    Article  Google Scholar 

  • Horton P, Kiseleva L, Fujibuchi W (2006) RaPiDS: an algorithm for rapid expression profile database search. Genome Inform Int Conf Genome Inform 17:67–76

    Google Scholar 

  • Hu Z et al (2006) The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genom 7:96

    Article  Google Scholar 

  • Hubble J et al (2009) Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 37:D898–D901. doi:10.1093/nar/gkn786

    Article  Google Scholar 

  • Hunter L, Taylor RC, Leach SM, Simon R (2001) GEST: a gene expression search tool based on a novel Bayesian similarity metric. Bioinformatics 17:S115–S122. doi:10.1093/bioinformatics/17.suppl_1.S115

    Article  Google Scholar 

  • Irizarry RA (2005) Multiple-laboratory comparison of microarray platforms. Nat Meth 2:345–350. http://www.nature.com/nmeth/journal/v2/n5/suppinfo/nmeth756_S1.html

  • Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003a) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31:e15. doi:10.1093/nar/gng015

    Article  Google Scholar 

  • Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP (2003b) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4:249–264. doi:10.1093/biostatistics/4.2.249

    Article  MATH  Google Scholar 

  • Jonathan TF (1997) Content-based retrieval of music and audio. In: Voice, Video, and Data Communications, 1997. International Society for Optics and Photonics, pp 138–147. doi:10.1117/12.290336

  • Kevenaar TAM, Schrijen GJ, van der Veen M, Akkermans AHM, Zuo F (2005) Face recognition with renewable and privacy preserving binary templates. In: Automatic identification advanced technologies, 2005. Fourth IEEE Workshop on, 17–18 Oct 2005, pp 21–26. doi:10.1109/autoid.2005.24

  • Kokare M, Chatterji BN, Biswas PK (2003) Comparison of similarity metrics for texture image retrieval. In: TENCON 2003. Conference on convergent technologies for asia-pacific region, 15–17 Oct 2003, vol 572, pp 571–575. doi:10.1109/tencon.2003.1273228

  • Liang W et al (2005) Therapeutic targets for HIV-1 infection in the host proteome. Retrovirology 2:20

    Article  Google Scholar 

  • Lukk M (2010) A global map of human gene expression. Nat Biotech 28:322–324. http://www.nature.com/nbt/journal/v28/n4/abs/nbt0410-322.html#supplementary-information

  • Lund R, Aittokallio T, Nevalainen O, Lahesmaa R (2003) Identification of novel genes regulated by IL-12, IL-4, or TGF-β during the early polarization of CD4+ lymphocytes. J Immunol 171:5328–5336

    Article  Google Scholar 

  • MAQC Consortium (2006) The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24:1151–1161

    Article  Google Scholar 

  • McCall MN, Uppal K, Jaffee HA, Zilliox MJ, Irizarry RA (2011) The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res 39:D1011–D1015

    Article  Google Scholar 

  • Mosmann TR, Coffman RL (1989) TH1 and TH2 cells: different patterns of lymphokine secretion lead to different functional properties. Annu Rev Immunol 7:145–173. doi:10.1146/annurev.iy.07.040189.001045

    Article  Google Scholar 

  • Parkinson H et al (2009) ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37:D868–D872. doi:10.1093/nar/gkn889

    Article  Google Scholar 

  • Piwowar H, Chapman W (2010) Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers. J Biomed Discov Collab 5:7–20

    Google Scholar 

  • Rogers DJ, Fleming H (1964) A computer program for classifying plants II. A numerical handling of non-numerical data. Bioscience 14:15–28

    Article  Google Scholar 

  • Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132:1115–1118. doi:10.1126/science.132.3434.1115

    Article  Google Scholar 

  • Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517. doi:10.1093/bioinformatics/btm344

    Article  Google Scholar 

  • Shi GR (1993) Multivariate data analysis in palaeoecology and palaeobiogeography—a review. Palaeogeogr Palaeoclimatol Palaeoecol 105:199–234. doi:10.1016/0031-0182(93)90084-v

    Article  Google Scholar 

  • Sneath PH, Sokal RR (1962) Numerical taxonomy. Nature 193:855–860

    Article  Google Scholar 

  • Spencer MJ, Montecino-Rodriguez E, Dorshkind K, Tidball JG (2001) Helper (CD4+) and cytotoxic (CD8+) T cells promote the pathology of dystrophin-deficient muscle. Clinic Immunol 98:235–243. doi:10.1006/clim.2000.4966

    Article  Google Scholar 

  • Sung-Hyuk C, Sungsoo Y, Tappert CC (2005) On binary similarity measures for handwritten character recognition. In: Document analysis and recognition, 2005. Proceedings, 8th international conference on, 29 Aug 1 Sept 2005, vol 1, pp 4–8. doi:10.1109/icdar.2005.173

  • Swamidass SJ, Baldi P (2007) Mathematical correction for fingerprint similarity measures to improve chemical retrieval. J Chem Inf Model 47:952–964. doi:10.1021/ci600526a

    Article  Google Scholar 

  • Wei-Jen L, Ke W, Stolfo SJ, Herzog B (2005) Fileprints: identifying file types by n-gram analysis. In: Information assurance workshop, 2005. IAW ‘05, Proceedings from the 6th Annual IEEE SMC, 15–17 June 2005, pp 64–71. doi:10.1109/iaw.2005.1495935

  • Wheeler DL et al (2004) Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 32:D35–D40. doi:10.1093/nar/gkh073

    Article  Google Scholar 

  • Willett P (2003) Similarity-based approaches to virtual screening. Biochem Soc Trans 31:603–606

    Article  Google Scholar 

  • Zimmerman G et al (2006) TGF-β1 as a marker of delayed fracture healing. Bone 38:456–457. doi:10.1016/j.bone.2005.08.026

    Article  Google Scholar 

Download references

Conflict of interest

None declared.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmet Sacan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

13721_2015_76_MOESM1_ESM.doc

Supplementary material 1 Validation Data Sets. Lists the profiles from which subsets were derived for validation. Also includes which samples are included in which subsets. (DOC 132 kb)

13721_2015_76_MOESM2_ESM.xlsx

Supplementary material 2 Results of Confirmation. A spread sheet containing results for confirmation for Hamming, Modified Tanimoto, and Tanimoto distance measure. (XLSX 45 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bell, F., Sacan, A. Content-based search of gene expression databases using binary fingerprints of differential expression profiles. Netw Model Anal Health Inform Bioinforma 4, 4 (2015). https://doi.org/10.1007/s13721-015-0076-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-015-0076-3

Keywords

Navigation