Abstract
This chapter outlines key considerations for constructing and implementing an EST database. Instead of showing the technological details step by step, emphasis is put on the design of an EST database suited to the specific needs of EST projects and how to choose the most suitable tools. Using TBestDB as an example, we illustrate the essential factors to be considered for database construction and the steps for data population and annotation. This process employs technologies such as PostgreSQL, Perl, and PHP to build the database and interface, and tools such as AutoFACT for data processing and annotation. We discuss these in comparison to other available technologies and tools, and explain the reasons for our choices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Keeling, P. J., Burger, G., Durnford, D. G., Lang, B. F., Lee, R. W., Pearlman, R. E., Roger, A. J., and Gray, M. W. (2005) The tree of eukaryotes. Trends Ecol Evol 20, 670–6.
O'Brien, E. A., Koski, L. B., Zhang, Y., Yang, L., Wang, E., Gray, M. W., Burger, G., and Lang, B. F. (2007) TBestDB: a taxonomically broad database of expressed sequence tags (ESTs). Nucleic Acids Res 35, D445–51.
Koski, L. B., Gray, M. W., Lang, B. F., and Burger, G. (2005) AutoFACT: an automatic functional annotation and classification tool. BMC Bioinformatics 6, 151.
Kumar, C. G., LeDuc, R., Gong, G., Roinishivili, L., Lewin, H. A., and Liu, L. (2004) ESTIMA, a tool for EST management in a multi-project environment. BMC Bioinformatics 5, 176.
Korth, H.F. and Silberschatz, A. (1991) Database System Concepts (2nd edn.). McGraw-Hill, Columbus, Ohio
Date, C.J. (2000) An Introduction to Database Systems (7th edn.). Addison-Wesley, Boston, Massachusetts.
D'Agostino, N., Aversano, M., and Chiusano, M. L. (2005) ParPEST: a pipeline for EST data analysis based on parallel computing. BMC Bioinformatics 6 Suppl 4, S9.
Ayoubi, P., Jin, X., Leite, S., Liu, X., Martajaja, J., Abduraham, A., Wan, Q., Yan, W., Misawa, E., and Prade, R. A. (2002) PipeOnline 2.0: automated EST processing and functional data sorting. Nucleic Acids Res 30, 4761–9.
Lottaz C., Iseli, C., Jongeneel, C.V., and Bucher, P. (2003) Modeling sequencing errors by combining Hidden Markov models. Bioinformatics 19, ii103–ii112.
Hatzigeorgiou, A. G., Fiziev, P., and Reczko, M. (2001) DIANA-EST: a statistical analysis. Bioinformatics 17, 913–9.
Wuyts, J., Perriere, G., and Van De Peer, Y. (2004) The European ribosomal RNA database. Nucleic Acids Res 32, D101–3.
Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., and Apweiler, R. (2004) The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 32, D262–6.
Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., O'Donovan, C., Redaschi, N., and Yeh, L. S. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32, D115–9.
Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., Krylov, D. M., Mazumder, R., Mekhedov, S. L., Nikolskaya, A. N., Rao, B. S., Smirnov, S., Sverdlov, A. V., Vasudevan, S., Wolf, Y. I., Yin, J. J., and Natale, D. A. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.
Tatusov, R. L., Koonin, E. V., and Lipman, D. J. (1997) A genomic perspective on protein families. Science 278, 631–7.
Kanehisa, M., and Goto, S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27–30.
Sonnhammer, E. L., Eddy, S. R., and Durbin, R. (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–20.
Schultz, J., Milpetz, F., Bork, P., and Ponting, C. P. (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci USA 95, 5857–64.
Klein, P., Kanehisa, M., and DeLisi, C. (1984) Prediction of protein function from sequence properties. Discriminant analysis of a data base. Biochim Biophys Acta 787, 221–6.
Jensen, L. J., Gupta, R., Blom, N., Devos, D., Tamames, J., Kesmir, C., Nielsen, H., Staerfeldt, H. H., Rapacki, K., and Workman, C. (2002) Prediction of human protein function from post-translational modifications and localization features. J Mol Biol 319, 1257–65.
Kelley, L. A., MacCallum, R. M., and Sternberg, M. J. E. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 299, 501–22.
Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., and Eisenberg, D. (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402, 83–6.
Enright, A. J., Iliopoulos, I., Kyrpides, N. C., and Ouzounis, C. A. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90.
Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D., and Maltsev, N. (1999) The use of gene clusters to infer functional coupling. PNAS 96, 2896–901.
Ettema, T., van der Oost, J., and Huynen, M. (2001) Modularity in the gain and loss of genes: applications for function prediction. Trends in Genetics 17, 485–7.
Zheng, Y., Roberts, R. J., and Kasif, S. (2002) Genomic functional annotation using co-evolution profiles of gene clusters. Genome Biology 3, research0060.1–60.9.
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., and Yeates, T. O. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci 96, 4285–88.
King, R. D., Karwath, A., Clare, A., and Dehaspe, L. (2000) Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast 17, 283–93.
Hua, S., and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17, 721–8.
Nair, R., and Rost, B. (2005) Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol 348, 85–100.
Xie, D., Li, A., Wang, M., Fan, Z., and Feng, H. (2005) LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res 33, W105–10.
Bannai, H., Tamada, Y., Maruyama, O., Nakai, K., and Miyano, S. (2002) Extensive feature detection of N-terminal protein sorting signals. Bioinformatics 18, 298–305.
Guda, C., Fahy, E., and Subramaniam, S. (2004) MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics 20, 1785–94.
Bhasin, M., and Raghava, G. P. (2004) ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res 32, W414–9.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Shen, YQ., O’Brien, E., Koski, L., Lang, B.F., Burger, G. (2009). EST Databases and Web Tools for EST Projects. In: Parkinson, J. (eds) Expressed Sequence Tags (ESTs). Methods in Molecular Biology, vol 533. Humana Press. https://doi.org/10.1007/978-1-60327-136-3_11
Download citation
DOI: https://doi.org/10.1007/978-1-60327-136-3_11
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-58829-759-4
Online ISBN: 978-1-60327-136-3
eBook Packages: Springer Protocols