Algorithmic problems arising from Genome informatics
Important information about genes and proteins is encoded as nucleic sequences and amino acid sequences. The genome databases PIR, GenBank, PDB, EMBL, PROSITE, etc. compile these sequences together with their function and structure, from information published in the open literature or directly submitted to the database owners by researchers. For proteins, the number of amino acid sequences registered in these databases is currently about 50,000. This number is expected to exceed 100,000 within a few years.
The Human Genome Project started around 1990 in the United States, Japan, and a number of European countries, and is marked by international cooperation of researchers in a wide range of fields. The Human Genome Project is now entering a second stage, in which the goal is to determine all information about human genes by about 2005 or 2010. Genome Informatics or Molecular Bioinformatics is a new and rapidly evolving field which considers the various problems involved in processing these enormous amounts of genome data. Although great strides have been made in the last 30 years in terms of gathering and analyzing genome data, it also seems that in some ways Computer Science and Molecular Biology have developed on different evolutionary branches of the Science family tree. However, it is now understood that Genome Informatics is concerned with and embraces almost all fields in Computer Science.
This talk discusses some algorithmic problems arising from our research into developing a knowledge discovery system BONSAI Garden for amino acid sequences and nucleic sequences in relation to algorithm design and complexity. Algorithmic challenges for knowledge discovery may provide new directions in experimental investigations in Molecular Biology.