Abstract
Biological research frequently requires specialist databases to support in-depth analysis about specific subjects. With the rapid growth of biological sequences in public domain data sources, it is difficult to keep these databases current with the sources. Simple queries formulated to retrieve relevant sequences typically return a large number of false matches and thus demanding manual filtration. In this paper, we propose a novel methodology that can support automatic incremental updating of specialist databases. Complex queries for incremental updating of relevant sequences are learned using Association Rule Mining (ARM), resulting in a significant reduction in false positive matches. This is the first time ARM is used in formulating descriptive queries for the purpose of incremental maintenance of specialised biological databases. We have implemented and tested our methodology on two real-world databases. Our experiments conclusively show that the methodology guarantees an F-score of up to 80% in detecting new sequences for these two databases.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Siew, J.P., Khan, A.M., Tan, P.T., Koh, J.L., Seah, S.H., Koo, C.Y., Chai, S.C., Armugam, A., Brusic, V., Jeyaseelan, K.: Systematic analysis of snake neurotoxins functional classification using a data warehousing approach. Bioinformatics 20(18), 3466–3480 (2004)
Wang, Z., Wang, G.: APD: the Antimicrobial Peptide Database. Nucleic Acids. Res. 32, 590–592 (2004)
Szymanski, M., Barciszewski, J.: Aminoacyl-tRNA synthetases database Y2K. Nucleic Acids Res. 28, 326–328 (2000)
Tan, P.T.J., Khan, A.M., Brusic, V.: Bioinformatics for venom and toxin sciences. Brief Bioinform. 1, 53–62 (2003)
Gendel, S.M.: Sequence Databases for Assessing the Potential Allergenicity of Proteins Used in Transgenic Foods. Advances in Food and Nutrition Research 42, 63–92 (1998)
Koh, J.L.Y., Krishnan, S.P.T., Seah, S.H., Tan, P.T.J., Khan, A.M., Lee, M.L., Brusic, V.: BioWare: A framework for bioinformatics data retrieval, annotation and publishing. In: SIGIR 2004 workshop on Search and Discovery in Bioinformatics, Sheffield, UK, July 29 (2004)
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, Washington, D.C., United States, pp. 207–216 (1993)
Creighton, C., Hanash, S.: Mining gene expression databases for association rules. Bioinformatics 19(1), 79–86 (2003)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: The International Conference on Very Large Databases, pp. 487–499 (1994)
Borgelt, C., Kruse, R.: Induction of Association Rules: Apriori Implementation. In: 15th Conference on Computational Statistics. Physica Verlag, Heidelberg (2002)
Ananiadou, S., Friedman, C., Tsujii, J.: Introduction: named entity recognition in biomedicine. Journal of Biomedical Informatics 37, 393–395 (2004)
Zhou, G.D., Zhang, J., Su, J., Shen, D., Tan, C.L.: Recognizing Names in Biomedical Texts: a Machine Learning Approach. Bioinformatics 20(7), 1178–1190 (2004)
Settles, B.: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)
Ohta, T., Tateisi, Y., Kim, J., Mima, H., Tsujii, J.: The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of Human Language Technology (HLT 2002), San Diego, pp. 489–493 (2002)
Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), Geneva, Switzerland, pp. 70–75 (2004)
Yeh, A., Hirschman, L., Morgan, A., Colosimo, M.: BioCreAtIve Task 1A: gene mention finding evaluation. BMC Bioinformatics 6(suppl. 1), S2 (2005)
Bailey, T.L., Elkan, C.: The Value of Prior Knowledge in Discovering Motifs with MEME. ISMB 3, 21–29 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lam, KT., Koh, J.L.Y., Veeravalli, B., Brusic, V. (2006). Incremental Maintenance of Biological Databases Using Association Rule Mining. In: Rajapakse, J.C., Wong, L., Acharya, R. (eds) Pattern Recognition in Bioinformatics. PRIB 2006. Lecture Notes in Computer Science(), vol 4146. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11818564_16
Download citation
DOI: https://doi.org/10.1007/11818564_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37446-6
Online ISBN: 978-3-540-37447-3
eBook Packages: Computer ScienceComputer Science (R0)