A Method for Similarity-Based Grouping of Biological Data

  • Vaida Jakonienė
  • David Rundqvist
  • Patrick Lambrix
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4075)


Similarity-based grouping of data entries in one or more data sources is a task underlying many different data management tasks, such as, structuring search results, removal of redundancy in databases and data integration. Similarity-based grouping of data entries is not a trivial task in the context of life science data sources as the stored data is complex, highly correlated and represented at different levels of granularity. The contribution of this paper is two-fold. 1) We propose a method for similarity-based grouping and 2) we show results from test cases. As the main steps the method contains specification of grouping rules, pairwise grouping between entries, actual grouping of similar entries, and evaluation and analysis of the results. Often, different strategies can be used in the different steps. The method enables exploration of the influence of the choices and supports evaluation of the results with respect to given classifications. The grouping method is illustrated by test cases based on different strategies and classifications. The results show the complexity of the similarity-based grouping tasks and give deeper insights in the selected grouping tasks, the analyzed data source, and the influence of different strategies on the results.


Gene Ontology Mutual Information Data Entry Biological Data Semantic Similarity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [AGMML90]
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)Google Scholar
  2. [BTS02]
    Berg, J.M., Tymoczko, J.L., Stryer, L.: Biochemistry. W.H. Freeman and Company, New York (2002)Google Scholar
  3. [BBBDN05]
    Bilke, A., Bleiholder, J., Böhm, C., Draba, K., Naumann, F.: Automatic Data Fusion with HumMer. In: Demo at VLDB Conference, pp. 1251–1254 (2005)Google Scholar
  4. [CSC05]
    Couto, F.M., Silva, M.J., Coutinho, P.: Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. In: Conference on Information and Knowledge Management, pp. 343–344 (2005)Google Scholar
  5. [DS05]
    Doms, A., Schroeder, M.: GoPubMed: Exploring PubMed with the GeneOntology. Nucleic Acids Research 33, W783–W786 (2005)CrossRefGoogle Scholar
  6. [GH04]
    Gabaldon, T., Huynen, M.A.: Prediction of protein function and pathways in the genome era. Cellular and molecular life sciences: CMLS 61(7-8), 930–944 (2004)CrossRefGoogle Scholar
  7. [GO00]
    The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genetics 25(1), 25–29 (2000), Google Scholar
  8. [HGPWW04]
    Herbert, K.G., Gehani, N.H., Piel, W.H., Wang, J., Wu, C.H.: BIO-AJAX: An Extensible Framework for Biological Data Cleaning. SIGMOD Record 33(2), 51–57 (2004)CrossRefGoogle Scholar
  9. [JAligner]
    Java implementation of the Smith-Waterman algorithm for biological sequence alignment,
  10. [KLKTB04]
    Koh, J.L.Y., Lee, M.L., Khan, A.M., Tan, P.T.J., Brusic, V.: Duplicate Detection in Biological Data using Association Rule Mining. In: ECML/PKDD Workshop on Data Mining and Text Mining for Bioinformatics, pp. 31–37 (2004)Google Scholar
  11. [Lev66]
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  12. [LSBG03]
    Lord, P.W., Stevens, R., Brass, A., Goble, C.A.: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10), 1275–1283 (2003)CrossRefGoogle Scholar
  13. [PPF95]
    Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. In: Conference on Hypertext - Information Retrieval - Multimedia, pp. 259–275 (1995)Google Scholar
  14. [SS02]
    Shamir, R., Sharan, R.: Algorithmic Approaches to Clustering Gene Expression Data. In: Jiang, T., Smith, T., Xu, Y., Zhang, M.Q. (eds.) Current Topics in Computational Biology, pp. 269–299. MIT Press, Cambridge (2002)Google Scholar
  15. [SFSZ05]
    Speer, N., Fröhlich, H., Spieth, C., Zell, A.: Functional Distances for Genes Based on GO Feature Maps and their Application to Clustering. In: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 142–149 (2005)Google Scholar
  16. [SKK00]
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  17. [Str02]
    Strehl, A.: Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. PhD thesis, University of Texas at Austin (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Vaida Jakonienė
    • 1
  • David Rundqvist
    • 1
  • Patrick Lambrix
    • 1
  1. 1.Department of Computer and Information ScienceLinköpings universitetLinköpingSweden

Personalised recommendations