WekaBioSimilarity—Extending Weka with Resemblance Measures

  • César Domínguez
  • Jónathan HerasEmail author
  • Eloy Mata
  • Vico Pascual
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9868)


The classification of organisms is a daily-basis task in biology as well as other contexts. This process is usually carried out by comparing a set of descriptors associated with each object. However, general-purpose statistical packages offer a limited number of methods to perform such a comparison, and specific tools are required for each concrete problem. Weka is a freely-available framework that supports both supervised and unsupervised machine-learning algorithms. Here, we present WekaBioSimilarity, an extension of Weka implementing several resemblance measures to compare different kinds of descriptors. Namely, WekaBioSimilarity works with binary, multi-value, string, numerical, and heterogeneous data. WekaBioSimilarity, together with Weka, offers the functionality to classify objects using different resemblance measures, and clustering and classification algorithms. The combination of these two systems can be used as a standalone application or can be incorporated in the workflow of other software systems that require a classification process. WekaBioSimilarity is available at


  1. 1.
    Arif, M., Basalama, S.: Similarity-dissimilarity plot for high dimensional data of different attribute types in biomedical datasets. Int. J. Innovative Comput. Inf. Control 8(2), 1173–1181 (2012)Google Scholar
  2. 2.
    Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, pp. 243–254 (2008)Google Scholar
  3. 3.
    Breese, J., Heckerman, D., Kadie, D.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (1998)Google Scholar
  4. 4.
    Choi, S.S., et al.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)Google Scholar
  5. 5.
    Hall, M., et al.: The weka data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  6. 6.
    Hubálek, Z.: Coefficients of association and similarity, based on binary (presence-absence) data: an evaluation. Biol. Rev. 57(4), 669–689 (2008)CrossRefGoogle Scholar
  7. 7.
    Jeffreys, A.J., Wilson, V., Thein, S.L.: Hypervariable ‘minisatellite’ regions in human DNA. Nature 314, 67–73 (1985)CrossRefGoogle Scholar
  8. 8.
    Jurasinski, G., Retzer, V.: simba: a collection of functions for similarity analysis of vegetation data (2012)Google Scholar
  9. 9.
    Kurgan, L.A., et al.: Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif. Intell. Med. 23(2), 149–169 (2001)CrossRefGoogle Scholar
  10. 10.
    Lazar, I.: Gelanalyzer 2010a (2010).
  11. 11.
    Legendre, P., Legendre, L.: Numerical Ecology. Elsevier, Amsterdam (1999)zbMATHGoogle Scholar
  12. 12.
    Lichman, M.: UCI machine learning repository (2013).
  13. 13.
    MacArthur, R.: Geographical Ecology: Patterns in the Distribution of Species. Princeton University Press, New Jersey (1984)Google Scholar
  14. 14.
    Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)zbMATHGoogle Scholar
  15. 15.
    Michael, H.: Binary coefficients: a theoretical and empirical study. Math. Geol. 8(2), 137–150 (1976)CrossRefGoogle Scholar
  16. 16.
    Miyamoto, M., Cacraft, J.: Phylogenetic Analysis of DNA Sequences. Oxford University Press, Oxford (1991)Google Scholar
  17. 17.
    Nei, M., Kumar, S.: Molecular Evolution and Phylogenetics. Oxford University Press, Oxford (2000)Google Scholar
  18. 18.
    Nutt, C.L., et al.: Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res. 63(7), 1602–1607 (2003)Google Scholar
  19. 19.
    Read, M.M. (ed.): Trends in DNA Fingerprint Research. Nova Science Publishers Inc., New York (2005)Google Scholar
  20. 20.
    Rettinger, A., et al.: Mining the semantic web. Data Min. Knowl. Disc. 24, 613–662 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Rögnvaldsson, T., You, L., Garwicz, D.: State of the art prediction of HIV-1 protease cleavage sites. BioInformatics 31(8), 1204–1210 (2015)CrossRefGoogle Scholar
  22. 22.
    Silva, T.C., Zhao, L.: Machine Learning in Complex Networks. Springer, Heidelberg (2016)CrossRefGoogle Scholar
  23. 23.
    Sneath, P., Sokal, R.: Numerical Taxonomy: The Principles and Practice of Numerical Classification. W.H. Freeman & Co., San Francisco (1973)zbMATHGoogle Scholar
  24. 24.
    Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the orkut social network. In: Proceedings of the 11th ACM SIGKDD Conference on Knowledge Discovery in Data Mining, pp. 678–684 (2005)Google Scholar
  25. 25.
    USDA, NRCS: The plants database (2008).
  26. 26.
    Vauterin, L., Vauterin, P.: Integrated databasing and analysis. In: Stackebrandt, E. (ed.) Molecular Identification, Systematics, and Population Structure of Prokaryotes. Springer, Heidelberg (2006)Google Scholar
  27. 27.
    Wang, X., et al.: Experimental comparison of representation methods and distance measures for time series data. Data Min. Knowl. Disc. 26, 275–309 (2013)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Wealtec: Dolphin-1D software version 2.4 (2006).
  29. 29.
    Willett, P.: Similarity-based approaches to virtual screening. Biochem. Soc. Trans. 31, 603–606 (2003)CrossRefGoogle Scholar
  30. 30.
    Willett, P., Barnard, J.M., Downs, G.M.: Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998)CrossRefGoogle Scholar
  31. 31.
    Xu, R., Wunsch, D.C.: Clustering. IEEE Computer Society Press, Washington, DC (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • César Domínguez
    • 1
  • Jónathan Heras
    • 1
    Email author
  • Eloy Mata
    • 1
  • Vico Pascual
    • 1
  1. 1.Department of Mathematics and Computer ScienceUniversity of La RiojaLogroñoSpain

Personalised recommendations