Mining Biological Data on the Cloud – A MapReduce Approach

  • Zafeiria-Marina Ioannou
  • Nikolaos Nodarakis
  • Spyros Sioutas
  • Athanasios Tsakalidis
  • Giannis Tzimas
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 437)


During last decades, bioinformatics has proven to be an emerging field of research leading to the development of a wide variety of applications. The primary goal of bioinformatics is to detect useful knowledge hidden under large volumes biological and biomedical data, gain a greater insight into their relationships and, therefore, enhance the discovery and the comprehension of biological processes. To achieve this, a great number of text mining techniques have been developed that efficiently manage and disclose meaningful patterns and correlations from biological and biomedical data repositories. However, as the volume of data grows rapidly these techniques cannot cope with the computational burden that is produced since they apply only in centralized environments. Consequently, a turn into distributed and parallel solutions is indispensable. In the context of this work, we propose an efficient and scalable solution, in the MapReduce framework, for mining and analyzing biological and biomedical data.


Bioinformatics Data mining Text mining Clustering MapReduce Hadoop 


  1. 1.
    Ananiadou, S., Mcnaught, J.: Text Mining for Biology and Biomedicine. Artech House (2006)Google Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edn. ACM Press (2011)Google Scholar
  3. 3.
    Chen, B., Harrison, R., Pan, Y., Tai, P.: Novel Hybrid Hierarchical-K-means Clustering Method (H-K-means) for Microarray Analysis. In: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference - Workshops, pp. 105–108. IEEE Computer Society, Washington, DC (2005)CrossRefGoogle Scholar
  4. 4.
    Cohen, A.M., Herch, W.R.: A Survey of Current Work in Biomedical Text Mining. Brief Bioinform. 6, 57–71 (2005)CrossRefGoogle Scholar
  5. 5.
    Dai, H.J., Lin, J.Y.W., Huang, C.H., Chou, P.H., Tsai, R.T.H., Hsu, W.L.: A Survey of State of the Art Biomedical Text Mining Techniques for Semantic Analysis. In: Proceedings of the IEEE International Conference on Sensor Networks, Ubiquitous and Trustworthy Computing, pp. 410–417 (2008)Google Scholar
  6. 6.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pp. 137–150. USENIX Association, Berkeley (2004)Google Scholar
  7. 7.
    Dhillon, I.S., Guan, Y., Kogan, J.: Iterative Clustering of High Dimensional Text Data Augmented by Local Search. In: Proceedings of the 2002 IEEE International Conference on Data Mining, pp. 131–138 (2002)Google Scholar
  8. 8.
    Georgitsi, M., Viennas, E., Gkantouna, V., Christodoulopoulou, E., Zagoriti, Z., Tafrali, C., Ntellos, F., Giannakopoulou, O., Boulakou, A., Vlahopoulou, P., Kyriacou, E., Tsaknakis, J., Tsakalidis, A., Poulas, K., Tzimas, G., Patrinos, G.: Population-Specific Documentation of Pharmacogenomic Markers and their Allelic Frequencies in FINDbase. Pharmacogenomics 12, 49–58 (2011)CrossRefGoogle Scholar
  9. 9.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2006)Google Scholar
  10. 10.
    Ioannou, M., Makris, C., Tzimas, G., Viennas, E.: A Text Mining Approach for Biomedical Documents. In: Proceedings of the 6th Conference of the Hellenic Society for Computational Biology and Bioinformatics, Patras, Greece (2011)Google Scholar
  11. 11.
    Ioannou, M., Patrinos, G.P., Tzimas, G.: Genome-based population clustering: Nuggets of truth buried in a pile of numbers? In: Iliadis, L., Maglogiannis, I., Papadopoulos, H., Karatzas, K., Sioutas, S. (eds.) AIAI 2012, Part II. IFIP AICT, vol. 382, pp. 602–611. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Inoue, K., Urahama, K.: Fuzzy Clustering Based on Cooccurence Matrix and Its Application to Data Retrieval. Electron. Comm. Jpn. 84(pt. 2 ), 10–19 (2001)Google Scholar
  13. 13.
    Ioannou, M., Makris, C., Patrinos, G., Tzimas, G.: A Set of Novel Mining Tools for Efficient Biological Knowledge Discovery. In: Artificial Intelligence Review. Springer (2013)Google Scholar
  14. 14.
    Kogan, J.: Introduction to Clustering Large and High-Dimensional Data, pp. 51–72. Cambridge University Press, New York (2007)zbMATHGoogle Scholar
  15. 15.
    Lu, Z.: Pubmed and Beyond: A Survey of Web Tools for Searching Biomedical Literature. Database, Oxford (2011)Google Scholar
  16. 16.
    Manconi, A., Vargiu, E., Armano, G., Milanesi, L.: Literature Retrieval and Mining in Bioinformatics: State of the Art and Challenges. In: Adv. Bioinformatics (2012)Google Scholar
  17. 17.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proceedings of the KDD Workshop on Text Mining, 6th ACM SIGKDD International Conference on Data Mining (2000)Google Scholar
  18. 18.
    The apache software foundation: Hadoop homepage,
  19. 19.
    The apache software foundation: Mahout homepage,
  20. 20.
    Van Baal, S., Kaimakis, P., Phommarinh, M., Koumbi, D., Cuppens, H., Riccardino, F., Macek, M. Jr., Scriver, C.R., Patrinos. G.: FINDbase: A Relational Database Recording Frequencies of Genetic Defects Leading to Inherited Disorders Worldwide. Nucleic Acids Res. 35 (2007)Google Scholar
  21. 21.
    Viennas, E., Gkantouna, V., Ioannou, M., Georgitsi, M., Rigou, M., Poulas, K., Patrinos, G., Tzimas, G.: Population-Ethnic Group Specific Genome Variation Allele Frequency Data: A Querying and Visualization Journey. Genomics 100, 93–101 (2012)CrossRefGoogle Scholar
  22. 22.
    Wang, J.T.L., Zaki, M.J., Toivonen, H.T.T., Shasha, D.: Data Mining in Bioinformatics. In: Advanced Information and Knowledge Processing. Springer (2005)Google Scholar
  23. 23.
    White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly Media / Yahoo Press (2012)Google Scholar
  24. 24.
    Zhang, C., Xia, S.: K-means Clustering Algorithm with Improved Initial Center. In: Knowledge Discovery and Data Mining, pp.790–792 (2009)Google Scholar
  25. 25.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an Efficient Data Clustering Method for Very Large Databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, pp. 103–114 (1996)Google Scholar
  26. 26.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a New Data Clustering Algorithm and its Applications. Journal of Data Mining and Knowledge Discovery 1, 141–182 (1997)CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Zafeiria-Marina Ioannou
    • 1
  • Nikolaos Nodarakis
    • 1
  • Spyros Sioutas
    • 2
  • Athanasios Tsakalidis
    • 1
  • Giannis Tzimas
    • 3
  1. 1.Computer Engineering and Informatics DepartmentUniversity of PatrasPatrasGreece
  2. 2.Department of InformaticsIonian UniversityCorfuGreece
  3. 3.Computer & Informatics Engineering DepartmentTechnological Educational Institute of Western GreecePatrasGreece

Personalised recommendations