Advertisement

Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters

  • Mouhamadou Ba
  • Sébastien Ferré
  • Mireille Ducassé
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9510)

Abstract

Heterogeneity of data and data formats in bioinformatics entail mismatches between inputs and outputs of different services, making it difficult to compose them into workflows. To reduce those mismatches, bioinformatics platforms propose ad’hoc converters, called shims. When shims are written by hand, they are time-consuming to develop, and cannot anticipate all needs. When shims are automatically generated, they miss transformations, for example data composition from multiple parts, or parallel conversion of list elements.

This article proposes to systematically detect convertibility from output types to input types. Convertibility detection relies on a rule system based on abstract types, close to XML Schema. Types allow to abstract data while precisely accounting for their composite structure. Detection is accompanied by an automatic generation of converters between input and output XML data. We show the applicability of our approach by abstracting concrete bioinformatics types (e.g., complex biosequences) for a number of bioinformatics services (e.g., blast). We illustrate how our automatically generated converters help to resolve data mismatches when composing workflows. We conducted an experiment on bioinformatics services and datatypes, using an implementation of our approach, as well as a survey with domain experts. The detected convertibilities and produced converters were validated as relevant from a biological point of view. Furthermore the automatically produced graph of potentially compatible services exhibited a connectivity higher than with the ad’hoc approaches. Indeed, the experts discovered unknown possible connexions.

Keywords

Biological Sequence Output Type Type Expression Rule System Primitive Type 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

We thank Olivier Collin, Yvan Le Bras, Olivier Dameron, Francois Moreews and Olivier Sallou for their expertise in bioinformatics services and workflows, as well as for enriching discussions.

References

  1. 1.
    Oinn, T., Greenwood, M., Addis, M., Ferris, J., Glover, K., Goble, C., Hull, D., Marvin, D., Li, P., Lord, P.: Taverna: lessons in creating a workflow environment for the life sciences. Concurrency Comput. Pract. Experience 18(10), 1067–1100 (2006)CrossRefGoogle Scholar
  2. 2.
    Gundersen, S., Kalas, M., Abul, O., Frigessi, A., Hovig, E., Sandve, G.K.: Identifying elemental genomic track types and representing them uniformly. BMC Bioinform. 12, 494 (2011)CrossRefGoogle Scholar
  3. 3.
    Rice, P., Longden, I., Bleasby, A.: Emboss: the European molecular biology open software suite. Trends Genet. 16(6), 276–277 (2000)CrossRefGoogle Scholar
  4. 4.
    Goecks, J., Nekrutenko, A., Taylor, J., Team, T.G.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86 (2010)CrossRefGoogle Scholar
  5. 5.
    Ménager, H., Gopalan, V., Néron, B., Larroudé, S., Maupetit, J., Saladin, A., Tufféry, P., Huyen, Y., Caudron, B.: Bioinformatics applications discovery and composition with the mobyle suite and mobyleNet. In: Lacroix, Z., Vidal, M.E. (eds.) RED 2010. LNCS, vol. 6799, pp. 11–22. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Wassink, I.H.C., van der Vet, P.E., Wolstencroft, K., Neerincx, P.B.T., Roos, M., Rauwerda, H., Breit, T.M.: Analysing scientific workflows: why workflows not only connect web services. In: SERVICES, pp. 314–321 (2009)Google Scholar
  7. 7.
    Seibel, P.N., Krüger, J., Hartmeier, S., Schwarzer, K., Löwenthal, K., Mersch, H., Dandekar, T., Giegerich, R.: XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinform. 7, 490 (2006)CrossRefGoogle Scholar
  8. 8.
    Han, M.V., Zmasek, C.M.: phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinform. 10, 356 (2009)CrossRefGoogle Scholar
  9. 9.
    Kalas, M., Puntervoll, P., Joseph, A., Bartaseviciute, E., Töpfer, A., Venkataraman, P., Pettifer, S., Bryne, J.C., Ison, J.C., Blanchet, C., Rapacki, K., Jonassen, I.: Bioxsd: the common data-exchange format for everyday bioinformatics web services. Bioinformatics 26(18), i540–i546 (2010)CrossRefGoogle Scholar
  10. 10.
    Embley, D.W., Xu, L., Ding, Y.: Automatic direct and indirect schema mapping: experiences and lessons learned. SIGMOD Rec. 33(4), 14–19 (2004)CrossRefGoogle Scholar
  11. 11.
    Li, X., Fan, Y., Jiang, F.: A classification of service composition mismatches to support service mediation. In: GCC, pp. 315–321 (2007)Google Scholar
  12. 12.
    Lebreton, N., Blanchet, C., Claro, D.B., Chabalier, J., Burgun, A., Dameron, O.: Verification of parameters semantic compatibility for semi-automatic web service composition: a generic case study. In: Taniar, D., Pardede, E., Nguyen, H.-Q., Rahayu, J.W., Khalil, I. (eds.) International Conference on Information Integration and Web Based Applications and Services, pp. 845–848. ACM (2010)Google Scholar
  13. 13.
    Elizondo, P.V., Dwivedi, V., Garlan, D., Schmerl, B.R., Fernandes, J.M.: Resolving data mismatches in end-user compositions. In: IS-EUD, pp. 120–136 (2013)Google Scholar
  14. 14.
    Hull, D., Stevens, R., Lord, P., Wroe, C., Goble, C.: Treating “shimantic web” syndrome with ontologies (2004)Google Scholar
  15. 15.
    Bowers, S., Ludäscher, B.: An ontology-driven framework for data transformation in scientific workflows. In: Rahm, E. (ed.) DILS 2004. LNCS (LNBI), vol. 2994, pp. 1–16. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  16. 16.
    Kashlev, A., Lu, S., Chebotko, A.: Coercion approach to the shimming problem in scientific workflows. In: 2013 IEEE International Conference on Services Computing, Santa Clara, CA, USA, 28 June–3 July 2013, pp. 416–423 (2013)Google Scholar
  17. 17.
    DiBernardo, M., Pottinger, R., Wilkinson, M.: Semi-automatic web service composition for the life sciences using the biomoby semantic web framework. J. Biomed. Inform. 41(5), 837–847 (2008)CrossRefGoogle Scholar
  18. 18.
    Ba, M., Ferré, S., Ducassé, M.: Generating data converters to help compose services in bioinformatics workflows. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014, Part I. LNCS, vol. 8644, pp. 284–298. Springer, Heidelberg (2014)Google Scholar
  19. 19.
    Missier, P., Wolstencroft, K., Tanoh, F., Li, P., Bechhofer, S., Belhajjame, K., Pettifer, S., Goble, C.A.: Functional units: abstractions for web service annotations. In: SERVICES, pp. 306–313. IEEE Computer Society (2010)Google Scholar
  20. 20.
    Hosoya, H., Vouillon, J., Pierce, B.C.: Regular expression types for XML. In: ICFP, pp. 11–22 (2000)Google Scholar
  21. 21.
    Chen, Z., Wu, J., Deng, S., Li, Y., Wu, Z.: Describing and verifying web service using type theory. In: Proceedings of the 10th International Conference on CSCW in Design, CSCWD 2006, 3–5 May 2006, Southeast University, Nanjing, China, pp. 746–750 (2006)Google Scholar
  22. 22.
    Bates, J.L., Constable, R.L.: Proofs as programs. ACM Trans. Program. Lang. Syst. 7(1), 113–136 (1985)zbMATHCrossRefGoogle Scholar
  23. 23.
    Moreews, F., Lavenier, D.: Seamless coarse grained parallelism integration in intensive bioinformatics workflows. In: 20th European MPI Users’s Group Meeting, EuroMPI 2013, Madrid, Spain, 15–18 September 2013, pp. 277–282 (2013)Google Scholar
  24. 24.
    Westbrook, J.D., Ito, N., Nakamura, H., Henrick, K., Berman, H.M.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992 (2005)CrossRefGoogle Scholar
  25. 25.
    Dowell, R.D., Jokerst, R.M., Day, A., Eddy, S.R., Stein, L.: The distributed annotation system. BMC Bioinform. 2, 7 (2001)CrossRefGoogle Scholar
  26. 26.
    Consortium, U., et al.: The universal protein resource (uniprot) in 2010. Nucleic Acids Res. 38, 142–148 (2010). Database-IssueCrossRefGoogle Scholar
  27. 27.
    McWilliam, H., Valentin, F., Goujon, M., Li, W., Narayanasamy, M., Martin, J., Miyar, T., Lopez, R.: Web services at the European bioinformatics institute-2009. Nucleic Acids Res. 37, 6–10 (2009). Web-Server-IssueCrossRefGoogle Scholar
  28. 28.
    Wilkinson, M.D., Links, M.: Biomoby: an open source biological web services proposal. Briefings Bioinform. 3(4), 331–341 (2002)CrossRefGoogle Scholar
  29. 29.
    Sirin, E., Hendler, J., Parsia, B.: Semi-automatic composition of web services using semantic descriptions. In: Web Services: Modeling, Architecture And Infrastructure Workshop in ICEIS, vol. 2003. Citeseer (2003)Google Scholar
  30. 30.
    Lin, C., Lu, S., Fei, X., Pai, D., Hua, J.: A task abstraction and mapping approach to the shimming problem in scientific workflows. In: 2009 IEEE International Conference on Services Computing (SCC 2009), Bangalore, India, 21–25 September 2009, pp. 284–291 (2009)Google Scholar
  31. 31.
    Kongdenfha, W., Nezhad, H.R.M., Benatallah, B., Casati, F., Saint-Paul, R.: Mismatch patterns and adaptation aspects: a foundation for rapid development of web service adapters. IEEE T. Serv. Comput. 2(2), 94–107 (2009)CrossRefGoogle Scholar
  32. 32.
    Ison, J.C., Kalas, M., Jonassen, I., Bolser, D.M., Uludag, M., McWilliam, H., Malone, J., Lopez, R., Pettifer, S., Rice, P.M.: EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 29(10), 1325–1332 (2013)CrossRefGoogle Scholar
  33. 33.
    Wolstencroft, K., Alper, P., Hull, D., Wroe, C., Lord, P.W., Stevens, R.D., Goble, C.A.: The myGrid ontology,: bioinformatics service discovery. Int. J. Bioinform. Res. Appl. 3(3), 303–325 (2007)CrossRefGoogle Scholar
  34. 34.
    Stroulia, E., Wang, Y.: Structural and semantic matching for assessing web-service similarity. Int. J. Coop. Inf. Syst. 14(4), 407–438 (2005)CrossRefGoogle Scholar
  35. 35.
    Linke, B., Giegerich, R., Goesmann, A.: Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 27(7), 903–911 (2011)CrossRefGoogle Scholar
  36. 36.
    Sadedin, S.P., Pope, B., Oshlack, A.: Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics 28(11), 1525–1526 (2012)CrossRefGoogle Scholar
  37. 37.
    Köster, J., Rahmann, S.: Snakemake:a scalable bioinformatics workflow engine. Bioinformatics 28(19), 2520–2522 (2012)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Mouhamadou Ba
    • 1
  • Sébastien Ferré
    • 2
  • Mireille Ducassé
    • 1
  1. 1.IRISA/INSA RennesRennes CedexFrance
  2. 2.IRISA/Université de Rennes 1Rennes CedexFrance

Personalised recommendations