Athena: Text Mining Based Discovery of Scientific Workflows in Disperse Repositories

  • Flavio Costa
  • Daniel de Oliveira
  • Eduardo Ogasawara
  • Alexandre A. B. Lima
  • Marta Mattoso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6799)

Abstract

Scientific workflows are abstractions used to model and execute in silico scientific experiments. They represent key resources for scientists and are enacted and managed by engines called Scientific Workflow Management Systems (SWfMS). Each SWfMS has a particular workflow language. This heterogeneity of languages and formats poses as complex scenario for scientists to search or discover workflows in distributed repositories for reuse. The existing workflows in these repositories can be used to leverage the identification and construction of families of workflows (clusters) that aim at a particular goal. However it is hard to compare the structure of these workflows since they are modeled in different formats. One alternative way is to compare workflow metadata such as natural language descriptions (usually found in workflow repositories) instead of comparing workflow structure. In this scenario, we expect that the effective use of classical text mining techniques can cluster a set of workflows in families, offering to the scientists the possibility of finding and reusing existing workflows, which may decrease the complexity of modeling a new experiment. This paper presents Athena, a cloud-based approach to support workflow clustering from disperse repositories using their natural language descriptions, thus integrating these repositories and providing a facilitated form to search and reuse workflows.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Murta, L., Ogasawara, E., Oliveira, D., Cruz, S.M.S.D., Martinho, W.: Towards Supporting the Life Cycle of Large Scale Scientific Experiments. International Journal of Business Process Integration and Management 5(1), 79–92 (2010)CrossRefGoogle Scholar
  2. 2.
    Goderis, A., De Roure, D., Goble, C., Bhagat, J., Cruickshank, D., Fisher, P., Michaelides, D., Tanoh, F.: Discovering Scientific Workflows: The myExperiment Benchmarks. IEEE Transactions on Automation Science and Engineering (2008)Google Scholar
  3. 3.
    Santos, E., Lins, L., Ahrens, J.P., Freire, J., Silva, C.T.: A first study on clustering collections of workflow graphs. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 160–173. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  4. 4.
    Goderis, A., Fisher, P., Gibson, A., Tanoh, F., Wolstencroft, K., De Roure, D., Goble, C.: Benchmarking Workflow Discovery: A Case Study From Bioinformatics. Concurrency and Computation: Practice and Experience 21, 2052–2069 (2009)CrossRefGoogle Scholar
  5. 5.
    Goderis, A., Li, P., Goble, C.: Workflow discovery: the problem, a case study from e-Science and a graph-based solution. In: International Conference on Web Services, ICWS 2006, pp. 312–319 (2006)Google Scholar
  6. 6.
    Pressman, R.S.: Software Engineering Software Engineering: A Practitioner’s Approach, 6th edn. McGraw-Hill, New York (2004)MATHGoogle Scholar
  7. 7.
    Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34(Web Server issue), 729–732 (2006)CrossRefGoogle Scholar
  8. 8.
    Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: VisTrails: visualization meets data management. In: Proc. SIGMOD 2006, Chicago, Illinois, USA, pp. 745–747 (2006)Google Scholar
  9. 9.
    Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Scientific and Statistical Database Management, Greece, pp. 423–424 (2004)Google Scholar
  10. 10.
    Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications. In: Workflows for e-Science, pp. 320–339. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  11. 11.
    Deelman, E., Mehta, G., Singh, G., Su, M., Vahi, K.: Pegasus: Mapping Large-Scale Workflows to Distributed Resources. In: Workflows for e-Science, pp. 376–394. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  12. 12.
    Zhao, Y., Hategan, M., Clifford, B., Foster, I., von Laszewski, G., Nefedova, V., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In: Services 2007, Salt Lake City, UT, USA, pp. 199–206 (2007)Google Scholar
  13. 13.
    Jung, J., Bae, J.: Workflow clustering method based on process similarity. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganá, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3981, pp. 379–389. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)CrossRefGoogle Scholar
  15. 15.
    Oliveira, D., Baião, F., Mattoso, M.: Towards a Taxonomy for Cloud Computing from an e-Science Perspective. In: Cloud Computing: Principles, Systems and Applications. Springer, Heidelberg (2010)Google Scholar
  16. 16.
    Amazon EC2, 2010. Amazon Elastic Compute Cloud (Amazon EC2). Amazon Elastic Compute Cloud (Amazon EC2). Dispon?vel em, http://aws.amazon.com/ec2/ (acesso em: March 5, 2010)
  17. 17.
    Cruz, S.M.S.D., Barros, P.M., Bisch, P.M., Campos, M.L.M., Mattoso, M.: A Provenance-based Approach to Resource Discovery. In: Proceedings of the Red Workshop (2009)Google Scholar
  18. 18.
    Corcho, O., Alper, P., Missier, P., Bechhofer, S., Goble, C.: Grid metadata management: Requirements and architecture. In: 8th IEEE/ACM International Conference on Grid Computing, pp. 97–104 (2007)Google Scholar
  19. 19.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)MATHGoogle Scholar
  20. 20.
    Dragut, E., Fang, F., Sistla, P., Yu, C., Meng, W.: Stop word and related problems in web interface integration. Proc. VLDB Endow. 2(1), 349–360 (2009)CrossRefGoogle Scholar
  21. 21.
    Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, D.C., USA, pp. 625–633 (2004)Google Scholar
  22. 22.
    Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, pp. 201–210 (2009)Google Scholar
  23. 23.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  24. 24.
    Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, Geneva, pp. 104–113 (2004)Google Scholar
  25. 25.
    Hu, X., Sun, N., Zhang, C., Chua, T.: Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, China, pp. 919–928 (2009)Google Scholar
  26. 26.
    Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 179–186 (2008)Google Scholar
  27. 27.
    Chen, L., Tokuda, N., Nagai, A.: A differential LSI method for document classification. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, vol. 11, pp. 25–32 (2003)Google Scholar
  28. 28.
    Abbasi, A., Chen, H.: Categorization and analysis of text in computer mediated communication archives using visualization. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, BC, Canada, pp. 11–18 (2007)Google Scholar
  29. 29.
    Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: SciCumulus: A Lightweigth Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows. In: Proc. 3rd IEEE International Conference on Cloud Computing, Miami, FL (2010)Google Scholar
  30. 30.
    Oliveira, D., Ogasawara, E., Baiao, F., Mattoso, M.: An Adaptive Approach for Workflow Activity Execution in Clouds. In: International Workshop on Challenges in e-Science - SBAC, Petrópolis, RJ - Brazil, pp. 9–16 (2010)Google Scholar
  31. 31.
    Ogasawara, E., Paulino, C., Murta, L., Werner, C., Mattoso, M.: Experiment Line: Software Reuse in Scientific Workflows. In: Scientific and Statistical Database Management, New Orleans, LA, pp. 264–272 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Flavio Costa
    • 1
  • Daniel de Oliveira
    • 1
  • Eduardo Ogasawara
    • 1
    • 2
  • Alexandre A. B. Lima
    • 1
  • Marta Mattoso
    • 1
  1. 1.COPPEFederal University of Rio de JaneiroRio de JaneiroBrazil
  2. 2.Federal Center of Technological Education (CEFET/RJ)Rio de JaneiroBrazil

Personalised recommendations