Skip to main content

Athena: Text Mining Based Discovery of Scientific Workflows in Disperse Repositories

  • Conference paper
Resource Discovery (RED 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6799))

Included in the following conference series:

Abstract

Scientific workflows are abstractions used to model and execute in silico scientific experiments. They represent key resources for scientists and are enacted and managed by engines called Scientific Workflow Management Systems (SWfMS). Each SWfMS has a particular workflow language. This heterogeneity of languages and formats poses as complex scenario for scientists to search or discover workflows in distributed repositories for reuse. The existing workflows in these repositories can be used to leverage the identification and construction of families of workflows (clusters) that aim at a particular goal. However it is hard to compare the structure of these workflows since they are modeled in different formats. One alternative way is to compare workflow metadata such as natural language descriptions (usually found in workflow repositories) instead of comparing workflow structure. In this scenario, we expect that the effective use of classical text mining techniques can cluster a set of workflows in families, offering to the scientists the possibility of finding and reusing existing workflows, which may decrease the complexity of modeling a new experiment. This paper presents Athena, a cloud-based approach to support workflow clustering from disperse repositories using their natural language descriptions, thus integrating these repositories and providing a facilitated form to search and reuse workflows.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Murta, L., Ogasawara, E., Oliveira, D., Cruz, S.M.S.D., Martinho, W.: Towards Supporting the Life Cycle of Large Scale Scientific Experiments. International Journal of Business Process Integration and Management 5(1), 79–92 (2010)

    Article  Google Scholar 

  2. Goderis, A., De Roure, D., Goble, C., Bhagat, J., Cruickshank, D., Fisher, P., Michaelides, D., Tanoh, F.: Discovering Scientific Workflows: The myExperiment Benchmarks. IEEE Transactions on Automation Science and Engineering (2008)

    Google Scholar 

  3. Santos, E., Lins, L., Ahrens, J.P., Freire, J., Silva, C.T.: A first study on clustering collections of workflow graphs. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 160–173. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  4. Goderis, A., Fisher, P., Gibson, A., Tanoh, F., Wolstencroft, K., De Roure, D., Goble, C.: Benchmarking Workflow Discovery: A Case Study From Bioinformatics. Concurrency and Computation: Practice and Experience 21, 2052–2069 (2009)

    Article  Google Scholar 

  5. Goderis, A., Li, P., Goble, C.: Workflow discovery: the problem, a case study from e-Science and a graph-based solution. In: International Conference on Web Services, ICWS 2006, pp. 312–319 (2006)

    Google Scholar 

  6. Pressman, R.S.: Software Engineering Software Engineering: A Practitioner’s Approach, 6th edn. McGraw-Hill, New York (2004)

    MATH  Google Scholar 

  7. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34(Web Server issue), 729–732 (2006)

    Article  Google Scholar 

  8. Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: VisTrails: visualization meets data management. In: Proc. SIGMOD 2006, Chicago, Illinois, USA, pp. 745–747 (2006)

    Google Scholar 

  9. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Scientific and Statistical Database Management, Greece, pp. 423–424 (2004)

    Google Scholar 

  10. Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications. In: Workflows for e-Science, pp. 320–339. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  11. Deelman, E., Mehta, G., Singh, G., Su, M., Vahi, K.: Pegasus: Mapping Large-Scale Workflows to Distributed Resources. In: Workflows for e-Science, pp. 376–394. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. Zhao, Y., Hategan, M., Clifford, B., Foster, I., von Laszewski, G., Nefedova, V., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In: Services 2007, Salt Lake City, UT, USA, pp. 199–206 (2007)

    Google Scholar 

  13. Jung, J., Bae, J.: Workflow clustering method based on process similarity. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganá, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3981, pp. 379–389. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)

    Book  Google Scholar 

  15. Oliveira, D., Baião, F., Mattoso, M.: Towards a Taxonomy for Cloud Computing from an e-Science Perspective. In: Cloud Computing: Principles, Systems and Applications. Springer, Heidelberg (2010)

    Google Scholar 

  16. Amazon EC2, 2010. Amazon Elastic Compute Cloud (Amazon EC2). Amazon Elastic Compute Cloud (Amazon EC2). Dispon?vel em, http://aws.amazon.com/ec2/ (acesso em: March 5, 2010)

  17. Cruz, S.M.S.D., Barros, P.M., Bisch, P.M., Campos, M.L.M., Mattoso, M.: A Provenance-based Approach to Resource Discovery. In: Proceedings of the Red Workshop (2009)

    Google Scholar 

  18. Corcho, O., Alper, P., Missier, P., Bechhofer, S., Goble, C.: Grid metadata management: Requirements and architecture. In: 8th IEEE/ACM International Conference on Grid Computing, pp. 97–104 (2007)

    Google Scholar 

  19. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)

    MATH  Google Scholar 

  20. Dragut, E., Fang, F., Sistla, P., Yu, C., Meng, W.: Stop word and related problems in web interface integration. Proc. VLDB Endow. 2(1), 349–360 (2009)

    Article  Google Scholar 

  21. Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, D.C., USA, pp. 625–633 (2004)

    Google Scholar 

  22. Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, pp. 201–210 (2009)

    Google Scholar 

  23. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  24. Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, Geneva, pp. 104–113 (2004)

    Google Scholar 

  25. Hu, X., Sun, N., Zhang, C., Chua, T.: Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, China, pp. 919–928 (2009)

    Google Scholar 

  26. Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 179–186 (2008)

    Google Scholar 

  27. Chen, L., Tokuda, N., Nagai, A.: A differential LSI method for document classification. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, vol. 11, pp. 25–32 (2003)

    Google Scholar 

  28. Abbasi, A., Chen, H.: Categorization and analysis of text in computer mediated communication archives using visualization. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, BC, Canada, pp. 11–18 (2007)

    Google Scholar 

  29. Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: SciCumulus: A Lightweigth Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows. In: Proc. 3rd IEEE International Conference on Cloud Computing, Miami, FL (2010)

    Google Scholar 

  30. Oliveira, D., Ogasawara, E., Baiao, F., Mattoso, M.: An Adaptive Approach for Workflow Activity Execution in Clouds. In: International Workshop on Challenges in e-Science - SBAC, Petrópolis, RJ - Brazil, pp. 9–16 (2010)

    Google Scholar 

  31. Ogasawara, E., Paulino, C., Murta, L., Werner, C., Mattoso, M.: Experiment Line: Software Reuse in Scientific Workflows. In: Scientific and Statistical Database Management, New Orleans, LA, pp. 264–272 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Costa, F., de Oliveira, D., Ogasawara, E., Lima, A.A.B., Mattoso, M. (2012). Athena: Text Mining Based Discovery of Scientific Workflows in Disperse Repositories. In: Lacroix, Z., Vidal, M.E. (eds) Resource Discovery. RED 2010. Lecture Notes in Computer Science, vol 6799. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27392-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27392-6_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27391-9

  • Online ISBN: 978-3-642-27392-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics