A First Study on Clustering Collections of Workflow Graphs

  • Emanuele Santos
  • Lauro Lins
  • James P. Ahrens
  • Juliana Freire
  • Cláudio T. Silva
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5272)


As workflow systems get more widely used, the number of workflows and the volume of provenance they generate has grown considerably. New tools and infrastructure are needed to allow users to interact with, reason about, and re-use this information. In this paper, we explore the use of clustering techniques to organize large collections of workflow and provenance graphs. We propose two different representations for these graphs and present an experimental evaluation, using a collection of 1,700 workflow graphs, where we study the trade-offs of these representations and the effectiveness of alternative clustering techniques.


  1. 1.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley (1999)Google Scholar
  2. 2.
    Barbosa, L., Freire, J., da Silva, A.S.: Organizing hidden-web databases by clustering visible web documents. In: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, pp. 326–335. IEEE, Los Alamitos (2007)CrossRefGoogle Scholar
  3. 3.
    Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19(3-4), 255–259 (1998)CrossRefMATHGoogle Scholar
  4. 4.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR 1992: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 318–329 (1992)Google Scholar
  5. 5.
    Ester, M., Frommelt, A., Kriegel, H.-P., Sander, J.: Spatial data mining: Database primitives, algorithms and efficient dbms support. Data Mining and Knowledge Discovery 4(2-3), 193–216 (2000)CrossRefGoogle Scholar
  6. 6.
    Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing rapidly-evolving scientific workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 10–18. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Greco, G., Guzzo, A., Pontieri, L., Sacca, D.: Discovering expressive process models by clustering log traces. IEEE Transactions on Knowledge and Data Engineering 18(8), 1010–1027 (2006)CrossRefGoogle Scholar
  8. 8.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Series in Statistics. Springer, Heidelberg (2001)CrossRefMATHGoogle Scholar
  9. 9.
    Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)Google Scholar
  10. 10.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)CrossRefGoogle Scholar
  11. 11.
    Kitware. The Visualization Toolkit (March 15, 2008),
  12. 12.
    Makrogiannis, S., Economou, G., Fotopoulos, S., Bourbakis, N.: Segmentation of color images using multiscale clustering and graph theoretic region synthesis. IEEE Transactions on Systems, Man and Cybernetics, Part A 35(2), 224–238 (2005)CrossRefGoogle Scholar
  13. 13.
    myExperiment (March 15, 2008),
  14. 14.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of ACM 18(11), 613–620 (1975)CrossRefMATHGoogle Scholar
  15. 15.
    Sanfeliu, A., Fu, K.S.: A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man and Cybernetics (Part B) 13(3), 353–363 (1983)CrossRefMATHGoogle Scholar
  16. 16.
    Ullmann, J.R.: An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42 (1976)MathSciNetCrossRefGoogle Scholar
  17. 17.
    The VisTrails Project (March 15, 2008),
  18. 18.
    Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1101–1113 (1993)CrossRefGoogle Scholar
  19. 19.
    Yahoo! Pipes (March 15, 2008),

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Emanuele Santos
    • 1
  • Lauro Lins
    • 1
  • James P. Ahrens
    • 3
  • Juliana Freire
    • 2
  • Cláudio T. Silva
    • 1
    • 2
  1. 1.Scientific Computing and Imaging InstituteUniversity of UtahUSA
  2. 2.School of ComputingUniversity of UtahUSA
  3. 3.Los Alamos National LabUSA

Personalised recommendations