Advertisement

Modeling Job Arrivals in a Data-Intensive Grid

  • Hui Li
  • Michael Muskulus
  • Lex Wolters
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4376)

Abstract

In this paper we present an initial analysis of job arrivals in a production data-intensive Grid and investigate several traffic models to characterize the interarrival time processes. Our analysis focuses on the heavy-tail behavior and autocorrelation structures, and the modeling is carried out at three different levels: Grid, Virtual Organization (VO), and region. A set of m-state Markov modulated Poisson processes (MMPP) is investigated, while Poisson processes and hyperexponential renewal processes are evaluated for comparison studies. We apply the transportation distance metric from dynamical systems theory to further characterize the differences between the data trace and the simulated time series, and estimate errors by bootstrapping. The experimental results show that MMPPs with a certain number of states are successful to a certain extent in simulating the job traffic at different levels, fitting both the interarrival time distribution and the autocorrelation function. However, MMPPs are not able to match the autocorrelations for certain VOs, in which strong deterministic semi-periodic patterns are observed. These patterns are further characterized using different representations. Future work is needed to model both deterministic and stochastic components in order to better capture the correlation structure in the series.

Keywords

Renewal Process Interarrival Time Virtual Organization Data Trace Hurst Parameter 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Asmussen, S., Nerman, O., Olsson, M.: Fitting phase-type distribution via the EM algorithm. Scand. J. Statist. 23, 419–441 (1996)zbMATHGoogle Scholar
  2. 2.
    Barabasi, A.-L.: The origin of bursts and heavy tails in human dynamics. Nature 435, 207–211 (2005)CrossRefGoogle Scholar
  3. 3.
    Basu, S., Foufoula-Georgiou, E.: Detection of nonlinearity and chaoticity in time series using the transportation distance function. Physics Letters A 301, 413–423 (2002)CrossRefMathSciNetzbMATHGoogle Scholar
  4. 4.
    Beran, J.: Statistics for Long Memory Processes. Chapman and Hall, New York (1994)zbMATHGoogle Scholar
  5. 5.
    Brémaud, P.: Markov Chains. Gibbs Fields, Monte Carlo Simulation, and Queues. Springer, New York (2001)Google Scholar
  6. 6.
    Chapin, S.J., et al.: Benchmarks and standards for the evaluation of parallel job schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999, IPPS-WS 1999, and SPDP-WS 1999. LNCS, vol. 1659, pp. 67–90. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  7. 7.
    Cirne, W., Berman, F.: A comprehensive model of the supercomputer workload. In: IEEE 4th Annual Workshop on Workload Characterization, IEEE Computer Society Press, Los Alamitos (2001)Google Scholar
  8. 8.
    Davison, A.C., Hinkley, D.V.: Bootstrap Methods and Their Applications. Cambridge University Press, Cambridge (1997)Google Scholar
  9. 9.
    Downey, A.B., Feitelson, D.G.: The elusive goal of workload characterization. Performance Evaluation Review 26(4), 14–29 (1999)CrossRefGoogle Scholar
  10. 10.
    Dumitrescu, C., Raicu, I., Foster, I.: DI-GRUBER: A Distributed Approach to Grid Resource Brokering. In: Proceedings of Supercomputing ’05, ACM Press, New York (2005)Google Scholar
  11. 11.
    Workload Management in EGEE and gLite. http://lxmi.mi.infn.it/egee-jra1-wm/
  12. 12.
  13. 13.
    Feitelson, D.G.: Workload modeling for performance evaluation. In: Calzarossa, M.C., Tucci, S. (eds.) Performance 2002. LNCS, vol. 2459, pp. 114–141. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Fischer, W., Meier-Hellstern, K.: The Markov-modulated Poisson process (MMPP) cookbook. Performance Evaluation 18(2), 149–171 (1993)CrossRefMathSciNetzbMATHGoogle Scholar
  15. 15.
    Li, H.: Tools for Workload Modeling in the Grid. http://www.liacs.nl/home/hli/gwm/
  16. 16.
    Heffes, H., Lucantoni, D.M.: A Markov modulated characterization of packetized voice and data traffic and related statistical multiplexer performance. IEEE J. on Sel. Areas in Comm. 4(6), 856–868 (1986)CrossRefGoogle Scholar
  17. 17.
    Jagerman, D.L., Melamed, B., Willinger, W.: Stochastic modeling of traffic processes. In: Frontiers in Queueing: Models, Methods and Problems, CRC Press, Boca Raton (1996)Google Scholar
  18. 18.
    Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis. Cambridge University Press, Cambridge (2003)Google Scholar
  19. 19.
    Karagiannis, T., Faloutsos, M.: SELFIS: A Tool For Self-Similarity and Long-Range Dependence Analysis. In: 1st Workshop on Fractals and Self-Similarity in Data Mining: Issues and Approaches, Canada (2002)Google Scholar
  20. 20.
    Künsch, H.R.: The jackknife and bootstrap for general stationary observations. The Annals of Statistics 17, 1217–1241 (1989)CrossRefMathSciNetzbMATHGoogle Scholar
  21. 21.
    The Worldwide LHC Computing Grid project. http://lcg.web.cern.ch/LCG/
  22. 22.
    Leland, W., et al.: On the self-similar nature of ethernet traffic (extended version). IEEE/ACM Trans. on Networking 2(1), 1–15 (1994)CrossRefGoogle Scholar
  23. 23.
    Li, H., Groep, D., Wolters, L.: Workload Characteristics of a Multi-cluster Supercomputer. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 176–193. Springer, Heidelberg (2005)Google Scholar
  24. 24.
  25. 25.
    Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Para. and Dist. Comput. 63(11), 1105–1122 (2003)CrossRefzbMATHGoogle Scholar
  26. 26.
    Löbel, A.: Solving large-scale real-world minimum-cost flow problems by a network simplex method. Technical Report SC 96-7, Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB) (February 1996), Software available at http://www.zib.de/Optimization/Software/Mcf/
  27. 27.
    Medernach, E.: Workload analysis of a cluster in a Grid environment. In: Feitelson, D.G., et al. (eds.) JSSPP 2005. LNCS, vol. 3834, Springer, Heidelberg (2005)CrossRefGoogle Scholar
  28. 28.
    Moeckel, R., Murray, B.: Measuring the distance between timeseries. Physica D 102, 187–194 (1997)CrossRefMathSciNetzbMATHGoogle Scholar
  29. 29.
    Muskulus, M., et al.: Estimating differences between probability densities and time series (In preparation), Software available at http://www.math.leidenuniv.nl/~muskulus/
  30. 30.
    Neuts, M.F.: Structured Stochastic Matrices of M/G/1-type and their Applications. Marcel Dekker, New York (1989)zbMATHGoogle Scholar
  31. 31.
    Nabrzyski, J., Schopf, J.M., Weglarz, J.: Grid Resource Management: State of the Art and Future Trends. International Series in Operations Research & Management Science. Springer, Heidelberg (2003)Google Scholar
  32. 32.
    Politis, D.N.: The Impact of Bootstrap Methods on Time Series Analysis. Statistical Science 18(2), 219–230 (2003)CrossRefMathSciNetGoogle Scholar
  33. 33.
  34. 34.
    Riska, A.: Aggregate Matrix-analytic Techniques and their Applications. PhD thesis, Department of Computer Science, College of William and Mary (2002)Google Scholar
  35. 35.
    Roberts, W.J.J., Ephraim, Y., Dieguez, E.: On Ryden’s EM algorithm for estimating MMPP’s. IEEE Sig. Proc. Let. (to appear)Google Scholar
  36. 36.
    The LCG Real Time Monitor. http://gridportal.hep.ph.ic.ac.uk/rtm/
  37. 37.
    Ryden, T.: Parameter estimation for Markov modulated Poisson processes. Communications in Statistics - Stochastic Models 10(4), 795–829 (1994)CrossRefMathSciNetzbMATHGoogle Scholar
  38. 38.
    Ryden, T.: An EM algorithm for estimation in Markov-modulated Poisson processes. Comp. Stat. and Data Analysis 21, 431–447 (1996)CrossRefMathSciNetzbMATHGoogle Scholar
  39. 39.
    Schrijver, A.: Theory of Linear and Integer Programming. Wiley, Chichester (1998)zbMATHGoogle Scholar
  40. 40.
    Scott, S.L.: Bayesian Methods for Hidden Markov Models: Recursive Computing in the 21st Century. J. Am. Stat. Assoc. 97(457), 337–351 (2002)CrossRefzbMATHGoogle Scholar
  41. 41.
    Song, B., Ernemann, C., Yahyapour, R.: Parallel Computer Workload Modeling with Markov Chains. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 47–62. Springer, Heidelberg (2005)Google Scholar
  42. 42.
    Squillante, M.S., Yao, D.D., Zhang, L.: The impact of job arrival patterns on parallel scheduling. ACM SIGMETRICS Performance Evaluation Review 26(4), 52–59 (1999)CrossRefGoogle Scholar
  43. 43.
    Takeuchi, J.I., Yamanishi, K.: A Unified Framework for Detecting Outliers and Change Points from Time Series. IEEE Transactions on Knowledge and Data Engineering 18(4), 482–492 (2006)CrossRefGoogle Scholar
  44. 44.
    Willinger, W., Taqqu, M.S., Erramilli, A.: A Bibliographical Guide to Self-Similar Traffic and Performance Modeling for Modern High-Speed Networks. In: Stochastic Networks: Theory and Applications, pp. 339–366. Oxford University Press, Oxford (1996)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Hui Li
    • 1
  • Michael Muskulus
    • 2
  • Lex Wolters
    • 1
  1. 1.Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Niels Bohrweg 1, 2333 CA LeidenThe Netherlands
  2. 2.Mathematical Institute, Leiden University, Niels Bohrweg 1, 2333 CA LeidenThe Netherlands

Personalised recommendations