Advertisement

Creating Realistic Synthetic Incident Data

  • Nico Roedder
  • Paul Karaenke
  • Christof Weinhardt
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 258)

Abstract

The utilisation of the full flexibility of on-demand IT service provisioning requires in-depth knowledge on service performance. Otherwise reduction in cost going along with an increase of availability cannot be achieved. Thus, IT service decision methods incorporating IT service incident data are required. However, a lot of these models cannot be evaluated in a satisfactory fashion due to the lack of real-world incident data. To address this problem, we identify the need for realistic synthetic incident data for IT services. We stipulate the composition of this incident data and proclaim a procedure enabling the creation of realistic synthetic incident data for IT services allowing for a thorough evaluation of any formal decision model that relies on these forms of data sources.

Keywords

On-demand services Cloud computing Simulation Evaluation Decision support 

1 Introduction

When services fail service management tends to be interested in answering three questions. First: How do we fix it fast? Second: Who is responsible? And Third: How do we stop it from happening again? It is a generally accepted fact that the most significant failures in information technology (IT) environments (as in other industry environments) are due to human errors [1, 2, 3]. Famous examples such as the June 29, 2001 NASDAQ integrity failure which was caused during the routine testing of a development system by an administrator, lead up to more recent service disruptions like the April 29, 2011 Amazon Web Services (AWS) 47 h downtime that started with an incorrectly performed network traffic shift by a network technician [4]. So what should be the consequence of this perception other than a twist on “What can go wrong will go wrong?” Failures are always an interaction of inchoate elements where even a small error can lead to a disaster. And it is hardly possible to foresee all risks even by the most formidable group of experts. With computational power ever increasing and analytical methods for analysing data streams becoming better and better there is a soaring number of research that deals with the analysis of incident and monitoring data to improve stability on the one hand and support IT service decisions on the other hand.

The evaluation of these new models however, is very bothersome as there is hardly any usable real-world IT service incident or monitoring data available. A lot of research in cloud computing and decision support systems relies on a series of case studies or is tailored specifically for the scenario with a project partner who is (understandably) not willing to publish their service incident data. Other available data e.g. from cloud providers like Amazon Web Services1 or Salesforce2 is aggregated in a way that makes it unusable for most cases. Consequently there is a need for the creation of synthetic incident data.

We thus, state the following: it should be the aim of any researcher requiring service monitoring or incident data to validate his work with real-world data. Unfortunately this data is very hard to come by. Or to specify: almost impossible to come by with certain necessary properties to make it comparable.

The goal of this research is the identification of characteristics of a procedure to create realistic, comparable and reproducible incident data to validate formal models from the realm of service science research.

This work is structured as follows: In Sect. 2 we analyse related works. Section 3 summarises the requirements for realistic incident data and proclaims its characteristics. We conclude our work in Sect. 4.

2 Literature Review

As stated in the section before, it should be the aim of a researcher to validate his work with real-world data if there is any way to do so. Because of this established fact, research dealing with the creation of artificial incident data is sparse. Nevertheless there has been significant research on identifying the failure patterns of service incidents. Franke has analysed empirical data sets and concluded that the Weibull distribution and the Log-normal distribution are suited for a fitting in [5, 6].

When exploring a connection between business impact costs and service incidents Kieninger et al. have identified the Beta distribution to fit their empirical data sets [7, 8]. The focus is on finding the relation between these incidents and business costs.

Google researchers have conducted analyses on the failure of hard drives in their data centres [9]. However, they stick with exhibiting the results of their findings without trying to fit them to probability distributions. When scoping these results it can be assumed that disk failures are somewhat normally distributed.

While these findings are very interesting indeed, the aim is never to reproduce these incident patterns for future experiments and simulation studies. They should serve as a basis when deciding for the correct distributions or rather the incident patterns that should be analysed.

3 Model Description

We first identify reasons why real-world monitoring data is not sufficient or to hard to come by when dealing with decisions in IT service settings. The subsection thereafter comprises characteristics necessary for simulated services.

3.1 The Need for Synthetic Incident Data

In this section we want to identify why there is a need for incident data to be synthetic when validating decision models relying on IT service incident and monitoring data.

Disposal and Aggregation. Most service providers dispose of their monitoring and incident data as soon as it is no longer needed for contractual obligations or internal analysis. In most cases fine granulations of monitoring data is deleted after a set time period and only aggregated data is stored over a longer course of time. However, smaller service providers might not even keep this aggregated data.

Service Changes. In realistic settings IT infrastructures change over the course of time. A provider might increase computing power, change other hardware components or improve/change the software running on the infrastructure. This makes fair comparisons impossible.

Different Parameters. Besides their functionality, IT services have certain parameters that are non-functional (e.g. availability). Comparisons of these parameters be only conduced when extracting them to a common format (e.g. WS-Agreement [10]) and reducing the set of parameters to the ones available for all services.

Segment Lengths. Depending on what kind of service is monitored, the time segment length might significantly differ. Some services are monitored on a millisecond basis, while the availability of other services is tested once an hour. This also makes comparisons between services assailable.

These limitations are most prominent when dealing with real-world incident data and create the need for synthetic data. For this data to be realistic and be of use in IT service decision scenarios a series of characteristics have to be simulated. These are listed in the following section.

3.2 Service Characteristics

It is assumed that IT services have unique generic types i (e.g. storage or database service) and are offered by multiple service providers j. Specific services are the combination of an IT service type and a provider offering that service \(s_{ij}\). Each service has a price \(p_{ij}\) per unit of time t it is contracted.

IT service incidents have a frequency \(m_f(s_{ij},t)\) and an expected failure duration \(d_f(s_{ij},t)\). This is common practice in reliability engineering where the frequency is often labelled as the Rate Of Occurrence Of Failures (ROCOF) and the failure duration as Mean Time To Repair (MTTR) [11]. Both are substituted to \(\lambda _{ij}^t := (m_f(s_{ij},t),d_f(s_{ij},t)) \in \varLambda \). Some services \(s_{ij}\) come with a penalty agreement \(\mu _{ij}(\cdot )\) in case service objectives are not met. The penalty that has to be paid by the service provider is dependent on \(\lambda _{ij}^t\).

For incident data to be realistic in a service decision scenario the afore introduced components have to be simulated.

Pricing Strategy. IT services tend to be priced in tiers. Providers offer e.g. gold and platinum plans, where the platinum plan is significantly more expensive than the gold plan and offers a higher quality. Additionally usage-based pricing, performance-based pricing, user-based pricing and flat pricing should be implemented [12, 13].

Penalty Agreements. Especially public cloud computing providers offer no penalty payments when a service is unavailable. Traditional (outsourcing) service providers however are contractually obligated to pay a fee if their service is not usable. In reality, however, only a series of functions are encountered that are generally limited by an upper bound [14, 15].

Service Failures. It is assumed that services fail with certain probabilities that can be approximated through failure distributions. This is a well-established fact for systems in reliability engineering research [16, 17], and also valid across the wide range of different IT services, no matter if considering human error [2], hardware failures [9] or other unanticipated failures [6, 8].

Each of the above characteristics are necessary for a simulative comparison of different IT services and their incident behaviour. The correct parametrisation of different failure distributions is vital for meaningful incident time series.

4 Conclusion

In this work we have primarily shown that there is a need for synthetic incident data. Having evaluated a previously introduced decision method [18, 19, 20], we want to improve an existing implementation significantly and provide a thorough survey of the generated data and package our model and make it available for usage. Hence, the work at hand is conducted as research in progress to gather further input from fellow researchers and enhance our model.

Footnotes

References

  1. 1.
    Kirwan, B.: Human reliability assessment. Encyclopedia of Quantitative Risk Analysis and Assessment (2008)Google Scholar
  2. 2.
    Reason, J.: Human error: models and management. Bmj 320(7237), 768–770 (2000)CrossRefGoogle Scholar
  3. 3.
    Ayachitula, N., Buco, M., Diao, Y., Maheswaran, S., Pavuluri, R., Shwartz, L., Ward, C.: IT service management automation-a hybrid methodology to integrate and orchestrate collaborative human centric and automation centric workflows. In: IEEE International Conference on Services Computing, SCC 2007, pp. 574–581. IEEE (2007)Google Scholar
  4. 4.
    AWS-Team: Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region (2011). http://aws.amazon.com/message/65648/. Accessed 09 Nov 2015
  5. 5.
    Franke, U., Holm, H., König, J.: The distribution of time to recovery of enterprise IT services. IEEE Trans. Reliab. 63(4), 858–867 (2014)CrossRefGoogle Scholar
  6. 6.
    Franke, U.: Optimal IT service availability: shorter outages, or fewer? IEEE Trans. Netw. Serv. Manag. 9(1), 22–33 (2012)CrossRefGoogle Scholar
  7. 7.
    Kieninger, A., Straeten, D., Kimbrough, S.O., Schmitz, B., Satzger, G.: Leveraging service incident analytics to determine cost-optimal service offers. In: Wirtschaftsinformatik, 64 (2013)Google Scholar
  8. 8.
    Kieninger, A., Berghoff, F., Fromm, H., Satzger, G.: Simulation-based quantification of business impacts caused by service incidents. In: Falcão e Cunha, J., Snene, M., Nóvoa, H. (eds.) IESS 2013. LNBIP, vol. 143, pp. 170–185. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  9. 9.
    Pinheiro, E., Weber, W.-D., Barroso, L.A.: Failure trends in a large disk drive population. In: FAST, vol. 7, pp. 17–23 (2007)Google Scholar
  10. 10.
    Andrieux, A., Czajkowski, K., Dan, A., Keahey, K., Ludwig, H., Nakata, T., Pruyne, J., Rofrano, J., Tuecke, S., Xu, M.: Web Services Agreement Specification (WS-Agreement). Open Grid Forum (OGF) Proposed Recommendation GFD.107 (2007)Google Scholar
  11. 11.
    Yeh, L.: The rate of occurrence of failures. J. Appl. Probab. 34(1), 234–247 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Harmon, R., Demirkan, H., Hefley, B., Auseklis, N.: Pricing strategies for information technology services: a value-based approach. In: 42nd Hawaii International Conference on System Sciences, HICSS 2009, pp. 1–10. IEEE (2009)Google Scholar
  13. 13.
    Kesidis, G., Das, A., de Veciana, G.: On flat-rate and usage-based pricing for tiered commodity internet services. In: 42nd Annual Conference on Information Sciences and Systems, CISS 2008, pp. 304–308, March 2008Google Scholar
  14. 14.
    Moon, H.J., Chi, Y., Hacigumus, H.: SLA-aware profit optimization in cloud services via resource scheduling. In: 2010 6th World Congress on Services (SERVICES-1), pp. 152–153. IEEE (2010)Google Scholar
  15. 15.
    Buco, M.J., Chang, R.N., Luan, L.Z., Ward, C., Wolf, J.L., Yu, P.S.: Utility computing SLA management based upon business objectives. IBM Syst. J. 43(1), 159–178 (2004)CrossRefGoogle Scholar
  16. 16.
    Elsayed, E.A.: Reliability Engineering, vol. 88. Wiley, Hoboken (2012)zbMATHGoogle Scholar
  17. 17.
    Kapur, K.C., Pecht, M.: Reliability Engineering. Wiley, Hoboken (2014)CrossRefGoogle Scholar
  18. 18.
    Roedder, N., Knapper, R., Martin, J.: Risk in modern IT service landscapes: towards a dynamic model. In: 2012 5th IEEE International Conference on Service-Oriented Computing and Applications (SOCA), pp. 1–4, December 2012Google Scholar
  19. 19.
    Roedder, N., Karaenke, P., Knapper, R.: A Risk-aware decision model for service sourcing (Short Paper). In: 2013 IEEE 6th International Conference on Service-Oriented Computing and Applications (SOCA), pp. 135–139, December 2013Google Scholar
  20. 20.
    Roedder, N. and Karaenke, P. and Knapper, R. and Weinhardt, C.: Decision-making based on incident data analysis. In: 2014 IEEE 16th Conference on Business Informatics (CBI), vol. 1, pp. 46–53, July 2014Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Nico Roedder
    • 1
  • Paul Karaenke
    • 2
  • Christof Weinhardt
    • 3
  1. 1.Information Process EngineeringFZI Research Center for Information TechnologyKarlsruheGermany
  2. 2.Department of InformaticsTU MünchenGarchingGermany
  3. 3.Institute of Information Systems and Marketing, Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations