Skip to main content
Log in

An architecture framework for enterprise IT service availability analysis

  • Theme Section Paper
  • Published:
Software & Systems Modeling Aims and scope Submit manuscript

Abstract

This paper presents an integrated enterprise architecture framework for qualitative and quantitative modeling and assessment of enterprise IT service availability. While most previous work has either focused on formal availability methods such as fault trees or qualitative methods such as maturity models, this framework offers a combination. First, a modeling and assessment framework is described. In addition to metamodel classes, relationships and attributes suitable for availability modeling, the framework also features a formal computational model written in a probabilistic version of the object constraint language. The model is based on 14 systemic factors impacting service availability and also accounts for the structural features of the service architecture. Second, the framework is empirically tested in nine enterprise information system case studies. Based on an initial availability baseline and the annual evolution of the 14 factors of the model, annual availability predictions are made and compared with the actual outcomes as reported in SLA reports and system logs. The practical usefulness of the method is discussed based on the outcomes of a workshop conducted with the participating enterprises, and some directions for future research are offered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.ics.kth.se/eaat.

  2. http://www.ics.kth.se/eaat.

References

  1. Object constraint language, version 2.2. Technical report, Object Management Group, OMG, February 2010. http://www.omg.org/spec/OCL/2.2. OMG Document Number: formal/2010-02-01

  2. Aier, S., Buckl, S., Franke, U., Gleichauf, B., Johnson, P., Närman, P., Schweda, C.M., Ullberg, J.: A survival analysis of application life spans based on enterprise architecture models. In: 3rd International Workshop on Enterprise Modelling and Information Systems Architectures, pp. 141–154 (2009)

  3. Askåker, J., Kulle, M.: Miljardaffärer gick förlorade. Dagens Industri, 4 June 2008, pp. 6–7 (in Swedish)

  4. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1(1), 11–33 (2004)

    Article  Google Scholar 

  5. Bernardi, S., Merseguer, J.: A UML profile for dependability analysis of real-time embedded systems. In: Proceedings of the 6th International Workshop on Software and performance, pp. 115–124. ACM, New York (2007)

  6. Bocciarelli, P., D’Ambrogio, A.: A model-driven method for describing and predicting the reliability of composite services. Softw. Syst. Model. 10:265–280 (2011). ISSN: 1619–1366. doi:10.1007/s10270-010-0150-3

  7. Buschle, M., Ullberg, J., Franke, U., Lagerström, R., Sommestad, T.: A tool for enterprise architecture analysis using the PRM formalism. In: CAiSE2010 Forum PostProceedings, Oct 2010

  8. Buschle, M., Holm, H., Sommestad, T., Ekstedt, M., Shahzad, K.: A tool for automatic enterprise architecture modeling. In: Proceedings of the CAiSE Forum 2011, 25–32 (2011)

  9. Charette, R.: Bank of America Suffered Yet Another Online Banking Outage. http://spectrum.ieee.org/riskfactor/telecom/internet/bank-of-america-suffered-yet-another-online-banking-outage. Accessed Jan 2011. IEEE Spectrum “Risk factor” blog

  10. Cortellessa, V., Pompei, A.: Towards a UML profile for qos: a contribution in the reliability domain. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 197–206. ACM, New York (2004)

  11. Cortellessa, V., Singh, H., Cukic, B.: Early reliability assessment of UML based software models. In: Proceedings of the 3rd International Workshop on Software and Performance, WOSP ’02, pp. 302–309. ACM, New York (2002). ISBN: 1-58113-563-7. http://doi.acm.org/10.1145/584369.584415

  12. Forbus, K.D.: Chapter 9 qualitative modeling. In: van Harmelen, V.L.F., Bruce P. (eds.) Handbook of Knowledge Representation. Foundations of Artificial Intelligence, vol. 3, pp. 361–393. Elsevier, Amsterdam (2008). doi:10.1016/S1574-6526(07)03009-X. http://www.sciencedirect.com/science/article/pii/S157465260703009X

  13. Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS quarterly, pp. 319–340 (1989)

  14. Durkee, D.: Why cloud computing will never be free. Queue 8(4), 20 (2010)

    Google Scholar 

  15. Ekstedt, M., Franke, U., Johnson, P., Lagerström, R., Sommes- tad, T., Ullberg, J., Buschle, M.: A tool for enterprise architecture analysis of maintainability. In: Proceedings of the 13th European Conference on Software Maintenance and Reengineering (2009)

  16. Ericson, C.: Fault tree analysis—a history. In: 17th International System Safety Conference (1999)

  17. Franke, U., Johnson, P., König, J., von Würtemberg, L.V.: Availability of enterprise IT systems: an expert-based Bayesian framework. Softw. Qual. J. 20, 369–394 (2011). ISSN: 0963–9314. doi:10.1007/s11219-011-9141-z

  18. Gray, J.: Why Do Computers Stop and What Can Be Done About It? Technical report. Tandem Computers Inc., Cupertino (1985)

  19. Henrion, M.: Some practical issues in constructing belief networks. In: Kanal, L.N., Levitt, T.S., Lemmer, J.F. (eds.) Uncertainty in Artificial Intelligence 3, pp. 161–173. Elsevier Science Publishers B.V, North Holland (1989)

  20. Hillier, F.S., Lieberman, G.J.: Introduction to Operations Resea- rch, 8th edn. McGraw-Hill, New York (2005)

  21. Hochstein, A., Tamm, G., Brenner, W.: Service-oriented it management: benefit, cost and success factors. In: Proceedings of the 13th European Conference on Information Systems (ECIS), Regensburg (2005)

  22. Holub, E.: Embracing ITSM to Build a Customer Service Provider Culture in IT I &O. Inc, Technical Report, Gartner (2009)

  23. IBM Global Services: Improving systems availability. Technical Report, IBM Global Services (1998)

  24. Immonen, A.: A method for predicting reliability and availability at the architecture level. In: Käköla, T., Duenas, J.C. (eds.) Software Product Lines, pp. 373–422. Springer, Berlin (2006). ISBN: 978-3-540-33253-4. doi:10.1007/978-3-540-33253-4_10

  25. Irani, Z., Marinos T., Love, P.E.D.: The impact of enterprise application integration on information system lifecycles. Inf. Manag. 41(2), 177–187 (2003). ISSN: 0378–7206. doi:10.1016/S0378-7206(03)00046-6. http://www.sciencedirect.com/science/article/pii/S0378720603000466

  26. Johnson, P.: Enterprise Software System Integration: An Architectural Perspective. PhD thesis, Royal Institute of Technology (KTH) (2002)

  27. Johnson, P., Johansson, E., Sommestad, T., Ullberg, J.: A tool for enterprise architecture analysis. In: Proceedings of the 11th IEEE International Enterprise Computing Conference (EDOC 2007) (2007)

  28. Johnson, P., Lagerström, R., Närman, P., Simonsson, M.: Enterprise architecture analysis with extended influence diagrams. Inf. Syst. Frontiers 9(2), (2007)

  29. Johnson, P., Ullberg, J., Buschle, M., Shahzad, K., Franke, U.: P\(^2\)AMF: Predictive, Probabilistic Architecture Modeling Framework (2012, Submitted manuscript)

  30. Jordan, D., Evdemon, J., Alves, A., Arkin, A., Askary, S., Barreto, C., Bloch, B., Curbera, F., Ford, M., Goland, Y. et al.: Web services business process execution language version 2.0. Technical Report, OASIS

  31. Lankhorst, M.: Enterprise Architecture at Work: Modelling, Communication and Analysis. Springer, New York (2009)

    Book  Google Scholar 

  32. Lankhorst, M.M., Proper, H.A., Jonkers, H.: The architecture of the archimate language, pp. 367–380. Enterprise, Business-Process and Information Systems Modeling (2009)

  33. Leangsuksun, C., Song, H., Shen, L.: Reliability modeling using UML. In: Proceeding of the 2003 International Conference on Software Engineering Research and, Practice (2003)

  34. Liang, Y.H.: Analyzing and forecasting the reliability for repairable systems using the time series decomposition method. Int. J. Qual. Reliabil. Manag. 28(3), 317–327 (2011). ISSN: 0265–671X

    Google Scholar 

  35. Majzik, I., Pataricza, A., Bondavalli, A.: Stochastic dependability analysis of system architecture based on UML models. In: de Lemos, R., Gacek, C., Romanovsky, A. (eds.) Architecting Dependable Systems. Lecture Notes in Computer Science, vol. 2677, pp. 219–244. Springer, Berlin (2003). doi:10.1007/3-540-45177-3_10

  36. Malik, B., Scott, D.: How to Calculate the Cost of Continuously Available IT Services. Inc. Technical Report, Gartner (2010)

  37. Marcus, E., Stern, H.: Blueprints for High Availability, 2nd edn. Wiley, Indianapolis (2003)

  38. Marrone, M., Kiessling, M., Kolbe, L.M.: Are we really innovating? An exploratory study on innovation management and service management. In: IEEE International Conference on Management of Innovation and Technology (ICMIT), 2010, pp. 378–383 (2010). doi:10.1109/ICMIT.2010.5492719

  39. Marrone, M., Kolbe, L.: Uncovering ITIL claims: IT executives perception on benefits and business-IT alignment. In: Information Systems and E-Business Management, pp. 1–18 (2010). ISSN: 1617-9846. doi:10.1007/s10257-010-0131-7

  40. Milanovic, N.: Models, Methods and Tools for Availability Assessment of IT-Services and Business Processes. Universitätsbibliothek, Habilitationsschrift (2010)

  41. Närman, P., Franke, U., König, J., Buschle, M., Ekstedt, M.: Enterprise architecture availability analysis using fault trees and stakeholder interviews. Enterp. Inf. Syst. (2011, to appear)

  42. Onisko, A., Druzdzel, M.J., Wasyluk, H.: Learning bayesian network parameters from small data sets: application of noisy-or gates. Int. J. Approx. Reason. 27(2), 165–182 (2001). ISSN: 0888–613X. doi:10.1016/S0888-613X(01)00039-1

  43. Oppenheimer, D.: Why do internet services fail, and what can be done about it? In: Proceedings of USITS 03: 4th USENIX Symposium on Internet Technologies and Systems (2003)

  44. Österlind, M.: Validering av verktyget “Enterprise Architecture Analysis Tool”. Master’s thesis, Royal Institute of Technology (KTH) (2011)

  45. Pertet, S., Narasimhan, P.: Causes of failure in web applications. Technical Report, Parallel Data Laboratory, Carnegie Mellon University, CMU-PDL-05-109 (2005)

  46. Phelan, P., Prior, D.: Additional Tools for a World-Class ERP Infrastructure. Technical Report, Gartner, Inc. (2011)

  47. Rausand, M., Høyland, A.: System Reliability Theory: Models, Statistical Methods, and Applications, 2nd edn. Wiley, Hoboken (2004). http://www.ntnu.no/ross/srt

  48. Rodrigues, G.N.: A Model Driven Approach for Software Reliability Prediction. PhD thesis, University College London (2008)

  49. Sahner, R.A., Trivedi, K.S.: Reliability modeling using sharpe. IEEE Trans. Reliabil. 36(2), 186–193 (1987)

    Article  Google Scholar 

  50. Scott, D.: Benchmarking Your IT Service Availability Levels. Technical Report, Gartner, Inc. (2011)

  51. Singh, H., Cortellessa, V., Cukic, B., Gunel, E., Bharadwaj, V.: A bayesian approach to reliability prediction and assessment of component based systems. In: Software Reliability Engineering, 2001. ISSRE 2001. In: Proceedings of the 12th International Symposium, pp. 12–21 (2001). doi:10.1109/ISSRE.2001.989454

  52. Sundkvist, F.: Efter haveriet: Tieto granskas. Computer Sweden (2012, in Swedish)

  53. Taylor, S., Cannon, D., Wheeldon, D.: Service Operation (ITIL). The Stationery Office, TSO (2007a). ISBN: 9780113310463

  54. Taylor, S., Case, G., Spalding, G.: Continual Service Improvement (ITIL). The Stationery Office, TSO (2007b). ISBN: 9780113310494

  55. Taylor, S., Iqbal, M., Nieves, M.: Service Strategy (ITIL). The Stationery Office, TSO (2007c). ISBN: 9780113310456

  56. Taylor, S., Lacy, S., Macfarlane, I.: Service Transition (ITIL). The Stationery Office, TSO (2007d). ISBN: 9780113310487

  57. Taylor, S., Lloyd, V., Rudd, C.: Service Design (ITIL). The Stationery Office, TSO (2007e). ISBN: 978011310470

  58. The Open Group: Archimate 1.0 specification. http://www.opengroup.org/archimate/doc/ts_archimate/ (2009)

  59. Ullberg, J., Franke, U., Buschle, M., Johnson, P.: A tool for interoperability analysis of enterprise architecture models using pi-OCL. In: Proceedings of the International Conference on Interoperability for Enterprise Software and Applications (I-ESA) (2010)

  60. Van Buuren, R., Jonkers, H., Iacob, M.E., Strating, P.: Composition of relations in enterprise architecture models. Graph Transformations, pp. 183–186 (2004)

  61. Varian, H.: System reliability and free riding. In: Camp, L., Lewis, S. (eds.) Economics of Information Security. Advances in Information Security, vol. 12, pp. 1–15. Springer US, New York (2004). ISBN: 978-1-4020-8090-6. doi:10.1007/1-4020-8090-5_1

  62. Wang, J.: Timed Petri nets: theory and application, vol. 39. Kluwer Academic Publishers, Norwell (1998)

    Book  MATH  Google Scholar 

  63. Zambon, E., Etalle, S., Wieringa, R., Hartel, P.: Model-based qualitative risk assessment for availability of it infrastructures. Softw. Syst. Model., pp. 1–28 (2010). ISSN: 1619-1366. doi:10.1007/s10270-010-0166-8

  64. Zhang, X., Pham, H.: An analysis of factors affecting software reliability. J. Syst. Softw. 50(1), 43–56 (2000). ISSN: 0164-1212. doi:10.1016/S0164-1212(99)00075-8

    Google Scholar 

Download references

Acknowledgments

The authors wish to thank Per Närman for valuable input on metamodels for availability analysis, Johan Ullberg for help on the P\(^2\)AMF language and for reviewing the whole manuscript, Nicholas Honeth for reviewing the whole manuscript, Khurram Shahzad whose conscientious and timely programming of the EA\(^2\)T tool was a prerequisite for this paper, Michael Mirbaha and Jakob Raderius for their valuable input on the ITIL operationalizations and the five enterprises that kindly allowed us to conduct the case studies. In addition, the comments of the three anonymous referees improved the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ulrik Franke.

Additional information

Communicated by Prof. Dr. Dorina Petriu and Dr. Jens Happe.

Appendices

Appendix A: ITIL operationalization of the Bayesian expert model

A major challenge in the use of the system-level model [17] is the operationalization of the 16 factors. In the expert survey, the factor descriptions were—deliberately—kept short and general. As the survey respondents were all selected based on academic publications, detailed specifications in terms of enterprise operating procedures, processes, and activities were not deemed appropriate. However, as we now turn to practical use, there is a need to offer unambiguous operationalizations of the factors. This is a prerequisite for being able to assess whether a company meets the “best practice” level or not.

Making “best practice” an unambiguous notion might seem futile. However, in the area of IT service management (ITSM), the IT infrastructure library (ITIL) [5357] has become the de facto ITSM framework [21, 38, 39]. ITIL adoption is also a prescription offered to enterprises by influential consultancies such as Gartner [22].

To make the factors understandable and applicable to practitioners, they were translated into the ITIL language. This appendix explains the meaning of the causal factors, using ITIL as a frame of reference. During the case studies, the texts offered in this appendix were used to explain to practitioners how to think about the factors, to be able to assess their level of best practice.

To cast the factors into the ITIL language, the five ITIL volumes were studied, and the ITIL recommendations, processes, activities, and examples were mapped to the factors in Table 1. In so doing, care was taken to retain the original meaning of the factors, as first articulated in the expert survey. In order to make sure that the interpretation of ITIL was correct, two certified ITIL experts were—independently—asked to offer input and feedback. As a result, a number of changes were made so as to better express the factors in ITIL terminology. The biggest change made during this feedback phase was the merging of factors physical environment with infrastructure redundancy into one single factor, and data redundancy with storage architecture redundancy into another single factor. Each pair of factors was deemed close to indistinguishable in ITIL wording. Thus, following the expert validation of the ITIL operationalization, the 16 survey factors were converted into 14, as illustrated in Table 1.

If two causal factors \(i\) and \(j\) in the leaky Noisy-OR model described by Eq. (3) are to be merged into a single factor \(k\), it is reasonable to require that the original model and the new, merged model are unanimous in their availability predictions when they are semantically equivalent, i.e.,

$$\begin{aligned}&P(y | \bar{x}_1, \bar{x}_2, \ldots , x_i, \ldots , x_j, \ldots \bar{x}_n)\nonumber \\&\quad = P(y | \bar{x}_1, \bar{x}_2, \ldots , x_k, \ldots \bar{x}_n) \end{aligned}$$
(7)
$$\begin{aligned}&P(y | \bar{x}_1, \bar{x}_2, \ldots , \bar{x}_i, \ldots , \bar{x}_j, \ldots \bar{x}_n) \nonumber \\&\quad =P(y | \bar{x}_1, \bar{x}_2, \ldots , \bar{x}_k, \ldots \bar{x}_n) \end{aligned}$$
(8)

It follows from Eq. (3) that such a model preserving merger requires the following equality to hold:

$$\begin{aligned} p_k = \frac{p_i + p_j -p_i p_j - p_0}{1-p_0} \end{aligned}$$
(9)

This relation has been used when performing the mergers recommended by the ITIL experts. In all reasonable models, of course, \(p_0 \ne 1\), so the divisor is non-zero.

1.1 A.1 Descriptions of the 14 ITIL factors

In the following section, each of the 14 factors are described in more detail. The introductory italic text of each factor presents the description used in the expert survey, when the probability of the factor to cause system unavailability was estimated. Then follows a description where each factor is presented in further detail with reference to the appropriate ITIL documentation [5357]. Wordings are re-used from the ITIL volumes and the certified ITIL experts to the largest extent possible.

Physical environment and Infrastructure redundancy

The physical environment, including such things as electricity supply, cooling and cables handling, can affect the availability of a system. Infrastructure redundancy goes further than data and storage redundancy. Separate racks and cabinets may not help if all electricity cables follow the same path liable to be severed or if separate cables end at the same power source.

Factors to be considered include the following:

  • Building/site

  • Major equipment room

  • Major data centres

  • Regional data centres and major equipment centres

  • Server or network equipment rooms

  • Office environments

For a more detailed description of each factor, cf. ITIL Service Design [57] Appendix E, Table E.1-6.

Remark Physical security/access control is part of this category as well.

Requirements and procurement

Requirements and procurement reflect the early phases of system development and administration. This includes return on investment analyses, re-use of existing concepts, procuring software designed for the task at hand, negotiating service level agreements, etc.

This factor is about development of new systems and services, not about those already taken into operation. Business requirements for IT availability should at least contain (ITIL Service Design [57], p. 112):

  • “A definition of the vital business functions (VBFs) supported by the IT service

  • A definition of IT service downtime, i.e. the conditions under which the business considers the IT service to be unavailable

  • The business impact caused by loss of service, together with the associated risk

  • Quantitative availability requirements, i.e. the extent to which the business tolerates IT service downtime or degraded service

  • The required service hours, i.e. when the service is to be provided

  • An assessment of the relative importance of different working periods

  • Specific security requirements

  • The service backup and recovery capability.”

There should also be service level requirements (SLR).

Note that poor requirements management is often a root cause behind other faults, covered by the other factors.

Operations

Operations is everyday system administration. This includes removing single points of failure, maintaining separate environments for development, testing and production, consolidating servers, etc.

“All IT components should be subject to a planned maintenance strategy.” (ITIL service design [57], p. 120).

“Once the requirements for managing scheduled maintenance have been defined and agreed, these should be documented as a minimum in

  • Service level agreements (SLAs)

  • Operational level agreements (OLAs)

  • Underpinning contracts

  • Change management schedules

  • Release and deployment management schedules” (ITIL service design [57], p. 120.)

“Availability Management should produce and maintain the projected service outage (PSO) document. This document consists of any variations from the service availability agreed within SLAs.” (ITIL service design [57], p. 121.)

Incident management aims to “restore normal service as quickly as possible and minimize the adverse impact on business operations” (ITIL service operation [53], p. 46.). Incidents are managed using an incident model which should include

  • “The steps that should be taken to handle the incident

  • The chronological order these steps should be taken in, with any dependences or co-processing defined

  • Responsibilities—who should do what

  • Timescales and thresholds for completion of the actions

  • Escalation procedures; who should be contacted and when

  • Any necessary evidence-preservation activities” (ITIL service operation [53], p. 47).

“Service operation functions must be involved in the following areas:

  • Risk assessment, using its knowledge of the infrastructure and techniques such as component failure impact analysis (CFIA) and access to information in the configuration management system (CMS) to identify single points of failure or other high-risk situations

  • Execution of any risk management measures that are agreed, e.g. implementation of countermeasures, or increased resilience to components of the infrastructures, etc.

  • Assistance in writing the actual recovery plans for systems and services under its control

  • Participation in testing of the plans (such as involvement in off-site testing, simulations, etc) on an ongoing basis under the direction of the IT service continuity manager (ITSCM)

  • Ongoing maintenance of the plans under the control of IT service continuity manager ITSCM and change management

  • Participation in training and awareness campaigns to ensure that they are able to execute the plans and understand their roles in a disaster

  • The service desk will play a key role in communicating with the staff, customers and users during an actual disaster” (ITIL service operation [53], p. 77).

The problem management process should contain the following steps (ITIL service operation [53], p. 60):

  • Problem detection

  • Problem logging

  • Categorization

  • Prioritization

  • Investigation and diagnosis

  • Create known error record

  • Resolution

  • Closure

Change control

Change control is the process of controlling system changes. This applies to both hardware and software and includes documentation of the actions taken.

This factor is about systems and services already taken into operation, not about those under development. This is just a subset of the full ITIL Change Management process.

The seven Rs of change management: “The following questions must be answered for all changes:

  • Who RAISED the change?

  • What is the REASON for the change?

  • What is the RETURN required from the change?

  • What are the RISKS involved in the change?

  • What RESOURCES are required to deliver the change?

  • Who is RESPONSIBLE for the build, test and implementation of the change?

  • What is the RELATIONSHIP between this change and other changes?” (ITIL service transition [56], p. 53).

The following is a list of activities from an example process for a normal change (ITIL service transition [56], p. 49):

  • Record the request for change (RFC)

  • Review request for change (RFC)

  • Assess and evaluate change

  • Authorize change

  • Plan updates

  • Co-ordinate change implementation

  • Review and close change record

Questions to be addressed include the following: Is there unavailability caused by standard changes that are pre-approved and do not go through the cycle above? Is there unavailability caused by unauthorized changes? Is there access control, or can unauthorized people make changes? Is there a service asset and configuration management (SACM, described in ITIL service transition [56], p. 65 ff.) process?

Technical solution of backup

The technical solution of backup includes the choice of back-up media, whether commercial or a proprietary software is used, whether old media can still be read, whether full, cumulative or differential backup is chosen, etc.

The technical aspects of a backup strategy should cover the following:

  • “How many generations of data have to be retained—thus may vary by the type of data being backed up, or what type of file (e.g. data file or application executable)

  • The type of backup (full partial, incremental) and checkpoints to be used

  • The locations to be used for storage (likely to include disaster recovery sites) and rotation schedules

  • Transportation methods (e.g. file transfer via the network, physical transportation in magnetic media)

  • Testing/checks to be performed, such as test-reads, test restores, check-sums etc.” (ITIL service operation [53], p. 93).

Process solution of backup

The process solution of backup regulates the use of the technical solution. This includes routines such as whether backups are themselves backed up, whether the technical equipment is used in accordance with its specifications, what security measures (logical and physical) are used to guard backups, etc.

The business requirements for IT service continuity must be properly determined to define the strategy. The requirements elicitation has two steps: 1. business impact analysis and 2. risk analysis (ITIL service design [57], p. 128 ff).

The process aspects of a backup strategy should cover the following:

  • “What data have to be backed up and the frequency and intervals to be used

\([\ldots ]\)

  • Recovery point objective This describes the point to which data will be restored after recovery of an IT service. This may involve loss of data. For example, a recovery point objective of one day may be supported by daily backups and up to 24 h of data may be lost. Recovery point objectives for each IT service should be negotiated, agreed, and documented in operational level agreements (OLAs), Service Level Agreements (SLAs) and underpinning contracts (UCs).

  • Recovery time objective This describes the maximum time allowed for recovery of an IT service following an interruption. The service level to be provided may be less than normal service level targets. Recovery time objectives for each IT service should be negotiated, agreed, and documented in OLAs, SLAs and UCs” (ITIL service operation [53], p. 93–94, emphasis in original.)

The restore process must include these steps:

  • “Location of the appropriate data/media

  • Transportation or transfer back to the physical recovery location

  • Agreement on the checkpoint recovery point and the specific location for the recovered data (disk, directory, folder etc)

  • Actual restoration of the file/data (copy-back and any roll-back/roll-forward needed to arrive at the agreed checkpoint)

  • Checking to ensure successful completion of the restore— with further recovery action if needed until success has been achieved

  • User/customer sign-off” (ITIL service operation [53], p. 94).

Data and storage architecture redundancy

Data redundancy means that data stored on a disk remain available even if a particular disk crashes. Such data redundancy is often achieved through RAID. Storage architecture redundancy refers to redundancy at the level above disks: RAID may not help if all raided disks are placed in a single cabinet or rack, or if disks are connected through the same data paths and controller.

Is there a separate team or department to manage the organization’s data storage technology such as (ITIL service operation [53], p. 97):

  • Storage devices (disks, controllers, tapes etc.)

  • Network attached storage (NAS), storage attached network (SAN), direct attached storage (DAS) and content addressable storage (CAS).

Is there someone responsible for

  • “Definition of data storage policies and procedures

  • File storage naming conventions, hierarchy, and placement decisions

  • Design, sizing, selection, procurement, configuration, and operation of all data storage infrastructure

  • Maintenance and support for all utility and middleware data-storage software

  • Liaison with information lifecycle management team(s) or governance teams to ensure compliance with freedom of information, data protection, and IT governance regulations

  • Involvement with definition and agreement of archiving policy

  • Housekeeping of all data storage facilities

  • Archiving data according to rules and schedules defined during service design. The storage teams or departments will also provide input into the definition of these rules and will provide reports on their effectiveness as input into future design

  • Retrieval of archived data as needed (e.g. for audit purposes, for forensic evidence, or to meet any other business requirements)

  • Third-line support for storage- and archive-related incidents” (ITIL service operation [53], p. 97).

All redundancy decisions should have been through an appropriate requirements engineering process (the requirements engineering process is defined on ITIL service design, pp. 167 ff.).

Avoidance of internal application failures

Systems can become unavailable because of internal application failures, e.g. because of improper memory access or hanging processes.

When new software is released and deployed, the plans should define

  • “Scope and content of the release

  • Risk assessment and risk profile for the release

  • Organizations and stakeholders affected by the release

  • Stakeholders that approved the change request for the release and/or deployment

  • Team responsible for the release

  • Approach to working with stakeholders and deployment groups to determine the:

    • Delivery and deployment strategy

    • Resources for the release and deployment

    • Amount of change that can be absorbed” (ITIL service transition [56], p. 91).

For software in operation, the following must be in place or considered (ITIL service operations [53], p. 133–134):

  • Modeling, workload forecasting and workload testing

  • Testing by an independent tester

  • Up to date design, management and user manuals

  • Process of application bug tracking and patch management

  • Error code design and error messaging

  • Process for application sizing and performance

  • Process for enhancement to existing software (functionality and manageability)

  • Documentation of type of brand of technology used.

Avoidance of external services that fail

Systems can become unavailable because they depend on external services that fail.

Service level management includes (ITIL continual service improvement [54]):

  • “Identifying existing contractual relationships with external vendors. Verifying that these underpinning contracts (UCs) meet the revised business requirements. Renegotiating them, if necessary.” (pp. 28–29)

  • “Create a service improvement plan (SIP) to continually monitor and improve the levels of services” (p. 29)

  • An implemented monitor and data collection procedure (service monitoring should also address both internal and external suppliers since their performance must be evaluated and managed as well (p. 46)).

  • Service level management (SLM)

    • “Analyze the service level achievements compared to SLAs and service level targets that may be associated with the service catalogue

    • Document and review trends over a period of time to identify any consistent patterns

    • Identify the need for service improvement plans

    • Identify the need to modify existing operational level agreements (OLAs) or underpinning contracts (UCs)” (ITIL continual service improvement [54] p. 59).

Key elements of successful supplier management:

  • “Clearly written, well-defined and well-managed contract

\([\ldots ]\)

  • Clearly defined (and communicated) roles and responsibilities on both sides

  • Good interfaces and communications between the parties

  • Well-defined service management processes on both sides

  • Selecting suppliers who have achieved certification against internationally recognized certifications, such as ISO 9001, ISO/IEC 20000, etc.” (ITIL service design [57], p. 164).

Network redundancy

Network redundancy, e.g. by multiple connections, multiple live networks or multiple networks in an asymmetric configuration, is often used to increase availability.

Important configurations for redundancy include

  • Diversity of channels: “provide multiple types of access channels so that demand goes though different channels and is safe form a single cause of failure.” (ITIL Service Strategy [55], p. 177).

  • Density of network: “add additional service access points, nodes, or terminals of the same type to increase the capacity of the network with density of coverage.” (ITIL Service Strategy [55], p. 177)

  • Loose coupling: “design interfaces based on public infrastructure, open source technologies and ubiquitous access points such as mobile phones and browsers so that the marginal cost of adding a user is low.” (ITIL service strategy [55], p. 177).

Avoidance of network failures

Network failures include the simplest networking failure modes, e.g. physical device failures, IP level failures, and congestion failures.

Important factors include

  • “Third-level support for all network related activities, including investigation of network issues (e.g. pinging or trace route and/or use of network management software tools—although it should be noted that pinging a server does not necessarily mean that the service is available!) and liaison with third-parties as necessary. This also includes the installation and use of ’sniffer’ tools, which analyze network traffic, to assist in incident and problem resolution.

  • Maintenance and support of network operating system and middleware software including patch management, upgrades, etc.

  • Monitoring of network traffic to identify failures or to spot potential performance or bottleneck issues.

  • Reconfiguring or rerouting of traffic to achieve improved throughput or batter balance—definition of rules for dynamic balancing/routing” (ITIL service operation [53], p. 96).

Physical location

The physical location of hardware components can affect the recovery time of a malfunctioning system. This is the case, for instance, when a system crashes and requires technicians to travel to a remote data center to get it up again.

Has physical location been taken into account when fixing the recovery time objective (RTO)? The overall backup strategy must include

  • Recovery time objective This describes the maximum time allowed for recovery of an IT service following an interruption. The service level to be provided may be less than normal service level targets. Recovery time objectives for each IT service should be negotiated, agreed, and documented in OLAs, SLAs and UCs” (ITIL service operation [53], p. 93–94, emphasis in original.)

Physical location is partly addressed in the detailed description of facility management (ITIL service operation [53], Appendix E: E2, E3, E4 and E6).

Resilient client/server solutions

In some client/server solutions, a server failover results in the client crashing. In more resilient client/server solutions, clients do not necessarily fail when the server fails.

Typically, the following activities should be undertaken:

  • “Third-level support for any mainframe-related incidents/problems”

\([\ldots ]\)

  • Interfacing to hardware (H/W) support; arranging maintenance, agreeing slots, identifying H/W failure, liaison with H/W engineering.

  • Provision of information and assistance to capacity management to help achieve optimum throughput, utilization and performance from the mainframe.” (ITIL service operation [53], p. 95).

Other activities include:

  • “Providing transfer mechanisms for data from various applications or data sources

  • Sending work to another application or procedure for processing

  • Transmitting data or information to other systems, such as sourcing data from publication on websites

  • Releasing updated software modules across distributed environments

  • Collation and distribution of system messages and instructions, for example events or operational scripts that need to be run on remote devices

  • Multicast setup with networks. Multicast is the delivery of information to a group of destination simultaneously using the most efficient delivery route

  • Managing queue sizes.

  • Working as part of service design and transition to ensure that the appropriate middleware solutions are chosen and that they can perform optimally when they are deployed

  • Ensuring the correct operation of middleware through monitoring and control

  • Detecting and resolving incidents related to middleware

  • Maintaining and updating middleware, including licensing, and installing new versions

  • Defining and maintaining information about how applications are linked through middleware. This should be part of the configuration management system (CMS)” (ITIL service operation [53], p. 99).

There should exist a transition strategy for how to release client/server systems from testing into production (ITIL service transition [56], p. 38 ff.). Do not forget that requirements engineering and change control might be the root cause of failures in this domain!

Monitoring of the relevant components

Failover is the capability to switch from a primary system to a secondary in case of failure, thus limiting the interruption of service to only the takeover time. A prerequisite for quick failover is monitoring of the relevant components.

Instrumentation is about “defining and designing exactly how to monitor and control the IT infrastructure and IT services” (ITIL service operation [53], p. 45). The following needs to be answered:

  • “What needs to be monitored?

  • What type of monitoring is required (e.g. active or passive; performance or output)?

  • When do we need to generate an event?

  • What type of information needs to be communicated in the event?

  • Who are the messages intended for?” (ITIL service operation [53], p. 45)

Examples of important monitoring needs include:

  • “CPU utilization (overall and broken down by system/service usage)

  • Memory utilization

  • IO rates (physical and buffer) and device utilization

  • Queue length (maximum and average)

  • File storage utilization (disks, partition, segments)

  • Applications (throughput rates, failure rates)

  • Databases (utilization, record locks, indexing, contention)

  • Network transaction rates, error and retry rates

  • Transaction response time

  • Batch duration profiles

  • Internet response times (external and internal to firewalls)

  • Number of system/application log-ons and concurrent users

  • Number of network nodes in use, and utilization levels.” (ITIL service operation [53], p. 74).

For each incident: Was it/could it reasonably have been foreseen, or does lack of foresight indicate poor monitoring? Is there end-to-end monitoring?

Appendix B: OCL code for derived metamodel attributes

1.1 B.1 BehaviorElement

1.1.1 B.1.1 ArchitecturalAvailability:Real

figure a7

1.1.2 B.1.2 AvoidedUnavailability:Real

figure a8

1.1.3 B.1.3 HolisticAvailability:Real

figure a9

It should be noted that the final divisor is zero (i.e. the expression is undefined) if \(\mathtt{{AvoidedUnavailability}} \mathtt{{Baseline}}=1\). However, as the leakage of the Noisy-OR model prevents AvoidedUnavailability from ever reaching 1, a baseline value of 1 is also impermissible. This is enforced using the baselineCheck invariant.

Appendix C: OCL code for metamodel operations

1.1 C.1 BehaviorElement

1.1.1 C.1.1 isTopService():Boolean

This operation checks whether the current Behavior Element is the top service, i.e. whether it has no causal successors. The reason for doing this is that the holistic availability ought only be evaluated at a single point in the architecture, viz. at its “top”. This follows from the fact that holistic availability accounts for non-localizable properties.

figure a10

1.1.2 C.1.2 getBehaviorElements(BehaviorElement): BehaviorElement[*]

This operation returns the set of all BehaviorElements causally prior to the one given as argument, including itself. This is implemented in a recursive fashion.

figure a11

curr is the BehaviorElement given as argument.

1.1.3 C.1.3 getGates(Gate):Gate[*]

This operation returns the set of all gates connected to the one given as argument through other gates (not through BehaviorElements).

figure a12

curr is the Gate given as argument.

1.1.4 C.1.4 getBestPracticeCausalFactor(BehaviorElement): Real

This set includes 14 operations, one for each of the best practice attribute factors of Table 2. While similar, the 14 P\(^2\)AMF implementations differ a bit from each other. Each operation traverses the architectural model, using the getBehaviorElements and getGates operations, to find all attributes of the relevant kind. The arithmetic mean of the attribute values is returned. If no attributes of the relevant kind are found, the default value 0.5 is returned. getBestPracticeAvoidanceOfExternalServi- ceFailures represents the basic case:

figure a13

top is the BehaviorElement given as argument.

If the attributes sought belong to more than one class, these several class types must be found, as in getBestPract- ice ResilientClientServerSolutions:

figure a14

Sometimes the attributes belong to a class that has to be accessed through a relationship, e.g. isGovernedBy in getBestPracticeChangeControl:

figure a15

1.1.5 C.1.5 avoidedUnavailability():Real

This operation calculates the avoided unavailability based on the Noisy-OR model from [17]. The model originally gives the avoided unavailability as

$$\begin{aligned} A(\mathbf{X}_p) = 1 - P(y | \mathbf{X}_p) = (1- p_0) \prod _{i:X_i \in \mathbf{X}_p} \frac{(1- p_i)}{(1- p_0)} \end{aligned}$$
(10)

where \(A(\mathbf{X}_p)\) is the avoided unavailability of an architecture lacking the best practice factors listed in the vector \(\mathbf{X}_p\). The \(p_i\) variables are the ones given in Table 1. In this implementation, each factor to the right of the product sign is calculated using the operation noisyOrFactor.

figure a16

1.1.6 C.1.6 noisyOrFactor(Real,Real,Real):Real

This operation calculates factors for the avoided Unavailability operation. In the formulation given in Eq. 10, each best practice factor is either present or not (i.e. listed in the vector \(\mathbf{X}_p\), or not). However, noisyOrFactor allows factors to be present probabilistically, i.e. be present with a probability \(q\) and absent with a probability \(1-q\). A causal factor being present means that it is not listed in the index set \(\mathbf{X}_p\), which is equivalent to noisyOrFactor returning 1, i.e. the multiplicative identity. This consideration gives noisyOrFactor the following form:

figure a17

noisyOrFactor takes three arguments; \(p, q\), and \(p_0\). \(p_0=1~\%\) [17], \(p\) is the appropriate \(p_i\) from Table 1, and \(q\) is the value returned from the appropriate getBest Practice CausalFactor.

It is a feature of Noisy-OR models as such that the impact of the systemic causal factors is multiplicatively separable. This allows the probabilistic presence of each factor to be treated independently from the rest, as in noisyOrFactor.

Appendix D: OCL code for metamodel invariants

A number of OCL invariants are included in the metamodel, in order to express constraints that cannot be expressed in a regular entity-relationship diagram.

1.1 D.1 BehaviorElement

1.1.1 D.1.1 baselineCheck

This invariant requires that the \(\mathtt{{AvoidedUnavailability}}\mathtt{{Baseline}} \in [0,1) \). It cannot be unity, since the Noisy-OR model behind \(AvoidedUnavailability\) has leakage and cannot be unity.

figure a18

1.1.2 D.1.2 mostOneType

This invariant requires that a BehaviorElement is at most connected to one gate through the gateToService relation or to one ActiveStructureElement through the assigned relation, but not to both. Being connected to both would lead to a conflict in the calculation of architectural availability.

figure a19

1.2 D.2 Node

1.2.1 D.2.1 onlyInfrastructureService

This invariant requires that only an InfrastructureService, and no other type of BehaviorElement can be assigned to a Node.

figure a20

1.3 D.3 ApplicationComponent

1.3.1 D.3.1 onlyFunction

This invariant requires that only an ApplicationFunction, and no other type of BehaviorElement can be assigned to an ApplicationComponent.

figure a21

1.4 D.4 CommunicationPath

1.4.1 D.4.1 onlyInfrastructureService

This invariant requires that only an InfrastructureService, and no other type of BehaviorElement can be assigned to a CommunicationPath.

figure a22

Rights and permissions

Reprints and permissions

About this article

Cite this article

Franke, U., Johnson, P. & König, J. An architecture framework for enterprise IT service availability analysis. Softw Syst Model 13, 1417–1445 (2014). https://doi.org/10.1007/s10270-012-0307-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10270-012-0307-3

Keywords

Navigation