# Top Five Challenges Facing the Practice of Fault-tolerance

Ram Chillarege
IBM Thomas J. Watson Research Center, 1994

Abstract -- This paper identifies key problem areas for the fault-tolerant community to address. Changes in technology, expectation of society, and needs of the market pressure the design point for fault-tolerance in their own special manner.  A developer, who has only a finite set of resources and limited time, responds to these pressures with a set of priorities. I believe that the top five challenges, which ultimately drive the exploitation of fault-tolerant technology are:
(1)   Shipping a product on schedule
(2)   Reducing unavailability
(3)   Non-disruptive change management
(4)   Human fault-tolerance
(5)   All over again in the distributed world.
Each of these are discussed to explore their influence on the choice for fault-tolerance. Understanding them is key to guide research investment and maximize its derivatives.

Lecture Notes in Computer Science 774, "Hardware and Software Architectures for Fault Tolerance, Springer-Verlag 1994, ISBN 3-540-57767-X

# 1. The Area of Fault-tolerance

The area of fault-tolerance is never clearly defined, however, in some quarters it is assumed that fault tolerant computing appears in a box. This is misleading given that the ideas of fault-tolerance permeate the entire industry into hardware software, and application. Yet, it is not uncommon for industry segmentation efforts to divvy up the market and identify one of them as the fault tolerant market. This market, when quantified by adding up the revenue from fault-tolerant boxes, is only in the range of two billion dollars [1], in an industry that is estimated at more than two hundred billion dollars. However, as most engineers would agree, the perception that fault tolerant computing comes in a box, either hardware or software, is only a very narrow view of the area.

A larger view, one I believe to be more accurate, is that the ideas and concepts of fault-tolerance permeate every segment of the industry -- starting with the device, the machine and following through with systems software, sub-system software, application software and including the end user. However, this larger vision is confounded by the fact that there are several different forces and expectations on what is considered fault-tolerance and what is not.  The single hardest problem that continues to persist in this community is the definition of the faults that need to be tolerated [2] [3]. An engineering effort to design fault-tolerance is effective only when there is a clear picture of what faults need to be tolerated. These questions need to be answered at every level of the system. There are trade-offs in cost, manufacturability, design time and capability in arriving at a design point. Utimately, like any decision, there is a substantial amount of subjective judgement used for that purpose.

Independent of the technical challenges that face the designer of fault tolerant machines, there is another dimension which is based on society and its expectations placed on computers. In the long run this has a more significant impact on what needs to be designed into machines than is ordinarily given credit. Let us for a moment go back in time and revisit the Apollo Seven disaster that took place more than two decades ago.  At that time the tragedy brought about a grave sadness in our society. We let it pass, hoped it would not happen again, and continued the pursuit of scientific accomplishments for mankind.  Contrasted with the Challenger disaster that took place only a few years ago, there was a very different perception in society.  It was considered unacceptable that such a disaster could take place.  The expectations on technology had changed in the minds of people.  Technology has significantly advanced, and the average person trusted it a lot more, whether or not that trust was rightfully placed.

These changes in expectation place an enormous pressure on the designers of equipment which is used in every day life.  People expect them to work and expect them to be reliable, whether or not the product has such specifications. In cases where safety is critical, there may exist an elaborate process and specification to insure safety. However, there are computers imbedded in consumer devices which may or may not have gone through the design and scrutiny to insure reliability, safety and dependability.

## 1.1 Operating Just Below the Threshold of Pain

The question of engineering fault-tolerance is a very critical one. Just how much fault-tolerance is needed for a system or an application is a hard question. Ideally, this question should have an engineering answer but in reality that is rare. It is one that has to be answered bearing in mind, expectation of a customer, the capability of a technology and what is considered competitive in the marketplace. I propose that a realistic answer to this question is to recognize that a system need be fault tolerant only to the extent that it operates just below the threshold of pain.

Fault-tolerance does not come free. Developing a system which is fault tolerant beyond customer expectation is excessive in cost and cannot be competitive. On the other hand, if the system fails too often causing customer dissatisfaction, one will lose market share. The trick is to understand what that threshold of pain is and insure that the system operates just below that threshold.  This would then be the perfect engineering solution.

Understanding the threshold of pain, knowing the limits of the technology and the capability of alternate offerings is critical. One approach is to dollarize lost business opportunity to provide a quantitative mechanism to arrive at a reasonable specification. However, when the impact is customer satisfaction and lost market share, it is more complicated. When the impact is perception, it is truly difficult. When safety is in question, all bets are off. Nevertheless, the  bottom line is that one has to maintain a careful balance in designing fault-tolerance capability.

# 2. Forces Driving the Prioritization

The design of fault-tolerant computing is influenced by several changes in the current industrial environment.  These changes come from different directions and pressure the fault tolerant community in their own special manner.  For the purposes of this paper we will discuss a few of the pressures that influence the current environment.

Component reliability and speed both have made dramatic improvements. The improvement in hard failure rate for IBM mainframes, (measured in failures per year per MIPS) decreased almost two orders of magnitude in 10 years [4]. The dramatic improvement in reliability decreases the sense of criticality on fault-tolerance, although it does not completely go away. However, this trend coupled with other effects does tend to change the focus. One of the other significant effects is standardization and commoditization, yielding a very competitive market. Standardization increases competition, reduces profit margins and on puts a very strong focus on cost. This competitiveness is experienced in almost every segment, more so in the lower price segments than the higher price segments. The net result is a tremendous focus on cost.

The belief that fault-tolerance increases cost is quite pervasive, although it is not clear that it does so from a life-cycle-cost perspective. Fundamentally, fault-tolerance in hardware is  achieved through the use of redundant resources.  Redundant resources cost money (or performance) and no matter how small come under perpetual debate.

The drive towards standardization has decreased product differentiation. As a result, in a market with standardized offerings every vendor is looking for product differentiators. Fault-tolerance, is certainly a key differentiator amongst equivalent function and this should help drive the need for greater fault-tolerance.

The customer's perspective on these problems is driven from a different set of forces.  There is a much greater dependency on information technology as time progresses, and this dependency is not as well recognized until things go wrong. This subliminal nature of computing in the work place has resulted in much higher expectations on delivery of the service. Customers do not write down specifications on dependability, reliability, availability or serviceability.  They have expectations which have grown to the point that dependability is expected as a given. As a result, the focus of the customer is on the solution and not so much on how we got to the solution.  Unless a vendor recognizes these expectations and designs appropriately, the resulting solution may be far from the expectations.

Given the downturn in the economy in the last few years there has been a trend to reduce expenditure on information systems. One of the tendencies is to move off equipment which is considered higher priced and on to those considered lower priced alternatives. These moves are usually not matched with a corresponding shift in expectation as far as reliability and availability. This is particularly true where applications on centralized mainframe computers are moved to networks of distributed personal computers. While the mainframe systems had experienced staff who understood systems management, dependencies, and recovery strategies the equivalent function or skill may not exist for the distributed setup. However, with computing in the hands of the non-experts, the expectations continue to carry through while the underlying enablers may not exist, and the risks not adequately comprehended.

Given these various forces one can see that the design point for fault tolerant equipment is indeed a very nebulous issue. Although it is good to, ask the customer," on a realistic front questions on expectation of reliability, availability, with the corresponding cost and risk are extremely hard to quantify.

# 3. The Top 5 Challenges

There is only a finite amount of time and resource that can be juggled to produce product and profit. Fault-tolerance, as a technology or a methodology to enhance a product, plays a key role in it. Without the techniques of fault-tolerance it is unlikely that any of the products would function at levels of acceptability.  Yet, since the use of fault-tolerance takes a finite amount of resource (parts and development) its application is always debated. Thus, there are times when its use, though arguably wise and appropriate has to be traded due to its impact on the practicality of producing product.

There is no magic answer to the tradeoffs that are debated. Some times, there are costing procedures and data to understand the tradeoffs.  Most times, these decisions have to be made with less than perfect information.  Thus judgment, experience, and vision drive much of the decision making.  What results are a set of priorities.  The priorities change with time and are likely to be different for the different product lines.  However, stepping back and observing the trends there are some generalizations that become apparent. The following sections list five items, in order, that I believe drive most of the priorities. There are numerous data sources and facts from technical, trade and news articles that provide bits and pieces of information. Although some are cited, the arguments are developed over a much larger body of information. The compilation is purely subjective and the prioritization is a further refinement of it.

## 3.1 Shipping a Product on Schedule

This is by far the single largest force that drives what gets built or not, and for the most part, rightly so.  What has that to do with fault-tolerance?  It worthwhile noting that producing product is always king - the source of revenue and sustenance. The current development process is under extreme pressure to reduce cycle time.   Getting to market first provides advantage and the price of not being first is significant. This is driven by the need in a competitive market to introduce products faster, and a technological environment where product life times are shrinking. The reduction in cycle time impacts market opportunity realised and life-cycle costs.

The development cycle time is proportional to the amount of function being designed. The shrinking of development cycle time brings under scrutiny any function that could be considered additional. Unfortunately, this pressure does not spare function used to provide fault-tolerance.

Fault-tolerance requires additional resource not only in hardware but also in design and verification, which add to development cycle time. Any extra function that doesn't directly correspond to a marketable feature comes under scrutiny.  Thus, the pressure of reducing of cycle time can indirectly work against functionality such as fault-tolerance which is usually a support type of function in the background. Until fault-tolerance becomes a feature which is directly translates to customer gain, the cycle time pressures do not work in its favor.

Systems have become very complex. The complexity exists at almost every level of the system -- the hardware, the software, the applications and user interface levels. Designing complex systems increases development cycle time, and also creates correspondingly complex failure mechanisms. Although automation and computer aided design techniques have helped reduce  the burden, especially in hardware, they cause a new kind of problem. Errors that are inserted and the ones that dominate the development process are the higher level specification and design errors.  These design errors have a large impact on the overall development cycle time, significantly impacting cost. There are no easy solutions to these problems and an understanding of the fault models is only emerging.

The classical positioning of dependable computing and fault-tolerance has been not to address the faults that escape the development process but to address the random errors attributed to nature.  Unfortunately, this is probably a fairly major oversight in this industry.  One of the critical paths in a business is the development cycle time.  The compressed schedules can result in a greater number of errors that actually escape into the field. Unfortunately, this has never been the focus and is not easy to make  the focus. It would make a difference if these error escapes were also the focus of the fault-tolerant computing research community.

## 3.2 Reducing Unavailability

From the perspective of a commercial customer, it is the loss of availability that causes a large impact. The causes of outage need to be carefully understood before one can develop a strategy for where fault tolerance needs to be applied.  There have been quite a few studies that identify the various causes of outage and their impact.  A widely recognized conclusion is that the Pareto is dominated by software and procedural issues, such as operator errors or user errors.  Next to these errors are hardware and environmental problems. Studies show that a decade ago, hardware outage dominated the Pareto but improvements in technology and manufacturing have decreased that contribution. However, there have not been similar improvements in software which is why it now dominates the cause of outage [5]. It is common place in the industry to separate outage causes into scheduled and unscheduled outage. Given the Pareto this split is more relevant in software than in hardware.

An unscheduled outage is an act of technology or nature and is the kind of fault that is commonly the target of fault-tolerant design. Typically, these faults are due to manufacturing defects or marginal performance which result in transient or intermittent errors. Unscheduled outage can also occur due to software bugs or defects. Although there is considerable effort expended on de-bugging software prior to release, there is no such thing as the last bug. Bugs that cause failures but do not always result in a complete outage [6]. Infact, the severity 1 (on a scale of 1-4) implying a complete loss of function are less than 10\% and severity 2, which requires some circumvention to restore operation is typically between 20\%-40\%.  The severity 3's and 4's correspond to an annoyance and are usually the bulk of the problems. Not every software defect hits every customer.  However, it is common practice to upgrade a release of software with a maintenance release.  A maintenance release includes recent bug-fixes and the time required for periodic maintenance is usually accounted under scheduled outage.

The largest part of outage due to software is what may be called planned or scheduled outage. Primarily, these are for maintenance, reconfiguration, upgrade etc. Over the past few years we have seen that the proportion of scheduled outage, especially in software, has greatly increased. The mean scheduled outage in the commercial data processing center is at least twice that of an unscheduled outage. It is also the case that the total amount of outage caused, due to scheduled down time, far exceeds unscheduled outage, particularly for software. Typically commercial systems have scheduled down time to reorganize data bases, accommodate new configurations or tune for performance. This is an aspect of outage that has not been adequately studied in the academic community. Although, it may sound like a topic for systems management it impacts availability most directly.

As the industry places greater emphasis on reducing software defects and their impact, the proportion of the scheduled outage will rise. Reducing the scheduled outage down to zero is becoming a requirement in some commercial applications that call for 24x7 operation, i.e., 24 hours a day, 7 days a week [7]. To reach this design point one has to reduce outage from all sources. The difficulty is in reducing scheduled outage since, most old designs assume the availability of a window for repair and maintenance. To reach the goal of 24x7 operation one has to broaden the vision and scope of fault-tolerance to include all sources of outage. This calls for  rethinking the design point. The task is much simpler for hardware, where each machine design starts with almost a clean slate. Whereas, designing software is a much more constrained, building on a base of code, whose design might not all be understood or documented.

## 3.3 Non-Disruptive Change Management

The earlier discussion on scheduled outage brings to focus a very important aspect about software maintenance. Software will always need to be maintained: either the installation of patches, upgrade to a newer release, establishing of new interfaces, etc. All these  cause disruption and more often than not demand an outage. Unless software has been designed to be maintained non-disruptively, it is unlikely this capability can be retrofit. The increasing network applications create situations where products communicate with different releases and functionality, requiring N to N+1 compatibility. This requirement has serious implications on how software is designed, control structures maintained, and data shared. Architecting this from the very beginning makes the task of designing upgradability much easier.  Trying to do this in a legacy system is invariably a hard exercise and sometimes infeasible.

There are some techniques that can be adopted towards non-disruptive change management. Broadly, they fall into a couple of major categories: one being a hot standby and the other the mythical modular construction that can be maintained on-line. With legacy system the choices are more limited given a base architecture which is inherited, and a hot standby approach is easier to conceive [8], [9]. In a hot standby, a second version of the application is brought up and users migrated from one application to the other while the first version is taken down for rework. To do this one has to maintain communication between the applications, consistency of data, and a failover capability. Alternatively, applications can be built so they are more modular and the shared resources managed to permit online maintenance.

A related problem that impacts non-disruptive change management is the very first step namely, problem isolation and diagnosis. Unlike hardware, software failures do not always result in adequate information to identify the fault or the cause of failure. In IBM parlance, this is commonly called  first failure data capture. Studies have shown, that the first failure data capture is usually quite poor. Barring, some of the mainframe software which has traditionally had a lot of instrumentation [9]. Most software does not trap, trace or log adequate information to help diagnose the failure the first time it occurs. Furthermore, error propagation and latency make it hard to identify the root cause. The problem then requires to be re-created>/em> which, at the customer site, causes further disruption and outage.

In a network environment, an application can be spread  across the network in a client-server relationship with data from distributed databases. Providing a non-disruptive solution for change management becomes more complicated. To reduce outage, change management has to be carefully architected. Current trends in this area are mostly adhoc and a unifying theme and architecture is certainly an opportune area for research. It would also provide for better inter-operability across multi-vendor networks.

## 3.4 Human Fault-Tolerance

With the current focus on the defect problem and unscheduled outage their impact will eventually be decreased. The scheduled down time will also decrease with improved systems management. However, a new problem will then start to dominate. This problem has to do with the human comprehension of tasks being performed. In IBM parlance, we call this the non-defect oriented problems<\em>. As the name suggests, a non-defect is one that does not require a code change to fix the problem. The non-defect problems also includes tasks such as installation and migration, provided they are problems related to comprehensibility of instructions and tasks, as opposed to defects in the code.

A non-defect can cause work to be stopped by the human, resulting in an eventual loss of availability.   This disruption in the work, can also result in calls to the vendor increasing service costs. More importantly, these problems can eventually impact the perception of the product.  Increasingly, information on a product is integrated with the application making documentation more accessible and available. New graphical user interfaces have paradigms that make the execution of a task far more intuitive.  Additionally, there can evolve a culture of user is always right. In this environment, the concept of availability needs to be re-thought and correspondingly the concepts of fault-tolerance. The classical user error will quickly become passe. Nevertheless, designing systems to tolerate human error is only part of the story. Designing systems to ensure a certain degree of useability perceived by a user, is certainly a new challenge for the fault-tolerant community.

## 3.5 All Over Again in the Distributed World

One of the philosophies in fault-tolerance, goes back to John von Neumann, -- the synthesis of reliable organisms from unreliable components''. Stretching this to the present day, we often think of designing distributed systems using parts from the single system era. If it were single system with no fault-tolerance being used to build distributed systems, we might be luckier. However, single high end commercial systems, are amazingly fault-tolerant.  When we lash a few of them together, one has to be careful in understanding the failure and recovery semantics, before designing a higher level protocol. For, we are no longer, synthesizing a reliable system with unreliable components.

The problems emanate because there are several layers of recovery management, each one optimized locally, which may not prove to be a good global optimal.  For example, assume there are two paths to a disk via two different fault-tolerant controllers. If an error condition presented on a request, is re-tried repeatedly by the controller, it would be a poor choice given the configuration.  Failing the request reported with an error, and re-issuing it on a different path would be preferred. However, this implies understanding the recovery semantics and disabling them to develop yet  another higher level policy. The above situation only illustrates the tip of the iceberg. There are several nuances that need to be dealt with.

In essence, one has to think of the design point and the strategy all over again in the distributed world. There are several benefits, one of them being the availability of a substantial number of spares.  With plenty of spares, shoot and restart, might be a better policy than trying to go through an elaborate recovery process. Assuming that error detection is available, sparing provides a nice repair policy. Contrast with a high-end commercial processor such as the IBM ES/9000 Model 900 [10], which has extensive checking but limited spares. Whereas a network of workstations, in todays technology with minimal checking can provide a lot of spares. On the other hand, the ES/9000 provides some of the highest integrity in computing. Solutions to the integrity problem in a network of workstations, when designed on the granularity of a machine, has questionable performance. This leaves open the very important question of integrity. The design for fault-tolerance in the distributed world, needs to look carefully at integrity, detection, recovery and reconfiguration at an appropriate level of granularity.

## 4. Summary

The goal of this paper is to bring to the fault-tolerant community a perspective of, what I believe are, the top five priorities for a developer in today's environment. The issues identify the factors that help or hinder the exploitation of fault-tolerant technology. Understanding the issues and placing a focus on them could eventually lead to innovation and research that will benefit the industry.

1. Shipping a product on schedule dominates the list and is further accentuated due to the compressed development cycle times. In an intensely competitive market with very short product life times, any extra function that might stretch the cycle time, can be argued as non critical and end up on the chopping block. Fault-tolerance function is no exception to it unless the resulting reliability is essential to the survival of the product line, or is a feature that is clearly added value. The dramatic improvements in component reliability probably do not help it.  Whereas, a crisp articulation of the life-cycle-cost reduction due to fault-tolerance and overall improvement in customer satisfaction are driving forces, when applicable.

2. Reducing Unavailability is critical as more segments of the market bet their business on the data processing and information technology. Today, given the consolidations in commercial computing and globalization of the economy, the window for outage is quickly disappearing, driving towards the requirements of 24x7 operations. The outage due to software dominates the causes of unavailability, and is commonly separated into scheduled and un-scheduled outage. In the commercial area, scheduled dominates the two. Research in fault-tolerant computing does not directly address some of these issues, but is a relevant topic for investigation.

3. Non-disruptive change management will be a key to achieving continuous availability and dealing with the largest fraction of problems associated with software.  Given that most software in the industry is legacy code there is an important question of how one retrofits such capability.  It is likely that a networked environment, with several spares, could effectively employ a {\em shoot and restart} policy, to reduce unavailability and provide change management. \vspace{.1in}

4. Human Fault-tolerance will eventually start dominating the list of causes for unavailability and the consequent loss of productivity. Currently there is a significant focus in the industry on the defect problem and the associated unavailability problems due to scheduled downtime. Eventually these will be reduced, in that order, leaving the non-defect oriented problems to dominate.  This problem is accentuated by the fact that there is a significant component of graphical user interface in today's applications meant for the non-computer person.  Useability will be synonymous to availability, creating this new dimension for fault-tolerance research to focus on.

5. All over again in the distributed world summarizes the problems we face in distributed computing environment. The difficulty is that the paradigms of providing fault-tolerance do not naturally map over from the single system to the distributed system. The design point, cost structure, failure modes, resources for sparing, checking and recovery are all different.  So long as that is recognized, hopefully, gross errors in design will not be committed.

## References

[1] J. Bozman, "Identifies the sources as International Data Corporation", Computerworld, Mar 30, 1992, pp75-78.
[2] J. J. Stiffler, "panel: On establishing fault tolerance objectives", The 21st Intl. Symposium on Fault-Tolerant Computing, June 1991.
[3] IEEE International Workshop on Fault And Error Models, Palm Beach, FL, Januar 1993.
[4] D. Siewiorek and R. Swarz, Reliable Computer Systems, Digital Press, 1992.
[5] J. Gray, "A census of Tandem System availability between 1985 and 1990," IEEE Transactions on Reliability, Vol 39, October 1990.
[6] M.Sullivan and R. Chillarege, "Software defects and their impact on system availabiity - a study of fiel dfailures in operating systems," Teh 21st International Symposium on Fault-tolerant Computing, June 1991.
[7] J.F. Isenberg, "Panel: Evolving systems for continuous availability," The 21st International Symposium on Fault-Tolerant Computing, June 1991.
[8] IMS/VS Extended Recovery Facility: Technical Reference. IBM GC24-3153, 1987.
[9] D. Gupta and P. Jalote, "Increasing system availability through on-line software version change," The 23rd International Symposium on Software Reliability Engineering, 1993.
[10] R. Chillarege, B.K. Ray, A.W. Garrigan and D. Ruth, "Estimating the recreate problem in software failures," The 4th International Symposium on Software Reliability Engineering, 1993.
[11] L. Spainhover, J. Isenberg, R. Chillarege, and J. Berding, "Design for fault-tolerance in systems ES/9000 Model 900," The 22nd International Symposium on Fault Tolerant Computing, 1992.