Self-Sustaining Systems Wiki

KristenProposal

em>Project Proposal: Redundant Robustness

Problem

For the last 40 or more years computers have evolved from relatively slow, large, and expensive devices dedicated to specific purposes to ubiquitous, fast, and cheap parts of our daily lives. We use them for both specific, dedicated tasks and for general purpose storage, retrieval, and transformation of information, among other things. While some characteristics of computing hardware have changed dramatically over this time period, many of the assumptions buried deeply inside software architectures that have evolved around this hardware have not changed from the time when hardware was slow, large, and expensive. Reflections of this are observable in operating systems, programming languages, and computational models we use when we are trying to get useful work done with our computers.

Aside from speed and power, software artifacts reflect other assumptions which influence our thoughts, and which subsequently influence designs. For example, an assumption of reliability can be observed when we look at the construction of most programming languages (even so-called 'modern' languages). When we write "x = y" in most common (imperative) languages, the language specification implies that the value that is in y will be copied to x without failure. The programming model says that if there is a problem executing this assignment then the program itself will fail. As a result we program as if most things we say in our program will happen according to specification, without fail, each and every time. There are exceptions to this assumption but they appear as special cases.

If we look at the other extreme where every operation can potentially fail, however, other problems arise. In this case, the programmer must constantly be aware of the failure potential of each and every operation. The number of potential error states rises dramatically in this framework and with that the cognitive load on the programmer rises. This cognitive load could easily grow to the point where the programmer or user of such a system might spend more time concentrating on error recovery than on the task at hand. Neither of these situations seems ideal.

To live with this tension systems have evolved to include some degree of redundancy. An example might be how "reliable" web services are built for a typical large system installation in 2004. To deliver even simple HTML static web pages it would not be unusual to see an array of redundant HTTP servers, redundant database servers, multiple network connections, load balancing routers, etc. Each essential function of the system (divided at functional boundaries) is replicated and some fail-over mechanism is employed to detect failure and shift the work to the remaining components performing that function. Systems built around this paradigm appear more reliable at delivering web services when compared with a single server for each function, however, they suffer from a similar architectural flaw: meta-regression of coarse grained functionality.

When such a system encounters a failed functional component it can react in many ways. In the best case the failure is detected, traffic is re-routed, loads are rebalanced and the system, after some delay, recovers functionality. This best case appears difficult to realize, however, when we have the goal of reliability through redundancy competing with the goal of scalability. The meta-regress referred to earlier appears when we attempt to do things like load balance the load balancers and encounter failure in the load balancer for the load balancer. Or perhaps the control process for balancing the load balancers fails and then we need a control process to watch the other control process. In scalable systems that we typically build, the regress bottoms out at some point and we are left with the equivalent of a single point of failure.

Other scaling issues often arise in architectures which attempt to employ redundancy to increase reliability. When optimizing for increasing performance it is easy to create an architecture where, at a certain scale, the processing andor data transfer necessary for redundancy management will exceed the amount necessary for the base level of processing andor data transfer. Such architectures quickly become untenable when a customer's desire for more capacity (typically an ever increasing function of time) exceeds some threshold. The reasons for these scalability issues may be rooted, at least partially, in the separation of function between redundancy management and base system level function.

Another aspect of systems with explicitly architected redundancy is that of the performance of the system when attempting to recover from failure of replicated functional components. In these systems, as they are typically built today, failure recovery appears as a change in the state of the system. This change in state is accompanied by a change in performance and/or responsiveness as resources are applied to repair the failure. Latency increases in this scenario during the recovery phase as resources are re-allocated followed by resumption of functionality at reduced performance.

Lastly, systems with functional redundancy run the risk of depletion of resources dedicated to particular functions. An implication of this assumption is that a smaller number of failed units can cripple the system unless these failures are evenly distributed among functional groups. If correlated failures follow functional divisions, systemwide failure will not be far behind.

Investigating Solutions

Investigations of systems using redundancy to achieve robustness often have focused on using explicit global control. In this paradigm, a feedback loop is established to control, insert, and reallocate the redundant components when failure occurs. Many of the problems mentioned above have emerged from these architectures.

This project proposes to build a prototype system which does not rely on explicitly engineered redundant recovery or global feedback loops to assemble a robust system. The proposed prototype system would deliver some type of service, a web service is one such example, to a client and would do so even under adverse conditions where a significant fraction of the components making up the system were randomly disabled.

This prototype would be composed of a number of replicated component pieces. These component pieces would be similar or identical in their capabilities and would not have access to any global information about the architecture or interconnections other than that which they could glean from their local neighborhood. It is proposed that the system be viewed as always in failure recovery and will attempt to achieve robust behavior with that assumption in mind.

There are several sub-problems to be solved on the way to realizing a prototype system as described. Some of these sub-problems follow:

- introducing a new resource to the system

- routing algorithms

- service discovery

- synchronization

- partial / full failure of components

- resource / load management

Solutions to all of these remain to be discovered, however, there are interesting directions to proceed based on current work in the field of robustness and redundancy. Complex adaptive systems will be the starting point to view how this might be achieved. Some work has been done in the area of routing and load balancing using the ant pheromone model. This research may provide a starting point for work in this area. There are several explorations of service discovery (including Jxta) from which to begin.

Proposed Research Agenda

Work in this area would begin with an investigation of current methodology for building resilient systems, starting with architectures in common use and proceeding through research in the area of systems with emergent properties, robust communication paths, and fault tolerance. Some research questions we hope to answer are:

Can reliable high performance topologies emerge from local rule based components?
Are robustness and bisection bandwidth essential tensions in network topologies?
Is Byzantine Agreement an essential part of such systems or can they be made fully symmetric?
Does an assumption of constant failure yield a workable system?
How does performance degrade as components are disabled?
Can such a system be reasoned about by the user / programmer?
What are the scalability limitations when local control drives behavior?