Self-Sustaining Systems Wiki

GroupProposal

Here are the ideas from RonProposal refactored to try to answer Glenn's 5 questions (plus an extra question at the end). We need to add concrete examples to illustrate things wherever possible.

What do we want to change?

Current programming of distributed applications is like the space shuttle: It is very expensive, takes many people to make it work, is a handcrafted solution, is very complex & is a very brittle solution where one small failure can result in total catastrophic failure. We want to change that.

Non-expert programmers should be able to easily create an application that involves multiple computers and devices (e.g. sensors) and understand what it is doing. This application should be robust by default.

What is the problem we are trying to solve?

Several problems:

How can we make software systems so that they actively maintain/sustain themselves?
How can we build distributed systems that need to exhibit global behavior through local action (i.e. emergence)?
How can we make it easier to create such systems?

What is unique about our approach?

We feel that the current assumptions underlaying software development & computer science are inadequate and actually harmful for creating robust complex systems. We seek inspiration from the study of biological systems, from the Sante Fe Institute's work on complex adaptive systems, from the principles and techniques used in high-reliability telecom networks, and other areas outside the normal software development world.

In particular we are focused on the question of how do systems actively work to sustain/maintain their activity. In living systems a large fraction of the overall system activity is concerned with preservation, conservation and repair, while only a small amount takes care of the basic functionality. However for that basic functionality to be performed robustly seems to require the system devote the bulk of its resources to self-sustaining activities. This is in vivid contrast to typical software where basic functionality makes up most of the code with a small amount devoted to error detection and correction.

In current systems small changes to the code can affect distant modules, damage to internal data structures can snowball & a failure in one component can bring the whole application to a halt. The only force opposing this destructive entropy is the programmer crafting the program; any slip-ups in development or omissions in testing results in bugs in the final version. Error handling is usually added towards the end of the development cycle and tends to be local in scope.

A self-sustaining system is one where each module is constantly checking & repairing any damage to its internal data structures and where modules are more loosely coupled (i.e. less trusting & more defensive of other modules). The assumption is that errors are a fact of life & occur all the time. So the system needs to actively repair any damage before it can spread.

Some examples:

ECC parity bits on memory (RAM)
garbage collection = active memory maintenance

How will we know when we are done?

Since the level of complexity of software applications continues to grow, this sort of work is open ended. However some milestones to indicate our progress include:

acceptance of our new assumptions & aesthetics
it becoming noticeably easier to create robust complex applications - takes less effort & less expertise
applications & runtime environments include much more in the way of mechanisms to maintain their integrity and to continue to function despite continual failures

What is our artifact?

various papers describing new approaches to software architecture & development, including how to program emergent behavior.
new ideas for tools to support developers dealing with complexity
new programming techniques and frameworks for more reliable computing
new language features, or even a new language, to better express solutions
tools to explore this vision including a simulation to illustrate some of these ideas

What does this mean for Sun?

Today a major problem holding back development is how to deal with the complexity of the current generation of distributed, network applications-in particular how to reliably offer network services. Any help that Sun can give developers to make it easier to develop distributed network applications and to make the resulting software more robust will increase Sun's mindshare and create product opportunities.

This is similar to the opportunity in the mid-1990s when people were struggling to create basic network-enabled applications. Java directly addressed that sweet spot through its support of network computing: architecture neutral, portable, secure, and standard library !APIs for both low-level and high-level network operations. Making it easier for developers to write programs that used the network was one of the key contributors to the success of Java as a programming language. Now the problem is how to build more complex applications and whoever can help developers solve that problem stands to reap the benefits.