ComputerScience
Fault Tolerance
- Nancy, MIT Professor of Aeronautics and Astronautics, especially her papers on fault:
- An Experimental Evaluation of the Assumption of Independence in Multi-Version Programming, by John Knight and Nancy Leveson, IEEE Transactions on Software Engineering, Vol. SE-12, No. 1, January 1986, pp. 96-109
@ "Our original paper that got us in such hot water for the next ten years until everyone who tried to show we were wrong, got the same results and grudgingly admitted we were right. Unfortunately, the same idea keeps popping up again like a bad penny among people who do not bother to learn anything about what has been done in the past." @ It's too bad the MIT server won't serve up this paper. -rpg
@ Downloading the paper now works.... -RonGoldman - The Use of Self Checks and Voting in Software Error Detections: An Empirical Study by Nancy Leveson, Stephen Cha, John Knight, and Timothy Shimeall. IEEE Trans. on Software Engineering, Vol. SE-16, No. 4, April, 1990.
@ "While we were on a roll, we decided to compare the use of self-checks (assertions) and voting (n-versions)." - An Empirical Comparison of Software Fault Tolerance and Fault Elimination by Timothy Shimeall and Nancy Leveson IEEE Trans. on Software Engineering, Vol. SE-17, No. 2, February 1991, pp. 173-183.
@ "Before bowing out gracefully (and bloodied) from the software fault-tolerance community and taking a break from running experiments, I decided to try one more. This paper compares the effectiveness of two software fault tolerance techniques (embedded self-checks and multi-version programming) with some common fault elimination techniques."- Robust By Gerry Sussman at MIT is an interesting read on this topic. -rpg
Interesting System Failures
- NASA's Mars Climate Orbiter - failed to enter Mars orbit due to the failure to use metric units in the coding of a ground software file, Small Forces, used in trajectory models.
- NASA's Mars Polar Lander - crashed during landing due to premature shutdown of the descent engines, resulting from a vulnerability of the software to transient signals.
- Arianne 5 disaster - due to reuse of Arianne 4 software without adapting it to the different requirements of the Arianne 5.
Related Articles
- Software by Dan Bricklin
- This paper calls for a new style of development to better meet the long term requirements for software used by society as part of the infrastructure (e.g. robustness, long-term stability & security).
- One point he confuses is that Open Source does not necessarily mean unpaid individual volunteers. Many organizations that derive value from an open source application will pay their employees to customize it. The big advantage is that they do not need to negotiate with other organizations about who pays how much & for what. Small improvements can be done by a single organization, while large efforts may involve recruiting enough resources (= people) from multiple organizations in a decentralized fashion. Often easier than having a central board to decide what gets worked on. -- RonGoldman
Programming Language Ideas
I think this paper is worth reading. It presents a new set of programming constructs that are, I think, not obvious. -rpg