|







| |
Downtime Incidents
 | Slashdot stories on
bugs, viruses and downtime |
 | Ariane 5
Flight 501 Failure, Inquiry Board Report, Prof. J. L. Lions, July 1996
|
 | Therac-25 Failure
Description, Nancy Leveson, MIT |
 | Navy Smart Ship USS Yorktown
|
 |
Human error called culprit in 3 rocket launch failures, Florida Today
Space Online, June 16, 1999 |
 |
Stock Market Outages Highlight Software Availability Issues, The Payne
Report, July/August 2001 |
 | CERT/CC. Advisories. |
Downtime Costs and Statistics
 | Improving Systems
Availability, IBM Global Services |
 |
Comparison of High Availability Software and Hardware Clusters, D. H.
Brown Associates, Inc. |
 |
Making Smart Investments to Reduce Unplanned Downtime, D. Scott, Gartner
Group, 1999 |
 | D. Scott. Assessing the costs of application downtime. Gartner Group, May
1998. |
 | Five T's of
Database Availability, Standish Group Research Note, 1999 |
 | TCO (Total
Cost of Ownership) in the Trenches, Standish Group Research Note, 1999
|
General Surveys and Classic Papers
 | Jim Gray,
Why Do
Computers Stop and What Can Be Done About It? Symposium on Reliability in
Distributed Software and Database Systems 1986, pp. 3-12. (Link is to a 1985
technical report version of the paper from Tandem Computers.) |
 | M. Sullivan and R. Chillarege. Software defects and their impact on system
availability – A study of field failures in operating systems. In FTCS, Jun
1991. |
 | Fundamental
Concepts of Dependability, Avizienis, Laprie & Randell |
 | Lessons
Learned From Delta-4, David Powell |
 | Failure
Analysis of an ORB in the Presence of Faults, LAAS, France |
 |
Fault Injection Tools and Techniques, Iyer et. al. |
 | Fault-Tolerant
CORBA Standard, Object Management Group |
 |
Software Fault Tolerance: A Tutorial, Wilfredo Torres-Pomales,
NASA/TM-2000-210616, October 2000, pp. 66 |
 | J. F. Bartlett. A NonStop kernel. In SOSP, Dec 1981. |
 | A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance
under UNIX. ACM TOCS, 7(1), Feb 1989. |
 | E. Marcus and H. Stern. Blueprints for High Availability. John Willey &
Sons, 2000. |
Fault Injection
Fault Analysis
 | Norman E. Fenton and
Niclas Ohlsson.
Quantitative Analysis of Faults and Failures in a Complex Software System.
2000. IEEE Transactions on Software Engineering, 26(8):797-814.
|
 | Sharon E. Perl and Richard
L. Sites.
Studies of Windows NT Performance Using Dynamic Execution Traces.
1996. in Operating Systems Design and Implementation, pages 169-183.
|
 | Barton Miller and David
Koski and Cjin Pheow Lee and Vivekananda Maganty and Ravi Murthy and Ajitkumar
Natarajan and Jeff Steidl.
Fuzz Revisited: A Re-examination of the Reliability of UNIX Utilities and
Services. 1995. in Technical Report CS-TR-1995-1268, pages
(null). |
 | I. Lee and R. Iyer.
Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN
Operating System. 1993. in 23rd Int. Symp. on Fault-Tolerant
Computing (FTCS-23), pages 20-29. IEEE Computer Society Press.
|
 | M. Sullivan and R.
Chillarege.
A Comparison of Software Defects in Database Management Systems and Operating
Systems. 1992. in 22nd Int. Symp. on Fault-Tolerant Computing
(FTCS-22), pages 475-484. IEEE Computer Society Press. |
 | M. Sullivan and R.
Chillarege.
Software defects and their impact on system availability - a study of field
failures in operating systems. 1991. 21st Int. Symp. on
Fault-Tolerant Computing (FTCS-21), (null):2-9. |
 | Barton P. Miller and Lars
Fredriksen and Bryan So.
An empirical study of the reliability of UNIX utilities. 1990.
Communications of the Association for Computing Machinery, 33(12):32-44.
|
 | Jim Gray.
A Census of Tandem System Availability Between 1985 and 1990. 1990. in
Technical Report 90.1, pages (null). Tandem Computers Incorporated,
Cupertino, Calif. |
 | W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z.-Y. Yang. Characterization of
linux kernel behavior under errors. In DSN, 2003. |
 | S. Chandra and P. M. Chen. Whither generic recovery from application
faults? A fault study using open-source software. In DSN/FTCS, Jun 2000. |
 | D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency
and the limits of generic recovery. In OSDI, 2000. |
 | P.J. Koopman and J. DeVale.
Comparing the Robustness of POSIX Operating Systems. Proc. of 29th IEEE
Symposium on Fault-Tolerant Computing (FTCS), 1999. |
Testing
Checkpointing
 | Y. Chen, J. S. Plank, and K. Li. Clip: A checkpointing tool for message
passing parallel programs. In SC, 1997. |
 | Milos Prvulovic and Zheng
Zhang and Josep Torrellas.
ReVive: cost-effective architectural support for rollback recovery in
shared-memory multiprocessors. 2002. in Proceedings of the 29th
annual International Symposium on Computer Architecture(ISCA), pages
111-122. IEEE Computer Society. |
 | Daniel J. Sorin and Milo
M. K. Martin and Mark D. Hill and David A. Wood.
SafetyNet: improving the availability of shared memory multiprocessors with
global checkpoint/recovery. 2002. in Proceedings of the 29th annual
international symposium on Computer architecture, pages 123-134. IEEE
Computer Society. |
 | D. E. Lowell and P. M. Chen. Discount checking: Transparent, lowoverhead
recovery for general applications. Technical report, CSE-TR-410-99, University
of Michigan, Jul 1998. |
 | J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE TPDS,
9(10), 1998. |
 | Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala.
Checkpointing and its applications. In FTCS, Jun 1995. |
Failure Recovery
 | Feng Qin, Joe Tucek, Jagadeesan Sundaresan and
Yuanyuan Zhou. "Rx: Treating bugs as allergies---a safe
method to survive software failure". To appear in the 20th
ACM Symposium on Operating Systems
Principles (SOSP'05), October 2005. |
 | L. Alvisi and K. Marzullo. Trade-offs in implementing optimal message
logging protocols. In SPDS, 1996. |
 | C. Amza, A. Cox, and W. Zwaenepoel. Data replication strategies for
fault tolerance and availability on commodity clusters. In DSN, Jun 2000. |
 | A. Avizienis. The N-version approach to fault-tolerant software. IEEE
TSE, SE-11(12), 1985. |
 | A. Avizienis and L. Chen. On the implementation of N-version programming
for software fault tolerance during execution. In COMPSAC, Nov 1977. |
 | A. Bobbio and M. Sereno. Fine grained software rejuvenation models. In
IPDS, Sep 1998. |
 | A. Borg, J. Baumbach, and S. Glazer. A message system supporting fault
tolerance. In SOSP, 1983. |
 | T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM
Transactions on Computer Systems, 14(1):80–107, Feb. 1996. |
 | G. Candea, J. Cutler, A. Fox, R. Doshi, P. Garg, and R. Gowda. Reducing
recovery time in a small recursively restartable system. In DSN, Jun 2002.
|
 | G. Candea and A. Fox. Recursive restartability: Turning the reboot
sledge hammer into a scalpel. In HotOS, May 2001. |
 | G. Candea and A. Fox. Crash-only software. In HotOS, May 2003. |
 | G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. A
miro-rebootable system – Design, implementation, and evaluation. In OSDI, Dec
2004. |
 | Castro and Liskov. Practical byzantine fault tolerance. In OSDI, 1999.
|
 | M. Castro and B. Liskov. Proactive recovery in a Byzantine-Fault-
Tolerant system. In OSDI, 2000. |
 | S. Chandra and P. M. Chen. The impact of recovery mechanisms on the
likelihood of saving corrupted state. In 13th International Symposium on
Software Reliability Engineering, 2002. |
 | E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of
rollback-recovery protocols in message-passing system. Technical report, TR
CMU-CS-96-181, Carnegie Mellon University, 1996. |
 | S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. On the analysis of
software rejuvenation policies. In COMPASS, Jun 1997. |
 | Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton. Software
rejuvenation: Analysis, module and applications. In FTCS, Jun 1995. |
 | D. Johnson and W. Zwaenepoel. Recovery in distributed systems using
optimistic message logging and checkpointing. In PODC, Aug 1988. |
 | D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using
optimistic message logging and check-pointing. Journal of Algorithms,
11(3):462–491, 1990. |
 | K. Li, J. Naughton, and J. Plank. Concurrent real-time checkpoint for
parallel programs. In PPoPP, Mar 1990. |
 | D. E. Lowell and P. M. Chen. Free transactions with rio vista. In SOSP.
ACM Press, 1997. |
 | D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P.
Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry,W.
Tetzlaff, J. Traupman, and N. Treuhaft. Recovery oriented computing (ROC):
Motivation, definition, techniques, and case studies. Technical report,
Technical Report UCB//CSD-02-1175, U.C.Berkeley, Mar 2002. |
 | B. Randell. System structure for software fault tolerance. IEEE TSE, 1(2),
Jun 1975. |
 | B. Randell, P. A. Lee, and P. C. Treleaven. Reliability issues in
computing system design. ACM Computer Surveys, 10(2), Jun 1978. |
 | M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Beebee, Jr.
Enhancing server availability and security through failure-oblivious
computing. In OSDI, Dec 2004. |
 | R. Rodrigues, M. Castro, and B. Liskov. BASE: Using abstraction to improve
fault tolerance. In SOSP, 2001. |
 | M. Russinovich and B. Cogswell. Replay for concurrent nondeterministic
shared-memory applications. In PLDI, 1996. |
 | S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a
reactive immune system for software services. In USENIX ATC, Apr 2005. |
 | R. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM
TOCS, 3(3):204–226, 1985. |
 | M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering
device drivers. In OSDI, 2004. |
 | W. Vogels, D. Dumitriu, K. Birman, R. Gamache, M. Massa, R. Short, J. Vert,
J. Barrera, and J. Gray. The design and architecture of the Microsoft Cluster
Service. In FTCS, Jun 1998. |
 | Y.-M.Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error
recovery in distributed systems. In FTCS, Jun 1993. |
 | Y. Zhou, P. M. Chen, and K. Li. Fast cluster failover using virtual
memory-mapped communication. In ICS, Jun 1999. |
|