CS598YYZ Fall 2005

Reliable and Robust Software Systems

Home
Overview
Schedule
Policies
CritiqueAssignment
Projects
Topic Matrix
Reading List

  • Downtime Incidents

    bulletSlashdot stories on bugs, viruses and downtime
    bulletAriane 5 Flight 501 Failure, Inquiry Board Report, Prof. J. L. Lions, July 1996
    bulletTherac-25 Failure Description, Nancy Leveson, MIT
    bulletNavy Smart Ship USS Yorktown
    bulletSoftware glitches leave Navy Smart Ship dead in the water, Government Computer News, July 13, 1998.
    bulletThe Smart Ship is not the answer, U.S. Naval Institute Proceedings, June 1998
    bullet Human error called culprit in 3 rocket launch failures, Florida Today Space Online, June 16, 1999
    bullet Stock Market Outages Highlight Software Availability Issues, The Payne Report, July/August 2001
    bulletCERT/CC. Advisories.

     

  • Downtime Costs and Statistics

    bulletImproving Systems Availability, IBM Global Services
    bullet Comparison of High Availability Software and Hardware Clusters, D. H. Brown Associates, Inc.
    bullet Making Smart Investments to Reduce Unplanned Downtime, D. Scott, Gartner Group, 1999
    bulletD. Scott. Assessing the costs of application downtime. Gartner Group, May 1998.
    bulletFive T's of Database Availability, Standish Group Research Note, 1999
    bulletTCO (Total Cost of Ownership) in the Trenches, Standish Group Research Note, 1999

     

  • General Surveys and Classic Papers

    bulletJim Gray, Why Do Computers Stop and What Can Be Done About It? Symposium on Reliability in Distributed Software and Database Systems 1986, pp. 3-12. (Link is to a 1985 technical report version of the paper from Tandem Computers.)
    bulletM. Sullivan and R. Chillarege. Software defects and their impact on system availability – A study of field failures in operating systems. In FTCS, Jun 1991.
    bulletFundamental Concepts of Dependability, Avizienis, Laprie & Randell
    bulletLessons Learned From Delta-4, David Powell
    bulletFailure Analysis of an ORB in the Presence of Faults, LAAS, France
    bullet Fault Injection Tools and Techniques, Iyer et. al.
    bulletFault-Tolerant CORBA Standard, Object Management Group
    bullet Software Fault Tolerance: A Tutorial, Wilfredo Torres-Pomales, NASA/TM-2000-210616, October 2000, pp. 66
    bulletJ. F. Bartlett. A NonStop kernel. In SOSP, Dec 1981.
    bulletA. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under UNIX. ACM TOCS, 7(1), Feb 1989.
    bulletE. Marcus and H. Stern. Blueprints for High Availability. John Willey & Sons, 2000.
  • Fault Injection

    bulletJeffrey M. Voas and Gary McGraw and Lora Kassab and Larry Voas. A 'Crystal Ball' for Software Liability. 1997. IEEE Computer, 30(6):29-36.
    bulletJeffrey Voas and Frank Charron and Gary McGraw and Keith Miller and Michael Friedman. Predicting How Badly ``Good'' Software Can Behave. 1997. IEEE Software, 14(4):73-83.
    bulletGhani A. Kanawati and Nasser A. Kanawati and Jacob A. Abraham. FERRARI: A Flexible Software-Based Fault and Error Injection System. 1995. IEEE Transactions on Computers, 44(2):248-260.
    bulletJ. Arlat and M. Aguera and L. Amat and Y. Crouzet and J. C. Fabre and J. C. Laprie and E. Martins and D. Powell. Fault Injection for Dependability Validation: A Methodology and some Applications. 1990. IEEE Transactions on Software Engineering, 16(2):166-182.
    bulletR. Chillarege and N. Bowen. Understanding Large System Failures - A Fault Injection Experiment. 1989. in 19th Int. Symp. on Fault-Tolerant Computing (FTCS-19), pages 356-363. IEEE Computer Society Press.

    Fault Analysis

    bulletNorman E. Fenton and Niclas Ohlsson. Quantitative Analysis of Faults and Failures in a Complex Software System. 2000. IEEE Transactions on Software Engineering, 26(8):797-814.
    bulletSharon E. Perl and Richard L. Sites. Studies of Windows NT Performance Using Dynamic Execution Traces. 1996. in Operating Systems Design and Implementation, pages 169-183.
    bulletBarton Miller and David Koski and Cjin Pheow Lee and Vivekananda Maganty and Ravi Murthy and Ajitkumar Natarajan and Jeff Steidl. Fuzz Revisited: A Re-examination of the Reliability of UNIX Utilities and Services. 1995. in Technical Report CS-TR-1995-1268, pages (null).
    bulletI. Lee and R. Iyer. Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN Operating System. 1993. in 23rd Int. Symp. on Fault-Tolerant Computing (FTCS-23), pages 20-29. IEEE Computer Society Press.
    bulletM. Sullivan and R. Chillarege. A Comparison of Software Defects in Database Management Systems and Operating Systems. 1992. in 22nd Int. Symp. on Fault-Tolerant Computing (FTCS-22), pages 475-484. IEEE Computer Society Press.
    bulletM. Sullivan and R. Chillarege. Software defects and their impact on system availability - a study of field failures in operating systems. 1991. 21st Int. Symp. on Fault-Tolerant Computing (FTCS-21), (null):2-9.
    bulletBarton P. Miller and Lars Fredriksen and Bryan So. An empirical study of the reliability of UNIX utilities. 1990. Communications of the Association for Computing Machinery, 33(12):32-44.
    bulletJim Gray. A Census of Tandem System Availability Between 1985 and 1990. 1990. in Technical Report 90.1, pages (null). Tandem Computers Incorporated, Cupertino, Calif.
    bulletW. Gu, Z. Kalbarczyk, R. K. Iyer, and Z.-Y. Yang. Characterization of linux kernel behavior under errors. In DSN, 2003.
    bulletS. Chandra and P. M. Chen. Whither generic recovery from application faults? A fault study using open-source software. In DSN/FTCS, Jun 2000.
    bulletD. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In OSDI, 2000.
    bulletP.J. Koopman and J. DeVale. Comparing the Robustness of POSIX Operating Systems. Proc. of 29th IEEE Symposium on Fault-Tolerant Computing (FTCS), 1999.

    Testing

    bulletNathan P. Kropp, Philip J. Koopman, and Daniel P. Siewiorek. Automated Robustness Testing of Off-the-Shelf Software Components. Proc. of 28th IEEE Symposium on Fault-Tolerant Computing (FTCS), 1998.

    Checkpointing

    bulletY. Chen, J. S. Plank, and K. Li. Clip: A checkpointing tool for message passing parallel programs. In SC, 1997.
    bulletMilos Prvulovic and Zheng Zhang and Josep Torrellas. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. 2002. in Proceedings of the 29th annual International Symposium on Computer Architecture(ISCA), pages 111-122. IEEE Computer Society.
    bulletDaniel J. Sorin and Milo M. K. Martin and Mark D. Hill and David A. Wood. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. 2002. in Proceedings of the 29th annual international symposium on Computer architecture, pages 123-134. IEEE Computer Society.
    bulletD. E. Lowell and P. M. Chen. Discount checking: Transparent, lowoverhead recovery for general applications. Technical report, CSE-TR-410-99, University of Michigan, Jul 1998.
    bulletJ. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE TPDS, 9(10), 1998.
    bulletY.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In FTCS, Jun 1995.

    Failure Recovery

    bulletFeng Qin, Joe Tucek, Jagadeesan Sundaresan and Yuanyuan Zhou. "Rx: Treating bugs as allergies---a safe method to survive software failure".  To appear in the 20th ACM Symposium on Operating Systems Principles (SOSP'05), October 2005.
    bulletL. Alvisi and K. Marzullo. Trade-offs in implementing optimal message logging protocols. In SPDS, 1996.
    bulletC. Amza, A. Cox, and W. Zwaenepoel. Data replication strategies for  fault tolerance and availability on commodity clusters. In DSN, Jun 2000.
    bulletA. Avizienis. The N-version approach to fault-tolerant software. IEEE  TSE, SE-11(12), 1985.
    bulletA. Avizienis and L. Chen. On the implementation of N-version programming for software fault tolerance during execution. In COMPSAC, Nov 1977.
    bulletA. Bobbio and M. Sereno. Fine grained software rejuvenation models. In IPDS, Sep 1998.
    bulletA. Borg, J. Baumbach, and S. Glazer. A message system supporting fault tolerance. In SOSP, 1983.
    bulletT. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems, 14(1):80–107, Feb. 1996.
    bulletG. Candea, J. Cutler, A. Fox, R. Doshi, P. Garg, and R. Gowda. Reducing recovery time in a small recursively restartable system. In DSN, Jun 2002.
    bulletG. Candea and A. Fox. Recursive restartability: Turning the reboot  sledge hammer into a scalpel. In HotOS, May 2001.
    bulletG. Candea and A. Fox. Crash-only software. In HotOS, May 2003.
    bulletG. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. A miro-rebootable system – Design, implementation, and evaluation. In OSDI, Dec 2004.
    bulletCastro and Liskov. Practical byzantine fault tolerance. In OSDI, 1999.
    bulletM. Castro and B. Liskov. Proactive recovery in a Byzantine-Fault-  Tolerant system. In OSDI, 2000.
    bulletS. Chandra and P. M. Chen. The impact of recovery mechanisms on the likelihood of saving corrupted state. In 13th International Symposium  on Software Reliability Engineering, 2002.
    bulletE. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing system. Technical report, TR CMU-CS-96-181, Carnegie Mellon University, 1996.
    bulletS. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. On the analysis of software rejuvenation policies. In COMPASS, Jun 1997.
    bulletY. Huang, C. Kintala, N. Kolettis, and N. D. Fulton. Software rejuvenation: Analysis, module and applications. In FTCS, Jun 1995.
    bulletD. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and checkpointing. In PODC, Aug 1988.
    bulletD. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and check-pointing. Journal of Algorithms, 11(3):462–491, 1990.
    bulletK. Li, J. Naughton, and J. Plank. Concurrent real-time checkpoint for parallel programs. In PPoPP, Mar 1990.
    bulletD. E. Lowell and P. M. Chen. Free transactions with rio vista. In SOSP. ACM Press, 1997.
    bulletD. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry,W. Tetzlaff, J. Traupman, and N. Treuhaft. Recovery oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical report, Technical Report UCB//CSD-02-1175, U.C.Berkeley, Mar 2002.
    bulletB. Randell. System structure for software fault tolerance. IEEE TSE, 1(2), Jun 1975.
    bulletB. Randell, P. A. Lee, and P. C. Treleaven. Reliability issues in computing system design. ACM Computer Surveys, 10(2), Jun 1978.
    bulletM. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Beebee, Jr. Enhancing server availability and security through failure-oblivious computing. In OSDI, Dec 2004.
    bulletR. Rodrigues, M. Castro, and B. Liskov. BASE: Using abstraction to improve fault tolerance. In SOSP, 2001.
    bulletM. Russinovich and B. Cogswell. Replay for concurrent nondeterministic shared-memory applications. In PLDI, 1996.
    bulletS. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a reactive immune system for software services. In USENIX ATC, Apr 2005.
    bulletR. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM TOCS, 3(3):204–226, 1985.
    bulletM. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. In OSDI, 2004.
    bulletW. Vogels, D. Dumitriu, K. Birman, R. Gamache, M. Massa, R. Short, J. Vert, J. Barrera, and J. Gray. The design and architecture of the Microsoft Cluster Service. In FTCS, Jun 1998.
    bulletY.-M.Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In FTCS, Jun 1993.
    bulletY. Zhou, P. M. Chen, and K. Li. Fast cluster failover using virtual memory-mapped communication. In ICS, Jun 1999.

     

     

    Home | Overview | Schedule | Policies | CritiqueAssignment | Projects | Topic Matrix | Reading List

    Last updated: 08/15/05.