2009 Fault Tolerance Workshop

From TeraGrid Wiki

Jump to: navigation, search

Fault Tolerance for Extreme-Scale Computing


Abstract

The purpose of this workshop is to discuss fault-tolerance on large systems for running large, possibly long-running applications. The main point of the workshop will be to have systems people, middleware people (include FT experts), and apps people talk about the issues and figure out what needs to be done, mostly at the middleware and app levels, to run such apps on the coming petascale systems, without having faults cause large numbers of application failures.


One way of looking at the topic, and the potential discussion, is through the following questions (as suggested by Bill Gropp):

  1. How serious is the problem? This must *not* be the usual simplistic analysis based on the failure rates of commodity server nodes; it needs to look more closely at what happens in large systems where the designers take into account the scale of the system.
  2. How does one know if there is a fault (does one need to do more than trust the system)?
  3. What automatic methods (e.g., system level checkpoint/restart) are there, and what are their pros and cons, particularly with respect to performance impact and portability?
  4. What tools are available to manage user-level checkpoint/restarts?
  5. What algorithmic approaches are there to detect and repair faults as alternatives to checkpoint/restart?

Organizers

  • Daniel S. Katz, LSU (Blue Water/TeraGrid)
  • Scott Lathrop, ANL (Blue Waters/TeraGrid)
  • Bob Wilhelmson, NCSA (Blue Waters)
  • Nick Nystrom, PSC (TeraGrid)
  • Amit Majumdar, SDSC (TeraGrid)
  • Sergiu Sanielevici, PSC (TeraGrid)
  • Patrick Bridges, UNM
  • Carolyn Peters, ANL (TeraGrid)

Report

Media:FT_workshop_report.pdf

Dates and Location

March 19 - 20, 2009

Marriott Pyramid, Albuquerque

A room block is available for the nights of March 18 and 19, with rooms at $80.

Call 1-800 262-2043, ask for the TeraGrid/Blue Waters PetaScale Workshop when making reservations.

The cut-off date for reservations is March 2nd.

Registration

http://www.teragrid.org/eot/petascale/registration.html

There is a $50 registration fee, and the username and password from the invitation are required to access the registration form.

Agenda & Presentations

Thursday, March 19th

8:00 - Opening/Context, Daniel S. Katz (LSU) - Introductions

8:30

8:30 - Fault Tolerance 101, Zbigniew Kalbarczyk (U. Illinois)

9:00 - Large Systems Session I

9:00 - Faults and Fault-Tolerance on the Argonne BG/P System, Rinku Gupta (ANL)

9:25 - Essential Feedback Loops, Jon Stearley (SNL)

Questions for each speaker:

  • what can you tell us about your fault/error situation currently?
  • what types of errors do you see?
  • what rate of errors?
  • what are you worried about in the future?
  • what do you think is needed to help?
  • what do you think is reasonable for apps people to do?

9:50 - break

10:20 - Large Systems Session II

10:20 - Experiences with Kraken, Patricia Kovatch (NICS)

10:55 - Growing Pains of Petascale Computing: Integrating Hardware, Software and Middleware for Successful Capacity and Throughput, Kent Milfeld (TACC)

11:10 - Rollback-Recovery in the Petascale Era, Mootaz Elnozahy (IBM)

11:35 - Large Systems Experience, Bill Kramer (NCSA)

12:00 - discussion

Questions for each speaker:

  • what can you tell us about your fault/error situation currently?
  • what types of errors do you see?
  • what rate of errors?
  • what are you worried about in the future?
  • what do you think is needed to help?
  • what do you think is reasonable for apps people to do?

12:30 - lunch

1:30 - I/O system session

1:30 - The Role of Storage in Exascale Fault Tolerance, Garth Gibson (CMU)

1:55 - I/O Fault Diagnosis in Software Storage Systems, Eric Schrock (Sun)

2:20 - stdchk: A Checkpoint Storage System for HPC Applications, Sudharshan Vazhkudai (ORNL)

2:45 - discussion

Questions for each speaker:

  • what can you tell us about your fault/error situation currently?
  • what types of errors do you see?
  • what rate of errors?
  • what are you worried about in the future?
  • what do you think is needed to help?
  • what do you think is reasonable for apps people to do?

3:00 - beak

3:30 - FT tech session

3:30 - System-level Checkpoint/Restart with BLCR, Paul Hargrove (LBL)

3:55 - Scalable Fault Tolerance Schemes using Adaptive Runtime Support, Celso Mendes (U. Illinois)

4:20 - Handling Faults in a Global Address Space Programming Model, Sriram Krishnamoorthy (PNL)

4:45 - Implications of System Errors in the Context of Numerical Accuracy, Patty Hough (SNL)

5:10 - The Scalable Checkpoint / Restart (SCR) Library: Approaching File I/O Bandwidth of 1 TB/s, Adam Moody (LLNL)

5:35 - discussion

Questions for each speaker

  • what have you done that people should be aware of?
  • how have you tested it?
  • who's using it?
  • what do you want from users to help you develop/test it?
  • what do you think is reasonable for apps people to do?
  • what errors/faults do you want to be aware of/notified of?

6:30 - dinner


Friday, March 20th

8:30 - MIxed FT tech / app session

8:30 - Sustained Exascale: The Challenge, George Bosilca (U. Tennessee)

8:55 - Between Application and System: Fault Tolerance Mechanisms for the Cactus Software Framework, Erik Schnetter (LSU)

9:20 - Fault Tolerance Support for HPC Tools and Applications: Scalability and Sustainability, Tony Drummond (LBL)

9:45 - Fault Tolerance in CREATE Phase 1, Lawrence Votta

10:10 - discussion

Questions for each FT speaker

  • what have you done that people should be aware of?
  • how have you tested it?
  • who's using it?
  • what do you want from users to help you develop/test it?
  • what do you think is reasonable for apps people to do?
  • what errors/faults do you want to be aware of/notified of?

Questions for each app speaker:

  • what science/computing are you doing now, focused on computing more than science?
  • what are you worried about, particularly thinking of future grand challenge science?
  • what errors/faults do you want to be aware of/notified of?
  • what do you want from tools/technologies?
  • what have you done about it so far?
  • what are you planning to do?
  • what do you think is reasonable for apps people to do?

10:30 - break

11:00 - apps session

11:00 - Application Resilience for Truculent Systems, John Daly (Center for Exceptional Computing)

11:25 - Insights into Fault Tolerance from FLASH Production Runs on Top 3 Supercomputing Platforms, Don Lamb (U. Chicago)

11:50 - Analysis of Cluster Failures on Blue Gene Supercomputing Systems, Christopher Carothers (RPI) & Thomas Hacker (Purdue)

12:15 - discussion

Questions for each app speaker:

  • what science/computing are you doing now, focused on computing more than science?
  • what are you worried about, particularly thinking of future grand challenge science?
  • what errors/faults do you want to be aware of/notified of?
  • what do you want from tools/technologies?
  • what have you done about it so far?
  • what are you planning to do?
  • what do you think is reasonable for apps people to do?

12:30 - closing discussion

1:00 - end

Individual Talk Abstracts

Personal tools