CTSS Version 4

From TeraGrid Wiki

Jump to: navigation, search

CTSS 4 went production on the TeraGrid in August of 2007. The most significant differences between CTSS 4 and earlier CTSS 1-3 are the introduction of the capability kit architecture and significantly better documented that is more accessible to TeraGrid users.

CTSS 4 testing began in January 2007 with incremental production rollout during Q1-Q2 2007. CTSS 4's capabilities were defined, designed, and implemented by appropriate teams within the TeraGrid community. The Software Integration area provided overall coordination. Major contributions to CTSS 4 were made by the Software Integration team, the Data working group, the Software working group, the Visualization working group, and the Scheduling working group.

Contents

High-level Structure

The CTSS Design document describes CTSS's high-level structure. However, to understand the capabilities offered in CTSS 4, you will need to review this page (not the high-level design document).

It is important to understand that not all TeraGrid resources support all CTSS capabilities. TeraGrid resources are carefully designed HPC and storage systems and not all CTSS capabilities are appropriate for all resources. CTSS 4 includes a new software and service capability registry that allows TeraGrid users to see which capabilities are supported on each resource, and which resources provide each capability. (See the TeraGrid Core Integration section below.) This offers a clearer view and understanding of the suite of capabilities offered across the entire TeraGrid system.

Process

The CTSS Process describes the process by which CTSS 4 was defined, designed, implemented, and deployed. In a nutshell, each major capability was defined, designed, and implemented by appropriate teams within the TeraGrid community. The software integration area provided overall coordination. Major contributions to CTSS 4 were made by the software integration team, the DV team, the software working group, and the data working group.

The initial definition of CTSS 4 capabilities is complete. Deployment of CTSS 4 capabilities began in January 2007 and the transition to production operation happened in August 2007.

CTSS 4 Change Management documents.

CTSS 4 Obsolete Capabilities kits.

Capabilities

All TeraGrid capabilities are defined and implemented as "capability kits". Capability kits include a Definition Document that describes the capabilities and driving use cases for the kit and an Implementation Document that provides the technical details on how the capabilities will be delivered. This document describes TeraGrid capabilities in the Class "ctss".

All TeraGrid HPC and storage resources supporting CTSS capability kits, must deploy the TeraGrid Core Integration kit, which provides the minimal capabilities to integrate the resource into the rest of the TeraGrid community (mainly the advertise which CTSS capabilities are provided). Other CTSS capability kits are optional, but we expect that most resources will implement the most of the capabilities.

To install any of the following capability kits on an existing TeraGrid resource follow the instructions in the kit's implementation document.

The CTSS 4 Bare Metal Installation describes host to install CTSS 4 on a machine that doesn't have any CTSS.

TeraGrid Core Integration

Definition:CTSS 4 TeraGrid Core Integration Capabilities
Implementation:CTSS 4 TeraGrid Core Integration Implementation

The only part of CTSS that TeraGrid resource providers must implement is the part that deals with operational integration: the TeraGrid Core Integration kit. The capabilities in this area enforce consistency in key areas: security, system information, verification & validation, and software deployment. Many of these capabilities are focused more on TeraGrid operators than on TeraGrid users, though they provide the fundamentals that make it possible (and easy) to use TeraGrid in advanced ways that users appreciate very much. (E.g., the ability to log into any authorized TeraGrid system using the same ID and password, or to submit jobs from one TeraGrid system to another without entering any ID/password at all.)

The most important capability provided by the TeraGrid Core Integration kit is the software and service capability registry, which registers the CTSS capabilities provided by each TeraGrid resource, plus key configuration and accessibility information about each capability.

We expect that the following capabilities will be useful to both TeraGrid staff and TeraGrid users.

  • Identify the TeraGrid resources providing a specific CTSS capability (new in CTSS 4)
  • Identify the CTSS capabilities provided by a specific TeraGrid resource (new in CTSS 4)
  • Discover the resource name and resource provider for the TeraGrid resource
  • Discover a system's local policies

The following CTSS capabilities are primarily of interest to TeraGrid's resource provider and operations personnel.

  • Authorize and provision a new TeraGrid user on the resource
  • Verify that a CTSS capability is working properly on the resource
  • Install a new CTSS capability kit on the resource (updated in CTSS 4)
  • Publish the availability and current support level for a CTSS capability on the resource (new in CTSS 4)

Remote Login

Definition:CTSS 4 Remote Login Capability Kit
Implementation:CTSS 4 Remote Login Capability Implementation

Most TeraGrid systems allow users to establish an interactive, command-line login session from which they may issue commands to the system. CTSS coordinates the following capabilities among these systems.

  • Login to a TeraGrid resource
  • Login to one TeraGrid resource and then login to others without re-authenticating
  • Obtain a grid proxy certficate after logging in to a TeraGrid resource
  • Use default TeraGrid software tools and environment variables in the login shell without special setup
  • Customize the login shell environment to include non-default software
  • Move data into or out of a TeraGrid login system (basic methods) (updated in CTSS 4)

Remote Computation

Definition:CTSS 4 Remote Compute Capability
Implementation:CTSS 4 Remote Compute Capability Implementation

Most TeraGrid systems allow computation jobs to be submitted for processing from a remote system. CTSS provides a consistent way to do the following things on TeraGrid systems.

  • Remotely submit a simple job to the resource (updated in CTSS 4)
  • Submit a job with file staging in and out (updated in CTSS 4)
  • Check the status of a remotely submitted job (updated in CTSS 4)
  • Signal (manage) a remotely submitted job (updated in CTSS 4)
  • Specify the environment in which to run a remotely submitted job (updated in CTSS 4)
  • Obtain accounting information about a remotely submitted job (new in CTSS 4)

Data Movement Clients and Servers

CTSS4 Data Movement Clients
Definition:CTSS 4 Data Movement Client Capabilities
Implementation:CTSS 4 Data Movement Client Implementation
CTSS4 Data Movement Server
Definition:CTSS 4 Data Movement Server Capabilities
Implementation:CTSS 4 Data Movement Server Implementation

Data is vital to science. Modern science produces significantly more data than earlier methods, and data often has to be moved from one Teragrid system to another during the production/use/analysis cycle. CTSS provides the following data movement capabilities.

  • Move files between TeraGrid systems without ID/passwords (updated in CTSS 4).
    • uberftp, globus-url-copy, gsiscp, tgcp and rft.
  • Move very large files (>50Gb) between TeraGrid system with very high performance (updated in CTSS 4).
    • GridFTP: uberftp, globus-url-copy, tgcp.
  • Automate the movement of a large number of files between TeraGrid systems
    • rft.
  • Login to a TeraGrid system and securely move files in/out of SRB storage systems without additional ID/password
    • srb.
  • Move moderatly large files (1-10 GB) to/from login nodes from/to remote systems and other TG login nodes with high performance.
    • hpn-scp.

Data Management Servers

Definition:CTSS 4 Data Management Capabilities
Implementation:CTSS 4 Data Management Capability Implementation

Managing large amounts of data is also a challenge, especially when so much of modern science is conducted as collaborations. CTSS provides the following data management capabilities.

  • Replicate data in multiple storage systems for redundancy or improved access time
  • Maintain a registry of where copies of individual files can be found
  • Maintain an archive of data for later use or for use by others
  • Locate data that matches application-specific metadata specifications (e.g., temperature, elevation, region, energy level, trial number, etc.)

Data Collections

CTSS4 Data Collections Client
Definition:CTSS 4 Data Collections Client Capabilities
Implementation:CTSS 4 Data Collections Client Implementation
CTSS4 Data Collections Server
Definition:CTSS 4 Data Collections Server Capabilities
Implementation:CTSS 4 Data Collections Server Implementation

These capabilities are under development. The implementation will be based on iRODS. Contact Chris Jordan for details.

Wide Area GPFS File System

Definition:CTSS 4 Wide Area File System Capabilities
Implementation:CTSS 4 Wide Area File System Implementation

The shared filesystem model is unquestionably simplest for most application to adapt to because it has the same interfaces as a local filesystem, which most applications already use. High performance, security, and scalability present interesting engineering and coordination challenges.

  • Write and read files with standard I/O, MPI-IO and HDF calls to a high performance, parallel filesystem mounted on multiple TeraGrid resources

Application Development & Runtime Support

Definition:CTSS 4 Application Development & Runtime Capabilities
Implementation:CTSS 4 Application Development & Runtime Support Implementation
  • Identify the software tools and application development libraries that are available on a TeraGrid system. (updated in CTSS 4)
  • Compile an application written in the C programming language on a TeraGrid system.
  • Compile an application written in the FORTRAN programming language on a TeraGrid system.
  • Configure the login environment (or job submission environment) to include a particular tool or library, or a particular version of that tool or library. (updated in CTSS 4)
  • Run an application script that requires a Globus Toolkit command-line tool.
  • Run an application that uses the Storage Resource Broker (SRB) client to store or retrieve data from SRB.
  • Run an application script that uses the TGCP command to transfer files between TeraGrid systems.

Science Workflow Support

Definition:CTSS 4 Science Workflow Support Capabilities
Implementation:CTSS 4 Science Workflow Support Implementation

Computational science often involves performing a large number of computational tasks, sometimes with elaborate coordination among those tasks. ("Do X, Y, and Z in parallel, then collect the resulting data and do A and B in sequence, then do...") Automation tools can improve productivity dramatically, and CTSS provides the following capabilities to support automation.

  • Run and manage a large number of uncoordinated parallel tasks using a single TeraGrid system
  • Run and manage a set of coordinated tasks using a single TeraGrid system
  • Execute either of the above workflows using multiple TeraGrid systems
  • Enable an existing workflow application (with its own workflow manager) to manage tasks on multiple TeraGrid systems

Science Gateway Support

Definition:CTSS 4 Science Gateway Capability
Implementation:CTSS 4 Science Gateway Capability Implementation

The Science Gateway kit defines capabilities that support the use of Community Accounts by Science Gateways on TeraGrid systems. Community accounts introduce new challenges for the TeraGrid project because these accounts are shared by multiple users of the science gateway. TeraGrid RPs require the ability to identify individual users of the community account and block unwanted activity in the account without blocking an entire science gateway capability. The TeraGrid project also requires the ability to perform usage accounting at the individual user level for community accounts. Some TeraGrid RPs also require the ability to restrict the use of the community account to the specific needs of the science gateway, to limit exposure in the case of account compromise.

This kit implements the following user cases:

  • Blocking unwanted behavior
  • Blocking undesirable netspace
  • After hours contact of user
  • Counting of gateway users

Parallel Application Support

Definition:CTSS 4 Parallel Application Support Capability
Implementation:CTSS 4 Parallel Application Support Implementation

Much of science and engineering involves simulation and analysis tasks. Simulation and analysis activities can often be sped up dramatically by employing parallelism: the ability to run parts of the problem on several processors simultaneously, thereby improving overall throughput. CTSS provides the following capabilities for supporting parallel applications.

  • Discover which versions of the most common parallel tools are available on a TeraGrid system (updated in CTSS 4)
  • Configure your environment on a TeraGrid system to use a specific version of a parallel tool

Distributed Parallel Application Support

Definition:CTSS 4 Distributed Parallel Application Support Capability
Implementation:CTSS 4 Distributed Parallel Application Support Implementation

Most simulation and analysis applications require fast communication between parallel subjobs, and thus cannot be distributed across multiple TeraGrid systems over the TeraGrid network. A few advanced applications have, however, been adapted to run well in a distributed mode, and require parallel tools that can support distributed parallel operations. CTSS offers an MPI implementation that provides the following capabilities related to running a parallel application on multiple TeraGrid systems over the TeraGrid network.

  • Discover which versions of distributed parallel tools are available on a TeraGrid system (updated in CTSS 4)
  • Configure your environment on a TeraGrid system to use a specific version of the distributed parallel tools

Data Visualization Support

Definition:VTSS Data Visualization Support Capability
Implementation:VTSS Data Visualization Support Implementation

The Data Visualization Support Kit defines capabilities that enable users to perform basic visualization tasks, and provides software for the development of visualization tools and applications. The components of this kit comprise the Visualization TeraGrid Software and Services (VTSS). Sample use cases include:

  • Image manipulation
  • Fundamental visualization and data exploration (through ParaView, an end-user application)
  • Developing custom applications for visualizing large data sets.

Advance Reservation

Definition:AdvanceReservation_Capability_Kit
Implementation:Advance_Reservation_Implementation_Plan

The Advance Reservation Kit defines capabilities that allow TeraGrid users to request and manage advance reservations on a TeraGrid system. An advance reservation is reserving a set of resources (typically nodes) on a single TeraGrid system in the future for a specific duration. This kit includes only the server-side capabilities needed to support reservations. Client-side capabilities are in the Application Development & Runtime Support kit. Sample use cases include:

  • Requesting an advance reservation and receiving a reservation identifier
  • Monitoring a reservation that has been made
  • Canceling a reservation

Co-Scheduling

Definition:CoScheduling_Capability_Kit
Implementation:CoScheduling_Implementation_Plan

The Co-Scheduling Kit defines capabilities that allow TeraGrid users to request and manage co-allocations on a TeraGrid system. Co-scheduling is coordinating the reservation of resources (typically nodes) on two or more TeraGrid systems. This kit includes only the server-side capabilities needed to support co-scheduling a TeraGrid system along with other TeraGrid systems. Client-side capabilities are in the Application Development & Runtime Support kit. Sample use cases include:

  • Requesting a co-allocation and receiving an identifier (and/or identifiers)
  • Monitoring a co-allocation that has been made
  • Canceling a co-allocation

Metascheduling

Definition:Metascheduling_Capability_Kit
Implementation:Metascheduling_Implementation_Plan

The Metascheduling Kit defines capabilities that allow TeraGrid users to have a system automatically selected for a job, rather than the user having to specify a system for each job. In addition, this kit supports the management of such jobs. This kit includes only the server-side capabilities needed to support metascheduling on a TeraGrid system. Client-side capabilities are in the Application Development & Runtime Support kit and the Science Workflow Support kit. Sample use cases include:

  • Describing a job
  • Selecting a system for a job
  • Managing the execution of jobs on the systems selected for them
  • Canceling a job

Local Computation

Definition:CTSS 4 Local Compute Capability
Implementation:CTSS 4 Local Compute Capability Implementation

The purpose of the Local Compute Kit is to support computation on a local TeraGrid resource (e.g. on a login node to its associated cluster). The primary way this kit does this is by providing information about the compute resource. Users can access this information to help select systems to apply for allocation on, select a system to use on a particular day, find the batch scheduler, and so on. The specific capabilities included are:

  • Retrieving manually-specified information about the compute system.
  • Supporting the manual specification of information about the batch scheduler.
  • Querying the batch scheduler for information about the system.
  • Providing information about the compute resource to TeraGrid-wide information services.

Distributed Programming Systems

Definition:CTSS 4 Distributed Programming Systems Capability - Definition
Implementation:[CTSS 4 Distributed Programming Systems Capability - Implementation]
Deployment:CTSS 4 Distributed Programming Systems Capability - Availability

The purpose of the Distributed Programming Systems Kit is to provide the functionality to build distributed applications, tools and frameworks so as to be independent of the details of the underlying infrastructure. The software included in this kit can be used to provide access layers for distributed systems and abstractions for applications and thereby address the fundamental application design objectives of interoperability across different infrastructure, distributed scale-out, adaptivity whilst preserving simplicity. The specific capabilities included are:

  • Provide a common access layer to the TeraGrid/XD resources, including IaaS clouds
  • Provide application-level interoperability across the TG/XD and with other national and international PGS
  • Writing simple frameworks and applications to submit jobs on distributed resources: grids, Condor pools, clouds.
  • Managing data in unified manners.

Resource Provider Deployment

Resource providers integrating their resources into the TeraGrid must deploy the CTSS 4 TeraGrid Core Integration Capabilities kit. They must also choose to deploy one or more user capability kits. The Wiki page below tracks the choices each TeraGrid Resource Provider has made regarding which user capability kits they will provide on each TeraGrid resource.

Detailed Design & Implementation

Kit Implementation Plans

Kit Change Plans

Kit Registration

SoftEnv

Build & Test

Packaging

Personal tools