CTSS Version 4
From TeraGrid Wiki
CTSS 4 went production on the TeraGrid in August of 2007. The most significant differences between CTSS 4 and earlier CTSS 1-3 are the introduction of the capability kit architecture and significantly better documented that is more accessible to TeraGrid users.
CTSS 4 testing began in January 2007 with incremental production rollout during Q1-Q2 2007. CTSS 4's capabilities were defined, designed, and implemented by appropriate teams within the TeraGrid community. The Software Integration area provided overall coordination. Major contributions to CTSS 4 were made by the Software Integration team, the Data working group, the Software working group, the Visualization working group, and the Scheduling working group.
The CTSS Design document describes CTSS's high-level structure. However, to understand the capabilities offered in CTSS 4, you will need to review this page (not the high-level design document).
It is important to understand that not all TeraGrid resources support all CTSS capabilities. TeraGrid resources are carefully designed HPC and storage systems and not all CTSS capabilities are appropriate for all resources. CTSS 4 includes a new software and service capability registry that allows TeraGrid users to see which capabilities are supported on each resource, and which resources provide each capability. (See the TeraGrid Core Integration section below.) This offers a clearer view and understanding of the suite of capabilities offered across the entire TeraGrid system.
The CTSS Process describes the process by which CTSS 4 was defined, designed, implemented, and deployed. In a nutshell, each major capability was defined, designed, and implemented by appropriate teams within the TeraGrid community. The software integration area provided overall coordination. Major contributions to CTSS 4 were made by the software integration team, the DV team, the software working group, and the data working group.
The initial definition of CTSS 4 capabilities is complete. Deployment of CTSS 4 capabilities began in January 2007 and the transition to production operation happened in August 2007.
All TeraGrid capabilities are defined and implemented as "capability kits". Capability kits include a Definition Document that describes the capabilities and driving use cases for the kit and an Implementation Document that provides the technical details on how the capabilities will be delivered. This document describes TeraGrid capabilities in the Class "ctss".
All TeraGrid HPC and storage resources supporting CTSS capability kits, must deploy the TeraGrid Core Integration kit, which provides the minimal capabilities to integrate the resource into the rest of the TeraGrid community (mainly the advertise which CTSS capabilities are provided). Other CTSS capability kits are optional, but we expect that most resources will implement the most of the capabilities.
To install any of the following capability kits on an existing TeraGrid resource follow the instructions in the kit's implementation document.
The CTSS 4 Bare Metal Installation describes host to install CTSS 4 on a machine that doesn't have any CTSS.
TeraGrid Core Integration
|Definition:||CTSS 4 TeraGrid Core Integration Capabilities|
|Implementation:||CTSS 4 TeraGrid Core Integration Implementation|
The only part of CTSS that TeraGrid resource providers must implement is the part that deals with operational integration: the TeraGrid Core Integration kit. The capabilities in this area enforce consistency in key areas: security, system information, verification & validation, and software deployment. Many of these capabilities are focused more on TeraGrid operators than on TeraGrid users, though they provide the fundamentals that make it possible (and easy) to use TeraGrid in advanced ways that users appreciate very much. (E.g., the ability to log into any authorized TeraGrid system using the same ID and password, or to submit jobs from one TeraGrid system to another without entering any ID/password at all.)
The most important capability provided by the TeraGrid Core Integration kit is the software and service capability registry, which registers the CTSS capabilities provided by each TeraGrid resource, plus key configuration and accessibility information about each capability.
We expect that the following capabilities will be useful to both TeraGrid staff and TeraGrid users.
- Identify the TeraGrid resources providing a specific CTSS capability (new in CTSS 4)
- Identify the CTSS capabilities provided by a specific TeraGrid resource (new in CTSS 4)
- Discover the resource name and resource provider for the TeraGrid resource
- Discover a system's local policies
The following CTSS capabilities are primarily of interest to TeraGrid's resource provider and operations personnel.
- Authorize and provision a new TeraGrid user on the resource
- Verify that a CTSS capability is working properly on the resource
- Install a new CTSS capability kit on the resource (updated in CTSS 4)
- Publish the availability and current support level for a CTSS capability on the resource (new in CTSS 4)
|Definition:||CTSS 4 Remote Login Capability Kit|
|Implementation:||CTSS 4 Remote Login Capability Implementation|
Most TeraGrid systems allow users to establish an interactive, command-line login session from which they may issue commands to the system. CTSS coordinates the following capabilities among these systems.
- Login to a TeraGrid resource
- Login to one TeraGrid resource and then login to others without re-authenticating
- Obtain a grid proxy certficate after logging in to a TeraGrid resource
- Use default TeraGrid software tools and environment variables in the login shell without special setup
- Customize the login shell environment to include non-default software
- Move data into or out of a TeraGrid login system (basic methods) (updated in CTSS 4)
|Definition:||CTSS 4 Remote Compute Capability|
|Implementation:||CTSS 4 Remote Compute Capability Implementation|
Most TeraGrid systems allow computation jobs to be submitted for processing from a remote system. CTSS provides a consistent way to do the following things on TeraGrid systems.
- Remotely submit a simple job to the resource (updated in CTSS 4)
- Submit a job with file staging in and out (updated in CTSS 4)
- Check the status of a remotely submitted job (updated in CTSS 4)
- Signal (manage) a remotely submitted job (updated in CTSS 4)
- Specify the environment in which to run a remotely submitted job (updated in CTSS 4)
- Obtain accounting information about a remotely submitted job (new in CTSS 4)
Data Movement Clients and Servers
|CTSS4 Data Movement Clients|
|Definition:||CTSS 4 Data Movement Client Capabilities|
|Implementation:||CTSS 4 Data Movement Client Implementation|
|CTSS4 Data Movement Server|
|Definition:||CTSS 4 Data Movement Server Capabilities|
|Implementation:||CTSS 4 Data Movement Server Implementation|
Data is vital to science. Modern science produces significantly more data than earlier methods, and data often has to be moved from one Teragrid system to another during the production/use/analysis cycle. CTSS provides the following data movement capabilities.
- Move files between TeraGrid systems without ID/passwords (updated in CTSS 4).
- uberftp, globus-url-copy, gsiscp, tgcp and rft.
- Move very large files (>50Gb) between TeraGrid system with very high performance (updated in CTSS 4).
- GridFTP: uberftp, globus-url-copy, tgcp.
- Automate the movement of a large number of files between TeraGrid systems
- Login to a TeraGrid system and securely move files in/out of SRB storage systems without additional ID/password
- Move moderatly large files (1-10 GB) to/from login nodes from/to remote systems and other TG login nodes with high performance.
Data Management Servers
|Definition:||CTSS 4 Data Management Capabilities|
|Implementation:||CTSS 4 Data Management Capability Implementation|
Managing large amounts of data is also a challenge, especially when so much of modern science is conducted as collaborations. CTSS provides the following data management capabilities.
- Replicate data in multiple storage systems for redundancy or improved access time
- Maintain a registry of where copies of individual files can be found
- Maintain an archive of data for later use or for use by others
- Locate data that matches application-specific metadata specifications (e.g., temperature, elevation, region, energy level, trial number, etc.)
|CTSS4 Data Collections Client|
|Definition:||CTSS 4 Data Collections Client Capabilities|
|Implementation:||CTSS 4 Data Collections Client Implementation|
|CTSS4 Data Collections Server|
|Definition:||CTSS 4 Data Collections Server Capabilities|
|Implementation:||CTSS 4 Data Collections Server Implementation|
These capabilities are under development. The implementation will be based on iRODS. Contact Chris Jordan for details.
Wide Area GPFS File System
|Definition:||CTSS 4 Wide Area File System Capabilities|
|Implementation:||CTSS 4 Wide Area File System Implementation|
The shared filesystem model is unquestionably simplest for most application to adapt to because it has the same interfaces as a local filesystem, which most applications already use. High performance, security, and scalability present interesting engineering and coordination challenges.
- Write and read files with standard I/O, MPI-IO and HDF calls to a high performance, parallel filesystem mounted on multiple TeraGrid resources
Application Development & Runtime Support
|Definition:||CTSS 4 Application Development & Runtime Capabilities|
|Implementation:||CTSS 4 Application Development & Runtime Support Implementation|
- Identify the software tools and application development libraries that are available on a TeraGrid system. (updated in CTSS 4)
- Compile an application written in the C programming language on a TeraGrid system.
- Compile an application written in the FORTRAN programming language on a TeraGrid system.
- Configure the login environment (or job submission environment) to include a particular tool or library, or a particular version of that tool or library. (updated in CTSS 4)
- Run an application script that requires a Globus Toolkit command-line tool.
- Run an application that uses the Storage Resource Broker (SRB) client to store or retrieve data from SRB.
- Run an application script that uses the TGCP command to transfer files between TeraGrid systems.
Science Workflow Support
|Definition:||CTSS 4 Science Workflow Support Capabilities|
|Implementation:||CTSS 4 Science Workflow Support Implementation|
Computational science often involves performing a large number of computational tasks, sometimes with elaborate coordination among those tasks. ("Do X, Y, and Z in parallel, then collect the resulting data and do A and B in sequence, then do...") Automation tools can improve productivity dramatically, and CTSS provides the following capabilities to support automation.
- Run and manage a large number of uncoordinated parallel tasks using a single TeraGrid system
- Run and manage a set of coordinated tasks using a single TeraGrid system
- Execute either of the above workflows using multiple TeraGrid systems
- Enable an existing workflow application (with its own workflow manager) to manage tasks on multiple TeraGrid systems
Science Gateway Support
|Definition:||CTSS 4 Science Gateway Capability|
|Implementation:||CTSS 4 Science Gateway Capability Implementation|
The Science Gateway kit defines capabilities that support the use of Community Accounts by Science Gateways on TeraGrid systems. Community accounts introduce new challenges for the TeraGrid project because these accounts are shared by multiple users of the science gateway. TeraGrid RPs require the ability to identify individual users of the community account and block unwanted activity in the account without blocking an entire science gateway capability. The TeraGrid project also requires the ability to perform usage accounting at the individual user level for community accounts. Some TeraGrid RPs also require the ability to restrict the use of the community account to the specific needs of the science gateway, to limit exposure in the case of account compromise.
This kit implements the following user cases:
- Blocking unwanted behavior
- Blocking undesirable netspace
- After hours contact of user
- Counting of gateway users
Parallel Application Support
|Definition:||CTSS 4 Parallel Application Support Capability|
|Implementation:||CTSS 4 Parallel Application Support Implementation|
Much of science and engineering involves simulation and analysis tasks. Simulation and analysis activities can often be sped up dramatically by employing parallelism: the ability to run parts of the problem on several processors simultaneously, thereby improving overall throughput. CTSS provides the following capabilities for supporting parallel applications.
- Discover which versions of the most common parallel tools are available on a TeraGrid system (updated in CTSS 4)
- Configure your environment on a TeraGrid system to use a specific version of a parallel tool
Distributed Parallel Application Support
|Definition:||CTSS 4 Distributed Parallel Application Support Capability|
|Implementation:||CTSS 4 Distributed Parallel Application Support Implementation|
Most simulation and analysis applications require fast communication between parallel subjobs, and thus cannot be distributed across multiple TeraGrid systems over the TeraGrid network. A few advanced applications have, however, been adapted to run well in a distributed mode, and require parallel tools that can support distributed parallel operations. CTSS offers an MPI implementation that provides the following capabilities related to running a parallel application on multiple TeraGrid systems over the TeraGrid network.
- Discover which versions of distributed parallel tools are available on a TeraGrid system (updated in CTSS 4)
- Configure your environment on a TeraGrid system to use a specific version of the distributed parallel tools
Data Visualization Support
|Definition:||VTSS Data Visualization Support Capability|
|Implementation:||VTSS Data Visualization Support Implementation|
The Data Visualization Support Kit defines capabilities that enable users to perform basic visualization tasks, and provides software for the development of visualization tools and applications. The components of this kit comprise the Visualization TeraGrid Software and Services (VTSS). Sample use cases include:
- Image manipulation
- Fundamental visualization and data exploration (through ParaView, an end-user application)
- Developing custom applications for visualizing large data sets.
The Advance Reservation Kit defines capabilities that allow TeraGrid users to request and manage advance reservations on a TeraGrid system. An advance reservation is reserving a set of resources (typically nodes) on a single TeraGrid system in the future for a specific duration. This kit includes only the server-side capabilities needed to support reservations. Client-side capabilities are in the Application Development & Runtime Support kit. Sample use cases include:
- Requesting an advance reservation and receiving a reservation identifier
- Monitoring a reservation that has been made
- Canceling a reservation
The Co-Scheduling Kit defines capabilities that allow TeraGrid users to request and manage co-allocations on a TeraGrid system. Co-scheduling is coordinating the reservation of resources (typically nodes) on two or more TeraGrid systems. This kit includes only the server-side capabilities needed to support co-scheduling a TeraGrid system along with other TeraGrid systems. Client-side capabilities are in the Application Development & Runtime Support kit. Sample use cases include:
- Requesting a co-allocation and receiving an identifier (and/or identifiers)
- Monitoring a co-allocation that has been made
- Canceling a co-allocation
The Metascheduling Kit defines capabilities that allow TeraGrid users to have a system automatically selected for a job, rather than the user having to specify a system for each job. In addition, this kit supports the management of such jobs. This kit includes only the server-side capabilities needed to support metascheduling on a TeraGrid system. Client-side capabilities are in the Application Development & Runtime Support kit and the Science Workflow Support kit. Sample use cases include:
- Describing a job
- Selecting a system for a job
- Managing the execution of jobs on the systems selected for them
- Canceling a job
|Definition:||CTSS 4 Local Compute Capability|
|Implementation:||CTSS 4 Local Compute Capability Implementation|
The purpose of the Local Compute Kit is to support computation on a local TeraGrid resource (e.g. on a login node to its associated cluster). The primary way this kit does this is by providing information about the compute resource. Users can access this information to help select systems to apply for allocation on, select a system to use on a particular day, find the batch scheduler, and so on. The specific capabilities included are:
- Retrieving manually-specified information about the compute system.
- Supporting the manual specification of information about the batch scheduler.
- Querying the batch scheduler for information about the system.
- Providing information about the compute resource to TeraGrid-wide information services.
Distributed Programming Systems
|Definition:||CTSS 4 Distributed Programming Systems Capability - Definition|
|Implementation:||[CTSS 4 Distributed Programming Systems Capability - Implementation]|
|Deployment:||CTSS 4 Distributed Programming Systems Capability - Availability|
The purpose of the Distributed Programming Systems Kit is to provide the functionality to build distributed applications, tools and frameworks so as to be independent of the details of the underlying infrastructure. The software included in this kit can be used to provide access layers for distributed systems and abstractions for applications and thereby address the fundamental application design objectives of interoperability across different infrastructure, distributed scale-out, adaptivity whilst preserving simplicity. The specific capabilities included are:
- Provide a common access layer to the TeraGrid/XD resources, including IaaS clouds
- Provide application-level interoperability across the TG/XD and with other national and international PGS
- Writing simple frameworks and applications to submit jobs on distributed resources: grids, Condor pools, clouds.
- Managing data in unified manners.
Resource Provider Deployment
Resource providers integrating their resources into the TeraGrid must deploy the CTSS 4 TeraGrid Core Integration Capabilities kit. They must also choose to deploy one or more user capability kits. The Wiki page below tracks the choices each TeraGrid Resource Provider has made regarding which user capability kits they will provide on each TeraGrid resource.
Detailed Design & Implementation
Kit Implementation Plans
- CTSS 4 TeraGrid Core Integration Implementation
- CTSS 4 Remote Login Capability Implementation
- CTSS 4 Remote Compute Capability Implementation
- CTSS 4 Data Movement Implementation
- CTSS 4 Data Management Capability Implementation
- CTSS 4 Wide Area File System Implementation
- CTSS 4 Application Development & Runtime Support Implementation
- CTSS 4 Science Workflow Support Implementation
- CTSS 4 Parallel Application Support Implementation
- Distributed Parallel Application Support
Kit Change Plans
- CTSS 4 Change Plan - TeraGrid Core Integration Kit Deployment
- CTSS 4 Change Plan - Remote Login Kit Deployment
- CTSS 4 Change Plan - Remote Compute Kit Deployment
- CTSS 4 Change Plan - Science Workflow Support Kit Deployment
- CTSS 4 Change Plan - Parallel Application Support Kit Deployment
- CTSS 4 Change Plan - Data Movement Kit Deployment
- CTSS 4 Change Plan - Data Management Kit Deployment
- CTSS 4 Change Plan - Application Development and Runtime Support Kit Deployment
- Kit Registration Mock-up User Views
- Kit Registration Design
- Kit Registration Details
- Related Schema Design
- Kit Registration Information Servers