The CTSS Process
From TeraGrid Wiki
CTSS (Coordinated TeraGrid Software and Services) has developed over a number of years with multiple groups participating in various pieces of the process. This Wiki page has been created in order to help us document the CTSS process. Documenting the process is important for a number of reasons, including the need to improve the quality of planning in CTSS and related areas, the need to increase participation in the CTSS process, and the need to explain the CTSS process to new members of the TeraGrid community and to TeraGrid partners.
A companion document, CTSS Design, explains the high-level structure of CTSS and why it is structured this way. This document focuses on our current processes as opposed to the technical design or history of CTSS.
As described in CTSS Design, CTSS is not a list of software packages. It is a set of distinct capabilities that have been conceived of, designed, and deployed in order to benefit TeraGrid users. Only one CTSS capability set (the TeraGrid Core Integration kit) is required on all TeraGrid resources. The rest of the capabilities are added to, removed from, and updated on each TeraGrid resource in response to user needs.
Each CTSS capability is conceived of, designed, implemented, deployed, and operated independently of the others. This document explains how these things happen relative to a single CTSS capability.
CTSS capabilities go through the following major phases.
- Implementation (implementation plan, software builds, software packaging)
- Changes (e.g., updates and decommissioning)
The definition and implementation phases are performed by whoever wants the capability to be available on TeraGrid. This is most often a TeraGrid working group (e.g. data working group), but this is not strictly required. There are several options for how the implementation can be performed (e.g., packages prepared by the GIG packaging team, packages provided by a vendor, packages provided by the RP software administrators). The deployment and operation/maintenance phases are mainly performed by TeraGrid resource providers who have chosen to offer the capability on their resources. The changes phase is performed by anyone who has an interest in a change to an existing capability.
The products of each phase are used as inputs to the next phase, but they are also useful for other activities as well. For example, the product of the implementation plan phase (an implementation plan) can be used as the basis for interoperability testing as well as the starting point for the software build/package/deploy cycle.
The kit structure for CTSS allows independent teams to take ownership of specific feature sets within CTSS. For example, the CTSS capabilities that specifically support science gateways can be owned by the GIG's Science Gateway area, whereas the capabilities that support cross-site data transfer can be owned by the Data Working Group.
To start the process of establishing a CTSS capability, a team first drafts a capability definition. This document outlines the high-level capability area, documents use cases and existing users who use TeraGrid in the manner described, and identifies the impact on various parts of the TeraGrid community regarding deployment of the new capabilities. The document structure is based closely on the template for TeraGrid policy documents.
Once a capability definition has been drafted, the team circulates the definition for review in increasingly broad circles. The team may review the definition internally, share it with closely related groups, share it with groups that are clearly impacted, share it with the intended users, and perhaps share it with the entire TeraGrid staff via the email@example.com mailing list. At all stages, feedback must be taken into consideration and appropriate changes made to the definition.
Capability definitions are critical because they form the basis for understanding the capabilities that are being proposed. This is important both to users and to members of other TeraGrid teams. It is also an important element of creating and maintaining interoperability, both within TeraGrid and externally.
Once a capability definition is complete, the following things can be done.
- User documentation (the major points of the user documentation should be clear at this point even before the implementation is ready)
- Gap analysis (original needs vs. actual capabilities, requirements traceability matrix)
- Implementation Plan (see next section)
- Beginning an interoperability analysis (for example, comparing TeraGrid vs. Open Science Grid remote job submission features)
- Establish a TeraGrid community policy that defines the capability (see addendum at end of this page)
Capability definitions must have a version number associated with them. Once a definition is "finalized," any change to the definition must use a new version number. A new version number must be assigned whenever user scenarios or requirements are added, changed, or removed from the definition.
The implementation phase of capability development includes: drafting an implementation plan, building the required software components on the target resources, and creating easy-to-use software installers.
Teams who define CTSS capabilities are also expected to develop a design and implementation for those capabilities. (This is not a strict rule as the responsibilities could be separated, but seems reasonable given the current organization of the project.) Teams should review requirements, survey available technologies, and select and document a reasonable path for implementing the capabilities.
The deliverable of this stage of capability development is an implementation plan. The implementation plan provides two important elements: (1) a requirements section that specifies the key technical requirements (e.g., user or system interfaces, interoperability requirements, operations and maintenance requirements), and (2) a recommended plan for implementing the requirements.
The requirements are non-negotiable. Whatever implementation a resource provider offers must be consistent with the requirements expressed here. For this reason, it is essential that these requirements be strictly defined by user needs, as opposed to arbitrary or debatable choices. If user needs can not justify a requirement, it should probably be removed.
The recommended plan is provided as a guide, not as a mandatory implementation. It explains how to obtain the necessary software and instructions for deploying and configuring the software. The recommended plan should be guaranteed to work under a set of reference conditions (e.g., on a known platform with known configuration), but it might not work on every TeraGrid resource. Resource providers who decide to offer the capability kit are free to implement the kit in other ways, so long as the technical design requirements are satisfied.
The recommended plan should detail the degrees of flexibility for all aspects of the plan. For example, are any of the recommended software components optional? Will the recommended plan work with more than one version of the software, or is there only one that will work? Which configuration choices are key to meeting the technical requirements and which are left to the resource providers? Are there multiple sources for the software, and if so, which ones are known to work?
Once an implementation plan is complete, the following actions may be taken.
- Identify specific software build inputs and recipes (see next section)
- Develop and review plans to use the TeraGrid capability in specific applications
- Establish testbeds that recreate the capability in a non-production setting (for interoperability testing, application prep & debugging, trying out improvements)
- Develop and deploy Inca tests that measure whether a resource's implementation satisfies use cases and technical requirements
- Recruit prospective users to test the capability when it is implemented
- Develop, run, and analyze results from interoperability tests (to measure interoperability with other systems)
The kit implementation plan may involve the use of software that is not pre-built for TeraGrid resources. In other words, the software may be available in source form but not in executable form. (This is true for most open source packages, for example. It is usually not true for software obtained from system vendors or other commercial sources.) When this is true, a decision needs to be made regarding who is responsible for building the software for use on TeraGrid resources where the capability kit is desired. Options include: the software developers or providers, resource provider personnel, the GIG software integration team, or other GIG personnel. The decision should be made by each kit's development lead in negotiation with the possible workers (GIG software integration, resource provider software admins, any others) and documented in the implementation plan.
The product of this phase of kit development is a build process for each of the target resources. The build process is a recipe or plan that can be brought into effect whenever a build of the software is needed for the target resource. (It is often necessary to rebuild software for resources if, for example, the resource platform changes or a critical patch is available for the software). Options for the build process include: a promise that a specific team will build the software when needed, or a "build recipe" that can be used as input for the NMI Build & Test system to automatically generate a build.
In addition to creating executable software images, the build stage should also identify platform-specific bugs or build issues. Most software build processes execute unit tests (internal software validation tests) after the software is built, so building the software on each platform could result in a set of platform-specific issues that must be fed back to the software providers for resolution.
In most cases, the implementation plan will involve deploying specific software. This work is usually (though not always) performed by resource provider personnel. When it is performed by resource provider personnel, they typically expect to be provided with easy-to-use software installers: a small set of files that they can download, install, configure, and activate by following a brief set of written instructions, usually supported by an automated script or command.
The product of the package phase of kit development is a set of installable software packages that resource providers, science gateway developers, peer grids, or end users can obtain and install to provide the kit's capabilities. It also includes documentation materials (whatever documentation is necessary to make the kit's capabilities usable), installation and configuration instructions, and (of key importance) validation tests that can be run by resource providers or by the TeraGrid operations team via Inca to validate successful deployment of the kit on a given system.
The installable software packages created by the packaging phase of capability development can be used for any of the following purposes.
- Deploying the intended capability on TeraGrid resources (see "capability deployment" below)
- Deploying other capabilities on TeraGrid resources (some TeraGrid capabilities share components with others)
- Deploying the capability on non-TeraGrid resources (e.g., campus systems) in order to provide TeraGrid-like capabilities
- Deploying testbeds for experimentation and interoperability testing
The deliverables mentioned above are defined generally, because different kits may have different types of implementations. These differences may arise from the nature of the capability and the software available to implement it OR from the nature of the team that is promoting the capability. The following "best practices" are recommended for kit implementors because they have proven useful in negotiating TeraGrid's unique challenges of heterogeneity, distributed management, and governance.
- A prototyping phase between the implementation and deployment phases is usually helpful in validating that the proposed implementation meets user needs. This can be accomplished by deploying a prototype implementation on a small number of resources and inviting some of the targeted users to try it out. If successful, the prototype implementation may simply become the recommended implementation. If not, a temporary return to the implementation phase (hopefully with a quick turnaround) will be necessary. In any case, ironing out these kinds of issues before proposing a widespread deployment has many benefits, not least of which is saving effort from deploying a flawed capability. Note: This practice is also helpful when preparing to deploy a new version of an existing capability.
- The software packages, documentation, instructions, and validation tests that make up a kit's implementation should ideally be stored in the TeraGrid software repository (repo.teragrid.org) so that they are available in a familiar place for all members of the community to use when needed.
- The Teragrid Pacman cache (pacman.teragrid.org) provides an excellent mechanism to simplify installation of software on TeraGrid resources. We have yet to explore using Pacman for delivery to science gateway developers, end users, or other grid operators, but we expect that it would work well for those scenarios also.
- The GIG software integration team has extensive experience porting software to TeraGrid's diverse resources. (TeraGrid currently includes more than 20 unique hardware/operating system platforms.) The software integration team can either provide guidance to RPs or do the work of getting software to build on these resources. (Neither style is best for all situations.)
- The software integration team is currently constructing a software build & test service for the TeraGrid community. This service allows its users to specify instructions for building and/or testing software, which it will then automatically execute on some or all of the TeraGrid resources using TeraGrid's Grid interfaces. The result of this is a set of executable software packages that can be easily installed on TeraGrid resources.
- TeraGrid's Inca service provides a harness for running validation tests on arbitrary TeraGrid resources. Validation tests for CTSS capability kits should be designed to work within the Inca service.
The Software working group coordinates the process of establishing agreements regarding which capabilities should be on which systems. Resource providers are responsible for deploying appropriate CTSS capabilities on their systems in accordance with those agreements. The software working group also coordinates and tracks the deployment of new capabilities across TeraGrid resources until they are considered to be in production operation, at which point this responsibility is transferred to the operations team.
The Software working group also maintains a set of information about the technical details of each TeraGrid resource, the software administrators for those resources, and the GIG packaging team members who have experience in and responsibility for porting software to each resource.
The Software Integration area in the GIG provides a number of services and tools that assist in the deployment process.
- The TeraGrid CVS repository (repo.teragrid.org) is a place where the software integration team stores software, configuration files, and deployment documentation for CTSS components. CTSS 3 software and deployment instructions are in the public CTSS Software Home.
- Starting with CTSS 3, the TeraGrid adopted a standard packaging tool called Pacman. The Pacman Usage Guidelines document describes how Pacman is used on the TeraGrid.
- The CTSS Packaging Coordinators are responsible for packaging CTSS components.
The operations team provides the Inca service to help verify deployments of new or modified capabilities. (Inca is also used during the operations and maintenance phase for CTSS capabilities.)
Capability Operations and Maintenance
The Inca service (provided by the Inca component of the TeraGrid Operations team) is used to execute tests and compile test results in order to verify and validate (V&V) availability of CTSS capabilities on each TeraGrid resource. The Software working group coordinates the production of the lists that define which capabilities should be on which systems, and Inca validates that the systems have those capabilities by running tests.
The TeraGrid trouble ticket service (provided by the Help Desk component of the TeraGrid operations team) is used to track known issues with CTSS capabilities that have been declared part of TeraGrid's "production operation." (See "Defining CTSS Capabilities" above and "CTSS Changes" below.) This includes both the software capabilities on each resource and the documentation provided by the TeraGrid user services/documentation team.
Resource providers are responsible for maintaining the CTSS capabilities on their systems and for resolving software capability issues that are discovered by Inca or by other means. The operations team tracks and coordinates the resolution process as needed.
Once a CTSS capability is part of TeraGrid's production operation, changes to that capability are made using the CTSS Change Process, described in the next section.
CTSS capabilities are often updated to include new features, to support new user scenarios, or to correct problems.
Each CTSS capability definition should have an associated version number. Once a specific version of a CTSS capability is finalized, no changes should be made to that version of the definition. Instead, changes should be proposed/documented via a new version of the capability definition. New versions of capabilities are defined, implemented, deployed, and operated as if they were entirely new capabilities. (There is no requirement that older versions must be removed when newer versions become available.)
Beginning with CTSS 4, changes to the deployed CTSS capabilities are documented using the CTSS change process. The philosophy of the change process is that the amount of effort required for the change process should be proportional to the size and substance of the proposed change.
The CTSS change process requires a change team (a specific team proposing a change) to describe (in writing) the proposed change and build a change plan based on templates provided by the Software Integration team.
- The change description explains the change and the purpose of the change in terms that can be understood broadly within the TeraGrid project and by TeraGrid users. It is the basis for announcements and other communications with users and may be a starting point for documentation changes.
- The change plan details the tasks that each team in the TeraGrid project needs to do to complete the change. A checklist provided by the Software Integration team helps the change team identify the key issues that need to be covered by the plan.
When a change is proposed, the change description and plan should be announced on the firstname.lastname@example.org mailing list so that everyone has an opportunity to learn about the change and self-identify that they are impacted by the change. (This is intended to avoid unintended impact.) Additional communication should be initiated with each team that has tasks in the change plan, and that communication should ensure that each team agrees with the plan (or proposes changes) and is able to provide the effort required.
The Inca service is typically used to track the progress of changes, and Inca changes are an issue that typically needs to be covered in change plans.
Addendum: A Note on TeraGrid Policies
There are two places where rules about how CTSS is implemented are found: in the award conditions for the projects that fund various pieces of TeraGrid, and in TeraGrid community policies.
In regard to award conditions, one of the responsibilities of the GIG project is to lead the process of defining and implementing a coordinated software strategy. This is performed by the GIG's area directors for software integration, who participate in the software working group and who have led the CTSS process. Several resource provider projects are required by the terms of their awards to implement CTSS on the resources that they provide to TeraGrid. The way this is commonly interpreted is that these resources must implement any CTSS kits that are declared to be mandatory in their definition documents. The only kit that is so defined is the TeraGrid Core Integration kit. Implementing the TeraGrid Core Integration kit is thus sufficient to satisfy the CTSS implementation requirement. The CTSS definition process makes it possible for specific CTSS kits to be mentioned in future project proposals or award conditions, but there is no requirement that this be done.
In regard to TeraGrid community policies, there currently are no community policies that define CTSS, specific CTSS capabilities, or specific requirements in regard to CTSS. A draft policy has been submitted to the TeraGrid Forum for review, but no other action has been taken in that regard at this time.