CTSS 4 Science Workflow Support Capabilities
From TeraGrid Wiki
This capability kit definition was developed in the GIG Software Integration Area based on the contents of CTSS 3 and our current understanding of requirements from applications communities. It defines an optional capability set for Teragrid resources beginning with CTSS 4. This definition will be consistent for any TeraGrid systems that support this kit.
Abstract *
This document defines the purpose and design of the CTSS 4 Science Workflow Support Kit. TeraGrid resource providers may offer the capabilities defined here in coordination with other resource providers in order to improve ease of use for TeraGrid users who make use of multiple TeraGrid resources.
The Science Workflow Support kit defines capabilities that support workflow-based science activities. In many branches of science, workflow techniques are being used to automate computational tasks in an effort to improve productivity and to explore the potential benefits of more intensive use of computational analysis and simulation.1 Workflows orchestrate the use of storage systems, processors, and other computational elements in order to efficiently and automatically execute complex sets of tasks defined by scientists.
The recommended implementation for the capabilities described in this kit are described in the companion document, CTSS 4 Science Workflow Support Implementation.
Purpose
The Science Workflow Support kit provides software capabilities that make it easier to run science workflow applications and tools on a TeraGrid system.
Scientists in diverse fields including high energy physics, structural biology, bioinformatics, seismology, mechanical and aerospace engineering, meteorology, and astronomy are exploring the potential for automating computational tasks to improve productivity and to open new modes of exploration.[1]
Since the first computers, scientists have written programs that modeled and simulated physical processes and that analyzed data collected from physical systems. Modern computational systems can execute simulation and analysis applications so fast that the productivity of scientists is now limited by their ability to supply data to these programs, start them running, monitor their execution and respond to failures, and collect the data generated by them. Formerly handled manually, scientists are now exploring workflow techniques as a way to automate these tasks.
Workflow combines information management and automation techniques to manage the automated execution of interrelated tasks (processing, data movement and management). Two key domains for workflow tools are knowledge management systems and automation tools.
- Knowledge management: The data produced by workflows must be carefully managed in order for the scientists to be able to make sense of the results. This set of requirement has led to development of cataloging tools for data, metadata, provenance information, and replicas. It has also led to considerable work on metadata schema, ontologies, data dictionaries, and other knowledge representation and management tools.
- Automation: The relationships between the tasks and the use policies for the computational resources must be managed in order to exploit parallelism when possible and make efficient use of the resources. This set of requirements has led to development of resource brokers, metaschedulers, personal workflow managers, and other tools for managing automation.
Scientists can certainly develop their own software to automate their work and manage their data. However, most scientists would prefer to not spend their time (or their students’ time) developing these “housekeeping” capabilities. Further, because the same capabilities can be applied to a wide range of scientific applications, it makes more sense to provide a common, high quality implementation that most scientists can use rather than to have each science group develop their own.
Workflow is a “hot topic” in business as well in science, leading to a plethora of workflow-related tools from both the computer science community and industry. At the same time, the workflow area is not yet mature enough for TeraGrid to offer a complete workflow management solution for scientists. (Scientists are still experimenting with different combinations of tools to determine which ones are the most useful and fit best with their existing programs and systems.) Instead, TeraGrid’s Science Workflow Support kit is currently focused on supplying a common set of automation and resource management tools that have proven useful in a wide variety of scientific applications. The kit does not currently address the knowledge management aspect of workflow. Application must supply their own knowledge management capabilities to kthey produce, including relevant metadata and provenance for the data.
To successfully meet the needs of TeraGrid's workflow-based users, we need to coordinate the design, implementation, deployment, and operation of these capabilities.
This kit definition provides a coordination mechanism and also provides the basis for subsequent activities that include:
- documenting the capabilities for users,
- deciding which resources should provide the capabilities,
- implementating and deploying the capabilities,
- verifying deployment of the capabilities via Inca or other mechanisms,
- making subsequent changes to the capabilities.
Current Applications
The following TeraGrid users are currently using the capabilities provided by this kit. These would be good candidates for validation that the kit is working well on specific resources.
| PI | Institution/Project | Science |
|---|---|---|
| Andrew Connelly | UPitt./NVO | Astronomy |
| Jeffrey P. Gardner | PSC/NVO | Astronomy |
| Harvey Newman | Caltech/CMS | ParticlePhysics |
| Vladimir Litvin | Caltech/CMS | Particle Physics |
| Michael Deem | Rice | Material Science |
| David Earl | Oxford | Material Science |
| Sarang Dalal | UCSF | Biology |
Architectural Considerations
The Science Workflow Support kit adds three layers of support for science applications. Each layer provides a specific service that is independent of the other two layers. Applications can use any of these layers in any combination to obtain appropriate services. The three layers are: workflow orchestration, reliable execution, and resource provisioning.
In addition to these three layers, the kit makes use of a fourth layer that is provided by the Remote Compute and Data Movement kits. The fourth layer provides the ability to submit jobs to a system via a network interface and the ability to move data into and out of a system via a network interface. Figure 1 shows how these capabilities work together.
Figure 1. Science Workflow Support Kit design. Scientific workflow applications can use any combination of the kit’s capabilities to reduce complexity in the application itself. Kit components are shaded in green.
For its reliable execution capability, this kit uses Condor-G. Condor-G accepts task descriptions and makes sure that each task is executed on available resources. It negotiates the job submission process, monitors tasks while they execute and responds to unusual circumstances (such as resources that fail to execute the task reliably).
The kit uses DAGman to provide workflow orchestration capabilities. DAGman accepts a description of the overall plan for the application’s tasks (dependencies, parallelization options, etc.) in the form of a DAG (directed acyclic graph) and orchestrates the execution of the plan using available resources. DAGman can make use of Condor-G’s reliable execution capability to ensure that all tasks are executed reliably.
For its resource provisioning capability, the kit uses MyCluster. MyCluster builds a Condor pool that can include resources from multiple TeraGrid systems. DAGman, Condor-G, or the user’s application itself can then use this pool to execute application tasks.
MyCluster constructs a personal Condor pool by submitting “job proxies” to a set of target systems selected by the user or application. Whenever a local scheduler activates one of these job proxies, the proxy registers with the user’s Condor pool and becomes available for executing tasks on the user’s behalf. The pool grows and shrinks as job proxies flow through the various scheduling systems.
Specific Capabilities (User Scenarios)
The use of workflow techniques in science applications is new enough that different applications have different expectations of what they must do themselves vs. what can be provided by other tools. The Science Workflow Support kit uses a modular approach that allows each capability to be used independently of the others, so that applications can use the capabilities they need without being forced to use capabilities they don’t need.
Run and manage a large number of uncoordinated parallel tasks using a single TeraGrid system
This scenario assumes that the scientist has a large number of tasks that need to be executed where each task can be executed independently of the others. None of the tasks require use of the results of other tasks, so they can be executed in any order. For example, an oceanographer may need to run a program that simulates the temperatures of currents in a particular bay over time, using a series of differing input parameter sets. The goal is to determine which particular combination of input parameters results in the simulation producing output that most closely matches observed behavior.
Without using the Science Workflow Support kit capabilities, the scientist would need to prepare a series of input files containing the input parameters and then manually submit each run of the simulation program to the local scheduling system, specifying a different input file for each run. He would then need to monitor the system queue and check each job’s output when it finished executing to be certain that it ran correctly. If any jobs failed (which can happen for a variety of reasons), he would need to resubmit the job and again wait for it to complete and recheck the output. If he is unable to monitor the queue closely, failed jobs may go minutes, hours, or even days before being restarted.
The reliable execution capability provided by Condor-G would eliminate the tedium of monitoring the queue, checking for valid output, and restarting failed jobs. The scientist produces a list of tasks that need to be executed, a means of automatically testing a task’s output for correctness, and one or more targets for where to submit tasks (jobs) for execution. He gives these inputs to Condor-G, and Condor-G then executes all of the tasks in the list as efficiently as possible using the resources specified by the scientist. He can monitor Condor-G’s queue, but Condor-G will automatically restart any failed jobs without requiring human intervention. This saves considerable human effort in managing the jobs and is likely to result in a shorter total time from the beginning of the first task to the end of the last one.
Run and manage a set of coordinated tasks using a single TeraGrid system
This scenario assumes that the scientist has one or more series of tasks that need to be executed, where some tasks are dependent on the results from other tasks. These tasks must be executed in a particular order in order for the inputs for each task to be available when the task runs. For example, a computational biologist may need to run a series of analysis tasks, each task consisting of many subtasks that all must be completed and compared with each other before the next task is begun. The goal might be to identify the most likely function(s) of a particular gene within a particular cell structure.
As in Scenario 1, the scientist could manage the tasks and subtasks manually, submitting each subtask in sequence, and using the results from appropriate subtasks as the inputs for subsequent tasks. This would, of course, be incredibly tedious and very likely prone to human error. She could use Condor-G and MyCluster to manage parallel subtasks within each task, and that would be an improvement. But recreating the task list and input files for each sequential task would still be tedious.
In this scenario, the scientist can use DAGman to describe the sequence of tasks and subtasks that need to be executed and the dependencies between them. She will describe the tasks in the form of a DAG (directed acyclic graph). She can then use MyCluster to construct a personal Condor pool and submit the DAG to Condor-G, and Condor-G will use DAGman to direct which tasks to execute when and how to use the output of one or more tasks as the input to subsequent tasks. She can monitor the progress of the entire workflow by viewing the Condor-G queue.
Execute either of the above workflows using multiple TeraGrid systems
If the scientist is authorized to use multiple systems and her simulation software will run on those systems, the MyCluster tool can add even greater efficiency. The scientist instructs MyCluster to construct a Condor pool using the resources she is allocated to use. She tells MyCluster roughly what unit of processing time is needed for each simulation task and how many tasks to run simultaneously on each resource. She then tells Condor-G to submit jobs to this personalized Condor pool rather than specific systems. MyCluster will submit job proxies to the scheduling systems on the authorized resources, and when each job proxy is activated, the resources allocated to that proxy are added to the Condor pool and used by Condor-G to execute tasks. In this way, the resources allocated to the user at several systems can be harnessed for use in executing the full set of tasks.
Enable an existing workflow application (with its own workflow manager) to manage tasks on multiple TeraGrid systems
This scenario assumes that a scientist has already added workflow capabilities to his application so that it can execute complex sets of tasks in an automated fashion. The workflow manager built into the application was developed for a single system, because the scientist did not previously have access to more than one system. Now, the scientist has been granted an allocation to use several TeraGrid systems (e.g., the DTF resources) and he wants the built-in workflow manager to be able to use all of the available resources for each workflow rather than being limited to just one of the available systems.
This integration task could seem daunting at first. It may appear that the scientist needs to decouple his workflow manager functionality from the application and replace it with DAGman or Condor-G in order to run on the TeraGrid. Because the Science Workflow Support kit uses a modular structure, this is not actually necessary.
Instead of removing his existing workflow manager, the scientist adapts it to use Condor to submit individual tasks. (This adaptation is likely to be much easier than entirely removing the workflow manager.) He then uses MyCluster to construct a personal Condor cluster using resources from all of the systems where he has an allocation (see scenario 2) and directs his workflow manager to submit tasks to the personal cluster. When the built-in workflow manager submits a task, MyCluster will ensure that it is executed on one of the available resources, using all of the resources available at any given time.
Future Directions
As noted above, computer scientists and the computer industry are devoting considerable effort to developing new tools that provide workflow capabilities. Also noted above, this kit currently does not provide knowledge management capabilities, one of the key areas in workflow support.
Future versions of the Science Workflow Support kit will add new tools for managing the data and knowledge produced by workflows. One source that is currently under consideration is the Virtual Data System (VDS) developed by the GriPhyN project. The VDS provides a virtual data catalog for recording information about the data required for and produced by workflows from an application-focused perspective. The VDS also provides a virtual data language that allows workflows to be expressed in terms of the contents of this catalog, simplifying the process of defining workflows. The VDS is already closely integrated with Condor-G and DAGman, as well as the GRAM and GridFTP services provided by TeraGrid’s remote compute and data movement kits, so it will be a good fit with the existing design for this kit. The VDS has been used by a variety of scientific applications (including some on TeraGrid) and has proven useful.
Acknowledgements *
This work was supported by the National Science Foundation Office of Cyberinfrastructure, grant number 0503697 “ETF Grid Infrastructure Group: Providing System Management and Integration for the TeraGrid.”
Condor and Globus software has been supported by the National Science Foundation, the U.S. Department of Energy’s Office of Science, DARPA, NASA, the U.K. e-Science Grid Core Programme, the Swedish Research Council, IBM, Microsoft, and Cisco Systems.
Author Information *
Lee Liming
University of Chicago / Argonne National Laboratory
9700 S. Cass Avenue
Argonne, IL 60439
liming@mcs.anl.gov
+1 630-687-1986
Requirements Analysis Teams
No RATs have been convened to date to review or discuss the requirements in this area.
Working Groups
The Software working group has been responsible for the design of this capability in the past, and will continue to review requirements and designs for this kit. The Software working group will also be responsible resources.
Glossary *
CTSS – Coordinated TeraGrid Software Stack, a collection of software capabilities that TeraGrid resource providers coordinate with each other regarding availability and implementation
DAGman – A software tool for managing the execution of tasks within workflows that can be expressed as Directed Acyclic Graphs (DAGs), included in the Condor software suite
GIG – Grid Interoperability Group, the part of TeraGrid’s staff that is funded independently of resource provider contracts
GRAM – A service interface for remotely submitting jobs to a computer system, available in a Web services version (WS GRAM) and a non-Web services version (GRAM), both included in the Globus Toolkit
GridFTP – A service interface for transferring data in or out of a system on a computer network, based on the popular FTP protocol with Grid security and performance enhancements
RAT – Requirements Analysis Team, a TeraGrid mechanism for collecting requirements in a specific problem area
RP – Resource Provider, an organization that operates hardware and/or specific services for the TeraGrid community
References
1. I. Foster. “Service-Oriented Science.” Science, vol. 308, May 6, 2005. http://www-fp.mcs.anl.gov/~foster/science-2005.htm.
2. Condor-G. http://www.cs.wisc.edu/condor/condorg/.
3. DAGman. http://www.cs.wisc.edu/condor/dagman/.
4. MyCluster. http://www.tacc.utexas.edu/services/userguides/mycluster/.
5. E. Walker, J. P. Gardner, V. Litvin, and E. L. Turner. “Creating Personal Adaptive Clusters for Managing Scientific Jobs in a Distributed Computing Environment.” 2006 IEEE Challenges of Large Applications in Distributed Environments, June 19, 2006.

