CTSS 4 Data Movement Capability Kit
From TeraGrid Wiki
TeraGrid’s Data Movement capabilities provide infrastructure in support of the movement of data on the TeraGrid. Data movement refers to the transfer of data from resource to resource or from site to site. The following is a repository of information and links that are of interest and relevance to users and technologists involved with the TeraGrid project.
TeraGrid’s Data Movement Support kit provides infrastructure capabilities in support of the movement of data on the TeraGrid. The document defines the purpose and design of the CTSS 4 TeraGrid Data Movement kit. Systems that provide TeraGrid services (including computational resources) can elect to support the capabilities defined here in coordination with other resource providers in order to be meaningfully integrated with other TeraGrid systems. The capabilities are focused on providing a layer of integration to otherwise diverse and specialized systems.
This Data Movement kit is proposed as an optional capability kit for Teragrid resources beginning with CTSS 4. All of the components described here have been deployed as part of the CTSS 3 deployment, so no changes to the current requirements would be needed for a resource to support this kit.
The TeraGrid Data Movement kit defines the core capabilities that allow the transfer of data on and between TeraGrid resources. It also provides TeraGrid users with a set of fundamental data operations on TeraGrid systems in consistent ways. The capabilities are focused on providing a layer of integration to otherwise diverse and specialized systems. The Data Movement kit is optional for all systems that provide TeraGrid services. Because these systems are diverse, the capabilities defined by this kit must be as minimal as possible, and must reflect what we believe are universal requirements for the movement of data in the TeraGrid. The data movement capabilities in the Data kit are intended to provide a core set of capabilities allowing the transfer of data between TeraGrid resources. These include the tools and technologies that handle the transfer of data from one location to another.
Sample data movement users of the TeraGrid:
- LEAD Gateway
- Need to finalize additional set of users
This kit depends only on the TeraGrid Core Services kit and the Globus toolkit and is currently met by installing a set of capabilities that are currently packaged by the GIG packaging team and distributed with CTSS 3. Integrating these capabilities into a single kit will allow system administrators at resource provider sites to implement a more modular approach to installing TeraGrid software. Although the TeraGrid data movement kit is not required, it is expected that this kit will be installed on most systems.
Analysis of allocations and resource usage show that users typically have allocations on multiple resources at multiple sites. This is particularly relevant for users that have DAC allocations or roaming allocations. Having allocations for resources at distributed sites presents the need for high performance file transfers between these resources and for capabilities such as wide area filesystems. To provide the data movement capabilities, the TeraGrid software stack supports data movement between resources through tools such as:
- The GSI-OpenSSH sshd component in the remote login kit provides a simple, easy-to-use mechanism for data movement (scp, sftp)
- GridFTP capability is distributed globally through CTSS 3 and is a high performance, secure, reliable data transfer protocol optimized for high-bandwidth, and wide-area networks. GridFTP is used for science gateways, portals and applications that use GridFTP directly. GridFTP provides GSI security on control and data channels, multiple data channels for parallel transfers, partial file transfers, third-party transfers, authenticated data channels, reusable data channels and command pipelining. GridFTP is not a command and there is no client program named gridFTP. Rather it is a capability that can be used through one of three client mechanisms:
- globus-url-copy is a gridFTP client for transferring files from the command line and is distributed through the Globus toolkit software.
- tgcp is a command-line user tool intended to provide high performance for transfer while simplifying efficient copying of files and directories between and within gridFTP enabled clusters. tgcp is a wrapper for globus-url-copy and RFT (see below) and invokes third party transfers between GridFTP servers at the TeraGrid source and destination sites.
- UberFTP is an interactive GridFTP file transfer client. Using UberFTP opens a session within which files may be transferred and directories and files may be manipulated. UberFTP on the TeraGrid requires GSI authentication.
- GridFTP SRB provides a gridFTP capability for people with data in SRB (see below).
- Reliable Transfer Service (RFT) is an OGSA-based service that provides interfaces for controlling and monitoring third party file transfers using GridFTP servers. It uses a database to store its state periodically so the transfers can be recovered from any failures. RFT uses standard grid security mechanisms for authorization and authentication of the users. RFT is distributed globally for gateways, portals, users and applications that move large amounts of data.
The Data Movement kit provides the software, services, and documentation required to allow TeraGrid users to do the following things.
NOTES: Each of the use cases below (and sub-cases) needs to be validated by TeraGrid's user services group. For each case & sub-case, we need specific TG users or applications that the use case is relevant to. This will allow us to get more detail about the use cases if it turns out to be necessary, and will give us a set of users to use for feedback on whatever solutions/implementations we design and deploy. Kelly Gaither will collect this information from user services.
Transfer a file between TeraGrid file systems (or vice-versa)
- Initiated from within the TG environment
- Initiated from outside the TG environment (this is important for the science gateway use model and perhaps also for Grid interoperability initiatives)
- Large files, high performance (E.g., transfer a large (>10 Gb) file from one TeraGrid system to another with >2Gbit/sec performance. There ought to be plenty of examples of users who have to move data from RP to RP for different parts of their workflow (simulation code produces data, analysis code analyzes data, visualization code produces viz from data, mass storage stores the data, other users access the data, etc.). Identifying some specific examples and using them to explain how to do it would be very useful.)
- Large number of smaller files, high performance (E.g., transfer a large number of files (e.g., a directory structure) from one TeraGrid system to another with high (>500Mbit/sec)performance. Some applications produce large numbers of files (often small files) rather than a few large files. Most file transfer tools will get very poor performance transfering small files. RFT and GridFTP (and TGCP) offer features for getting high performance when transferring large numbers of small files, so we should document this capability and how to use it.) This use scenario emphasizes the need for fault tolerance, because there is a greater chance for faults. TeraGrid's mechanisms should not require users to "babysit" these transfer operations and restart them manually if they fail at any time.
- Interactive versus batch transfer (coupling transfers with jobs, but not using compute allocation to do the data transfer; integration with workflows; interactive/remote visualization)
This is almost too obvious to document, but we do need to document that we provide these capabilities. It might be useful to explain how the data kit's capability relates to the similar capability in the Remote Login kit (which is implemented using GSI-OpenSSH). A key differentiator is that the Remote Login kit's capabilty is not intended for high-end (high volume, high performance) use. This might also be the place where we document TGCP?
Usability is a non-trivial issue in these use cases. To be "usable", the TeraGrid needs to offer a capability that handles the following issues transparently, or at least with near-100% reliability:
- authentication issues (credential availability, proxies, proxy lifetime & renewal, etc.)
- endpoints (keeping track of the "official" endpoints for users as systems change, identifying the "right" endpoints to use for performance, etc.)
- fault recovery (automatic retries, maybe even reassignment of transfers to other servers, etc.)
- diagnostics (when something fails, a clearer description of what the failure was)
- performance management (optimizing endpoints, tuning parameters)
- integration with applications ("scriptability," coordination with computation jobs, incorporation into workflows)
- monitoring/data collection/instrumentation (collecting data on what's happening in the system, what users are doing, errors, accessibility problems)
Just providing access to a GridFTP client and server (or even scp) is not sufficient. It's too primitive. Science users need automation and mediation between the primitive tools and their applications.
Import/Export of data between TeraGrid and external locations
This section has the same usability requirements as the first use case above, but there may be differences in performance expectations, interfaces, and (maybe) fault tolerance.
Should we be providing client software for data import/export with TeraGrid? Maybe we just tell users where to get it; e.g., VDT?
Store and retrieve files in mass storage
This perhaps should be multiple scenarios. The natural (tech-focused) way to do this is to use whatever client tools the mass storage system provides. In some cases, though, applications need a filesystem interface to the storage system (e.g., GPFS, GFFS-WAN, HPSS). In other cases, they might want the mass storage to look like a GridFTP server (HPSS/GridFTP, SRB/GridFTP). At a minimum, we need to be clear about what methods we support and how each is accomplished.
- High-performance to/from mass storage
- Compute nodes to mass storage through API
See above for usability comments.
Transfer data to/from a remote relational database
Jobs running on TG systems need to be able to access remote databases for configuration, input parameters, and push data into databases for metadata, object output, etc. We used to have database clients in TG, do we need to put them back? Or do people do things with others mechanisms? How does this relate to metadata use cases and file location registration use cases? What about portability/implementation requirements? ("I need Oracle," or "I need MySQL")
The Data working group has been responsible for the design of this capability in the past, and will continue to review requirements and designs for this kit with respect to the management and movement capabilities. The Software working group will be responsible for coordinating the deployment of this capability kit on the designated TeraGrid resources.
- Kit Leader: Kelly Gaither
- Working Groups: Data Working Group