JWAN: lustre-wan advanced features testing
From TeraGrid Wiki
INTRODUCTION
JWAN (Josephin-WAN) is PSC's initiative to establish the secure, federated, distributed lustre filesystem over the WAN of the TeraGrid.
BACKGROUND
Lustre-WAN Testing Efforts on the Teragrid
GENERAL OBJECTIVES
- We closely follow the development Lustre CVS head source as the advanced features become available.
- We explore the feasibility of a distributed global lustre-wan filesystem through the study, testing and investigation of the following:
- Heimdal/MIT Kerberos authentication implementation on Lustre/Lustre-wan
- UID/GUID with kerberos, no root squash
- Any GSI tie up (or not)
- Distributed OSS: OSS's contributed from within a local site and from a remote site (same/different realms)
- Clustered OSS; OST pools
- Clustered Metadata: 2 or more MDS units within a local site
- Multiple failover MDS's and OSS's
- Distributed MDS: 2 or more MDS units across 2 different sites
- Proxy service implementation
- Data Replication
- Benchmarking/Tuning: basic IO and application-based with these features enabled
SPECIFIC GOALS
- Establish one kerberos-secured realm lustre-wan filesystem namespace that can be easily mounted on TG resources with OSS storage servers spread across the TG at RP sites.
- Extend JWAN to multiple kerberos realms towards cross-realm kerberos authentication
- Experiment with Clustered Metadata Servers (CMD) with failovers
LINKS ARCHIVE
- Teragrid 2010 ACM technical paper and presentation
- http://portal.acm.org/citation.cfm?id=1838589&jmp=cit&coll=GUIDE&dl=GUIDE&CFID=97709203&CFTOKEN=43516115#CIT
- Powerpoint presentation insert here
- LUG2010 Presentation
- Kerberized Lustre 2.0 over the WAN:
- TeraGrid 2009 JWAN Poster
- Abstract: 14. JWAN: PSC's Secure, Federated, Distributed Lustre Filesystem on the WAN (TeraGrid)
- Actual Poster:
- TeraGrid ARCH Meeting 14 August 2008
- IU's Data capacitor wan testing
- NCSA's Lustre-wan config
- PSC's internal (NOT FOR PUBLIC USE) Lustre-wan wiki
COLLABORATORS
- PSC: Josephine Palencia, josephin@psc.edu; Robert Budden, rbudden@psc.edu; Kevin Sullivan, ksulliva@psc.edu
- SUN Lustre Group/Developers: Eric Mei
- SDSC: Jeffrey Bennett, jab@sdsc.edu
- NICS: Patricia Kovatch, pkovatch@utk.edu, Zachary Giles zgiles1@utk.edu
- TACC: Chris Jordan, ctjordan@tacc.utexas.edu
- NCSA: Michelle Butler, mbutler@ncsa.uiuc.edu; Chad Kerner, ckerner@ncsa.uiuc.edu
PAST COLLABORATORS:
- SUN: Andreas Dilger, Peter Braam
- PSC: Doug Balog (Rinera), Ben Bennett
- SDSC: Donald Thorp, dthorp@sdsc.edu; Haisong Cai, cai@sdsc.edu
- IU: Steve Simms, ssimms@indiana.edu
- NCAR: Adam Boggs (Morphlix)
- ORNL: Greg Pike (Okinawa Institute of Science and Technology)
HARDWARE
| Site | Machine name | IP address | Function | Storage | Arch/OS | Network |
|---|---|---|---|---|---|---|
| PSC | mgs.jwan.teragrid.org | 128.182.112.251 | MGS | - | (Dual) Xeon/Centos5.3 | 1GigE |
| - | mds00w.psc.jwan.teragrid.org | 128.182.112.60 | MDS | - | Dual core AMD/Centos5.3 | IB/GigE |
| - | oss00w.psc.jwan.teragrid.org | 128.182.112.61 | OSS | 1.4TB | Dual core AMD/Centos5.3 | IB/GigE |
| - | oss01w.psc.jwan.teragrid.org | 128.182.112.62 | OSS | 1.4TB | Dual core AMD/Centos5.3 | IB/GigE |
| - | client1.jwan.teragrid.org | 128.182.112.70 | Client | - | Dual Xeon/Centos5.3 | IB/GigE |
| PSC | attractor.jwan.teragrid.org | 128.182.168.77 | MGS/MDT/OSS/Client | 276GB | (Dual) AMD Opteron/Centos5.3 | 1GigE |
| SDSC | np-login.jwan.teragrid.org | 198.202.115.110 | Client | n/a | Quad Core 2.5 GHz Xeon, Rocks V | GigE |
| - | np-compute-0.jwan.teragrid.org | 198.202.115.111 | Client | n/a | Quad Core 2.5 GHz Xeon, Rocks V | GigE |
| - | np-compute-1.jwan.teragrid.org | 198.202.115.112 | Client | n/a | Quad Core 2.5 GHz Xeon, Rocks V | GigE |
| - | np-compute-2.jwan.teragrid.org | 198.202.115.113 | Client | n/a | Quad Core 2.5 GHz Xeon, Rocks V | GigE |
| - | np-compute-3.jwan.teragrid.org | 198.202.115.114 | Client | n/a | Quad Core 2.5 GHz Xeon, Rocks V | GigE |
| - | np-oss-7.jwan.teragrid.org | 198.202.115.122 | OSS | 2 TB | Dual Opteron 1.8 GHz, Rocks V | GigE |
| NICS | verne.nics.utk.edu | 192.249.7.6 | Client | n/a | GigE | |
| - | verne1.nics.utk.edu | 192.249.7.7 | Client | n/a | GigE | |
| - | verne2.nics.utk.edu | 192.249.7.8 | Client | n/a | GigE | |
| - | verne3.nics.utk.edu | 192.249.7.9 | Client | n/a | GigE | |
| - | verne4.nics.utk.edu | 192.249.7.10 | Client | n/a | GigE | |
| TACC | c0-201.tacc.utexas.edu | 129.114.50.33 | Client | n/a | Dual-Core Xeon 2.66Ghz, RedHat | 1GigE |
| NCSA | tg-lustre01.ncsa.uiuc.edu | 141.142.25.101 | MDS | 100G | dual core, intel, RedHat | 10GigE |
| - | tg-lustre02.ncsa.uiuc.edu | 141.142.25.102 | OSS | 3TB | dual core, intel, RedHat | 10GigE |
| - | tg-lustre03.ncsa.uiuc.edu | 141.142.25.103 | OSS | 3TB | dual core, intel, RedHat | 10GigE |
| - | tg-lustre04.ncsa.uiuc.edu | 141.142.25.104 | OSS | 3TB | dual core, intel, RedHat | 10GigE |
| IU | mds03.uits.indiana.edu | 149.165.235.173 | MDS | 500 GB | Dual core 3 GHz Xeon, Redhat | 1GigE |
| - | oss25.uits.indiana.edu | 149.165.235.25 | OSS | 93 TB | Dual core 3 GHz Xeon, Redhat | 10GigE |
| - | oss26.uits.indiana.edu | 149.165.235.26 | OSS | 93 TB | Dual core 3 GHz Xeon, Redhat | 10GigE |
| - | oss27.uits.indiana.edu | 149.165.235.27 | OSS | 93 TB | Dual core 3 GHz Xeon, Redhat | 10GigE |
| - | oss28.uits.indiana.edu | 149.165.235.28 | OSS | 93 TB | Dual core 3 GHz Xeon, Redhat | 10GigE |
| ORNL | - | - | - | - | - | - |
DOCUMENTATION
Instructions for lustre servers (OSS) and clients connecting to JWAN
(Single Kerberos Realm)
A. Decide on the names of your systems.
It should be of the form <your-system-name>.jwan.teragrid.org
Ex. Suggested names:
- client1.<your_site>.jwan.teragrid.org
- oss1.<your_site>.jwan.teragrid.org
After you have decided on the names, please update the hardware section for your site above (machine names, IP addresses, specs). We need this information for PSC network. (IMPORTANT)
B. Visit the site <http://downloads.lustre.org/public> and obtain the following:
- Kernel 2.6.18_128.7.1 for your specific Architecture (We use CentOS 5.*)
- Lustre-1.9.280 (Lustre 2.0 alpha 5 release)
C. Kernel
You can either patch your kernel with lustre or not
- For patchless clients, go to http://wiki.lustre.org/index.php/Patchless_Client
- For patched kernels, follow instructions under the heading "Building from Source" in http://wiki.lustre.org/index.php/Building_and_Installing_Lustre_from_Source_Code#Introducing_the_Quilt_Utility
- Install your patched kernel and reboot
NOTE: Go to ftp://ftp.psc.edu/pub/jwan/ for the patched kernel and lustre RPMS.
D. Lustre
- Follow instructions under the heading "Building Lustre" from http://wiki.lustre.org/index.php/Building_and_Installing_Lustre_from_Source_Code#Introducing_the_Quilt_Utility
- Make sure you use the option --enable-gss during configure
- Install lustre (rpms or via make install) and test load your lustre modules
NOTE: you can check http://staff.psc.edu/josephin/Lustre-wan/ for generic lustre rpms that have been built with kerberos that could work on your system
E. Contact PSC by sending mail to jwan@psc.edu and request for lustre keytabs for your system.
- In your email, indicate the FQDNs, IP addresses, network MAC address/es and functions that your systems will perform (e.g. as client, OST, or both)
F. Mount jwan with this command
- Mount -t lustre mgs.jwan.teragrid.org@tcp0:/jwan /jwan
G. Login to your JWAN client using your TERAGRID portal name
You need to have a TERAGRID account or portal username/account. Install the login script patch for the KRB5CCNAME then
- type kinit <your_TERAGRID_portal_name>@TERAGRID.ORG
- cd /jwan/<your_site>/<your_TERAGRID_portal_name> and start doing IO
NOTE: By default, /jwan/<your_site> will stripe across your contributed OSTs (if applicable) for faster IO
H. Lustre-2.0 bugs
- See below under BUGS: Phase II
Specific instructions for OST servers
A. First send mail to jwan@psc.edu that you want to add a server to the OST pool before creating/mounting it.
(Important!)
B. Test mount jwan from your OST. (Important!)
C. Go to http://wiki.lustre.org/index.php/Mount_Conf and specify attractor.jwan.teragrid as the mgs/mdt server
- mkfs.lustre --fsname=jwan --ost --mgsnode=attractor.jwan.teragrid.org@tcp0 <your_device>
- mkdir -p /mnt/jwan/ost0
- mount -t lustre <your_device> /mnt/jwan/ost0
Lustre Idmap
- Accounts (TERAGRID portal names) are consistent across all JWAN system. The current JWAN systems exist on just one kerberos realm, namely TERAGRID.ORG hence there are no cross-realm lustre id-mapping issues involved, i.e. idmap.conf is disabled.
Multiple Kerberos Realms/Cross-Realm Authentication/Lustre ID-MAP
- PSC started experimenting with adding systems placed in different kerberos realms (i.e. outside TERAGRID.ORG).
- Instructions to follow.
REFERENCES
- MIT Kerberos http://web.mit.edu/Kerberos/
- Lustre kerberos http://wiki.lustre.org/index.php?title=Kerb_Lustre
- Lustre Mountconf http://wiki.lustre.org/index.php?title=Mount_Conf
KEY CONFIG FILES
Initially use local parameters for these files to get your system operational. These values will be consolidated once cross-site testing begins.
- kdc.conf
- krb5.conf
- lustre_idmap
TEST PLAN
Phase I: Checking the Network
- Check network bandwidth (Mbits/sec)/latency (ms) between RP sites
| Site | System | Bandwidth | Reversed | Latency |
|---|---|---|---|---|
| PSC-TACC | (mgs, mds00w,oss00w,oss01w) - c0-201 | 9-27 | 64-88 | 39 |
| PSC-NCSA | (mgs, mds00w,oss00w,oss01w)- | - | - | 19 |
| - | (mgs, mds00w,oss00w,oss01w)- | - | - | 19 |
| - | (mgs, mds00w,oss00w,oss01w)- | - | - | 19 |
| - | (mgs, mds00w,oss00w,oss01w)- | - | - | 19 |
| PSC-IU | - | - | - | - |
| PSC-NCAR | - | - | - | - |
| PSC-SDSC | - | - | - | - |
| PSC-ORNL | - | - | - | - |
Phase II: Defining the Kerberos Realms, Domains Setup for the WAN lustre filesystem
Below are the 3 options we considered for setting up the kerberos framework of the lustre-wan. We listed their plusses/minuses and chose Option C as the most feasible approach.
- Option A: Use current local existing kerberos environment where RP sites lustre-wan systems with kerberos are mapped unto own their local kerberos realms
- (--) the filesystem will reside (be local) and be mounted from 1 RP site.
- Option B: Create a new (lustre-wan) domain (called mgs.teragrid.org) on the TERAGRID.ORG realm and roll all participating lustre-wan systems into this single domain.
- (--) set up totally new, separate kerberos realm (more work)
- Option C: Adopt a combination of 1 and 2 and place only the MGS (management configuration server) on the TERAGRID.ORG kerberos realm while the rest of the systems use their already established local realms.
- (++) leverages on an already existing TG kerberos infrastructure
- (++) lustre-wan filesystem resides on the TERAGRID.ORG realm and is mounted with the teragrid.org namespace.
- Option D: Use TERAGRID.ORG as the single kerberos realm; create the jwan.teragrid.org as the DOMAIN namespace where all joining systems adopt the <site>.jwan.teragrid.org namespace. (Successful July 2009)
- Then extend from the single kerberos realm to multiple kerberos realms with cross-site/cross-realm kerberos authentication (Sept 2009)
Background
We aim to distribute a filesystem across several organizations, not simply implement a centralized filesystem that will be accessed by different organizations. TeraGrid already has an existing kerberos authentication mechanism as well as a central database of users which could be used to consistently generate mappings.
Leveraging on already existing TG kerberos infrastructure, we implement the transit cross-realm kerberos connections, maintain the MDS's and OSS's in local respective realms, distribute the OSS's across several resource providers and export only the MGS on the TERAGRID.ORG realm.
KERBEROS SETUP:
We use transit cross-realm connections between the TERAGRID.ORG realm and RP sites respective realms.
- MGS in realm TERAGRID.ORG
- MGS: 1 MGS (machine physically at PSC but resides on the TERAGRID.ORG kerberos realm)
- MDS/OSS in RP respective realms
- MDS's: MDS- currently just 1 MDS residing at PSC (mds00w); more will be added later as configuration allows
- OSS's : OSS servers will reside at RP sites and use the local kerberos realm of the site with local site hostnames; we anticipate many more OSS contributions from various RP sites
- Clients in RP and teragrid realms
NOTES:
- The authentication model uses the TG's so while sites outside would be able to mount the file system, authentication might be a problem (though null kerberos authentication is an option for a particular lustre mount).
- Client hosts mount the MGS, i.e. mount -t lustre xdfs.teragrid.org:/lustre-kerb-wafs; having it in the TERAGRID.ORG realm eliminates pair-wise mounts.
- MGS being in the TERAGRID.ORG realm simplifies user authentication as well.
- Users get a TERAGRID.ORG ticket when they log into the portal. With this ticket, they will be able to access their files on any OSS through an lustre_ID mapping function, yet to be developed, but similar to the gridmap service available today.
- Work is under way for OST pools which allow users to specify subsets of OSS to store files in and therefore write on local OSS.
Summary Test Table
The table below briefly summarizes the kerberos realms & domains and lustre functions served by system components from various RP sites. For tests on the virtual environment, all systems (MDS, OSS, Client) reside on a single node. All other tests are on real separate systems unless otherwise stated.
| KERBEROS REALMS | DOMAINS | MGS | MDS | OSS | CLIENTS | STATUS |
|---|---|---|---|---|---|---|
| 1a PSC.EDU | psc.teragrid.org | mds00w | mds00w | oss00w, (oss01w) | fred002, operon22.psc.edu | working, lustre_idmap disabled/(enabled) |
| 1b PSC.EDU | psc.teragrid.org, utexas.tacc.edu, ncsa.teragrid.org | mds00w | mds00w | oss00w, oss01w | c0-201.utexas.tacc.edu, tg-pnfs[5-8].ncsa.teragrid.org | Verify if: i) kerb auth works for remote clients, ii) remote clients can mount /lustre-wan fs, iii) lustre_idmap works |
| 1c PSC.EDU | psc.teragrid.org, utexas.tacc.edu, ncsa.teragrid.org | mds00w | mds00w | oss00w, oss01w, wan-css0.tacc.teragrid.org, tg-pnfs5.ncsa.teragrid.org | fred002, c0-201.utexas.tacc.edu, tg-pnfs[5-8].ncsa.teragrid.org | Verify if: i) can add remote OSS |
| 2a PSC.EDU (virtual) | psc.teragrid.org | mds00w | mds00w | oss00w.ncsa.edu, oss01w.tacc.utexas,edu | - | working, lustre_idmap enabled |
| 2b PSC.EDU (virtual) | - | - | - | - | - | - |
| - | - | - | - | - | - | - |
| - | - | - | - | - | - | - |
| - | - | - | - | - | - | - |
| - | - | - | - | - | - | - |
| 3a TERAGRID.ORG | mgs.teragrid.org, psc.edu | mgs.teragrid.org | mds00w.psc.teragrid.org | oss00w.psc.teragrid.org, oss01w.psc.teragrid.org | mds00w.psc.teragrid.org, operon.psc.edu (patched kernels) with kerberos | Works-completed; lustre bug-OST addition should reflect storage (being fixed) |
| 3b TERAGRID.ORG | mgs.teragrid.org, psc.teragrid.org, psc.edu, tacc.teragrid.org, tacc.utexas.edu | mgs.teragrid.orgrg | mds00w.psc.teragrid.org | oss00w.psc.teragrid.org, oss01w.psc.teragrid.org, css0.tacc.teragrid.org | - | - |
| - | - | - | - | - | - | - |
| 3c TERAGRID.ORG | mgs.teragrid.org **virtual** | mgs.teragrid.org **virtual** | mds00w.psc.teragrid.org | oss00w.psc.teragrid.org, oss01w.psc.teragrid.org | - | compare with non-virtual setup |
| 3d TERAGRID.ORG | mgs.teragrid.org **virtual** | mgs.teragrid.org **virtual** | mds00w.psc.teragrid.org | oss00w.psc.teragrid.org, oss01w.psc.teragrid.org, css0.tacc.teragrid.org | - | compare with non-virtual setup |
| 4a TERAGRID.ORG | jwan.teragrid.org | attractor.jwan.teragrid.org | attractor.jwan.teragrid.org | attractor.jwan.teragrid.org | attractor.jwan.teragrid.org | Fully functional, successful cycling among kerb flavors for root and users, fast IO with kerberos, distributed OST, users doing IO with secure lustre server/client connections (July 2009) |
| 4b JWAN.TERAGRID.ORG | jwan.teragrid.org | mgs.jwan.teragrid.org | mds00w.psc.jwan.teragrid.org | oss00w.psc.jwan.teragrid.org, oss01w.psc.jwan.teragrid.org | client1.jwan.teragrid.org | Fully functional, successful cycling among kerb flavors for root and users, fast IO with kerberos, distributed OST, users doing IO with secure lustre server/client connections, quota enabled and working, lustre ACLs enabled and working. Extending this on cross-realm kerb auth with lustre-id mapping(Oct 2009) |
Phase III. Setting up Cross-Realm Authentication
The kerberized lustre-wan is mounted from mgs.jwan.teragrid.org which resides on the TERAGRID.ORG kerberos realm.
A. Checklist
- Knowing the KDC admins to contact
- Turning off your firewall
- Editing your kdc.conf (capaths)
- Adding lustre component principals and creating kerb keys
- Installing your keytabs; verifying if your keys are good
- Logging in with kerberos
B. Debugging your kerberos setup
- testing kerberos login
- reverifying if your keys are good
Phase IV: Implementing OST Pools
Users specify subsets of the OSTs to store files thus save output files to local OSTs for jobs being run at some other site. They could still access input files from any OSTs albeit at slower speed.
Phase V: Measuring the Kerberos Security Overhead
- Shift among all kerberos security flavors during mounts (null, plain, krb5n, krb5a, krb5i,krb5p).
- Do base IO benchmark for the local setup comparing different flavors to measure the kerb security overhead.
| FLAVOR | AUTH | RPC MESSAGE Prot | BULK DATA Prot | Performance Overhead (R:W) |
| null | Base | |||
| plain | null | checksum (adler32) | ||
| krb5n | GSS/Krb5 | null | checksum (adler32) | |
| krb5a | GSS/Krb5 | partly integrity | checksum (adler32) | |
| krb5i | GSS/Krb5 | integrity | integrity (sha1) | |
| krb5p | GSS/Krb5 | privacy | privacy (sha1/aes128) |
Phase VI: Benchmark Proper
- Set up 2 local systems with different realms and mount 'remote' local OSS
- Do IO benchmark for the local setup.
BENCHMARK ROLLS
- Basic IO results: fio, ior, iozone, bonnie, collectl
- Application IO results: oocore
BUGS & SECURITY
Phase I:
PSC Patches
- Cross-Realm issue/bug with patch (Category: Bug)*
- Symptom:
- Issue: lsvgssd refuses all remote-realm principals
- Fix: svcgssd_proc.c 's get_ids() function in lustre/utils/gss/svcgssd_proc.c
- Patch: http://staff.psc.edu/ben/patches/lustre/lsvcgssd-xrealm.patch
- This patch has been incorporated into CVS HEAD so no need to apply.
- Unsafe directory modes in lustre-source RPMs (Category: Bug/Security)
- Issue: automake sets all directories in the distdir tree to mode 777
- Fix/Patch: http://staff.psc.edu/ben/patches/lustre/lustre-source-fix-unsafe-dir-modes.patch
- This patch has been incorporated into lustre-1.6.5.1 and CVS HEAD so no need to apply.
- Prevent compromise of Lustre clients mounting lustre-wan filesystem (Category: Security)
- Resolution: Include nosuid, nodev as mount arguments
- Enable MDT-> OST cross-realm by removing implicit lustre_mds principal authorization, requiring authorized MDS principals be given as '-M lustre_mds/mdshost@REALM' and eliminating the danger of implicit authz of 2 lustre-clusters in a single realm (Category: Security/Enabling feature)
Relevant Lustre Bugs
- 14836 Adding OST pool support
- 15827 MGS to allow/deny OST from joining fs on MDS
- 14951 OST addition not being handled by client
Phase II
Relevant Lustre Bugs: (http://bugzilla.lustre.org)
- 20044 (Simul failed)
- 20119 (Assertion failure in journal_start() at fs/jbd/transaction.c:283: "handle->h_transaction->t_journal == journal)
- 20220 (Variable `DIST_SOURCES' is defined; 8KB stack)
- 20253 (Lustre kerberos credentials not looking at $KRB5CCNAME)
- 20694 (Missing libcfs.a from HEAD?)
- 20695 (configure: error: You have got no 64-bit kernel quota support)
Phase III: Jwan semi-production (kerberos enabled with TG accounts)
- 22026 (ASSERTION(ev->md.start == req->rq_repbuf) failed)
- 22028 (Lustre/kerberos cryptic and undocumented error codes)
- 22131 (ASSERTION(request->rq_repdata == NULL) failed)
- 22314 lu_ref.c:264:lu_ref_del()) ASSERTION(0) failed
- 23551 Lustre userid mapping in SKR (Single Kerberos realm): pseudo-kerberos
- 23552 Kerb auth by LNET: pursuing null on 1 interface
BLOG
TALKS
- Lustre Users Group @Aptos, CA 2010
- Lustre Users Group @Sausalito, CA 2009
- Lustre Users Group @Sonoma, CA 2008
- http://wiki.lustre.org/index.php?title=Lug_08#AGENDA
- Lustre Roadmap Update, Bryon Neitzel
- Sun Storage Perspective & Lustre Architecture, Peter Braam
- NFS/pNFS export, Ricardo Correia
- Clustered Metadata (CMD), Andreas Dilger
- Linux HPC Software Stack, Makia Minich
- PSC Lustre-wan talks
- Arch meeting 8.14.08
DEMONSTRATION
- SC10:
- SC09: JWAN with fully functional Lustre 2.0v kerberos authentication between lustre components, distributed OST/OST pools, explore users utilizing lustre idmap with cross realm implementation, efforts towards JWAN integration into TERAGRID portal,
clustered Metadata with failovers
- SC08: Advanced Lustre-wan features with working Kerberos, explore UID/GUID remapping with Kerberos, distributed OSS (1 or more OSS at remote site), standard IO benchmarks as well as application benchmark on the distributed OSS.
