JWAN: lustre-wan advanced features testing

From TeraGrid Wiki

Jump to: navigation, search

Contents

INTRODUCTION

JWAN (Josephin-WAN) is PSC's initiative to establish the secure, federated, distributed lustre filesystem over the WAN of the TeraGrid.

BACKGROUND

Lustre-WAN Testing Efforts on the Teragrid

GENERAL OBJECTIVES

  • We closely follow the development Lustre CVS head source as the advanced features become available.
  • We explore the feasibility of a distributed global lustre-wan filesystem through the study, testing and investigation of the following:
    • Heimdal/MIT Kerberos authentication implementation on Lustre/Lustre-wan
    • UID/GUID with kerberos, no root squash
    • Any GSI tie up (or not)
    • Distributed OSS: OSS's contributed from within a local site and from a remote site (same/different realms)
    • Clustered OSS; OST pools
    • Clustered Metadata: 2 or more MDS units within a local site
    • Multiple failover MDS's and OSS's
    • Distributed MDS: 2 or more MDS units across 2 different sites
    • Proxy service implementation
    • Data Replication
    • Benchmarking/Tuning: basic IO and application-based with these features enabled

SPECIFIC GOALS

  • Establish one kerberos-secured realm lustre-wan filesystem namespace that can be easily mounted on TG resources with OSS storage servers spread across the TG at RP sites.
  • Extend JWAN to multiple kerberos realms towards cross-realm kerberos authentication
  • Experiment with Clustered Metadata Servers (CMD) with failovers


LINKS ARCHIVE

COLLABORATORS

  • PSC: Josephine Palencia, josephin@psc.edu; Robert Budden, rbudden@psc.edu; Kevin Sullivan, ksulliva@psc.edu
  • SUN Lustre Group/Developers: Eric Mei
  • SDSC: Jeffrey Bennett, jab@sdsc.edu
  • NICS: Patricia Kovatch, pkovatch@utk.edu, Zachary Giles zgiles1@utk.edu
  • TACC: Chris Jordan, ctjordan@tacc.utexas.edu
  • NCSA: Michelle Butler, mbutler@ncsa.uiuc.edu; Chad Kerner, ckerner@ncsa.uiuc.edu


PAST COLLABORATORS:

  • SUN: Andreas Dilger, Peter Braam
  • PSC: Doug Balog (Rinera), Ben Bennett
  • SDSC: Donald Thorp, dthorp@sdsc.edu; Haisong Cai, cai@sdsc.edu
  • IU: Steve Simms, ssimms@indiana.edu
  • NCAR: Adam Boggs (Morphlix)
  • ORNL: Greg Pike (Okinawa Institute of Science and Technology)

HARDWARE

Site Machine name IP address Function Storage Arch/OS Network
PSC mgs.jwan.teragrid.org 128.182.112.251 MGS - (Dual) Xeon/Centos5.3 1GigE
- mds00w.psc.jwan.teragrid.org 128.182.112.60 MDS - Dual core AMD/Centos5.3 IB/GigE
- oss00w.psc.jwan.teragrid.org 128.182.112.61 OSS 1.4TB Dual core AMD/Centos5.3 IB/GigE
- oss01w.psc.jwan.teragrid.org 128.182.112.62 OSS 1.4TB Dual core AMD/Centos5.3 IB/GigE
- client1.jwan.teragrid.org 128.182.112.70 Client - Dual Xeon/Centos5.3 IB/GigE
PSC attractor.jwan.teragrid.org 128.182.168.77 MGS/MDT/OSS/Client 276GB (Dual) AMD Opteron/Centos5.3 1GigE
SDSC np-login.jwan.teragrid.org198.202.115.110Client n/aQuad Core 2.5 GHz Xeon, Rocks VGigE
- np-compute-0.jwan.teragrid.org198.202.115.111Client n/aQuad Core 2.5 GHz Xeon, Rocks VGigE
- np-compute-1.jwan.teragrid.org198.202.115.112Client n/aQuad Core 2.5 GHz Xeon, Rocks VGigE
- np-compute-2.jwan.teragrid.org198.202.115.113Client n/aQuad Core 2.5 GHz Xeon, Rocks VGigE
- np-compute-3.jwan.teragrid.org198.202.115.114Client n/aQuad Core 2.5 GHz Xeon, Rocks VGigE
- np-oss-7.jwan.teragrid.org198.202.115.122OSS 2 TBDual Opteron 1.8 GHz, Rocks VGigE
NICS verne.nics.utk.edu192.249.7.6Client n/aGigE
- verne1.nics.utk.edu192.249.7.7Client n/aGigE
- verne2.nics.utk.edu192.249.7.8Client n/aGigE
- verne3.nics.utk.edu192.249.7.9Client n/aGigE
- verne4.nics.utk.edu192.249.7.10Client n/aGigE
TACC c0-201.tacc.utexas.edu 129.114.50.33 Client n/a Dual-Core Xeon 2.66Ghz, RedHat 1GigE
NCSA tg-lustre01.ncsa.uiuc.edu 141.142.25.101 MDS 100G dual core, intel, RedHat 10GigE
- tg-lustre02.ncsa.uiuc.edu 141.142.25.102 OSS 3TB dual core, intel, RedHat 10GigE
- tg-lustre03.ncsa.uiuc.edu 141.142.25.103 OSS 3TB dual core, intel, RedHat 10GigE
- tg-lustre04.ncsa.uiuc.edu 141.142.25.104 OSS 3TB dual core, intel, RedHat 10GigE
IU mds03.uits.indiana.edu149.165.235.173MDS 500 GBDual core 3 GHz Xeon, Redhat1GigE
- oss25.uits.indiana.edu149.165.235.25OSS 93 TBDual core 3 GHz Xeon, Redhat10GigE
- oss26.uits.indiana.edu149.165.235.26OSS 93 TBDual core 3 GHz Xeon, Redhat10GigE
- oss27.uits.indiana.edu149.165.235.27OSS 93 TBDual core 3 GHz Xeon, Redhat10GigE
- oss28.uits.indiana.edu149.165.235.28OSS 93 TBDual core 3 GHz Xeon, Redhat10GigE
ORNL ---- --

DOCUMENTATION

Instructions for lustre servers (OSS) and clients connecting to JWAN

(Single Kerberos Realm)


A. Decide on the names of your systems.


It should be of the form <your-system-name>.jwan.teragrid.org

Ex. Suggested names:

  • client1.<your_site>.jwan.teragrid.org
  • oss1.<your_site>.jwan.teragrid.org

After you have decided on the names, please update the hardware section for your site above (machine names, IP addresses, specs). We need this information for PSC network. (IMPORTANT)

B. Visit the site <http://downloads.lustre.org/public> and obtain the following:


  • Kernel 2.6.18_128.7.1 for your specific Architecture (We use CentOS 5.*)
  • Lustre-1.9.280 (Lustre 2.0 alpha 5 release)


C. Kernel


You can either patch your kernel with lustre or not

  • Install your patched kernel and reboot

NOTE: Go to ftp://ftp.psc.edu/pub/jwan/ for the patched kernel and lustre RPMS.


D. Lustre


  • Make sure you use the option --enable-gss during configure
  • Install lustre (rpms or via make install) and test load your lustre modules

NOTE: you can check http://staff.psc.edu/josephin/Lustre-wan/ for generic lustre rpms that have been built with kerberos that could work on your system


E. Contact PSC by sending mail to jwan@psc.edu and request for lustre keytabs for your system.


  • In your email, indicate the FQDNs, IP addresses, network MAC address/es and functions that your systems will perform (e.g. as client, OST, or both)


F. Mount jwan with this command


  • Mount -t lustre mgs.jwan.teragrid.org@tcp0:/jwan /jwan


G. Login to your JWAN client using your TERAGRID portal name


You need to have a TERAGRID account or portal username/account. Install the login script patch for the KRB5CCNAME then

  • type kinit <your_TERAGRID_portal_name>@TERAGRID.ORG
  • cd /jwan/<your_site>/<your_TERAGRID_portal_name> and start doing IO

NOTE: By default, /jwan/<your_site> will stripe across your contributed OSTs (if applicable) for faster IO


H. Lustre-2.0 bugs


  • See below under BUGS: Phase II

Specific instructions for OST servers



A. First send mail to jwan@psc.edu that you want to add a server to the OST pool before creating/mounting it. (Important!)

B. Test mount jwan from your OST. (Important!)

C. Go to http://wiki.lustre.org/index.php/Mount_Conf and specify attractor.jwan.teragrid as the mgs/mdt server

  • mkfs.lustre --fsname=jwan --ost --mgsnode=attractor.jwan.teragrid.org@tcp0 <your_device>
  • mkdir -p /mnt/jwan/ost0
  • mount -t lustre <your_device> /mnt/jwan/ost0

Lustre Idmap

  • Accounts (TERAGRID portal names) are consistent across all JWAN system. The current JWAN systems exist on just one kerberos realm, namely TERAGRID.ORG hence there are no cross-realm lustre id-mapping issues involved, i.e. idmap.conf is disabled.

Multiple Kerberos Realms/Cross-Realm Authentication/Lustre ID-MAP

  • PSC started experimenting with adding systems placed in different kerberos realms (i.e. outside TERAGRID.ORG).
  • Instructions to follow.

REFERENCES

KEY CONFIG FILES

Initially use local parameters for these files to get your system operational. These values will be consolidated once cross-site testing begins.

  • kdc.conf
  • krb5.conf
  • lustre_idmap

TEST PLAN

Phase I: Checking the Network

  • Check network bandwidth (Mbits/sec)/latency (ms) between RP sites
Site System Bandwidth Reversed Latency
PSC-TACC (mgs, mds00w,oss00w,oss01w) - c0-201 9-27 64-88 39
PSC-NCSA (mgs, mds00w,oss00w,oss01w)- - - 19
- (mgs, mds00w,oss00w,oss01w)- - - 19
- (mgs, mds00w,oss00w,oss01w)- - - 19
- (mgs, mds00w,oss00w,oss01w)- - - 19
PSC-IU - - - -
PSC-NCAR - - - -
PSC-SDSC - - - -
PSC-ORNL - - - -

Phase II: Defining the Kerberos Realms, Domains Setup for the WAN lustre filesystem

Below are the 3 options we considered for setting up the kerberos framework of the lustre-wan. We listed their plusses/minuses and chose Option C as the most feasible approach.

  • Option A: Use current local existing kerberos environment where RP sites lustre-wan systems with kerberos are mapped unto own their local kerberos realms
    • (--) the filesystem will reside (be local) and be mounted from 1 RP site.
  • Option B: Create a new (lustre-wan) domain (called mgs.teragrid.org) on the TERAGRID.ORG realm and roll all participating lustre-wan systems into this single domain.
    • (--) set up totally new, separate kerberos realm (more work)
  • Option C: Adopt a combination of 1 and 2 and place only the MGS (management configuration server) on the TERAGRID.ORG kerberos realm while the rest of the systems use their already established local realms.
    • (++) leverages on an already existing TG kerberos infrastructure
    • (++) lustre-wan filesystem resides on the TERAGRID.ORG realm and is mounted with the teragrid.org namespace.
  • Option D: Use TERAGRID.ORG as the single kerberos realm; create the jwan.teragrid.org as the DOMAIN namespace where all joining systems adopt the <site>.jwan.teragrid.org namespace. (Successful July 2009)
    • Then extend from the single kerberos realm to multiple kerberos realms with cross-site/cross-realm kerberos authentication (Sept 2009)

Background

We aim to distribute a filesystem across several organizations, not simply implement a centralized filesystem that will be accessed by different organizations. TeraGrid already has an existing kerberos authentication mechanism as well as a central database of users which could be used to consistently generate mappings.

Leveraging on already existing TG kerberos infrastructure, we implement the transit cross-realm kerberos connections, maintain the MDS's and OSS's in local respective realms, distribute the OSS's across several resource providers and export only the MGS on the TERAGRID.ORG realm.

KERBEROS SETUP:

We use transit cross-realm connections between the TERAGRID.ORG realm and RP sites respective realms.

  • MGS in realm TERAGRID.ORG
    • MGS: 1 MGS (machine physically at PSC but resides on the TERAGRID.ORG kerberos realm)
  • MDS/OSS in RP respective realms
    • MDS's: MDS- currently just 1 MDS residing at PSC (mds00w); more will be added later as configuration allows
    • OSS's : OSS servers will reside at RP sites and use the local kerberos realm of the site with local site hostnames; we anticipate many more OSS contributions from various RP sites
  • Clients in RP and teragrid realms


NOTES:

  • The authentication model uses the TG's so while sites outside would be able to mount the file system, authentication might be a problem (though null kerberos authentication is an option for a particular lustre mount).
  • Client hosts mount the MGS, i.e. mount -t lustre xdfs.teragrid.org:/lustre-kerb-wafs; having it in the TERAGRID.ORG realm eliminates pair-wise mounts.
  • MGS being in the TERAGRID.ORG realm simplifies user authentication as well.
  • Users get a TERAGRID.ORG ticket when they log into the portal. With this ticket, they will be able to access their files on any OSS through an lustre_ID mapping function, yet to be developed, but similar to the gridmap service available today.
  • Work is under way for OST pools which allow users to specify subsets of OSS to store files in and therefore write on local OSS.

Summary Test Table

The table below briefly summarizes the kerberos realms & domains and lustre functions served by system components from various RP sites. For tests on the virtual environment, all systems (MDS, OSS, Client) reside on a single node. All other tests are on real separate systems unless otherwise stated.

KERBEROS REALMS DOMAINS MGS MDS OSS CLIENTS STATUS
1a PSC.EDU psc.teragrid.org mds00w mds00w oss00w, (oss01w) fred002, operon22.psc.edu working, lustre_idmap disabled/(enabled)
1b PSC.EDU psc.teragrid.org, utexas.tacc.edu, ncsa.teragrid.org mds00w mds00w oss00w, oss01w c0-201.utexas.tacc.edu, tg-pnfs[5-8].ncsa.teragrid.org Verify if: i) kerb auth works for remote clients, ii) remote clients can mount /lustre-wan fs, iii) lustre_idmap works
1c PSC.EDU psc.teragrid.org, utexas.tacc.edu, ncsa.teragrid.org mds00w mds00w oss00w, oss01w, wan-css0.tacc.teragrid.org, tg-pnfs5.ncsa.teragrid.org fred002, c0-201.utexas.tacc.edu, tg-pnfs[5-8].ncsa.teragrid.org Verify if: i) can add remote OSS
2a PSC.EDU (virtual) psc.teragrid.org mds00w mds00w oss00w.ncsa.edu, oss01w.tacc.utexas,edu - working, lustre_idmap enabled
2b PSC.EDU (virtual) - - - - - -
- - - - - - -
- - - - - - -
- - - - - - -
- - - - - - -
3a TERAGRID.ORG mgs.teragrid.org, psc.edu mgs.teragrid.org mds00w.psc.teragrid.org oss00w.psc.teragrid.org, oss01w.psc.teragrid.org mds00w.psc.teragrid.org, operon.psc.edu (patched kernels) with kerberos Works-completed; lustre bug-OST addition should reflect storage (being fixed)
3b TERAGRID.ORG mgs.teragrid.org, psc.teragrid.org, psc.edu, tacc.teragrid.org, tacc.utexas.edu mgs.teragrid.orgrg mds00w.psc.teragrid.org oss00w.psc.teragrid.org, oss01w.psc.teragrid.org, css0.tacc.teragrid.org - -
- - - - - - -
3c TERAGRID.ORG mgs.teragrid.org **virtual** mgs.teragrid.org **virtual** mds00w.psc.teragrid.org oss00w.psc.teragrid.org, oss01w.psc.teragrid.org - compare with non-virtual setup
3d TERAGRID.ORG mgs.teragrid.org **virtual** mgs.teragrid.org **virtual** mds00w.psc.teragrid.org oss00w.psc.teragrid.org, oss01w.psc.teragrid.org, css0.tacc.teragrid.org - compare with non-virtual setup
4a TERAGRID.ORG jwan.teragrid.org attractor.jwan.teragrid.org attractor.jwan.teragrid.org attractor.jwan.teragrid.org attractor.jwan.teragrid.org Fully functional, successful cycling among kerb flavors for root and users, fast IO with kerberos, distributed OST, users doing IO with secure lustre server/client connections (July 2009)
4b JWAN.TERAGRID.ORG jwan.teragrid.org mgs.jwan.teragrid.org mds00w.psc.jwan.teragrid.org oss00w.psc.jwan.teragrid.org, oss01w.psc.jwan.teragrid.org client1.jwan.teragrid.org Fully functional, successful cycling among kerb flavors for root and users, fast IO with kerberos, distributed OST, users doing IO with secure lustre server/client connections, quota enabled and working, lustre ACLs enabled and working. Extending this on cross-realm kerb auth with lustre-id mapping(Oct 2009)

Phase III. Setting up Cross-Realm Authentication

The kerberized lustre-wan is mounted from mgs.jwan.teragrid.org which resides on the TERAGRID.ORG kerberos realm.

A. Checklist

  • Knowing the KDC admins to contact
  • Turning off your firewall
  • Editing your kdc.conf (capaths)
  • Adding lustre component principals and creating kerb keys
  • Installing your keytabs; verifying if your keys are good
  • Logging in with kerberos

B. Debugging your kerberos setup

  • testing kerberos login
  • reverifying if your keys are good

Phase IV: Implementing OST Pools

Users specify subsets of the OSTs to store files thus save output files to local OSTs for jobs being run at some other site. They could still access input files from any OSTs albeit at slower speed.

Phase V: Measuring the Kerberos Security Overhead

  • Shift among all kerberos security flavors during mounts (null, plain, krb5n, krb5a, krb5i,krb5p).

  • Do base IO benchmark for the local setup comparing different flavors to measure the kerb security overhead.

FLAVORAUTHRPC MESSAGE ProtBULK DATA ProtPerformance Overhead (R:W)
null Base
plainnull checksum (adler32)
krb5nGSS/Krb5 nullchecksum (adler32)
krb5aGSS/Krb5 partly integritychecksum (adler32)
krb5iGSS/Krb5 integrity integrity (sha1)
krb5pGSS/Krb5 privacy privacy (sha1/aes128)

Phase VI: Benchmark Proper

  • Set up 2 local systems with different realms and mount 'remote' local OSS

  • Do IO benchmark for the local setup.

BENCHMARK ROLLS

  • Basic IO results: fio, ior, iozone, bonnie, collectl
  • Application IO results: oocore

BUGS & SECURITY

Phase I:


PSC Patches

  • Cross-Realm issue/bug with patch (Category: Bug)*
  • Prevent compromise of Lustre clients mounting lustre-wan filesystem (Category: Security)
    • Resolution: Include nosuid, nodev as mount arguments
  • Enable MDT-> OST cross-realm by removing implicit lustre_mds principal authorization, requiring authorized MDS principals be given as '-M lustre_mds/mdshost@REALM' and eliminating the danger of implicit authz of 2 lustre-clusters in a single realm (Category: Security/Enabling feature)

Relevant Lustre Bugs

  • 14836 Adding OST pool support
  • 15827 MGS to allow/deny OST from joining fs on MDS
  • 14951 OST addition not being handled by client

Phase II


Relevant Lustre Bugs: (http://bugzilla.lustre.org)

  • 20044 (Simul failed)
  • 20119 (Assertion failure in journal_start() at fs/jbd/transaction.c:283: "handle->h_transaction->t_journal == journal)
  • 20220 (Variable `DIST_SOURCES' is defined; 8KB stack)
  • 20253 (Lustre kerberos credentials not looking at $KRB5CCNAME)
  • 20694 (Missing libcfs.a from HEAD?)
  • 20695 (configure: error: You have got no 64-bit kernel quota support)

Phase III: Jwan semi-production (kerberos enabled with TG accounts)


  • 22026 (ASSERTION(ev->md.start == req->rq_repbuf) failed)
  • 22028 (Lustre/kerberos cryptic and undocumented error codes)
  • 22131 (ASSERTION(request->rq_repdata == NULL) failed)
  • 22314 lu_ref.c:264:lu_ref_del()) ASSERTION(0) failed
  • 23551 Lustre userid mapping in SKR (Single Kerberos realm): pseudo-kerberos
  • 23552 Kerb auth by LNET: pursuing null on 1 interface

BLOG

TALKS

  • Lustre Users Group @Sonoma, CA 2008
    • http://wiki.lustre.org/index.php?title=Lug_08#AGENDA
    • Lustre Roadmap Update, Bryon Neitzel
    • Sun Storage Perspective & Lustre Architecture, Peter Braam
    • NFS/pNFS export, Ricardo Correia
    • Clustered Metadata (CMD), Andreas Dilger
    • Linux HPC Software Stack, Makia Minich
  • PSC Lustre-wan talks
    • Arch meeting 8.14.08

DEMONSTRATION

  • SC10:
  • SC09: JWAN with fully functional Lustre 2.0v kerberos authentication between lustre components, distributed OST/OST pools, explore users utilizing lustre idmap with cross realm implementation, efforts towards JWAN integration into TERAGRID portal,

clustered Metadata with failovers

  • SC08: Advanced Lustre-wan features with working Kerberos, explore UID/GUID remapping with Kerberos, distributed OSS (1 or more OSS at remote site), standard IO benchmarks as well as application benchmark on the distributed OSS.

Personal tools