Tuesday, April 16, 2013

Building & Using a Highly Available ESOS Disk Array

At our institution, we have been using Enterprise Storage OS (ESOS) in production for the last year. We have (4) 24-slot SSD disk arrays that sit on our Fibre Channel Storage Area Network (SAN) and they have work great. These particular units are used in our VMware View (VDI) farm, and even though they are “single-headed” (not fully redundant), in our VDI environment with pools across multiple datastores, even if one disk array failed, VMs are still available on another.

In this environment, we used another enterprise disk array (no vendor names mentioned) that was aging, and we were continuing to pay large amounts of money on for maintenance... for a relatively small amount of storage space. We wanted to replace this expensive, proprietary disk array with something else.

We had been using ESOS for a while, and liked it. Recently, a number of new features were added to ESOS, including DRBD, Pacemaker + Corosync, LVM2, and other software packages and enhancements. This opened up the possibility of creating a highly available disk array using ESOS.


The Setup
At the time of building the new ESOS-based disk array, our disk space need was only 5 TB, so we wanted to make sure our new unit gave us room to grow. Since there doesn’t appear to be any local RAID controllers that have a high availability option, and/or support for sharing external JBODs, or something similar, our only option was to mirror the entire disk array to another unit. With this setup, we configured two servers in a cluster, with each containing its own local disks, RAID controller, and Fibre Channel HBA. The data between the two nodes is replicated using DRBD, and then each ESOS node is on the SAN, and the initiators use each target node as a path. We make use of implicit Asymmetric Logical Unit Access (ALUA) in SCST, to help control, or rather guide/recommend (depending on the initiators) which path to use, and we only want the initiators to use a single path/target (unless there is a failover), since there is more to SCSI than just reading/writing blocks of data. Only the blocks are replicated via DRBD in this ESOS disk array cluster; things like SCSI reservations are not replicated between the cluster nodes, so its important the initiators don’t round-robin between paths, and for clusters of initiators, they should all use the same target/path.

For this new disk array we designed, performance was not a big factor to consider -- we already had a ton of SSD-backed storage in our environment (SAN) on the other ESOS disk arrays. This new unit would primarily be used for VMware ESXi boot volumes, and a couple large general purpose VMFS volumes (for parent VMs, supporting server VMs, etc.). We decided on the LSI Logic Nytro MegaRAID application acceleration card, a relatively new product to the LSI line-up. This particular card features built-in (on card) SSD storage which allows data “hot spots” to be promoted onto the SSD storage, and it also sports 1 GB of cache. LSI Logic MegaRAID cards work well with ESOS since the text-based user interface (TUI) has basic support for configuring logical drives on these controllers.

For the replication link between the two systems, we initially planned on using InfiniBand HCAs with Sockets Direct Protocol (SDP) which utilizes Remote Direct Memory Access (RDMA), but unfortunately, since SDP is now deprecated, it is not supported by ESOS. So, we went ahead with two InfiniBand HCAs with a QDR cable between the two nodes, and used IP over InfiniBand (IPoIB).

Each node has (12) 3.5” SAS disk slots; we used 7,200 RPM 2 TB SAS drives (Seagate). We will dedicate (1) drive to a global hot spare, and will then create one (6) disk RAID5 volume, and one (5) disk RAID5 volume. This gives us approximately 16 TB of usable space (after RAID parity) which is quite a bit more than what we currently have.

We now had a plan for our new fully redundant, Fibre Channel disk array, based on Enterprise Storage OS (ESOS). We got down to business, and put in requisitions for all of the new hardware.

Cost breakdown for the new ESOS disk array:
  • ~ $1,100 - (2) Mellanox MHQH19B-XTR ConnectX 2 VPI InfiniBand HCAs
  • ~ $50 - (1) Mellanox Technologies Half M Copper 4x QSFP 30 AWG Cable
  • ~ $3,000 - (2) LSI Logic LSI00350 Nytro MegaRAID 8100-4I SAS RAID Controllers
  • ~ $300 - (2) LSI Nytro MegaRAID SCM01 RAID Controller Cache Data Protection Modules
  • ~ $6,400 - (26) Seagate Constellation ES.2 ST32000645SS 2 TB SAS-2 Hard Drives
  • ~ $2,000 - (2) QLogic 8 Gb Fibre Channel PCI-E Single Port Host Bus Adapters
  • ~ $8,100 - (2) Supermicro SuperStorage Server 6027R-E1R12T Chassis (12 x 3.5” Slots; 32 GB RAM; 2 x Intel Xeon Processors)
  • ~ $100 - (2) Lexar JumpDrive Triton 32 GB USB 3.0 Flash Drives

Total cost for a ~16 TB, fully redundant, Fibre Channel disk array: ~ $21,050... replacing your enterprise disk array for less than a year’s worth of maintenance costs... priceless!




So, we’re using the InfiniBand link for DRBD replication, and then in our environment, we have a normal management network for servers/devices, and then we also have a special, non-routable private network that we use for out-of-band management interfaces (DRAC, IPMI, etc.). For these ESOS storage server nodes, we used the Supermicro out-of-band management interfaces on our private network, and then connected one of the NICs on each node to this network. This connectivity is important since we use IPMI as our fencing/STONITH method later in the article. We make use of the other server NIC on our primary management network. The two networks are completely separate/independent, which is important since we use two rings with Corosync, one for each network; if one network/link goes down, the two nodes can still communicate with each other. Then each ESOS node is connected to an independent Fibre Channel (FC) fabric, and each host/initiator is connected to both fabrics giving us full redundancy in case of a switch/fabric failure.



Installation
We spent a morning installing the servers in a rack and installing all of the components (RAID controller, HCA, HBA, etc.) in each unit. We then cabled everything, powered up each server, and installed the disks in the trays.

We started by configuring the out-of-band management interface on the SuperMicro servers. Once we got the default password changed, we opened the virtual console and set a few BIOS (UEFI) settings:
  • We enabled the “mirroring” memory mode, giving us 16 GB of available memory.
  • For the MegaRAID card, we disabled controller BIOS (not booting from any logical drives).
  • We double-checked that the QLogic HBA BIOS option was set to disabled.

Next, we created (2) ESOS USB flash drives. For the USB drives, we decided to go with an above-average device, the Lexar JumpDrive Triton 32GB USB 3.0 flash drive. Even though our servers aren’t USB 3.0, these devices when run at USB 2.0 are much faster than ordinary/standard flash drives. This makes a noticeable difference in ESOS when booting since the entire image is copied into a tmpfs file system on start-up, and even when sync’ing configuration changes.

We used a RHEL (6) workstation as our system to create the ESOS USB flash drives. We then downloaded and extracted the latest installation package from the ESOS project page: http://code.google.com/p/enterprise-storage-os/

wget http://enterprise-storage-os.googlecode.com/files/esos-0.1-r469.tar.xz
tar xvfJ esos-0.1-r469.tar.xz

After the archive was extracted, we plugged in the first flash drive and found the device node using the lsscsi tool. We then started the ESOS installer script:

cd esos-0.1-r469
./install.sh

The installer will prompt for the USB flash drive device node, and warn you before writing the image to the disk. After the image was successfully written, the install script then prompted us to install a third-party (proprietary) CLI RAID configuration tool. In our case, we are using LSI Logic MegaRAID cards, so we downloaded MegaCLI from the given URL and placed it into the temporary directory. The installer finished incorporating the MegaCLI tool into the image and then it was ready for use!

We repeated the above ESOS installation steps for our second server (second USB flash drive). We then labeled each flash drive with the corresponding server’s host name and inserted the drives into each server.

Since we didn’t have any other boot devices on these systems, the ESOS USB flash drive defaulted to being the first boot device (we checked via the UEFI setup screen). We booted up each ESOS storage server, and the first thing we did on both was change the default password (root/esos).




System Configuration
Next, we configured our two Ethernet network interfaces and host name in the TUI. After the interfaces were configured, we SSH’d into the machines set the timezone, date/time and an NTP server.



Next we need to enable IP over InfiniBand (IPoIB) for our IB interfaces on each host. Ideally, Sockets Direct Protocol (SDP) would be the best for the replication with DRBD and InfiniBand, but SDP is now deprecated, and ESOS does not support it. There has been hints in forums of DRBD adding RDMA support (which IPoIB lacks), but until then, this is probably the best solution. 10 GbE would also be a good option, and truthfully, this IPoIB setup is probably only marginally better.

Edit the ‘/etc/infiniband/openib.conf’ IB driver configuration file, and set the following two lines (on both hosts):

IPOIB_LOAD=yes
SET_IPOIB_CM=yes

Next, we restarted the IB stack on each host:

/etc/rc.d/rc.openibd stop && /etc/rc.d/rc.openibd start

Now that IPoIB is loaded, we can configure the IB interfaces using the TUI. We just chose an arbitrary network range that we’re not using anywhere else on campus (even though this isn’t routable). We then started OpenSM on each storage server:

/etc/rc.d/rc.opensm start

The OpenSM InfiniBand subnet manager handles multiple instances and will make one of them enter “standby” mode. After starting the OpenSM service, we edited the ‘/etc/rc.conf’ file and set rc.opensm_enable to “YES” so it starts up on boot. We then tested the IPoIB interface by pinging the other host.

Next, we configured email (SMTP) on each ESOS storage server. ESOS makes use of email for communicating alerts, warnings, errors, etc. to the administrator, so its important to configure.


Initial Cluster Setup
Now that we have the basic system configuration out of the way for each host, we can move on to configuring the cluster. The first step in the cluster setup, will be Corosync. Here is the ‘/etc/corosync/corosync.conf’ file we used on both nodes:

# 20130410 MAS

totem {
        version: 2
        cluster_name: esos
        crypto_cipher: none
        crypto_hash: none
        rrp_mode: passive
       interface {
                ringnumber: 0
                bindnetaddr: 10.35.6.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
                ttl: 1
        }
       interface {
                ringnumber: 1
                bindnetaddr: 172.16.106.0
                mcastaddr: 226.94.1.2
                mcastport: 5407
                ttl: 1
        }
}

nodelist {
        node {
                ring0_addr: 10.35.6.21
                nodeid: 1
        }
        node {
                ring0_addr: 10.35.6.22
                nodeid: 2
        }
}

logging {
        fileline: off
        to_stderr: no
        to_syslog: yes
        syslog_facility: local2
        debug: off
        timestamp: off
        logger_subsys {
                subsys: QUORUM
                debug: off
       }
}

quorum {
        provider: corosync_votequorum
        two_node: 1
}

In our configuration, we opted to use one ring on our primary Ethernet management interface (10.35.6.0) and one ring on our special non-routable management network (172.16.106.0). Next, we restarted Pacemaker and Corosync on each host, then checked the Corosync configuration:

/etc/rc.d/rc.corosync stop
/etc/rc.d/rc.pacemaker stop
/etc/rc.d/rc.corosync start
/etc/rc.d/rc.pacemaker start
corosync-cfgtool -s

Everything looks, good, we see two rings with no faults. Next, we checked the cluster configuration:

crm configure show
crm_mon -1

In our configuration, on each host we see an extra node in the config. That is left-over from the default cluster stack configuration (eg, “node $id="16777343" raisin.mcc.edu”) so we just used ‘crm configure edit’ and removed that line.


LVM / SCST ALUA Settings
Next, we made a few system configuration LVM changes to prepare for later steps; we want LVM to only discover devices on /dev/drbdX block devices and not the underlying device. We also set it so LVM doesn’t cache, set the default locking type to 3 (built-in cluster wide locking), and removed the current cache file (on each host):

Edited the ‘/etc/lvm/lvm.conf’ and set/ran the following:
  • filter = [ "a|drbd.*|", "r|.*|" ]
  • write_cache_state = 0
  • locking_type = 3
  • rm -f /etc/lvm/cache/.cache
  • mount /mnt/conf && rm -f /mnt/conf/etc/lvm/cache/.cache && umount /mnt/conf

Since SCST is already running (default) we went ahead and added our base ALUA settings to each host. We create a device group, which all SCST devices will be added to, and then a “local” and “remote” target group on each host. The “local” target group on each host contains the single, local Fibre Channel target. Then on the “remote” target group, we add the FC target of the other host. This setup is required for the SCST resource agent (Master/Slave -> ALUA).

On host cantaloupe.mcc.edu:

scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=1
scstadmin -add_tgrp_tgt 50:01:43:80:21:df:9b:4c -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 50:01:43:80:21:df:9b:4c -driver qla2x00t -attributes rel_tgt_id=1
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=2
scstadmin -add_tgrp_tgt 50:01:43:80:21:df:c7:f4 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 50:01:43:80:21:df:c7:f4 -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=2

On host raisin.mcc.edu:

scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=2
scstadmin -add_tgrp_tgt 50:01:43:80:21:df:c7:f4 -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 50:01:43:80:21:df:c7:f4 -driver qla2x00t -attributes rel_tgt_id=2
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=1
scstadmin -add_tgrp_tgt 50:01:43:80:21:df:9b:4c -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 50:01:43:80:21:df:9b:4c -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=1


Additional System Setup / Back-End Storage Configuration
After ALUA was setup on each host, we exited the shell into the TUI and sync’d the configuration (System -> Sync. Configuration); this writes the current SCST configuration to a file and syncs everything with the USB flash drive. We could now configure the ESOS system services for our setup; edit the ‘/etc/rc.conf’ file and set the following (on both hosts):

rc.openibd_enable=YES
rc.opensm_enable=YES
rc.sshd_enable=YES
rc.lvm2_enable=NO
rc.drbd_enable=NO
rc.corosync_enable=YES
rc.dlm_enable=YES
rc.clvmd_enable=YES
rc.pacemaker_enable=YES
rc.mhvtl_enable=NO
rc.scst_enable=NO

The primary services/systems we use on these hosts (DRBD, LVM, and SCST) are all managed by the cluster stack, so we disable them from starting by the init/rc scripts. Since we will be using LVM on top of DRBD, we use clvmd which prevents (using locking) concurrent LVM metadata updates. DLM is a requirement for clvmd, so we enable that as well. Now we reboot both nodes to ensure everything starts up (or doesn't) as expected. Check the physical console for start-up errors/messages.

We wanted to be sure the LSI Logic Nytro MegaRAID (8100-4i) cards have the newest firmware available, so we downloaded the firmware image and flashed the controller on each host:

MegaCli64 -adpfwflash -f NytroMrFw.rom -a0

After the firmware image download was complete, we rebooted each node. We are now ready for creating our RAID logical drives (virtual drives). Since we are creating an exact replica of all the storage on each host, we’ll configure them the same. We have (12) SAS 2 TB hard drives in each box; we want (1) global hot spare drive, and then we decided on (2) RAID5 volumes (one with six disks, one with five disks). We felt this setup might give us more performance instead of making one large RAID5 volume with (11) disks, or a RAID6 volume. Since we are using a MegaRAID controller (LSI Logic) we were able to use the TUI to provision our back-end storage.



After we created our two RAID groups on each host, we needed to setup a global hot spare drive. The TUI in ESOS does not support this feature, we had to use the shell (Interface -> Exit to Shell):

MegaCli64 -pdhsp -set -physdrv[18:11] -a0


Back-End Storage Performance Testing
Before continuing our setup, we thought it would be fun to do a couple quick performance tests on the back-end storage. For these tests, we used the (6) disk RAID5 volume and used the included ‘fio’ tool in ESOS.

In this test, we are doing sequential reads with 4 MB blocks for 60 seconds:


fio --bs=4M --direct=1 --rw=read --ioengine=libaio --iodepth=64 --name=/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a --runtime=60

--snip--
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [352.0M/0K/0K /s] [88 /0 /0  iops] [eta 00m:00s]
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (groupid=0, jobs=1): err= 0: pid=3069: Wed Apr 10 13:49:37 2013
  read : io=20728MB, bw=353222KB/s, iops=86 , runt= 60091msec
    slat (usec): min=199 , max=48229 , avg=11580.02, stdev=12383.05
    clat (msec): min=66 , max=2134 , avg=726.91, stdev=58.13
     lat (msec): min=92 , max=2134 , avg=738.49, stdev=56.86
    clat percentiles (msec):
     |  1.00th=[  635],  5.00th=[  676], 10.00th=[  693], 20.00th=[  709],
     | 30.00th=[  717], 40.00th=[  725], 50.00th=[  734], 60.00th=[  742],
     | 70.00th=[  742], 80.00th=[  750], 90.00th=[  766], 95.00th=[  775],
     | 99.00th=[  791], 99.50th=[  791], 99.90th=[  799], 99.95th=[ 2114],
     | 99.99th=[ 2147]
    bw (KB/s)  : min= 5885, max=414476, per=99.32%, avg=350819.86, stdev=35462.71
    lat (msec) : 100=0.06%, 250=0.25%, 500=0.41%, 750=77.29%, 1000=21.94%
    lat (msec) : >=2000=0.06%
  cpu          : usr=0.02%, sys=2.02%, ctx=2440, majf=0, minf=65561
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.3%, 32=0.6%, >=64=98.8%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=5182/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=20728MB, aggrb=353222KB/s, minb=353222KB/s, maxb=353222KB/s, mint=60091msec, maxt=60091msec
--snip--

In this test, we are doing sequential writes with 4 MB blocks for 60 seconds:


fio --bs=4M --direct=1 --rw=write --ioengine=libaio --iodepth=64 --name=/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a --runtime=60

--snip--
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/740.0M/0K /s] [0 /185 /0  iops] [eta 00m:00s]
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (groupid=0, jobs=1): err= 0: pid=3072: Wed Apr 10 14:12:14 2013
  write: io=44996MB, bw=767254KB/s, iops=187 , runt= 60053msec
    slat (usec): min=347 , max=40575 , avg=5330.42, stdev=4645.18
    clat (msec): min=51 , max=395 , avg=336.13, stdev=34.02
     lat (msec): min=52 , max=397 , avg=341.46, stdev=34.21
    clat percentiles (msec):
     |  1.00th=[   74],  5.00th=[  318], 10.00th=[  322], 20.00th=[  330],
     | 30.00th=[  334], 40.00th=[  338], 50.00th=[  338], 60.00th=[  343],
     | 70.00th=[  347], 80.00th=[  351], 90.00th=[  359], 95.00th=[  363],
     | 99.00th=[  371], 99.50th=[  379], 99.90th=[  388], 99.95th=[  392],
     | 99.99th=[  396]
    bw (KB/s)  : min=692166, max=1378932, per=99.60%, avg=764193.58, stdev=62862.74
    lat (msec) : 100=1.16%, 250=0.55%, 500=98.28%
  cpu          : usr=17.26%, sys=3.64%, ctx=5367, majf=0, minf=25
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=11249/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=44996MB, aggrb=767253KB/s, minb=767253KB/s, maxb=767253KB/s, mint=60053msec, maxt=60053msec
--snip--

In this test, we are doing random reads with 4 KB blocks for 60 seconds:


fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a --runtime=60

--snip--
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [3128K/0K/0K /s] [782 /0 /0  iops] [eta 00m:00s]
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (groupid=0, jobs=1): err= 0: pid=3075: Wed Apr 10 14:14:06 2013
  read : io=191372KB, bw=3181.2KB/s, iops=795 , runt= 60158msec
    slat (usec): min=3 , max=49 , avg= 9.80, stdev= 3.59
    clat (usec): min=90 , max=1504.8K, avg=80370.26, stdev=83326.84
     lat (usec): min=107 , max=1504.8K, avg=80380.45, stdev=83326.84
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[    9], 10.00th=[   12], 20.00th=[   20],
     | 30.00th=[   29], 40.00th=[   41], 50.00th=[   55], 60.00th=[   72],
     | 70.00th=[   94], 80.00th=[  126], 90.00th=[  182], 95.00th=[  241],
     | 99.00th=[  396], 99.50th=[  469], 99.90th=[  652], 99.95th=[  750],
     | 99.99th=[  979]
    bw (KB/s)  : min= 2221, max= 3368, per=99.98%, avg=3180.32, stdev=119.15
    lat (usec) : 100=0.01%, 250=0.04%, 500=0.01%, 750=0.01%
    lat (msec) : 2=0.01%, 4=0.11%, 10=6.66%, 20=14.21%, 50=26.38%
    lat (msec) : 100=24.76%, 250=23.31%, 500=4.13%, 750=0.33%, 1000=0.05%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.67%, sys=1.12%, ctx=46578, majf=0, minf=87
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=47843/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=191372KB, aggrb=3181KB/s, minb=3181KB/s, maxb=3181KB/s, mint=60158msec, maxt=60158msec
--snip--

In this test, we are doing random writes with 4 KB blocks for 60 seconds:


fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a --runtime=60

--snip--
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/1388K/0K /s] [0 /347 /0  iops] [eta 00m:00s]
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (groupid=0, jobs=1): err= 0: pid=3078: Wed Apr 10 14:15:29 2013
  write: io=118572KB, bw=1969.2KB/s, iops=492 , runt= 60216msec
    slat (usec): min=3 , max=40 , avg= 8.88, stdev= 4.42
    clat (usec): min=392 , max=614423 , avg=129850.00, stdev=95972.42
     lat (usec): min=403 , max=614434 , avg=129859.28, stdev=95974.97
    clat percentiles (usec):
     |  1.00th=[  772],  5.00th=[  828], 10.00th=[  868], 20.00th=[ 1012],
     | 30.00th=[ 1112], 40.00th=[162816], 50.00th=[177152], 60.00th=[185344],
     | 70.00th=[193536], 80.00th=[201728], 90.00th=[214016], 95.00th=[226304],
     | 99.00th=[288768], 99.50th=[350208], 99.90th=[585728], 99.95th=[593920],
     | 99.99th=[610304]
    bw (KB/s)  : min=  691, max=78858, per=100.00%, avg=1976.98, stdev=7108.54
    lat (usec) : 500=0.06%, 750=0.38%, 1000=18.95%
    lat (msec) : 2=13.32%, 4=0.28%, 50=0.07%, 100=0.17%, 250=64.89%
    lat (msec) : 500=1.58%, 750=0.29%
  cpu          : usr=0.38%, sys=0.58%, ctx=19973, majf=0, minf=24
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.8%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=29643/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=118572KB, aggrb=1969KB/s, minb=1969KB/s, maxb=1969KB/s, mint=60216msec, maxt=60216msec
--snip--

So, these numbers seem pretty much on-par with what we expected for this type of disk: 352 MB/s for sequential reads, 740 MB/s for sequential writes, 782 IOPS (4 KB) for random reads, and 347 IOPS (4 KB) for random writes. During these tests, background logical drive / disk initialization was still taking place, so our numbers may have been a bit better after this was complete. The sequential write and even the read throughput is quite nice... we’re guessing this is thanks to the controller’s on-board SSD volume (CacheCade) and/or the 1 GB of controller cache.


DRBD Configuration
Now, we move on to configuring DRBD. In our setup, we will have (2) DRBD resources (volumes) in dual-primary mode, with LVM running on top of each of these (an LVM volume group for each). For the DRBD syncer rate, we read the rule of thumb for the max rate is 30% of your slowest link (I/O subsystem, replication link); we settled on 75 MB to start with. First, we set our global/common DRBD configuration on each host; we modified the ‘/etc/drbd.d/global_common.conf’ file to look like this on both hosts:

# 20130410 MAS

global {
        usage-count no;
}

common {
        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
                fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
        }
        startup {
                degr-wfc-timeout 120;
                outdated-wfc-timeout 2;
        }
        options {
                on-no-data-accessible io-error;
        }
        disk {
                on-io-error detach;
                disk-barrier no;
                disk-flushes no;
                fencing resource-only;
                al-extents 3389;
                c-plan-ahead 0;
                resync-rate 75M;
        }
        net {
                protocol C;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
                rr-conflict disconnect;
                max-buffers 8000;
                max-epoch-size 8000;
                sndbuf-size 512k;
        }
}

Next we created our DRBD resource configuration files; instead of using the generic “/dev/sdX” block device nodes for the storage backing, we used the unique SCSI disk identifiers populated in the "/dev/disk-by-id" directory. We created both of these files (exactly the same) on both ESOS storage server nodes.

/etc/drbd.d/r0.res:

# 20130410 MAS

resource r0 {
        net {
                allow-two-primaries;
        }
        on cantaloupe.mcc.edu {
                device     /dev/drbd0;
                disk       /dev/disk-by-id/LUN_NAA-600605b0054a753018f855fa236d6d41;
                address    192.168.50.21:7788;
                meta-disk  internal;
        }
        on raisin.mcc.edu {
                device    /dev/drbd0;
                disk      /dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a;
                address   192.168.50.22:7788;
                meta-disk internal;
        }
}

/etc/drbd.d/r1.res:

# 20130410 MAS

resource r1 {
        net {
                allow-two-primaries;
        }
        on cantaloupe.mcc.edu {
                device     /dev/drbd1;
                disk       /dev/disk-by-id/LUN_NAA-600605b0054a753018f8565a29255421;
                address    192.168.50.21:7789;
                meta-disk  internal;
        }
        on raisin.mcc.edu {
                device    /dev/drbd1;
                disk      /dev/disk-by-id/LUN_NAA-600605b0054a751018f856f319dfd5f7;
                address   192.168.50.22:7789;
                meta-disk internal;
        }
}

Now we are ready to setup the DRBD resources. On both nodes, run the following commands:

drbdadm create-md r0
drbdadm up r0
drbdadm create-md r1
drbdadm up r1

Now, on only one of the hosts (it doesn't really matter since this all fresh), run this:

drbdadm primary --force r0
drbdadm primary --force r1

The above commands make the DRBD resources primary on that host, and starts the full synchronization to the other host. On the non-primary host (“Secondary”) you can run the following to make the resources primary there:

drbdadm primary r0
drbdadm primary r1


LVM Configuration
Next, we need to get Logical Volume Manager (LVM) setup. For our configuration, we have (2) DRBD resources, and with this we will create (2) LVM physical volumes (PV), and (2) LVM volume groups (VG). We already setup our LVM device filter in the configuration file a few pages back, this way we don’t get complaints from LVM about finding duplicates, it will only match “/dev/drbdX” block devices. On just one of the hosts, we ran the following:

pvcreate /dev/drbd0
pvcreate /dev/drbd1
vgcreate -c y r0 /dev/drbd0
vgcreate -c y r1 /dev/drbd1

We can now check that our (2) new LVM volume groups are available (on both hosts):

vgdisplay


More Cluster Configuration
Now we are ready to finish configuring the cluster stack; we have our DRBD resources configured, and LVM volume groups setup. Lets start by disabling STONITH (we will enable it at the end):

crm configure property stonith-enabled="false"

We broke each chunk of the cluster configuration out into a separate step so we can explain each piece as we go in the article. The first chunk we added was for the DRBD resources:

crm
cib new drbd
configure primitive p_drbd_r0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
configure primitive p_drbd_r1 ocf:linbit:drbd \
        params drbd_resource="r1" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
configure group g_drbd p_drbd_r0 p_drbd_r1
configure ms ms_drbd g_drbd \
        meta master-max="2" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true" interleave="true"
cib commit drbd
quit

In the step above, we have two DRBD resources (r0, r1) that we configured previously, and we are setting two masters (two nodes, dual-primary mode). We used the advised/default resource agent parameters for ocf:linbit:drbd.

Next, we added the resource configuration for LVM2:

crm
cib new lvm
configure primitive p_lvm_r0 ocf:heartbeat:LVM \
        params volgrpname="r0" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
configure primitive p_lvm_r1 ocf:heartbeat:LVM \
        params volgrpname="r1" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
configure group g_lvm p_lvm_r0 p_lvm_r1
configure clone clone_lvm g_lvm \
        meta interleave="true" target-role="Started"
cib commit lvm
quit

For LVM, we have two DRBD resources (r0, r1) that we are running LVM on top of. As mentioned earlier, the clvmd service is used in conjunction with this type of setup. This setup could have been done other ways, but we felt it was simplest to use LVM on top of a couple large DRBD resources, instead of trying to setup a DRBD resource for each individual volume we wanted to share on our SAN. The cluster configuration for these resources was pretty straight forward, a primitive for each volume group (r0, r1) and then a clone statement so they are started on both of our nodes.

Next we added the SCST configuration. In this setup, only one of the two nodes will be a “Master” for the SCST resource (and one “Slave”). Again, this is used with the ALUA setup in SCST which is our extra state for the resource (SCST is always started/running, only ALUA information is updated). The parameters for this resource specify the SCST ALUA device group name, “local” target group name, and “remote” target group name. This sounds exactly like what it is... the local target group contains targets local to that node, and remote are the other node’s targets. We added the SCST ALUA device group and target groups earlier in the article.

crm
cib new scst
configure primitive p_scst ocf:esos:scst \
        params alua="true" device_group="esos" \
        local_tgt_grp="local" remote_tgt_grp="remote" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="60"
configure ms ms_scst p_scst \
        meta master-max="1" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true" interleave="true"
cib commit scst
quit

In the step above, the SCST RA is configured with only one master, and we don’t care which one it is since LVM and DRBD are both running active/active on the cluster.

Next, we added the order and colocation rules. At this point, resources have been trying to start, promote, etc. as we added them and some may have failed as we didn't add the constraints as we went, but in our case it didn't matter much since this is a new cluster, not existing, and we don’t have anything connected to it yet. Here are the constraints we used:

crm
cib new constraints
colocation c_r0_r1 inf: ms_scst:Started clone_lvm:Started ms_drbd:Master
order o_r0_r1 inf: ms_drbd:promote clone_lvm:start ms_scst:start
cib commit constraints
quit

Above you can see the colocation and order rules we added... we want DRBD to be promoted to master first, then LVM can start, and then SCST can start. This was the last main cluster configuration step.

When the cluster attempts to start the LVM resources, they will fail since at this point there were no logical volumes (LV) configured for the volume groups. So, we went ahead and created one on each:

lvcreate -L 4T -n big_vmfs_1 r0
lvcreate -L 4T -n big_vmfs_2 r1

 We used the ‘crm resource cleanup’ command to fix all of the failed / timed out resources and everything started as expected:

--snip--
Last updated: Thu Apr 11 11:54:41 2013
Last change: Thu Apr 11 11:49:44 2013 via cibadmin on raisin.mcc.edu
Stack: corosync
Current DC: cantaloupe.mcc.edu (1) - partition with quorum
Version: 1.1.8-1f8858c
2 Nodes configured, unknown expected votes
10 Resources configured.


Online: [ cantaloupe.mcc.edu raisin.mcc.edu ]

 Master/Slave Set: ms_scst [p_scst]
     Masters: [ cantaloupe.mcc.edu ]
     Slaves: [ raisin.mcc.edu ]
 Clone Set: clone_lvm [g_lvm]
     Started: [ cantaloupe.mcc.edu raisin.mcc.edu ]
 Master/Slave Set: ms_drbd [g_drbd]
     Masters: [ cantaloupe.mcc.edu raisin.mcc.edu ]
--snip--




Provisioning Storage
Now that the cluster is configured, we moved onto provisioning our storage. First we zoned all of our initiators with each target (on each switch). Then, using the TUI we created a host group (Hosts -> Add Group) for each server, and then added the server’s initiator to each group (Hosts -> Add Initiator). After zoning everything on our Fibre Channel switches, we used the ‘fcc.sh’ tool in the ESOS shell to get a list of the visible FC initiators. This made it very easy for copying/pasting the initiator names into the TUI.

Next we created a 50 GB boot volume for each of our (4) ESXi hosts; we used the CLI to do this (LVM logical volumes):

lvcreate -L 50G -n boot_mulberry r0
lvcreate -L 50G -n boot_lime r0
lvcreate -L 50G -n boot_banana r0
lvcreate -L 50G -n boot_keylime r0

Then, after we created the (4) ESXi boot volumes above, on each ESOS storage server, using the TUI, we added the SCST device for each (vdisk_blockio), and then mapped each device as LUN 0 to each corresponding host group (Devices -> Map to Group). For each SCST device we created using the vdisk_blockio mode, we made sure to set “Write Through” to Yes/1 and “NV Cache” to No/0 since we are using DRBD in dual-primary mode and would most definitely like to avoid data divergence!



For each SCST device we create, we need to run the following command on both hosts to add the device into our SCST implicit ALUA configuration:

scstadmin -add_dgrp_dev boot_mulberry -dev_group esos
scstadmin -add_dgrp_dev boot_lime -dev_group esos
scstadmin -add_dgrp_dev boot_banana -dev_group esos
scstadmin -add_dgrp_dev boot_keylime -dev_group esos
scstadmin -add_dgrp_dev big_vmfs_1 -dev_group esos
scstadmin -add_dgrp_dev big_vmfs_2 -dev_group esos


Final Cluster Setup
Now that our cluster is set up, some storage is provisioned, and everything is working, we can add fencing mechanisms into our configuration and re-enable STONITH:

crm
cib new stonith
configure primitive fence_cantaloupe stonith::fence_ipmilan \
params pcmk_host_list="cantaloupe.mcc.edu" ipaddr=”172.16.6.21” \
login=”user” passwd=”password” lanplus=”true” \
op monitor interval="60"
configure primitive fence_raisin stonith::fence_ipmilan \
params pcmk_host_list="raisin.mcc.edu" ipaddr=”172.16.6.22” \
login=”user” passwd=”password” lanplus=”true” \
op monitor interval="60"
cib commit stonith
quit

crm configure property stonith-enabled="true"

Finally, we tested our fencing mechanism (one at a time) on each node to ensure they work: crm node fence NODE_NAME

After we were sure everything was tested and working as it should be, we enabled a cluster-status-change email mechanism. The crm_mon utility supports an external agent; we used the ocf:pacemaker:ClusterMon resource agent and the crm_mon_email.sh script that ESOS includes, to send simple/basic emails regarding the cluster if anything changes. Not something you want enabled when testing as it sends an individual email for each cluster status change, so you can rack up a fair number of emails from something as simple as a node rebooting. We configured our ClusterMon RA like this:

crm
cib new clustermon
configure primitive p_notify ocf:pacemaker:ClusterMon \
params user="root" update="30" \
extra_options="-E /usr/local/bin/crm_mon_email.sh -e root" \
op monitor on-fail="restart" interval="10"
configure clone clone_notify p_notify \
meta target-role="Started"
cib commit clustermon
quit

There, thats it! Our ESOS disk array cluster is fully functional and tested. Here is our final cluster configuration (`crm configure show`), just for reference:

node $id="1" cantaloupe.mcc.edu
node $id="2" raisin.mcc.edu
primitive fence_cantaloupe stonith:fence_ipmilan \
        params pcmk_host_list="cantaloupe.mcc.edu" ipaddr="172.16.6.21" login="user" passwd="password" lanplus="true" \
        op monitor interval="60"
primitive fence_raisin stonith:fence_ipmilan \
        params pcmk_host_list="raisin.mcc.edu" ipaddr="172.16.6.22" login="user" passwd="password" lanplus="true" \
        op monitor interval="60"
primitive p_drbd_r0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
primitive p_drbd_r1 ocf:linbit:drbd \
        params drbd_resource="r1" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
primitive p_lvm_r0 ocf:heartbeat:LVM \
        params volgrpname="r0" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
primitive p_lvm_r1 ocf:heartbeat:LVM \
        params volgrpname="r1" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
primitive p_notify ocf:pacemaker:ClusterMon \
        params user="root" update="30" extra_options="-E /usr/local/bin/crm_mon_email.sh -e root" \
        op monitor on-fail="restart" interval="10"
primitive p_scst ocf:esos:scst \
        params alua="true" device_group="esos" local_tgt_grp="local" remote_tgt_grp="remote" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="60"
group g_drbd p_drbd_r0 p_drbd_r1
group g_lvm p_lvm_r0 p_lvm_r1
ms ms_drbd g_drbd \
        meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" interleave="true"
ms ms_scst p_scst \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" interleave="true"
clone clone_lvm g_lvm \
        meta interleave="true" target-role="Started"
clone clone_notify p_notify \
        meta target-role="Started"
colocation c_r0_r1 inf: ms_scst:Started clone_lvm:Started ms_drbd:Master
order o_r0_r1 inf: ms_drbd:promote clone_lvm:start ms_scst:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.8-1f8858c" \
        cluster-infrastructure="corosync" \
        stonith-enabled="true" \
        last-lrm-refresh="1365772801"

This disk array is currently only being used in a VMware vSphere (ESXi) environment. VMware ESXi supports implicit ALUA, and you can check the pathing in vSphere Client by going to the Configuration tab for a host, then click Storage, click Properties for a datastore, and finally click Manage Paths. We used the “Most Recently Used” path selection policy and checked that for each datastore, it selected the correct path for I/O. We also noticed when using ALUA with SCST in ESOS, that the storage array type now shows “VMW_SATP_ALUA”. For a non-ALUA SCST/ESOS configuration, it usually shows “VMW_SATP_DEFAULT_AA”.

One other thing we typically do with VMware ESXi initiators when using them with ESOS/SCST is disable support for vStorage APIs for Array Integration (VAAI). Its not currently supported on the disk arrays, and it just seems to pollute the logs since the VAAI SCSI commands fail (not supported). In vSphere Client, for each host, go to the Configuration tab, then Advanced Settings, and set the following to ‘0’:
  • /VMFS3/HardwareAcceleratedLocking
  • /DataMover/HardwareAcceleratedMove
  • /DataMover/HardwareAcceleratedInit

This concludes my article on building and using a Fibre Channel disk array based on Enterprise Storage OS (ESOS). This unit has been in production for less than a week now, and I will follow-up to this article after some time with our experiences using the disk array. Please leave any comments/question; I hope others might find this useful!