Monday, July 14, 2014

Open Storage: Dual-Controller OSS Disk Array

Introduction
The LSI Syncro CS controllers are here, and they are what the open storage community has been longing for. If you work in enterprise IT, and you’re not familiar with open storage, then you’re missing out; here is a nice article by Aaron Newcomb describing open storage: https://blogs.oracle.com/openstorage/entry/fishworks_virtualbox_tutorial

For the setup described in this article, this is a POC for our institution that will lead us to replacing all (80+ TB) of our commercial/proprietary disk arrays that sit on our Fibre Channel (FC) Storage Area Network (SAN) with ESOS-based storage arrays.

With this dual-controller ESOS storage setup, we are also testing a new SAN "medium": Fibre Channel over Ethernet (FCoE). A converged network is really where its at -- and for us, 10 GbE, or even 40 GbE provides plenty of bandwidth for sharing. We're quite excited about this new SAN technology and will hopefully replace all of our traditional Fibre Channel switches some day.


The Setup
We have (2) 40 GbE switches in this test SAN environment, and we're using 10 GbE CNAs for our targets and initiators. We connected one port on each server (the target, ESOS storage servers, and the initiators, ESXi hosts) to each switch (each device is connected to both fabrics). Each server has an iDRAC and we connected those to our management network. Then there are (2) Ethernet NICs, so we connected one on each server to our primary server network, and then used a short 1’ cable to connect the two systems directly together so we can create a private network between the two systems. We’ll use two Corosync rings on these interfaces.

Both of the servers came with 8 GB of RAM each, and since we use vdisk_blockio mode the SCST devices, we’re going to use very little physical RAM. We enabled ‘memory mirroring’ mode for redundancy/protection which gives us 4 GB of usable physical RAM on each server which is more than enough.



Two (2) Dell PowerEdge R420 (1U) servers:
  • (2) x Intel Xeon E5-2430 2.20GHz, 15M Cache, 7.2GT/s QPI, Turbo, 6C, 95W
  • (4) x 2GB RDIMM, 1333 MT/s, Low Volt, Single Rank, x8 Data Width
  • (1) x  Emulex OCe14102-UM Dual-channel, 10GBASE-SR SFP+ CNA Adapter
  • (1) x Dual Hot Plug Power Supplies 550W
Synco shared DAS storage:
  • (1) x LSI Syncro CS 9286-8e (includes two Syncro CS 9271-8i HA controllers with CacheVault)
SAS enclosure (JBOD):
  • (1) x DataON Storage DNS-1600D (4U 24-bay Modular-SBB compliant)





Getting Started
We used the latest and greatest version of Enterprise Storage OS (ESOS) and installed it on both of our USB flash drives using a CentOS Linux system:
wget --no-check-certificate https://6f70a7c9edce5beb14bb23b042763f258934b7b9.googledrive.com/host/0B-MvNl-PpBFPbXplMmhwaElid0U/esos-0.1-r663.zip
unzip esos-0.1-r663.tar.xz
cd esos-0.1-r663
./install.sh

When prompted during the ESOS installer, I added both the MegaCLI tool and the StorCLI tool.

Next, we booted both systems up and set the USB flash drive as the boot device. After each host loaded Enterprise Storage OS, we then configured the network interface cards, host/domain names, DNS, date/time settings, setup mail (SMTP), and set the root password.

Then we checked that both Syncro CS controllers were on the latest firmware (they were).


The LSI Syncro Controllers
So, now that we have our two Enterprise Storage OS systems up and running, lets take a look at the new Syncro CS controller locally. First a note on the MegaCLI, the StorCLI tool, and making volumes using the TUI in ESOS. I haven’t read this first-hand, but it seems like the StorCLI tool might be the new successor to MegaCLI. It appears you can use either MegaCLI or StorCLI interchangeably with the Syncro CS controller, however, it looks like you can only use StorCLI to create a VD that is “exclusive” (not shared between both nodes). When creating VDs with MegaCLI, its a shared VD. The TUI in ESOS makes use of the MegaCLI tool, so that works with this controller, however, it currently only supports basic VD creation/modification (no spanned VDs, no CacheCade stuff, etc.).

We used the StorCLI tool on the CLI to create a test virtual/logical drive on the Syncro:
storcli64 /c0 add vd r10 drives=8:1,8:2,8:3,8:4 wt nora pdperarray=2

The interesting thing to note is that that volume created above, is now “owned” by the controller it was created on. Try this command on both (showing that the volume is also visible/usable via MegaCLI):
MegaCli64 -ldinfo -lall -a0

If you run that on the node you created the volume on, it will show it, and if you run it on the other node, its not. However, the volume is most definitely accessible and usable in the OS by both nodes:
[root@blackberry ~]# sg_inq /dev/sda
standard INQUIRY:
 PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
 [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
 SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
 EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
 [RelAdr=0]  WBus16=0  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
 [SPI: Clocking=0x0  QAS=0  IUS=0]
   length=96 (0x60)   Peripheral device type: disk
Vendor identification: LSI     
Product identification: MR9286-8eHA     
Product revision level: 3.33
Unit serial number: 00a239ac8833a7ac1ad04dae06b00506
[root@gooseberry ~]# sg_inq /dev/sda
standard INQUIRY:
 PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
 [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
 SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
 EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
 [RelAdr=0]  WBus16=0  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
 [SPI: Clocking=0x0  QAS=0  IUS=0]
   length=96 (0x60)   Peripheral device type: disk
Vendor identification: LSI     
Product identification: MR9286-8eHA     
Product revision level: 3.33
Unit serial number: 00a239ac8833a7ac1ad04dae06b00506

Lets check out the performance (locally) of the new controllers using the fio tool. For this volume (created above) we’re using STEC s842 200GB SAS SSDs. First read performance, 100% random read, 4 KB IO size, 10 GB of data:
[root@gooseberry ~]# fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=/dev/sda --size=10G
/dev/sda: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [435.3M/0K/0K /s] [111K/0 /0  iops] [eta 00m:00s]
/dev/sda: (groupid=0, jobs=1): err= 0: pid=2753: Fri Mar  7 11:04:19 2014
 read : io=10240MB, bw=576077KB/s, iops=144019 , runt= 18202msec
   slat (usec): min=3 , max=260 , avg= 5.29, stdev= 1.99
   clat (usec): min=31 , max=1491 , avg=437.85, stdev=143.95
    lat (usec): min=35 , max=1495 , avg=443.28, stdev=144.88
   clat percentiles (usec):
    |  1.00th=[  199],  5.00th=[  251], 10.00th=[  282], 20.00th=[  322],
    | 30.00th=[  354], 40.00th=[  382], 50.00th=[  410], 60.00th=[  438],
    | 70.00th=[  474], 80.00th=[  532], 90.00th=[  684], 95.00th=[  756],
    | 99.00th=[  812], 99.50th=[  844], 99.90th=[  908], 99.95th=[  940],
    | 99.99th=[ 1020]
   bw (KB/s)  : min=331552, max=643216, per=100.00%, avg=578813.56, stdev=85260.02
   lat (usec) : 50=0.01%, 100=0.01%, 250=4.86%, 500=70.15%, 750=19.65%
   lat (usec) : 1000=5.32%
   lat (msec) : 2=0.01%
 cpu          : usr=21.20%, sys=78.21%, ctx=9451, majf=0, minf=89
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued    : total=r=2621440/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  READ: io=10240MB, aggrb=576077KB/s, minb=576077KB/s, maxb=576077KB/s, mint=18202msec, maxt=18202msec

Disk stats (read/write):
 sda: ios=2615354/0, merge=0/0, ticks=751429/0, in_queue=755922, util=99.59%

Looks like we’re getting right around 144K IOPS -- not too shabby. Now lets check out writes with 100% random write, 4 KB IO size, 10 GB of data:
[root@gooseberry ~]# fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=/dev/sda --size=10G
/dev/sda: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/306.6M/0K /s] [0 /78.5K/0  iops] [eta 00m:00s]
/dev/sda: (groupid=0, jobs=1): err= 0: pid=2759: Fri Mar  7 11:10:30 2014
 write: io=10240MB, bw=309707KB/s, iops=77426 , runt= 33857msec
   slat (usec): min=6 , max=278 , avg= 9.51, stdev= 2.94
   clat (usec): min=44 , max=22870 , avg=814.95, stdev=391.06
    lat (usec): min=60 , max=22881 , avg=824.70, stdev=391.04
   clat percentiles (usec):
    |  1.00th=[  111],  5.00th=[  231], 10.00th=[  370], 20.00th=[  652],
    | 30.00th=[  732], 40.00th=[  740], 50.00th=[  748], 60.00th=[  756],
    | 70.00th=[  772], 80.00th=[ 1004], 90.00th=[ 1384], 95.00th=[ 1624],
    | 99.00th=[ 1976], 99.50th=[ 2096], 99.90th=[ 2288], 99.95th=[ 2320],
    | 99.99th=[ 2480]
   bw (KB/s)  : min=281264, max=328936, per=99.98%, avg=309659.58, stdev=11008.89
   lat (usec) : 50=0.01%, 100=0.70%, 250=4.94%, 500=9.00%, 750=37.26%
   lat (usec) : 1000=28.02%
   lat (msec) : 2=19.21%, 4=0.87%, 50=0.01%
 cpu          : usr=22.62%, sys=73.73%, ctx=24146, majf=0, minf=25
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued    : total=r=0/w=2621440/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
 WRITE: io=10240MB, aggrb=309707KB/s, minb=309707KB/s, maxb=309707KB/s, mint=33857msec, maxt=33857msec

Disk stats (read/write):
 sda: ios=0/2604570, merge=0/0, ticks=0/939581, in_queue=941070, util=99.86%

And for writes it looks like we’re getting about 77K IOPS (4 KB). Now lets look and see what the performance numbers are like on the other node (the non-owner node/controller); we’ll do the same tests from above:
[root@blackberry ~]# fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=/dev/sda --size=10G
/dev/sda: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [180.8M/0K/0K /s] [46.3K/0 /0  iops] [eta 00m:00s]
/dev/sda: (groupid=0, jobs=1): err= 0: pid=9445: Fri Mar  7 11:17:45 2014
 read : io=10240MB, bw=173064KB/s, iops=43265 , runt= 60589msec
   slat (usec): min=5 , max=189 , avg= 8.40, stdev= 5.51
   clat (usec): min=829 , max=18076 , avg=1468.25, stdev=90.07
    lat (usec): min=837 , max=18084 , avg=1476.89, stdev=90.20
   clat percentiles (usec):
    |  1.00th=[ 1320],  5.00th=[ 1400], 10.00th=[ 1432], 20.00th=[ 1448],
    | 30.00th=[ 1464], 40.00th=[ 1464], 50.00th=[ 1480], 60.00th=[ 1480],
    | 70.00th=[ 1480], 80.00th=[ 1496], 90.00th=[ 1496], 95.00th=[ 1512],
    | 99.00th=[ 1576], 99.50th=[ 1624], 99.90th=[ 1672], 99.95th=[ 1704],
    | 99.99th=[ 2096]
   bw (KB/s)  : min=168144, max=191424, per=99.99%, avg=173050.38, stdev=2949.87
   lat (usec) : 1000=0.01%
   lat (msec) : 2=99.98%, 4=0.02%, 20=0.01%
 cpu          : usr=12.14%, sys=39.81%, ctx=204963, majf=0, minf=89
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued    : total=r=2621440/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  READ: io=10240MB, aggrb=173063KB/s, minb=173063KB/s, maxb=173063KB/s, mint=60589msec, maxt=60589msec

Disk stats (read/write):
 sda: ios=2613079/0, merge=0/0, ticks=3627011/0, in_queue=3626141, util=99.86%
[root@blackberry ~]# fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=/dev/sda --size=10G
/dev/sda: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/137.8M/0K /s] [0 /35.3K/0  iops] [eta 00m:00s]
/dev/sda: (groupid=0, jobs=1): err= 0: pid=9449: Fri Mar  7 11:19:35 2014
 write: io=10240MB, bw=132518KB/s, iops=33129 , runt= 79127msec
   slat (usec): min=3 , max=203 , avg=10.97, stdev= 6.88
   clat (usec): min=289 , max=18464 , avg=1917.83, stdev=318.12
    lat (usec): min=299 , max=18472 , avg=1929.07, stdev=317.61
   clat percentiles (usec):
    |  1.00th=[ 1240],  5.00th=[ 1512], 10.00th=[ 1656], 20.00th=[ 1784],
    | 30.00th=[ 1832], 40.00th=[ 1864], 50.00th=[ 1912], 60.00th=[ 1928],
    | 70.00th=[ 1944], 80.00th=[ 1976], 90.00th=[ 2064], 95.00th=[ 2640],
    | 99.00th=[ 3120], 99.50th=[ 3248], 99.90th=[ 3440], 99.95th=[ 3504],
    | 99.99th=[ 3696]
   bw (KB/s)  : min=127176, max=142088, per=100.00%, avg=132523.65, stdev=1832.69
   lat (usec) : 500=0.01%, 750=0.04%, 1000=0.21%
   lat (msec) : 2=87.54%, 4=12.21%, 20=0.01%
 cpu          : usr=10.16%, sys=40.90%, ctx=197327, majf=0, minf=25
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued    : total=r=0/w=2621440/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
 WRITE: io=10240MB, aggrb=132518KB/s, minb=132518KB/s, maxb=132518KB/s, mint=79127msec, maxt=79127msec

Disk stats (read/write):
 sda: ios=0/2614355, merge=0/0, ticks=0/4791266, in_queue=4790636, util=99.91%

Whoa, so there is definitely a difference in performance when accessing a volume from the non-owner node. It turns out when data is sent through the non-owner node they go through a process called IO shipping which will reduce the response time as the non-owner node needs to communicate to the owner of the volume before data can be read/written thus reducing the IOPS. So, this will be important to know for our setup below.

So, we learned accessing the shared virtual drives from each controller is not equal. The owning controller/node has the best performance. Now, if you reboot (turn off, kill, fail, whatever) the owner node, obviously the ownership is transferred to the standing node, and then you see the good performance numbers on that controller. When the down node comes back up, the VD ownership is NOT transferred back (it does not “fail back”). This is important since if you wanted to try and do an “active-active” setup (yes, not true active-active, but as in divvying up the VDs across both controllers) this won’t work. You could set it up that way, but then when you have a failover or reboot or something, the ownership will all be lopsided. I was curious if we could manually control volume ownership without having to reboot or something (eg, via CLI) but I didn’t see anything in the documentation. I’ve asked LSI support, but have not gotten an answer yet. I would think/hope that feature would be coming in the future. If we could control VD/LD ownership “live” (inside the OS) then we could script this as part of our cluster setup (described below). Until (if ever) that feature is available, we’ll have to do an active-passive setup where all virtual drives are owned by a single controller, and then when an event occurs (failure, reboot, etc.), they are transferred to the other node.


SCSI PRs + LSI Syncro + SCST
Now lets take a look at SCSI persistent reservations (PR) and see how they work with the LSI Syncro CS controllers locally. We still have our volume we created above, so we’ll test with that. Lets check for existing PR keys and register a new one:
[root@blackberry ~]# sg_persist --no-inquiry --read-keys /dev/sda
 PR generation=0x0, there are NO registered reservation keys
[root@blackberry ~]# sg_persist --no-inquiry --out --register --param-sark=123abc /dev/sda
[root@blackberry ~]# sg_persist --no-inquiry --read-keys /dev/sda
 PR generation=0x1, 1 registered reservation key follows:
   0x123abc

Now lets take a look from the other node:
[root@gooseberry ~]# sg_persist --no-inquiry --read-keys /dev/sda
 PR generation=0x1, 1 registered reservation key follows:
   0x123abc

So we have confirmed we can read the key on both nodes; now lets try reserving the device:
[root@blackberry ~]# sg_persist --no-inquiry --out --reserve --param-rk=123abc --prout-type=1 /dev/sda
[root@blackberry ~]# sg_persist --no-inquiry --read-reservation /dev/sda
 PR generation=0x1, Reservation follows:
   Key=0x123abc
   scope: LU_SCOPE,  type: Write Exclusive

And we can see it on the other node:
[root@gooseberry ~]# sg_persist --no-inquiry --read-reservation /dev/sda
 PR generation=0x1, Reservation follows:
   Key=0x123abc
   scope: LU_SCOPE,  type: Write Exclusive

Looks like we’re good on SCSI persistent reservations locally, which we expected. Now lets test SCSI PRs when combined with SCST. Lets first create a vdisk_blockio SCST device using our SSD volume from above as the back-end storage. We’ll then map it to a LUN for each target (each fabric) and these will be visible to our Linux initiator test system. Verify we can see the volumes on the initiator (not using multipath-tools since we want to easily see distinct devices on for each target/node with this test):
raspberry ~ # lsscsi
[1:0:0:0]    cd/dvd  TEAC     DVD-ROM DV28SV   D.0J  /dev/sr0
[2:0:0:0]    disk    FUJITSU  MAX3073RC        D206  -       
[2:0:1:0]    disk    FUJITSU  MAX3073RC        D206  -       
[2:1:0:0]    disk    Dell     VIRTUAL DISK     1028  /dev/sda
[4:0:0:0]    disk    SCST_BIO blackberry_test   300  /dev/sdb
[5:0:2:0]    disk    SCST_BIO gooseberry_test   300  /dev/sdc

Check there are no existing PR keys visible (on either node):
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdb
 PR generation=0x0, there are NO registered reservation keys
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdc
 PR generation=0x0, there are NO registered reservation keys

Lets start by making a SCSI PR key and reservation on one of the systems:
raspberry ~ # sg_persist --no-inquiry --out --register --param-sark=123abc /dev/sdb
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdb
 PR generation=0x1, 1 registered reservation key follows:
   0x123abc

raspberry ~ # sg_persist --no-inquiry --out --reserve --param-rk=123abc --prout-type=1 /dev/sdb
raspberry ~ # sg_persist --no-inquiry --read-reservation /dev/sdb
 PR generation=0x1, Reservation follows:
   Key=0x123abc
   scope: LU_SCOPE,  type: Write Exclusive

So, we see they key registered and the reservation active on that path/node (“blackberry”); now lets see if its visible on “gooseberry”:
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdc
 PR generation=0x0, there are NO registered reservation keys

raspberry ~ # sg_persist --no-inquiry --read-reservation /dev/sdc
 PR generation=0x0, there is NO reservation held

Nope, its not! This was expected as well. When using the vdisk_* device handles in SCST, these emulate SCSI commands (eg, the SCSI persistent reservations) and so these are stored in a software layer between initiators and the back-end storage -- the SCSI commands aren’t passed directly to the Syncro CS controllers with these device handlers. Lets try the same test but with the SCSI disk pass-through (dev_disk) handler.

Check that the SCSI device nodes show up on the initiator side:
raspberry ~ # lsscsi
[1:0:0:0]    cd/dvd  TEAC     DVD-ROM DV28SV   D.0J  /dev/sr0
[2:0:0:0]    disk    FUJITSU  MAX3073RC        D206  -       
[2:0:1:0]    disk    FUJITSU  MAX3073RC        D206  -       
[2:1:0:0]    disk    Dell     VIRTUAL DISK     1028  /dev/sda
[4:0:0:0]    disk    LSI      MR9286-8eHA      3.33  /dev/sdb
[5:0:2:0]    disk    LSI      MR9286-8eHA      3.33  /dev/sdc

Make sure there are no existing PR keys:
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdb
PR in: command not supported
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdc
PR in: command not supported

Snap! Something is wrong… we see this message on the ESOS target side:
[1902219.536846] scst: ***WARNING***: PR commands for pass-through devices not supported (device 0:2:0:0)

Looks like SCSI persistent reservations are not supported with SCST using pass-through devices, period. It would interesting to find out if there is a technical reason that SCSI PRs aren’t supported with the pass-through handlers, or if its something that just hasn’t been implemented yet (in SCST). Either way, for us and our project, it doesn’t really matter -- we don’t need to support SCSI PRs (we use VMware ESXi 5.5). I’m pretty sure the Microsoft cluster stuff relies on persistent reservations, but again, we’re not going to be doing any of that with our setup.


Syncro CS Storage Setup
For this test setup, we’ll be testing with (2) Dell PowerEdge R720’s running ESXi 5.5 and run a large pool of Windows 8.1 virtual desktops on these servers. For a VDI setup, we keep the replicas (for linked clones) on fast (SSD) storage. We’ll use the 15K disks for the linked clone datastores. So with the 24 slots in the DataON JBOD, we’ll split up the storage like this:
  • (1) RAID10 volume consisting of (4) STEC s842 SSD’s
  • (2) CacheCade volumes consisting of (1) STEC s842 SSD each in RAID0 (one for each controller)
  • (1) STEC s842 hot spare
  • (2) RAID10 volumes consisting of (8) Hitachi 15K drives each
  • (1) Hitachi 15K hot spare



First we’ll create our SSD RAID10 volume (disable read and write cache for SSD volumes):
storcli64 /c0 add vd r10 drives=8:1,8:2,8:3,8:4 wt nora pdperarray=2

Now create both of the SAS 15K RAID10 volumes (read/write cache + CacheCade):
storcli64 /c0 add vd r10 drives=8:8,8:9,8:10,8:11,8:12,8:13,8:14,8:15 wb ra pdperarray=4
storcli64 /c0 add vd r10 drives=8:16,8:17,8:18,8:19,8:20,8:21,8:22,8:23 wb ra pdperarray=4

Add the CacheCade volume (the CacheCade VDs are like exclusive VDs so we created one on each cluster node so we don’t need to worry about which one is “active”):
storcli64 /c0 add vd cachecade type=raid0 drives=8:5 wt
storcli64 /c0 add vd cachecade type=raid0 drives=8:6 wt

One interesting point to note is when we attempted to assign VDs to a CacheCade volume before writing this document (testing), we got the following error with the Syncro CS:
[root@blackberry ~]# storcli64 /c0/v3 set ssdcaching=on
Controller = 0
Status = Failure
Description = None

Detailed Status :
===============

-----------------------------------------------------------------------------------
VD Property   Value Status ErrCd ErrMsg                                            
-----------------------------------------------------------------------------------
3 SSDCaching On    Failed  1001 Controller doesn't support manual SSC Association
-----------------------------------------------------------------------------------


We opened a support case with LSI, and were told to setup the CacheCade volume this way (associating the VDs at CacheCade VD creation time):
[root@blackberry ~]# storcli64 /c0 add vd cachecade type=raid0 drives=8:5 wt assignvds=2
Controller = 0
Status = Failure
Description = Controller doesn't support manual SSC Association


It fails with the same error; so we asked LSI again and this time their solution was to use MSM or WebBIOS… we tried WebBIOS and after a long, confusing journey, we didn’t get any farther configuring CacheCade with this method either. From what I’ve read, there is no “manual” association of VDs with a CacheCade volume… its all “automatic” (supposedly). We left this as is for now, with a CacheCade VD on each controller. We’ll revisit it at some point during our testing to confirm it is (or isn’t) working correctly.

The SSD hot spare (global):
storcli64 /c0/e8/s7 add hotsparedrive

Now add the 15K global hot spare drive:
storcli64 /c0/e8/s24 add hotsparedrive

Next we set meaningful names for each of the volumes:
storcli64 /c0/v0 set name=SSD_R10_1
storcli64 /c0/v2 set name=15K_R10_1
storcli64 /c0/v3 set name=15K_R10_2


Cluster Setup
We’ll start this by enabling all of the services needed and disabling anything we’re not going to use. Edit ‘/etc/rc.conf’ and set the following (on both nodes):
rc.openibd_enable=NO
rc.opensm_enable=NO
rc.sshd_enable=YES
rc.mdraid_enable=NO
rc.lvm2_enable=NO
rc.eio_enable=NO
rc.dmcache_enable=NO
rc.btier_enable=NO
rc.drbd_enable=NO
rc.corosync_enable=YES
rc.dlm_enable=NO
rc.clvmd_enable=NO
rc.pacemaker_enable=YES
rc.fsmount_enable=YES
rc.mhvtl_enable=NO
rc.scst_enable=NO
rc.perfagent_enable=NO
rc.nrpe_enable=NO
rc.snmpd_enable=NO
rc.snmptrapd_enable=NO
rc.nut_enable=NO
rc.smartd_enable=NO

The cluster will manage SCST, so we disable it above. We also disable other things we’re not going to use in this setup (md software RAID, LVM, etc.).

Next, generate the Corosync key on one system and scp it to the other (check permissions and make them match if needed):
corosync-keygen

Now you can create/edit your corosync.conf file; we won’t go into all of the specifics for our configuration, there is lots of documentation on Corosync out there. Here it is:
totem {
       version: 2
       cluster_name: esos_syncro
       crypto_cipher: aes256
       crypto_hash: sha1
       rrp_mode: passive
       interface {
               ringnumber: 0
               bindnetaddr: 10.35.6.0
               mcastaddr: 226.94.1.3
               mcastport: 5411
               ttl: 1
       }
       interface {
               ringnumber: 1
               bindnetaddr: 192.168.1.0
               mcastaddr: 226.94.1.4
               mcastport: 5413
               ttl: 1
       }
}

nodelist {
       node {
               ring0_addr: 10.35.6.11
               nodeid: 1
       }
       node {
               ring0_addr: 10.35.6.12
               nodeid: 2
       }
}

logging {
       fileline: off
       to_stderr: no
       to_syslog: yes
       syslog_facility: local2
       debug: off
       timestamp: off
       logger_subsys {
               subsys: QUORUM
               debug: off
       }
}

quorum {
       provider: corosync_votequorum
       two_node: 1
}

Now lets start Corosync (on both) and check the status of the rings:
/etc/rc.d/rc.corosync start
corosync-cfgtool -s

We can now start Pacemaker on both nodes and check it:
/etc/rc.d/rc.pacemaker start
crm configure show

I was initially planning on doing fencing using SCSI PRs as the LSI manual describes doing it (for a DAS / local application setup), but this may not be the best option as the SCSI fence would not change controller/volume ownership -- or would it... for this setup, we decided that we’re not going to use any fencing. We’re providing shared storage with this ESOS cluster, and not running the application on the cluster itself. This was discussed a bit internally and we ultimately decided fencing added complexity that we did not want, and added no apparent benefits for this solution (if anyone can tell us different, we'd happily listen). So, we can go ahead and disable STONITH for this cluster:
crm configure property stonith-enabled="false"

Now lets setup the SCST ALUA configuration; we need to run both blocks of commands below, one on “host A” and the second on “host B”: For “host A” (blackberry.mcc.edu):
scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=256
scstadmin -add_tgrp_tgt 10000000C9E667E9 -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 10000000C9E667E9 -driver ocs_scst -attributes rel_tgt_id=1

scstadmin -add_tgrp_tgt 10000000C9E667ED -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 10000000C9E667ED -driver ocs_scst -attributes rel_tgt_id=2
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=257
scstadmin -add_tgrp_tgt 10000000C9E66A91 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 10000000C9E66A91 -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=3

scstadmin -add_tgrp_tgt 10000000C9E66A95 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 10000000C9E66A95 -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=4

For “host B” (gooseberry.mcc.edu):
scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=257
scstadmin -add_tgrp_tgt 10000000C9E66A91 -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 10000000C9E66A91 -driver ocs_scst -attributes rel_tgt_id=3

scstadmin -add_tgrp_tgt 10000000C9E66A95 -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 10000000C9E66A95 -driver ocs_scst -attributes rel_tgt_id=4
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=256
scstadmin -add_tgrp_tgt 10000000C9E667E9 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 10000000C9E667E9 -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=1

scstadmin -add_tgrp_tgt 10000000C9E667ED -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 10000000C9E667ED -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=2

Next we rebooted both systems to ensure whats enabled/disabled for services is setup correctly. After they both came back up, we worked on the rest of the cluster configuration.

This cluster is going to be extremely simple… no replication using DRBD, no logical volumes via LVM, so there really isn’t much to the configuration -- just SCST.

For SCST we’ll have a master/slave configuration… one host will always be the “master” and the other will be the “slave”. This makes it so in the ALUA configuration one target (host) is active/optimized, and the other is non-optimized which will (should) not be used by initiators.

We’ll add it in with this:
crm
cib new scst
configure primitive p_scst ocf:esos:scst \
params alua="true" device_group="esos" \
local_tgt_grp="local" remote_tgt_grp="remote" \
m_alua_state="active" s_alua_state="nonoptimized" \
op monitor interval="10" role="Master" \
op monitor interval="20" role="Slave" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="60"
configure ms ms_scst p_scst \
meta master-max="1" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true" interleave="true"
cib commit scst
quit

Next we moved on to creating our SCST devices; we have 3 block devices (RAID volumes) we’re going to present the initiators, so we created the 3 SCST devices on both ESOS hosts with the same names and parameters.

Now we can add the SCST devices to the device groups; run this on both ESOS hosts:
scstadmin -add_dgrp_dev ssd_r10_1 -dev_group esos
scstadmin -add_dgrp_dev 15k_r10_1 -dev_group esos
scstadmin -add_dgrp_dev 15k_r10_2 -dev_group esos

When using the SCST vdisk_blockio handler, we have noticed the I/O scheduler (eg, cfq, deadline, noop) makes a huge difference in performance. In ESOS, the default scheduler is “cfq”. This may I/O scheduler likely works best when using vdisk_fileio (untested) as its then running a local file system on the ESOS box; your LU’s are pointing to virtual disk files that reside on the file system.

You can easily change the scheduler “live” with I/O between the targets and initiators to see the difference. We haven’t done a lot of testing, simply flipping the scheduler on the block devices and seeing what the latency numbers look like on the ESXi side (the VMs). There doesn’t appear to be much (if any) difference between “noop” and “deadline” (no offical testing by us). We use the “noop” scheduler for all block devices that are used with vdisk_blockio. This is set by creating the “/etc/pre-scst_xtra_conf” file and adding this to it:
for i in /sys/block/sd*; do
    echo "noop" > ${i}/queue/scheduler
done

This sets the I/O scheduler to “noop” for all SCSI disk block devices on our systems (we only use vdisk_blockio). You also need to make the file executable:
chmod +x /etc/pre-scst_xtra_conf

Next we zoned all of the initiators/targets into our zone sets.

We moved on to creating our security groups -- using the TUI to do this was a snap. Similar to FC zoning, for each target we create a group and add the initiators (ESXi hosts). This setup needs to done on both of the ESOS cluster nodes.

After creating the groups, the final step for provisioning the storage is to map the SCST devices to LUNs (again, using the TUI). Like the SCST configuration above, we need to create the same LUN mapping setup on both nodes.


Finished
That’s it! We did a rescan in the vSphere client and created our VMFS5 file systems on each LU. Now we have lots and lots of testing to perform. For us, this is a POC for a bigger project where we replace all of our proprietary SAN disk arrays, with new “open storage” (ESOS-based) storage servers that are like what was described in this article, but bigger (more disks/JBODs). Look for another article in near future!



18 comments:

  1. Great post Marc. I echo'd the very same sentiment you did about how the LSI Syncro solution may be just what the opens storage community has been longing for on my blog. http://www.ha-guru.com/ssd-zfs-bcache-enhanceio-scst/

    ReplyDelete
  2. One question Marc, what target drivers are you using for the CNAs?

    ReplyDelete
    Replies
    1. The "ocs_fc_scst" driver from Emulex... part of the Emulex OCS SDK.

      --Marc

      Delete
    2. Man I wish I could play in your lab lol. You get all the cool toys :)

      Delete
  3. a VD 'owned' by the other controller should be tantamount to a 'foreign VD' as far as the local controller is concerned. Which matches with your quote:

    "If you run that on the node you created the volume on, it will show it, and if you run it on the other node, its not."


    What does "storcli /c0/fall show" on the non-owning controller?

    The trick is to somehow tell the local controller that a particular VD it inherited after a fail-over event is 'blocked' (accesspolicy=Blocked) and then change it's status to Foreign, then have the other rebooted controller re-Import it. Alas I can't find a command to disown a VD. But perhaps it is sufficient to deliberately import the "Foreign" ID on the other controller? Will this then tell the owning controller you've lost it? Should be trivial to test.

    It is also unclear what the following controller setting does: loadbalancemode=
    Setting foreignautoimport=off would interfere with automatic I/O resumption I expect but if you could reliably and quickly detect when the other controller failed, if this setting is 'off' then you could have some control over LUN migration.

    In sooth the Syncro cards are designed for Active/Passive and I don't think LSI wanted to address the whole A/A complexities.

    As to SCSi reservations, why not have the ESOS do the scsi reservation before it registers the device with SCST? I also am not seeing how you're setting ALUA properties with those SCST commands. I think I would have used the clustering resource script to deliberately mark paths as up or down depending on whether or not the 'active node' (one who held the matching reservation) and in a take-over situation the script to remove the other node's reservation, establish a new one with the node's ID or some other "unique" value.

    ReplyDelete
  4. Oh and with all things storage, a 3-way cluster is MANDATORY. You do not want to EVER run into a split-brain situation.

    ReplyDelete
  5. Hi, how to do manual failback (after VD was moved to non-native controller) I can't find such option in MRSM nor storeCLI?

    ReplyDelete
    Replies
    1. Reboot the host that you want to be slave/secondary. There is no way to do it via the CLI that I know of.

      --Marc

      Delete
  6. Hi Marc,
    I really enjoy reading your articles and always refer to them when I play with SCST. A couple of questions though in general and about the LSI Syncro. I'm also looking at a similar setup for my work with two nodes (heads) and a DataOn Dual Controller JBOD.

    Q1. Do the LSI Syncros' support IT mode or passthrough disk mode as I intend on using ZFSonlinux?
    Q2. Have you ever looked at zfsonlinux or any of the ZFS distributions? Any thoughts?
    Q3. I can see you are a big user of SCST and I have it running in house too but moving forward I will use LIO for the VAAI support. Do you have concerns/comments about SCST's lack of VAAI support? Have you tried or tested LIO?
    Q4. In an Active/Passive head arrangement, are the LSI Syncro cards even required? Can it not be done with Corosync, Heartbeat, Stonith etc or am I missing something? Every guide I see on-line has DRBD as a solution but I don't want to replicate data... I want to use the shared storages with dusal port SAS.
    Q5. Have you looked into RSF-1 services instead of LSI Syncro product? From what I gather the price for their consulting in setting up the cluster is the same a a pair of LSI Syncros.
    Q6. Have you tried LSI Interposers in your systems so you can use cheaper SSD ? I would expect them to be of benefit in an active/passive setup.

    Sorry about all the questions. I am looking forward to more updates from you with some test results and descriptions on how the system is going.

    ReplyDelete
    Replies
    1. Hi,

      Thank you.

      Q1> I'm not sure what "IT mode" is on the LSI controllers -- is that some type of LSI feature?
      Q2> I haven't personally looked at it, but I've had others ask about supporting it in ESOS. I'm not sure that anyone has proven a benefit over using ZFS vs. LVM and other tools that are available in ESOS.
      Q3> SCST does have VAAI support (several of the SCSI primitives including recently COMPARE AND WRITE which provides ATS in ESXi). I believe they still don't have EXTENDED COPY support.
      Q4> Probably not, but it makes things much less complicated... you have a magic set of controllers that do all the work (without any configuration). Plus you get 1 GB cache that is mirrored between the controllers providing usable write-back cache. In my experience, simpler is always the better solution.
      Q5> I have not heard of RSF-1 but I'm going to Google them now! It seems that is more of like a cluster framework, instead of a RAID controller.
      Q6> No, I have not yet... and I do have a box of LSI inter-posers somewhere around here, so something to try in the future.


      --Marc

      Delete
    2. Hi Marc,
      Thanks for the quick replies.

      Q1. IT is a firmware that that LSI provide for their controllers (not all) that enables the attached disks to be sent directly through to the OS (ie RAID card acts as an HBA). This is very popular for the ZFS crowd. Here is a URL with a quick explanation:

      http://linustechtips.com/main/topic/104425-flashing-an-lsi-9211-8i-raid-card-to-it-mode-for-zfssoftware-raid-tutorial/

      Why do this? To enable Btrfs and ZFS to control the disks directly. If using LVM + Raid, then it would not be useful.

      Q2.> The main benefits will be the advantages of using ZFS (or Btrfs) as the filesystem and volume manager. Also, with ZFS, you get checksums to make sure your data is correct, lots of write-caching (with lots of RAM), an Intent log that be placed on faster media to smooth out certain workload writes, a Read Cache that can be placed on faster media, plus built-in software raid, unlimited snapshot, copy on write filesystem. More information can be found here:
      zfsonlinux.org
      https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/

      Q3. Thanks for that update. When I google SCST + Vaai, I normally get a discussion from last year where Bart and/or Vlad say it is not their priority. I'll take another look. Have you used LIO in your testing though? I'm very interested if SCST still has the upper hand in performance.

      Q5. RSF-1 looks to be a commercial version of corosync,heartbeat, etc. or similar. I found them when investigating ZFS clusters.

      Cheers,
      David

      Delete
    3. Good to know about the "IT firmware"... although that appears to only be an option for SAS HBA cards... I believe the LSI 9211-8i card described in that article is a SAS HBA that might have some low-end RAID capabilities (eg, just RAID1/0). I'm not sure that the higher end RAID controllers have that capability or not... something to look into.

      I'll have to take another look at ZFS, but I don't think it would ever be an option due the in-compatible license.

      I just recently (a few days ago) added support for btrfs to ESOS. Still need to play and test with it. It'd be interesting to see if there is any performance different between something like ZFS/btrfs vs. a high-end RAID controller when used with SCST. Something to test in the future! =)

      I honestly never even looked at LIO (or others) after reading all of the performance, and feature reviews of SCST. I was sold!


      --Marc

      Delete
    4. Hi Marc,

      Thank you for the great article. I am thinking about trying to build a similar array, with LSI syncro controllers. I am curious, is there any update since the initial build? Any new problems encountered? Did LSI ever come up with a solution to the cachecade issue? Have you done an upgrade of the ESOS OS on the controllers yet? If so, was it online? Thanks again for such great work.

      Sean

      Delete
    5. Hi Sean,

      Everything has been great with the Syncro controllers; this was really a POC setup for us, and we're now working on rolling out a large setup utilizing the Syncro controllers. We are replacing all of our proprietary SAN storage with ESOS-based storage arrays. We're implementing two "storage stacks" with each stack consisting of (2) Dell PE R720's, a pair of Syncro CS 8e controllers, and (4) DataON 2U 24-slot 2.5" SAS enclosures. We've been testing this new solution for a while now and are getting ready to move it into production. After that is complete, we will begin working on another ESOS-based array using the Syncro controllers for our DR site.

      Yes, we figured out the CacheCade issue. Something that I had always wondered about with MegaRAID cards, what's the difference between "direct" and "cached" mode. Need to use "cached" mode on VDs and then CacheCade will be enabled for them. You still can't manually associate as the error in the article above indicates, the controller decides for you. Each controller needs its own set of CacheCade volumes (exclusive mode). We have a CC VD set dedicated on each controller, this way we don't need to worry about which node/controller is the master, CacheCade will work on either.

      And yes, we have done many live/online ESOS upgrades... the ALUA cluster setup works quite well for handling this. Update one host, reboot, and IO goes to the other node. It comes back up, update the other, and reboot... done.


      --Marc

      Delete
  7. dear Marc
    i was looking for a way to move ownership of single VD's between the two controllers in my setup and found your blog entry. i continued searching since i could not find the answer i was looking for in here and in the end found out how to do it.. and i must say you came prettly close .. you wrote "SCSI fence would not change controller/volume ownership -- or would it..." .. well it woud :)

    you could install the sg3_utils package on your server and then run the sg_persist utility in order to make a reservation. this automatically moves the ownership of the VD that corresponds to the specified block device to the current server. it's as simple as that.. commands to do so can basically be copy/pasted from the examples in man page of sg_persist.

    another way to move ownership of all vd's to the other controller would be to reset the controller rather than rebooting the entire server.. you can use storcli /c0 reset for that.

    hope that helps you or someone else who is sruggeling to read the message between the lines of the sparse LSI and intel documentations available for these controllers.

    cheers pascal

    ReplyDelete
    Replies
    1. That's great to hear! Thanks so much for the information Pascal -- this is very useful. I'll give this a shot on a development system when I can find some free time. =)

      Delete
    2. Pascal, that works great. Thank you. Are you using this in production? Did you write a RA for managing the reservations? Or have you configured fence_scsi? I did not have success with fence_scsi so far, it was always failing when the cluster manager tried to start it but I'm not sure if it will help anyway, I don't see how it decides which node will be the reservation holder, I think another RA would be needed something like M/S resource which can move VDs around. This even helps to spread load about fibre channel/ network ports of both both pathes in your sas network, if you like to configure it that way. I'm currently fencing with ipmi.

      Delete