Saturday, January 17, 2015

Crazy Performance From Something So Small

So, I did a refresh on my home machine recently, or really just an entirely new machine... I picked up a used Dell Precision T7500 workstation (24 GB memory, 2 x Xeon W5590 processors). I also bought a used Fusion-io ioDrive 160 GB SLC flash memory device. I knew it was going to be fast, but was surprised at just how fast with such a little card.

I'm running Fedora 21 "Workstation" on this system. The drive for this card, called "VSL" is available from fusionio.com but you need to create an account first to access it. It also appears there is a newer version of the driver/firmware if you pay for a support contract. I used the 2.3.11 version of driver, and it lists supporting Fedora 17. The driver is written for older kernels, so I had to change it a bit to work with 3.x -- let me know if you're interested in the changes needed for newer kernels.

Anyhow, here is a quick peak at the performance numbers on this system using the fio tool...

--snip--
# fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=/dev/fioa --size=10G
/dev/fioa: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [750.3MB/0KB/0KB /s] [192K/0/0 iops] [eta 00m:00s]
/dev/fioa: (groupid=0, jobs=1): err= 0: pid=1406: Sat Jan 17 11:00:38 2015
  read : io=10240MB, bw=763767KB/s, iops=190941, runt= 13729msec
    slat (usec): min=1, max=172, avg= 2.85, stdev= 2.61
    clat (usec): min=199, max=3604, avg=331.24, stdev=77.36
     lat (usec): min=201, max=3625, avg=334.22, stdev=77.33
    clat percentiles (usec):
     |  1.00th=[  245],  5.00th=[  253], 10.00th=[  270], 20.00th=[  294],
     | 30.00th=[  318], 40.00th=[  326], 50.00th=[  330], 60.00th=[  330],
     | 70.00th=[  334], 80.00th=[  350], 90.00th=[  402], 95.00th=[  426],
     | 99.00th=[  454], 99.50th=[  462], 99.90th=[  540], 99.95th=[ 2544],
     | 99.99th=[ 2992]
    bw (KB  /s): min=673840, max=768568, per=100.00%, avg=763737.48, stdev=18102.33
    lat (usec) : 250=3.48%, 500=96.39%, 750=0.04%, 1000=0.01%
    lat (msec) : 2=0.03%, 4=0.06%
  cpu          : usr=23.24%, sys=62.81%, ctx=254638, majf=0, minf=664
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=2621440/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=10240MB, aggrb=763767KB/s, minb=763767KB/s, maxb=763767KB/s, mint=13729msec, maxt=13729msec

Disk stats (read/write):
  fioa: ios=2607327/0, merge=31/0, ticks=815401/0, in_queue=815145, util=99.34%
--snip--

--snip--
# fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=/dev/fioa --size=10G
/dev/fioa: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0KB/747.3MB/0KB /s] [0/191K/0 iops] [eta 00m:00s]
/dev/fioa: (groupid=0, jobs=1): err= 0: pid=1433: Sat Jan 17 11:01:49 2015
  write: io=10240MB, bw=746955KB/s, iops=186738, runt= 14038msec
    slat (usec): min=1, max=192, avg= 3.33, stdev= 2.83
    clat (usec): min=192, max=3048, avg=338.28, stdev=70.32
     lat (usec): min=194, max=3052, avg=341.74, stdev=70.41
    clat percentiles (usec):
     |  1.00th=[  262],  5.00th=[  282], 10.00th=[  298], 20.00th=[  310],
     | 30.00th=[  318], 40.00th=[  322], 50.00th=[  330], 60.00th=[  334],
     | 70.00th=[  342], 80.00th=[  366], 90.00th=[  398], 95.00th=[  414],
     | 99.00th=[  454], 99.50th=[  478], 99.90th=[ 1144], 99.95th=[ 2024],
     | 99.99th=[ 2800]
    bw (KB  /s): min=660624, max=765872, per=99.99%, avg=746907.14, stdev=25759.49
    lat (usec) : 250=0.32%, 500=99.39%, 750=0.18%, 1000=0.01%
    lat (msec) : 2=0.06%, 4=0.05%
  cpu          : usr=23.67%, sys=68.75%, ctx=110028, majf=0, minf=431
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=2621440/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=10240MB, aggrb=746955KB/s, minb=746955KB/s, maxb=746955KB/s, mint=14038msec, maxt=14038msec

Disk stats (read/write):
  fioa: ios=109/2595463, merge=110/28, ticks=9/814160, in_queue=813744, util=99.39%
--snip--

So, in both of those tests, the first being 100% random, 100% read with 4K IOs, I'm getting 192K (192,000) IOPS! And in the second test its 100% random, 100% write with 4K IOs: 191K (191,000) IOPS! That's pretty fast for such a little package... just a single PCIe flash device.

And for some sequential IO tests with a much larger IO size...

--snip--
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=64 --name=/dev/fioa --size=10G
/dev/fioa: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [R] [92.9% done] [800.0MB/0KB/0KB /s] [200/0/0 iops] [eta 00m:01s]
/dev/fioa: (groupid=0, jobs=1): err= 0: pid=1452: Sat Jan 17 11:06:41 2015
  read : io=10240MB, bw=819392KB/s, iops=200, runt= 12797msec
    slat (usec): min=110, max=19743, avg=4959.85, stdev=8374.12
    clat (msec): min=92, max=392, avg=312.69, stdev=20.35
     lat (msec): min=92, max=411, avg=317.65, stdev=18.69
    clat percentiles (msec):
     |  1.00th=[  212],  5.00th=[  302], 10.00th=[  302], 20.00th=[  302],
     | 30.00th=[  322], 40.00th=[  322], 50.00th=[  322], 60.00th=[  322],
     | 70.00th=[  322], 80.00th=[  322], 90.00th=[  322], 95.00th=[  322],
     | 99.00th=[  322], 99.50th=[  334], 99.90th=[  392], 99.95th=[  392],
     | 99.99th=[  392]
    bw (KB  /s): min=442593, max=835584, per=97.73%, avg=800802.08, stdev=80018.74
    lat (msec) : 100=0.12%, 250=1.33%, 500=98.55%
  cpu          : usr=0.07%, sys=3.44%, ctx=1028, majf=0, minf=65543
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=0.6%, 32=1.2%, >=64=97.5%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=2560/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=10240MB, aggrb=819392KB/s, minb=819392KB/s, maxb=819392KB/s, mint=12797msec, maxt=12797msec

Disk stats (read/write):
  fioa: ios=20256/0, merge=0/0, ticks=1799659/0, in_queue=1806118, util=99.29%
--snip--

--snip--
# fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=64 --name=/dev/fioa --size=10G
/dev/fioa: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0KB/767.3MB/0KB /s] [0/191/0 iops] [eta 00m:00s]
/dev/fioa: (groupid=0, jobs=1): err= 0: pid=1448: Sat Jan 17 11:06:11 2015
  write: io=10240MB, bw=786157KB/s, iops=191, runt= 13338msec
    slat (usec): min=124, max=20466, avg=5167.94, stdev=8529.73
    clat (msec): min=99, max=412, avg=326.06, stdev=21.06
     lat (msec): min=99, max=413, avg=331.23, stdev=19.41
    clat percentiles (msec):
     |  1.00th=[  225],  5.00th=[  314], 10.00th=[  314], 20.00th=[  314],
     | 30.00th=[  334], 40.00th=[  334], 50.00th=[  334], 60.00th=[  334],
     | 70.00th=[  334], 80.00th=[  334], 90.00th=[  334], 95.00th=[  334],
     | 99.00th=[  334], 99.50th=[  351], 99.90th=[  412], 99.95th=[  412],
     | 99.99th=[  412]
    bw (KB  /s): min=407157, max=802816, per=98.28%, avg=772616.08, stdev=74921.31
    lat (msec) : 100=0.12%, 250=1.17%, 500=98.71%
  cpu          : usr=3.31%, sys=2.05%, ctx=1139, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=0.6%, 32=1.2%, >=64=97.5%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=2560/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=10240MB, aggrb=786156KB/s, minb=786156KB/s, maxb=786156KB/s, mint=13338msec, maxt=13338msec

Disk stats (read/write):
  fioa: ios=59/20405, merge=55/0, ticks=7/1888035, in_queue=1893181, util=98.52%
--snip--

So with 100% sequential, 100% read using 4M IOs we see 800 MB/sec; with same test using writes I'm seeing 767 MB/sec. Pretty fast! I'm not sure where the bottleneck is here... I believe this card is PCIe 2.0 4x so that bus may be the crippler, not sure, I'll have to look into it. Either way, the random IO performance is really where its at, and I am very much impressed.

Monday, July 14, 2014

Open Storage: Dual-Controller OSS Disk Array

Introduction
The LSI Syncro CS controllers are here, and they are what the open storage community has been longing for. If you work in enterprise IT, and you’re not familiar with open storage, then you’re missing out; here is a nice article by Aaron Newcomb describing open storage: https://blogs.oracle.com/openstorage/entry/fishworks_virtualbox_tutorial

For the setup described in this article, this is a POC for our institution that will lead us to replacing all (80+ TB) of our commercial/proprietary disk arrays that sit on our Fibre Channel (FC) Storage Area Network (SAN) with ESOS-based storage arrays.

With this dual-controller ESOS storage setup, we are also testing a new SAN "medium": Fibre Channel over Ethernet (FCoE). A converged network is really where its at -- and for us, 10 GbE, or even 40 GbE provides plenty of bandwidth for sharing. We're quite excited about this new SAN technology and will hopefully replace all of our traditional Fibre Channel switches some day.


The Setup
We have (2) 40 GbE switches in this test SAN environment, and we're using 10 GbE CNAs for our targets and initiators. We connected one port on each server (the target, ESOS storage servers, and the initiators, ESXi hosts) to each switch (each device is connected to both fabrics). Each server has an iDRAC and we connected those to our management network. Then there are (2) Ethernet NICs, so we connected one on each server to our primary server network, and then used a short 1’ cable to connect the two systems directly together so we can create a private network between the two systems. We’ll use two Corosync rings on these interfaces.

Both of the servers came with 8 GB of RAM each, and since we use vdisk_blockio mode the SCST devices, we’re going to use very little physical RAM. We enabled ‘memory mirroring’ mode for redundancy/protection which gives us 4 GB of usable physical RAM on each server which is more than enough.



Two (2) Dell PowerEdge R420 (1U) servers:
  • (2) x Intel Xeon E5-2430 2.20GHz, 15M Cache, 7.2GT/s QPI, Turbo, 6C, 95W
  • (4) x 2GB RDIMM, 1333 MT/s, Low Volt, Single Rank, x8 Data Width
  • (1) x  Emulex OCe14102-UM Dual-channel, 10GBASE-SR SFP+ CNA Adapter
  • (1) x Dual Hot Plug Power Supplies 550W
Synco shared DAS storage:
  • (1) x LSI Syncro CS 9286-8e (includes two Syncro CS 9271-8i HA controllers with CacheVault)
SAS enclosure (JBOD):
  • (1) x DataON Storage DNS-1600D (4U 24-bay Modular-SBB compliant)





Getting Started
We used the latest and greatest version of Enterprise Storage OS (ESOS) and installed it on both of our USB flash drives using a CentOS Linux system:
wget --no-check-certificate https://6f70a7c9edce5beb14bb23b042763f258934b7b9.googledrive.com/host/0B-MvNl-PpBFPbXplMmhwaElid0U/esos-0.1-r663.zip
unzip esos-0.1-r663.tar.xz
cd esos-0.1-r663
./install.sh

When prompted during the ESOS installer, I added both the MegaCLI tool and the StorCLI tool.

Next, we booted both systems up and set the USB flash drive as the boot device. After each host loaded Enterprise Storage OS, we then configured the network interface cards, host/domain names, DNS, date/time settings, setup mail (SMTP), and set the root password.

Then we checked that both Syncro CS controllers were on the latest firmware (they were).


The LSI Syncro Controllers
So, now that we have our two Enterprise Storage OS systems up and running, lets take a look at the new Syncro CS controller locally. First a note on the MegaCLI, the StorCLI tool, and making volumes using the TUI in ESOS. I haven’t read this first-hand, but it seems like the StorCLI tool might be the new successor to MegaCLI. It appears you can use either MegaCLI or StorCLI interchangeably with the Syncro CS controller, however, it looks like you can only use StorCLI to create a VD that is “exclusive” (not shared between both nodes). When creating VDs with MegaCLI, its a shared VD. The TUI in ESOS makes use of the MegaCLI tool, so that works with this controller, however, it currently only supports basic VD creation/modification (no spanned VDs, no CacheCade stuff, etc.).

We used the StorCLI tool on the CLI to create a test virtual/logical drive on the Syncro:
storcli64 /c0 add vd r10 drives=8:1,8:2,8:3,8:4 wt nora pdperarray=2

The interesting thing to note is that that volume created above, is now “owned” by the controller it was created on. Try this command on both (showing that the volume is also visible/usable via MegaCLI):
MegaCli64 -ldinfo -lall -a0

If you run that on the node you created the volume on, it will show it, and if you run it on the other node, its not. However, the volume is most definitely accessible and usable in the OS by both nodes:
[root@blackberry ~]# sg_inq /dev/sda
standard INQUIRY:
 PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
 [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
 SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
 EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
 [RelAdr=0]  WBus16=0  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
 [SPI: Clocking=0x0  QAS=0  IUS=0]
   length=96 (0x60)   Peripheral device type: disk
Vendor identification: LSI     
Product identification: MR9286-8eHA     
Product revision level: 3.33
Unit serial number: 00a239ac8833a7ac1ad04dae06b00506
[root@gooseberry ~]# sg_inq /dev/sda
standard INQUIRY:
 PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
 [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
 SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
 EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
 [RelAdr=0]  WBus16=0  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
 [SPI: Clocking=0x0  QAS=0  IUS=0]
   length=96 (0x60)   Peripheral device type: disk
Vendor identification: LSI     
Product identification: MR9286-8eHA     
Product revision level: 3.33
Unit serial number: 00a239ac8833a7ac1ad04dae06b00506

Lets check out the performance (locally) of the new controllers using the fio tool. For this volume (created above) we’re using STEC s842 200GB SAS SSDs. First read performance, 100% random read, 4 KB IO size, 10 GB of data:
[root@gooseberry ~]# fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=/dev/sda --size=10G
/dev/sda: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [435.3M/0K/0K /s] [111K/0 /0  iops] [eta 00m:00s]
/dev/sda: (groupid=0, jobs=1): err= 0: pid=2753: Fri Mar  7 11:04:19 2014
 read : io=10240MB, bw=576077KB/s, iops=144019 , runt= 18202msec
   slat (usec): min=3 , max=260 , avg= 5.29, stdev= 1.99
   clat (usec): min=31 , max=1491 , avg=437.85, stdev=143.95
    lat (usec): min=35 , max=1495 , avg=443.28, stdev=144.88
   clat percentiles (usec):
    |  1.00th=[  199],  5.00th=[  251], 10.00th=[  282], 20.00th=[  322],
    | 30.00th=[  354], 40.00th=[  382], 50.00th=[  410], 60.00th=[  438],
    | 70.00th=[  474], 80.00th=[  532], 90.00th=[  684], 95.00th=[  756],
    | 99.00th=[  812], 99.50th=[  844], 99.90th=[  908], 99.95th=[  940],
    | 99.99th=[ 1020]
   bw (KB/s)  : min=331552, max=643216, per=100.00%, avg=578813.56, stdev=85260.02
   lat (usec) : 50=0.01%, 100=0.01%, 250=4.86%, 500=70.15%, 750=19.65%
   lat (usec) : 1000=5.32%
   lat (msec) : 2=0.01%
 cpu          : usr=21.20%, sys=78.21%, ctx=9451, majf=0, minf=89
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued    : total=r=2621440/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  READ: io=10240MB, aggrb=576077KB/s, minb=576077KB/s, maxb=576077KB/s, mint=18202msec, maxt=18202msec

Disk stats (read/write):
 sda: ios=2615354/0, merge=0/0, ticks=751429/0, in_queue=755922, util=99.59%

Looks like we’re getting right around 144K IOPS -- not too shabby. Now lets check out writes with 100% random write, 4 KB IO size, 10 GB of data:
[root@gooseberry ~]# fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=/dev/sda --size=10G
/dev/sda: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/306.6M/0K /s] [0 /78.5K/0  iops] [eta 00m:00s]
/dev/sda: (groupid=0, jobs=1): err= 0: pid=2759: Fri Mar  7 11:10:30 2014
 write: io=10240MB, bw=309707KB/s, iops=77426 , runt= 33857msec
   slat (usec): min=6 , max=278 , avg= 9.51, stdev= 2.94
   clat (usec): min=44 , max=22870 , avg=814.95, stdev=391.06
    lat (usec): min=60 , max=22881 , avg=824.70, stdev=391.04
   clat percentiles (usec):
    |  1.00th=[  111],  5.00th=[  231], 10.00th=[  370], 20.00th=[  652],
    | 30.00th=[  732], 40.00th=[  740], 50.00th=[  748], 60.00th=[  756],
    | 70.00th=[  772], 80.00th=[ 1004], 90.00th=[ 1384], 95.00th=[ 1624],
    | 99.00th=[ 1976], 99.50th=[ 2096], 99.90th=[ 2288], 99.95th=[ 2320],
    | 99.99th=[ 2480]
   bw (KB/s)  : min=281264, max=328936, per=99.98%, avg=309659.58, stdev=11008.89
   lat (usec) : 50=0.01%, 100=0.70%, 250=4.94%, 500=9.00%, 750=37.26%
   lat (usec) : 1000=28.02%
   lat (msec) : 2=19.21%, 4=0.87%, 50=0.01%
 cpu          : usr=22.62%, sys=73.73%, ctx=24146, majf=0, minf=25
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued    : total=r=0/w=2621440/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
 WRITE: io=10240MB, aggrb=309707KB/s, minb=309707KB/s, maxb=309707KB/s, mint=33857msec, maxt=33857msec

Disk stats (read/write):
 sda: ios=0/2604570, merge=0/0, ticks=0/939581, in_queue=941070, util=99.86%

And for writes it looks like we’re getting about 77K IOPS (4 KB). Now lets look and see what the performance numbers are like on the other node (the non-owner node/controller); we’ll do the same tests from above:
[root@blackberry ~]# fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=/dev/sda --size=10G
/dev/sda: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [180.8M/0K/0K /s] [46.3K/0 /0  iops] [eta 00m:00s]
/dev/sda: (groupid=0, jobs=1): err= 0: pid=9445: Fri Mar  7 11:17:45 2014
 read : io=10240MB, bw=173064KB/s, iops=43265 , runt= 60589msec
   slat (usec): min=5 , max=189 , avg= 8.40, stdev= 5.51
   clat (usec): min=829 , max=18076 , avg=1468.25, stdev=90.07
    lat (usec): min=837 , max=18084 , avg=1476.89, stdev=90.20
   clat percentiles (usec):
    |  1.00th=[ 1320],  5.00th=[ 1400], 10.00th=[ 1432], 20.00th=[ 1448],
    | 30.00th=[ 1464], 40.00th=[ 1464], 50.00th=[ 1480], 60.00th=[ 1480],
    | 70.00th=[ 1480], 80.00th=[ 1496], 90.00th=[ 1496], 95.00th=[ 1512],
    | 99.00th=[ 1576], 99.50th=[ 1624], 99.90th=[ 1672], 99.95th=[ 1704],
    | 99.99th=[ 2096]
   bw (KB/s)  : min=168144, max=191424, per=99.99%, avg=173050.38, stdev=2949.87
   lat (usec) : 1000=0.01%
   lat (msec) : 2=99.98%, 4=0.02%, 20=0.01%
 cpu          : usr=12.14%, sys=39.81%, ctx=204963, majf=0, minf=89
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued    : total=r=2621440/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  READ: io=10240MB, aggrb=173063KB/s, minb=173063KB/s, maxb=173063KB/s, mint=60589msec, maxt=60589msec

Disk stats (read/write):
 sda: ios=2613079/0, merge=0/0, ticks=3627011/0, in_queue=3626141, util=99.86%
[root@blackberry ~]# fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=/dev/sda --size=10G
/dev/sda: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/137.8M/0K /s] [0 /35.3K/0  iops] [eta 00m:00s]
/dev/sda: (groupid=0, jobs=1): err= 0: pid=9449: Fri Mar  7 11:19:35 2014
 write: io=10240MB, bw=132518KB/s, iops=33129 , runt= 79127msec
   slat (usec): min=3 , max=203 , avg=10.97, stdev= 6.88
   clat (usec): min=289 , max=18464 , avg=1917.83, stdev=318.12
    lat (usec): min=299 , max=18472 , avg=1929.07, stdev=317.61
   clat percentiles (usec):
    |  1.00th=[ 1240],  5.00th=[ 1512], 10.00th=[ 1656], 20.00th=[ 1784],
    | 30.00th=[ 1832], 40.00th=[ 1864], 50.00th=[ 1912], 60.00th=[ 1928],
    | 70.00th=[ 1944], 80.00th=[ 1976], 90.00th=[ 2064], 95.00th=[ 2640],
    | 99.00th=[ 3120], 99.50th=[ 3248], 99.90th=[ 3440], 99.95th=[ 3504],
    | 99.99th=[ 3696]
   bw (KB/s)  : min=127176, max=142088, per=100.00%, avg=132523.65, stdev=1832.69
   lat (usec) : 500=0.01%, 750=0.04%, 1000=0.21%
   lat (msec) : 2=87.54%, 4=12.21%, 20=0.01%
 cpu          : usr=10.16%, sys=40.90%, ctx=197327, majf=0, minf=25
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued    : total=r=0/w=2621440/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
 WRITE: io=10240MB, aggrb=132518KB/s, minb=132518KB/s, maxb=132518KB/s, mint=79127msec, maxt=79127msec

Disk stats (read/write):
 sda: ios=0/2614355, merge=0/0, ticks=0/4791266, in_queue=4790636, util=99.91%

Whoa, so there is definitely a difference in performance when accessing a volume from the non-owner node. It turns out when data is sent through the non-owner node they go through a process called IO shipping which will reduce the response time as the non-owner node needs to communicate to the owner of the volume before data can be read/written thus reducing the IOPS. So, this will be important to know for our setup below.

So, we learned accessing the shared virtual drives from each controller is not equal. The owning controller/node has the best performance. Now, if you reboot (turn off, kill, fail, whatever) the owner node, obviously the ownership is transferred to the standing node, and then you see the good performance numbers on that controller. When the down node comes back up, the VD ownership is NOT transferred back (it does not “fail back”). This is important since if you wanted to try and do an “active-active” setup (yes, not true active-active, but as in divvying up the VDs across both controllers) this won’t work. You could set it up that way, but then when you have a failover or reboot or something, the ownership will all be lopsided. I was curious if we could manually control volume ownership without having to reboot or something (eg, via CLI) but I didn’t see anything in the documentation. I’ve asked LSI support, but have not gotten an answer yet. I would think/hope that feature would be coming in the future. If we could control VD/LD ownership “live” (inside the OS) then we could script this as part of our cluster setup (described below). Until (if ever) that feature is available, we’ll have to do an active-passive setup where all virtual drives are owned by a single controller, and then when an event occurs (failure, reboot, etc.), they are transferred to the other node.


SCSI PRs + LSI Syncro + SCST
Now lets take a look at SCSI persistent reservations (PR) and see how they work with the LSI Syncro CS controllers locally. We still have our volume we created above, so we’ll test with that. Lets check for existing PR keys and register a new one:
[root@blackberry ~]# sg_persist --no-inquiry --read-keys /dev/sda
 PR generation=0x0, there are NO registered reservation keys
[root@blackberry ~]# sg_persist --no-inquiry --out --register --param-sark=123abc /dev/sda
[root@blackberry ~]# sg_persist --no-inquiry --read-keys /dev/sda
 PR generation=0x1, 1 registered reservation key follows:
   0x123abc

Now lets take a look from the other node:
[root@gooseberry ~]# sg_persist --no-inquiry --read-keys /dev/sda
 PR generation=0x1, 1 registered reservation key follows:
   0x123abc

So we have confirmed we can read the key on both nodes; now lets try reserving the device:
[root@blackberry ~]# sg_persist --no-inquiry --out --reserve --param-rk=123abc --prout-type=1 /dev/sda
[root@blackberry ~]# sg_persist --no-inquiry --read-reservation /dev/sda
 PR generation=0x1, Reservation follows:
   Key=0x123abc
   scope: LU_SCOPE,  type: Write Exclusive

And we can see it on the other node:
[root@gooseberry ~]# sg_persist --no-inquiry --read-reservation /dev/sda
 PR generation=0x1, Reservation follows:
   Key=0x123abc
   scope: LU_SCOPE,  type: Write Exclusive

Looks like we’re good on SCSI persistent reservations locally, which we expected. Now lets test SCSI PRs when combined with SCST. Lets first create a vdisk_blockio SCST device using our SSD volume from above as the back-end storage. We’ll then map it to a LUN for each target (each fabric) and these will be visible to our Linux initiator test system. Verify we can see the volumes on the initiator (not using multipath-tools since we want to easily see distinct devices on for each target/node with this test):
raspberry ~ # lsscsi
[1:0:0:0]    cd/dvd  TEAC     DVD-ROM DV28SV   D.0J  /dev/sr0
[2:0:0:0]    disk    FUJITSU  MAX3073RC        D206  -       
[2:0:1:0]    disk    FUJITSU  MAX3073RC        D206  -       
[2:1:0:0]    disk    Dell     VIRTUAL DISK     1028  /dev/sda
[4:0:0:0]    disk    SCST_BIO blackberry_test   300  /dev/sdb
[5:0:2:0]    disk    SCST_BIO gooseberry_test   300  /dev/sdc

Check there are no existing PR keys visible (on either node):
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdb
 PR generation=0x0, there are NO registered reservation keys
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdc
 PR generation=0x0, there are NO registered reservation keys

Lets start by making a SCSI PR key and reservation on one of the systems:
raspberry ~ # sg_persist --no-inquiry --out --register --param-sark=123abc /dev/sdb
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdb
 PR generation=0x1, 1 registered reservation key follows:
   0x123abc

raspberry ~ # sg_persist --no-inquiry --out --reserve --param-rk=123abc --prout-type=1 /dev/sdb
raspberry ~ # sg_persist --no-inquiry --read-reservation /dev/sdb
 PR generation=0x1, Reservation follows:
   Key=0x123abc
   scope: LU_SCOPE,  type: Write Exclusive

So, we see they key registered and the reservation active on that path/node (“blackberry”); now lets see if its visible on “gooseberry”:
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdc
 PR generation=0x0, there are NO registered reservation keys

raspberry ~ # sg_persist --no-inquiry --read-reservation /dev/sdc
 PR generation=0x0, there is NO reservation held

Nope, its not! This was expected as well. When using the vdisk_* device handles in SCST, these emulate SCSI commands (eg, the SCSI persistent reservations) and so these are stored in a software layer between initiators and the back-end storage -- the SCSI commands aren’t passed directly to the Syncro CS controllers with these device handlers. Lets try the same test but with the SCSI disk pass-through (dev_disk) handler.

Check that the SCSI device nodes show up on the initiator side:
raspberry ~ # lsscsi
[1:0:0:0]    cd/dvd  TEAC     DVD-ROM DV28SV   D.0J  /dev/sr0
[2:0:0:0]    disk    FUJITSU  MAX3073RC        D206  -       
[2:0:1:0]    disk    FUJITSU  MAX3073RC        D206  -       
[2:1:0:0]    disk    Dell     VIRTUAL DISK     1028  /dev/sda
[4:0:0:0]    disk    LSI      MR9286-8eHA      3.33  /dev/sdb
[5:0:2:0]    disk    LSI      MR9286-8eHA      3.33  /dev/sdc

Make sure there are no existing PR keys:
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdb
PR in: command not supported
raspberry ~ # sg_persist --no-inquiry --read-keys /dev/sdc
PR in: command not supported

Snap! Something is wrong… we see this message on the ESOS target side:
[1902219.536846] scst: ***WARNING***: PR commands for pass-through devices not supported (device 0:2:0:0)

Looks like SCSI persistent reservations are not supported with SCST using pass-through devices, period. It would interesting to find out if there is a technical reason that SCSI PRs aren’t supported with the pass-through handlers, or if its something that just hasn’t been implemented yet (in SCST). Either way, for us and our project, it doesn’t really matter -- we don’t need to support SCSI PRs (we use VMware ESXi 5.5). I’m pretty sure the Microsoft cluster stuff relies on persistent reservations, but again, we’re not going to be doing any of that with our setup.


Syncro CS Storage Setup
For this test setup, we’ll be testing with (2) Dell PowerEdge R720’s running ESXi 5.5 and run a large pool of Windows 8.1 virtual desktops on these servers. For a VDI setup, we keep the replicas (for linked clones) on fast (SSD) storage. We’ll use the 15K disks for the linked clone datastores. So with the 24 slots in the DataON JBOD, we’ll split up the storage like this:
  • (1) RAID10 volume consisting of (4) STEC s842 SSD’s
  • (2) CacheCade volumes consisting of (1) STEC s842 SSD each in RAID0 (one for each controller)
  • (1) STEC s842 hot spare
  • (2) RAID10 volumes consisting of (8) Hitachi 15K drives each
  • (1) Hitachi 15K hot spare



First we’ll create our SSD RAID10 volume (disable read and write cache for SSD volumes):
storcli64 /c0 add vd r10 drives=8:1,8:2,8:3,8:4 wt nora pdperarray=2

Now create both of the SAS 15K RAID10 volumes (read/write cache + CacheCade):
storcli64 /c0 add vd r10 drives=8:8,8:9,8:10,8:11,8:12,8:13,8:14,8:15 wb ra pdperarray=4
storcli64 /c0 add vd r10 drives=8:16,8:17,8:18,8:19,8:20,8:21,8:22,8:23 wb ra pdperarray=4

Add the CacheCade volume (the CacheCade VDs are like exclusive VDs so we created one on each cluster node so we don’t need to worry about which one is “active”):
storcli64 /c0 add vd cachecade type=raid0 drives=8:5 wt
storcli64 /c0 add vd cachecade type=raid0 drives=8:6 wt

One interesting point to note is when we attempted to assign VDs to a CacheCade volume before writing this document (testing), we got the following error with the Syncro CS:
[root@blackberry ~]# storcli64 /c0/v3 set ssdcaching=on
Controller = 0
Status = Failure
Description = None

Detailed Status :
===============

-----------------------------------------------------------------------------------
VD Property   Value Status ErrCd ErrMsg                                            
-----------------------------------------------------------------------------------
3 SSDCaching On    Failed  1001 Controller doesn't support manual SSC Association
-----------------------------------------------------------------------------------


We opened a support case with LSI, and were told to setup the CacheCade volume this way (associating the VDs at CacheCade VD creation time):
[root@blackberry ~]# storcli64 /c0 add vd cachecade type=raid0 drives=8:5 wt assignvds=2
Controller = 0
Status = Failure
Description = Controller doesn't support manual SSC Association


It fails with the same error; so we asked LSI again and this time their solution was to use MSM or WebBIOS… we tried WebBIOS and after a long, confusing journey, we didn’t get any farther configuring CacheCade with this method either. From what I’ve read, there is no “manual” association of VDs with a CacheCade volume… its all “automatic” (supposedly). We left this as is for now, with a CacheCade VD on each controller. We’ll revisit it at some point during our testing to confirm it is (or isn’t) working correctly.

The SSD hot spare (global):
storcli64 /c0/e8/s7 add hotsparedrive

Now add the 15K global hot spare drive:
storcli64 /c0/e8/s24 add hotsparedrive

Next we set meaningful names for each of the volumes:
storcli64 /c0/v0 set name=SSD_R10_1
storcli64 /c0/v2 set name=15K_R10_1
storcli64 /c0/v3 set name=15K_R10_2


Cluster Setup
We’ll start this by enabling all of the services needed and disabling anything we’re not going to use. Edit ‘/etc/rc.conf’ and set the following (on both nodes):
rc.openibd_enable=NO
rc.opensm_enable=NO
rc.sshd_enable=YES
rc.mdraid_enable=NO
rc.lvm2_enable=NO
rc.eio_enable=NO
rc.dmcache_enable=NO
rc.btier_enable=NO
rc.drbd_enable=NO
rc.corosync_enable=YES
rc.dlm_enable=NO
rc.clvmd_enable=NO
rc.pacemaker_enable=YES
rc.fsmount_enable=YES
rc.mhvtl_enable=NO
rc.scst_enable=NO
rc.perfagent_enable=NO
rc.nrpe_enable=NO
rc.snmpd_enable=NO
rc.snmptrapd_enable=NO
rc.nut_enable=NO
rc.smartd_enable=NO

The cluster will manage SCST, so we disable it above. We also disable other things we’re not going to use in this setup (md software RAID, LVM, etc.).

Next, generate the Corosync key on one system and scp it to the other (check permissions and make them match if needed):
corosync-keygen

Now you can create/edit your corosync.conf file; we won’t go into all of the specifics for our configuration, there is lots of documentation on Corosync out there. Here it is:
totem {
       version: 2
       cluster_name: esos_syncro
       crypto_cipher: aes256
       crypto_hash: sha1
       rrp_mode: passive
       interface {
               ringnumber: 0
               bindnetaddr: 10.35.6.0
               mcastaddr: 226.94.1.3
               mcastport: 5411
               ttl: 1
       }
       interface {
               ringnumber: 1
               bindnetaddr: 192.168.1.0
               mcastaddr: 226.94.1.4
               mcastport: 5413
               ttl: 1
       }
}

nodelist {
       node {
               ring0_addr: 10.35.6.11
               nodeid: 1
       }
       node {
               ring0_addr: 10.35.6.12
               nodeid: 2
       }
}

logging {
       fileline: off
       to_stderr: no
       to_syslog: yes
       syslog_facility: local2
       debug: off
       timestamp: off
       logger_subsys {
               subsys: QUORUM
               debug: off
       }
}

quorum {
       provider: corosync_votequorum
       two_node: 1
}

Now lets start Corosync (on both) and check the status of the rings:
/etc/rc.d/rc.corosync start
corosync-cfgtool -s

We can now start Pacemaker on both nodes and check it:
/etc/rc.d/rc.pacemaker start
crm configure show

I was initially planning on doing fencing using SCSI PRs as the LSI manual describes doing it (for a DAS / local application setup), but this may not be the best option as the SCSI fence would not change controller/volume ownership -- or would it... for this setup, we decided that we’re not going to use any fencing. We’re providing shared storage with this ESOS cluster, and not running the application on the cluster itself. This was discussed a bit internally and we ultimately decided fencing added complexity that we did not want, and added no apparent benefits for this solution (if anyone can tell us different, we'd happily listen). So, we can go ahead and disable STONITH for this cluster:
crm configure property stonith-enabled="false"

Now lets setup the SCST ALUA configuration; we need to run both blocks of commands below, one on “host A” and the second on “host B”: For “host A” (blackberry.mcc.edu):
scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=256
scstadmin -add_tgrp_tgt 10000000C9E667E9 -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 10000000C9E667E9 -driver ocs_scst -attributes rel_tgt_id=1

scstadmin -add_tgrp_tgt 10000000C9E667ED -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 10000000C9E667ED -driver ocs_scst -attributes rel_tgt_id=2
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=257
scstadmin -add_tgrp_tgt 10000000C9E66A91 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 10000000C9E66A91 -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=3

scstadmin -add_tgrp_tgt 10000000C9E66A95 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 10000000C9E66A95 -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=4

For “host B” (gooseberry.mcc.edu):
scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=257
scstadmin -add_tgrp_tgt 10000000C9E66A91 -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 10000000C9E66A91 -driver ocs_scst -attributes rel_tgt_id=3

scstadmin -add_tgrp_tgt 10000000C9E66A95 -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 10000000C9E66A95 -driver ocs_scst -attributes rel_tgt_id=4
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=256
scstadmin -add_tgrp_tgt 10000000C9E667E9 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 10000000C9E667E9 -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=1

scstadmin -add_tgrp_tgt 10000000C9E667ED -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 10000000C9E667ED -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=2

Next we rebooted both systems to ensure whats enabled/disabled for services is setup correctly. After they both came back up, we worked on the rest of the cluster configuration.

This cluster is going to be extremely simple… no replication using DRBD, no logical volumes via LVM, so there really isn’t much to the configuration -- just SCST.

For SCST we’ll have a master/slave configuration… one host will always be the “master” and the other will be the “slave”. This makes it so in the ALUA configuration one target (host) is active/optimized, and the other is non-optimized which will (should) not be used by initiators.

We’ll add it in with this:
crm
cib new scst
configure primitive p_scst ocf:esos:scst \
params alua="true" device_group="esos" \
local_tgt_grp="local" remote_tgt_grp="remote" \
m_alua_state="active" s_alua_state="nonoptimized" \
op monitor interval="10" role="Master" \
op monitor interval="20" role="Slave" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="60"
configure ms ms_scst p_scst \
meta master-max="1" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true" interleave="true"
cib commit scst
quit

Next we moved on to creating our SCST devices; we have 3 block devices (RAID volumes) we’re going to present the initiators, so we created the 3 SCST devices on both ESOS hosts with the same names and parameters.

Now we can add the SCST devices to the device groups; run this on both ESOS hosts:
scstadmin -add_dgrp_dev ssd_r10_1 -dev_group esos
scstadmin -add_dgrp_dev 15k_r10_1 -dev_group esos
scstadmin -add_dgrp_dev 15k_r10_2 -dev_group esos

When using the SCST vdisk_blockio handler, we have noticed the I/O scheduler (eg, cfq, deadline, noop) makes a huge difference in performance. In ESOS, the default scheduler is “cfq”. This may I/O scheduler likely works best when using vdisk_fileio (untested) as its then running a local file system on the ESOS box; your LU’s are pointing to virtual disk files that reside on the file system.

You can easily change the scheduler “live” with I/O between the targets and initiators to see the difference. We haven’t done a lot of testing, simply flipping the scheduler on the block devices and seeing what the latency numbers look like on the ESXi side (the VMs). There doesn’t appear to be much (if any) difference between “noop” and “deadline” (no offical testing by us). We use the “noop” scheduler for all block devices that are used with vdisk_blockio. This is set by creating the “/etc/pre-scst_xtra_conf” file and adding this to it:
for i in /sys/block/sd*; do
    echo "noop" > ${i}/queue/scheduler
done

This sets the I/O scheduler to “noop” for all SCSI disk block devices on our systems (we only use vdisk_blockio). You also need to make the file executable:
chmod +x /etc/pre-scst_xtra_conf

Next we zoned all of the initiators/targets into our zone sets.

We moved on to creating our security groups -- using the TUI to do this was a snap. Similar to FC zoning, for each target we create a group and add the initiators (ESXi hosts). This setup needs to done on both of the ESOS cluster nodes.

After creating the groups, the final step for provisioning the storage is to map the SCST devices to LUNs (again, using the TUI). Like the SCST configuration above, we need to create the same LUN mapping setup on both nodes.


Finished
That’s it! We did a rescan in the vSphere client and created our VMFS5 file systems on each LU. Now we have lots and lots of testing to perform. For us, this is a POC for a bigger project where we replace all of our proprietary SAN disk arrays, with new “open storage” (ESOS-based) storage servers that are like what was described in this article, but bigger (more disks/JBODs). Look for another article in near future!



Saturday, February 22, 2014

eClinicalWorks (ECW) FTP File Data

So, we recently ran into a problem on our eClinicalWorks (ECW) server with a disk that is failing and/or a corrupt file system. Specifically the disk used to store all of the FTP data for patient documents. The possibly failing/corrupt disk (and/or file system) was only partially available -- the data seemed to be accessible, however, you could not get a complete directory listing (we're assuming due to file system corruption). We could copy data off if we knew the file name that we needed.

All of these patient documents are stored on the server and accessible via FTP (service running on server). The client application then looks up the file names associated with a patient, and that's what you see listed in the client application window. We figured this data (the file names) must be stored in the ECW MySQL database. We approached eClinicalWorks tech. support about getting a full listing (query) for the database, however, they were unsure of how to do this.

I didn't know the credentials for the ECW MySQL database, but they were pretty easy to find on our system. I looked in "Scheduled Tasks" applet in the Control Panel and found a job called "mysql_optimize". I found the batch script the job used and the username/password for the DB was listed.

I then connected to the DB with the mysql application and poked around a bit. I ended up doing a complete dump of the database (SQL) to a flat file, then looked for a file name that we already knew (some were able to be listed on the file system). This led me to the correct table and column needed: document (table name), fileName (column containing what we need)

Then you can simply extract the data to a flat file:
mysql -uecwDbUser -pPASSWORD -P4928
use mobiledoc;
select fileName into 'c:/file_names.txt' from document;

That gave me a flat text file for all the patient files that belong in the FTP root; I could then loop over and run robocopy for each use a batch script. I hope this tip helps some else with eClinicalWorks!

Tuesday, April 16, 2013

Building & Using a Highly Available ESOS Disk Array

At our institution, we have been using Enterprise Storage OS (ESOS) in production for the last year. We have (4) 24-slot SSD disk arrays that sit on our Fibre Channel Storage Area Network (SAN) and they have work great. These particular units are used in our VMware View (VDI) farm, and even though they are “single-headed” (not fully redundant), in our VDI environment with pools across multiple datastores, even if one disk array failed, VMs are still available on another.

In this environment, we used another enterprise disk array (no vendor names mentioned) that was aging, and we were continuing to pay large amounts of money on for maintenance... for a relatively small amount of storage space. We wanted to replace this expensive, proprietary disk array with something else.

We had been using ESOS for a while, and liked it. Recently, a number of new features were added to ESOS, including DRBD, Pacemaker + Corosync, LVM2, and other software packages and enhancements. This opened up the possibility of creating a highly available disk array using ESOS.


The Setup
At the time of building the new ESOS-based disk array, our disk space need was only 5 TB, so we wanted to make sure our new unit gave us room to grow. Since there doesn’t appear to be any local RAID controllers that have a high availability option, and/or support for sharing external JBODs, or something similar, our only option was to mirror the entire disk array to another unit. With this setup, we configured two servers in a cluster, with each containing its own local disks, RAID controller, and Fibre Channel HBA. The data between the two nodes is replicated using DRBD, and then each ESOS node is on the SAN, and the initiators use each target node as a path. We make use of implicit Asymmetric Logical Unit Access (ALUA) in SCST, to help control, or rather guide/recommend (depending on the initiators) which path to use, and we only want the initiators to use a single path/target (unless there is a failover), since there is more to SCSI than just reading/writing blocks of data. Only the blocks are replicated via DRBD in this ESOS disk array cluster; things like SCSI reservations are not replicated between the cluster nodes, so its important the initiators don’t round-robin between paths, and for clusters of initiators, they should all use the same target/path.

For this new disk array we designed, performance was not a big factor to consider -- we already had a ton of SSD-backed storage in our environment (SAN) on the other ESOS disk arrays. This new unit would primarily be used for VMware ESXi boot volumes, and a couple large general purpose VMFS volumes (for parent VMs, supporting server VMs, etc.). We decided on the LSI Logic Nytro MegaRAID application acceleration card, a relatively new product to the LSI line-up. This particular card features built-in (on card) SSD storage which allows data “hot spots” to be promoted onto the SSD storage, and it also sports 1 GB of cache. LSI Logic MegaRAID cards work well with ESOS since the text-based user interface (TUI) has basic support for configuring logical drives on these controllers.

For the replication link between the two systems, we initially planned on using InfiniBand HCAs with Sockets Direct Protocol (SDP) which utilizes Remote Direct Memory Access (RDMA), but unfortunately, since SDP is now deprecated, it is not supported by ESOS. So, we went ahead with two InfiniBand HCAs with a QDR cable between the two nodes, and used IP over InfiniBand (IPoIB).

Each node has (12) 3.5” SAS disk slots; we used 7,200 RPM 2 TB SAS drives (Seagate). We will dedicate (1) drive to a global hot spare, and will then create one (6) disk RAID5 volume, and one (5) disk RAID5 volume. This gives us approximately 16 TB of usable space (after RAID parity) which is quite a bit more than what we currently have.

We now had a plan for our new fully redundant, Fibre Channel disk array, based on Enterprise Storage OS (ESOS). We got down to business, and put in requisitions for all of the new hardware.

Cost breakdown for the new ESOS disk array:
  • ~ $1,100 - (2) Mellanox MHQH19B-XTR ConnectX 2 VPI InfiniBand HCAs
  • ~ $50 - (1) Mellanox Technologies Half M Copper 4x QSFP 30 AWG Cable
  • ~ $3,000 - (2) LSI Logic LSI00350 Nytro MegaRAID 8100-4I SAS RAID Controllers
  • ~ $300 - (2) LSI Nytro MegaRAID SCM01 RAID Controller Cache Data Protection Modules
  • ~ $6,400 - (26) Seagate Constellation ES.2 ST32000645SS 2 TB SAS-2 Hard Drives
  • ~ $2,000 - (2) QLogic 8 Gb Fibre Channel PCI-E Single Port Host Bus Adapters
  • ~ $8,100 - (2) Supermicro SuperStorage Server 6027R-E1R12T Chassis (12 x 3.5” Slots; 32 GB RAM; 2 x Intel Xeon Processors)
  • ~ $100 - (2) Lexar JumpDrive Triton 32 GB USB 3.0 Flash Drives

Total cost for a ~16 TB, fully redundant, Fibre Channel disk array: ~ $21,050... replacing your enterprise disk array for less than a year’s worth of maintenance costs... priceless!




So, we’re using the InfiniBand link for DRBD replication, and then in our environment, we have a normal management network for servers/devices, and then we also have a special, non-routable private network that we use for out-of-band management interfaces (DRAC, IPMI, etc.). For these ESOS storage server nodes, we used the Supermicro out-of-band management interfaces on our private network, and then connected one of the NICs on each node to this network. This connectivity is important since we use IPMI as our fencing/STONITH method later in the article. We make use of the other server NIC on our primary management network. The two networks are completely separate/independent, which is important since we use two rings with Corosync, one for each network; if one network/link goes down, the two nodes can still communicate with each other. Then each ESOS node is connected to an independent Fibre Channel (FC) fabric, and each host/initiator is connected to both fabrics giving us full redundancy in case of a switch/fabric failure.



Installation
We spent a morning installing the servers in a rack and installing all of the components (RAID controller, HCA, HBA, etc.) in each unit. We then cabled everything, powered up each server, and installed the disks in the trays.

We started by configuring the out-of-band management interface on the SuperMicro servers. Once we got the default password changed, we opened the virtual console and set a few BIOS (UEFI) settings:
  • We enabled the “mirroring” memory mode, giving us 16 GB of available memory.
  • For the MegaRAID card, we disabled controller BIOS (not booting from any logical drives).
  • We double-checked that the QLogic HBA BIOS option was set to disabled.

Next, we created (2) ESOS USB flash drives. For the USB drives, we decided to go with an above-average device, the Lexar JumpDrive Triton 32GB USB 3.0 flash drive. Even though our servers aren’t USB 3.0, these devices when run at USB 2.0 are much faster than ordinary/standard flash drives. This makes a noticeable difference in ESOS when booting since the entire image is copied into a tmpfs file system on start-up, and even when sync’ing configuration changes.

We used a RHEL (6) workstation as our system to create the ESOS USB flash drives. We then downloaded and extracted the latest installation package from the ESOS project page: http://code.google.com/p/enterprise-storage-os/

wget http://enterprise-storage-os.googlecode.com/files/esos-0.1-r469.tar.xz
tar xvfJ esos-0.1-r469.tar.xz

After the archive was extracted, we plugged in the first flash drive and found the device node using the lsscsi tool. We then started the ESOS installer script:

cd esos-0.1-r469
./install.sh

The installer will prompt for the USB flash drive device node, and warn you before writing the image to the disk. After the image was successfully written, the install script then prompted us to install a third-party (proprietary) CLI RAID configuration tool. In our case, we are using LSI Logic MegaRAID cards, so we downloaded MegaCLI from the given URL and placed it into the temporary directory. The installer finished incorporating the MegaCLI tool into the image and then it was ready for use!

We repeated the above ESOS installation steps for our second server (second USB flash drive). We then labeled each flash drive with the corresponding server’s host name and inserted the drives into each server.

Since we didn’t have any other boot devices on these systems, the ESOS USB flash drive defaulted to being the first boot device (we checked via the UEFI setup screen). We booted up each ESOS storage server, and the first thing we did on both was change the default password (root/esos).




System Configuration
Next, we configured our two Ethernet network interfaces and host name in the TUI. After the interfaces were configured, we SSH’d into the machines set the timezone, date/time and an NTP server.



Next we need to enable IP over InfiniBand (IPoIB) for our IB interfaces on each host. Ideally, Sockets Direct Protocol (SDP) would be the best for the replication with DRBD and InfiniBand, but SDP is now deprecated, and ESOS does not support it. There has been hints in forums of DRBD adding RDMA support (which IPoIB lacks), but until then, this is probably the best solution. 10 GbE would also be a good option, and truthfully, this IPoIB setup is probably only marginally better.

Edit the ‘/etc/infiniband/openib.conf’ IB driver configuration file, and set the following two lines (on both hosts):

IPOIB_LOAD=yes
SET_IPOIB_CM=yes

Next, we restarted the IB stack on each host:

/etc/rc.d/rc.openibd stop && /etc/rc.d/rc.openibd start

Now that IPoIB is loaded, we can configure the IB interfaces using the TUI. We just chose an arbitrary network range that we’re not using anywhere else on campus (even though this isn’t routable). We then started OpenSM on each storage server:

/etc/rc.d/rc.opensm start

The OpenSM InfiniBand subnet manager handles multiple instances and will make one of them enter “standby” mode. After starting the OpenSM service, we edited the ‘/etc/rc.conf’ file and set rc.opensm_enable to “YES” so it starts up on boot. We then tested the IPoIB interface by pinging the other host.

Next, we configured email (SMTP) on each ESOS storage server. ESOS makes use of email for communicating alerts, warnings, errors, etc. to the administrator, so its important to configure.


Initial Cluster Setup
Now that we have the basic system configuration out of the way for each host, we can move on to configuring the cluster. The first step in the cluster setup, will be Corosync. Here is the ‘/etc/corosync/corosync.conf’ file we used on both nodes:

# 20130410 MAS

totem {
        version: 2
        cluster_name: esos
        crypto_cipher: none
        crypto_hash: none
        rrp_mode: passive
       interface {
                ringnumber: 0
                bindnetaddr: 10.35.6.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
                ttl: 1
        }
       interface {
                ringnumber: 1
                bindnetaddr: 172.16.106.0
                mcastaddr: 226.94.1.2
                mcastport: 5407
                ttl: 1
        }
}

nodelist {
        node {
                ring0_addr: 10.35.6.21
                nodeid: 1
        }
        node {
                ring0_addr: 10.35.6.22
                nodeid: 2
        }
}

logging {
        fileline: off
        to_stderr: no
        to_syslog: yes
        syslog_facility: local2
        debug: off
        timestamp: off
        logger_subsys {
                subsys: QUORUM
                debug: off
       }
}

quorum {
        provider: corosync_votequorum
        two_node: 1
}

In our configuration, we opted to use one ring on our primary Ethernet management interface (10.35.6.0) and one ring on our special non-routable management network (172.16.106.0). Next, we restarted Pacemaker and Corosync on each host, then checked the Corosync configuration:

/etc/rc.d/rc.corosync stop
/etc/rc.d/rc.pacemaker stop
/etc/rc.d/rc.corosync start
/etc/rc.d/rc.pacemaker start
corosync-cfgtool -s

Everything looks, good, we see two rings with no faults. Next, we checked the cluster configuration:

crm configure show
crm_mon -1

In our configuration, on each host we see an extra node in the config. That is left-over from the default cluster stack configuration (eg, “node $id="16777343" raisin.mcc.edu”) so we just used ‘crm configure edit’ and removed that line.


LVM / SCST ALUA Settings
Next, we made a few system configuration LVM changes to prepare for later steps; we want LVM to only discover devices on /dev/drbdX block devices and not the underlying device. We also set it so LVM doesn’t cache, set the default locking type to 3 (built-in cluster wide locking), and removed the current cache file (on each host):

Edited the ‘/etc/lvm/lvm.conf’ and set/ran the following:
  • filter = [ "a|drbd.*|", "r|.*|" ]
  • write_cache_state = 0
  • locking_type = 3
  • rm -f /etc/lvm/cache/.cache
  • mount /mnt/conf && rm -f /mnt/conf/etc/lvm/cache/.cache && umount /mnt/conf

Since SCST is already running (default) we went ahead and added our base ALUA settings to each host. We create a device group, which all SCST devices will be added to, and then a “local” and “remote” target group on each host. The “local” target group on each host contains the single, local Fibre Channel target. Then on the “remote” target group, we add the FC target of the other host. This setup is required for the SCST resource agent (Master/Slave -> ALUA).

On host cantaloupe.mcc.edu:

scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=1
scstadmin -add_tgrp_tgt 50:01:43:80:21:df:9b:4c -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 50:01:43:80:21:df:9b:4c -driver qla2x00t -attributes rel_tgt_id=1
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=2
scstadmin -add_tgrp_tgt 50:01:43:80:21:df:c7:f4 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 50:01:43:80:21:df:c7:f4 -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=2

On host raisin.mcc.edu:

scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=2
scstadmin -add_tgrp_tgt 50:01:43:80:21:df:c7:f4 -dev_group esos -tgt_group local
scstadmin -set_tgt_attr 50:01:43:80:21:df:c7:f4 -driver qla2x00t -attributes rel_tgt_id=2
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=1
scstadmin -add_tgrp_tgt 50:01:43:80:21:df:9b:4c -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 50:01:43:80:21:df:9b:4c -dev_group esos -tgt_group remote \
-attributes rel_tgt_id=1


Additional System Setup / Back-End Storage Configuration
After ALUA was setup on each host, we exited the shell into the TUI and sync’d the configuration (System -> Sync. Configuration); this writes the current SCST configuration to a file and syncs everything with the USB flash drive. We could now configure the ESOS system services for our setup; edit the ‘/etc/rc.conf’ file and set the following (on both hosts):

rc.openibd_enable=YES
rc.opensm_enable=YES
rc.sshd_enable=YES
rc.lvm2_enable=NO
rc.drbd_enable=NO
rc.corosync_enable=YES
rc.dlm_enable=YES
rc.clvmd_enable=YES
rc.pacemaker_enable=YES
rc.mhvtl_enable=NO
rc.scst_enable=NO

The primary services/systems we use on these hosts (DRBD, LVM, and SCST) are all managed by the cluster stack, so we disable them from starting by the init/rc scripts. Since we will be using LVM on top of DRBD, we use clvmd which prevents (using locking) concurrent LVM metadata updates. DLM is a requirement for clvmd, so we enable that as well. Now we reboot both nodes to ensure everything starts up (or doesn't) as expected. Check the physical console for start-up errors/messages.

We wanted to be sure the LSI Logic Nytro MegaRAID (8100-4i) cards have the newest firmware available, so we downloaded the firmware image and flashed the controller on each host:

MegaCli64 -adpfwflash -f NytroMrFw.rom -a0

After the firmware image download was complete, we rebooted each node. We are now ready for creating our RAID logical drives (virtual drives). Since we are creating an exact replica of all the storage on each host, we’ll configure them the same. We have (12) SAS 2 TB hard drives in each box; we want (1) global hot spare drive, and then we decided on (2) RAID5 volumes (one with six disks, one with five disks). We felt this setup might give us more performance instead of making one large RAID5 volume with (11) disks, or a RAID6 volume. Since we are using a MegaRAID controller (LSI Logic) we were able to use the TUI to provision our back-end storage.



After we created our two RAID groups on each host, we needed to setup a global hot spare drive. The TUI in ESOS does not support this feature, we had to use the shell (Interface -> Exit to Shell):

MegaCli64 -pdhsp -set -physdrv[18:11] -a0


Back-End Storage Performance Testing
Before continuing our setup, we thought it would be fun to do a couple quick performance tests on the back-end storage. For these tests, we used the (6) disk RAID5 volume and used the included ‘fio’ tool in ESOS.

In this test, we are doing sequential reads with 4 MB blocks for 60 seconds:


fio --bs=4M --direct=1 --rw=read --ioengine=libaio --iodepth=64 --name=/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a --runtime=60

--snip--
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [352.0M/0K/0K /s] [88 /0 /0  iops] [eta 00m:00s]
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (groupid=0, jobs=1): err= 0: pid=3069: Wed Apr 10 13:49:37 2013
  read : io=20728MB, bw=353222KB/s, iops=86 , runt= 60091msec
    slat (usec): min=199 , max=48229 , avg=11580.02, stdev=12383.05
    clat (msec): min=66 , max=2134 , avg=726.91, stdev=58.13
     lat (msec): min=92 , max=2134 , avg=738.49, stdev=56.86
    clat percentiles (msec):
     |  1.00th=[  635],  5.00th=[  676], 10.00th=[  693], 20.00th=[  709],
     | 30.00th=[  717], 40.00th=[  725], 50.00th=[  734], 60.00th=[  742],
     | 70.00th=[  742], 80.00th=[  750], 90.00th=[  766], 95.00th=[  775],
     | 99.00th=[  791], 99.50th=[  791], 99.90th=[  799], 99.95th=[ 2114],
     | 99.99th=[ 2147]
    bw (KB/s)  : min= 5885, max=414476, per=99.32%, avg=350819.86, stdev=35462.71
    lat (msec) : 100=0.06%, 250=0.25%, 500=0.41%, 750=77.29%, 1000=21.94%
    lat (msec) : >=2000=0.06%
  cpu          : usr=0.02%, sys=2.02%, ctx=2440, majf=0, minf=65561
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.3%, 32=0.6%, >=64=98.8%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=5182/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=20728MB, aggrb=353222KB/s, minb=353222KB/s, maxb=353222KB/s, mint=60091msec, maxt=60091msec
--snip--

In this test, we are doing sequential writes with 4 MB blocks for 60 seconds:


fio --bs=4M --direct=1 --rw=write --ioengine=libaio --iodepth=64 --name=/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a --runtime=60

--snip--
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/740.0M/0K /s] [0 /185 /0  iops] [eta 00m:00s]
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (groupid=0, jobs=1): err= 0: pid=3072: Wed Apr 10 14:12:14 2013
  write: io=44996MB, bw=767254KB/s, iops=187 , runt= 60053msec
    slat (usec): min=347 , max=40575 , avg=5330.42, stdev=4645.18
    clat (msec): min=51 , max=395 , avg=336.13, stdev=34.02
     lat (msec): min=52 , max=397 , avg=341.46, stdev=34.21
    clat percentiles (msec):
     |  1.00th=[   74],  5.00th=[  318], 10.00th=[  322], 20.00th=[  330],
     | 30.00th=[  334], 40.00th=[  338], 50.00th=[  338], 60.00th=[  343],
     | 70.00th=[  347], 80.00th=[  351], 90.00th=[  359], 95.00th=[  363],
     | 99.00th=[  371], 99.50th=[  379], 99.90th=[  388], 99.95th=[  392],
     | 99.99th=[  396]
    bw (KB/s)  : min=692166, max=1378932, per=99.60%, avg=764193.58, stdev=62862.74
    lat (msec) : 100=1.16%, 250=0.55%, 500=98.28%
  cpu          : usr=17.26%, sys=3.64%, ctx=5367, majf=0, minf=25
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=11249/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=44996MB, aggrb=767253KB/s, minb=767253KB/s, maxb=767253KB/s, mint=60053msec, maxt=60053msec
--snip--

In this test, we are doing random reads with 4 KB blocks for 60 seconds:


fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a --runtime=60

--snip--
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [3128K/0K/0K /s] [782 /0 /0  iops] [eta 00m:00s]
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (groupid=0, jobs=1): err= 0: pid=3075: Wed Apr 10 14:14:06 2013
  read : io=191372KB, bw=3181.2KB/s, iops=795 , runt= 60158msec
    slat (usec): min=3 , max=49 , avg= 9.80, stdev= 3.59
    clat (usec): min=90 , max=1504.8K, avg=80370.26, stdev=83326.84
     lat (usec): min=107 , max=1504.8K, avg=80380.45, stdev=83326.84
    clat percentiles (msec):
     |  1.00th=[    6],  5.00th=[    9], 10.00th=[   12], 20.00th=[   20],
     | 30.00th=[   29], 40.00th=[   41], 50.00th=[   55], 60.00th=[   72],
     | 70.00th=[   94], 80.00th=[  126], 90.00th=[  182], 95.00th=[  241],
     | 99.00th=[  396], 99.50th=[  469], 99.90th=[  652], 99.95th=[  750],
     | 99.99th=[  979]
    bw (KB/s)  : min= 2221, max= 3368, per=99.98%, avg=3180.32, stdev=119.15
    lat (usec) : 100=0.01%, 250=0.04%, 500=0.01%, 750=0.01%
    lat (msec) : 2=0.01%, 4=0.11%, 10=6.66%, 20=14.21%, 50=26.38%
    lat (msec) : 100=24.76%, 250=23.31%, 500=4.13%, 750=0.33%, 1000=0.05%
    lat (msec) : 2000=0.01%
  cpu          : usr=0.67%, sys=1.12%, ctx=46578, majf=0, minf=87
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=47843/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=191372KB, aggrb=3181KB/s, minb=3181KB/s, maxb=3181KB/s, mint=60158msec, maxt=60158msec
--snip--

In this test, we are doing random writes with 4 KB blocks for 60 seconds:


fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a --runtime=60

--snip--
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/1388K/0K /s] [0 /347 /0  iops] [eta 00m:00s]
/dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a: (groupid=0, jobs=1): err= 0: pid=3078: Wed Apr 10 14:15:29 2013
  write: io=118572KB, bw=1969.2KB/s, iops=492 , runt= 60216msec
    slat (usec): min=3 , max=40 , avg= 8.88, stdev= 4.42
    clat (usec): min=392 , max=614423 , avg=129850.00, stdev=95972.42
     lat (usec): min=403 , max=614434 , avg=129859.28, stdev=95974.97
    clat percentiles (usec):
     |  1.00th=[  772],  5.00th=[  828], 10.00th=[  868], 20.00th=[ 1012],
     | 30.00th=[ 1112], 40.00th=[162816], 50.00th=[177152], 60.00th=[185344],
     | 70.00th=[193536], 80.00th=[201728], 90.00th=[214016], 95.00th=[226304],
     | 99.00th=[288768], 99.50th=[350208], 99.90th=[585728], 99.95th=[593920],
     | 99.99th=[610304]
    bw (KB/s)  : min=  691, max=78858, per=100.00%, avg=1976.98, stdev=7108.54
    lat (usec) : 500=0.06%, 750=0.38%, 1000=18.95%
    lat (msec) : 2=13.32%, 4=0.28%, 50=0.07%, 100=0.17%, 250=64.89%
    lat (msec) : 500=1.58%, 750=0.29%
  cpu          : usr=0.38%, sys=0.58%, ctx=19973, majf=0, minf=24
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.8%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=29643/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=118572KB, aggrb=1969KB/s, minb=1969KB/s, maxb=1969KB/s, mint=60216msec, maxt=60216msec
--snip--

So, these numbers seem pretty much on-par with what we expected for this type of disk: 352 MB/s for sequential reads, 740 MB/s for sequential writes, 782 IOPS (4 KB) for random reads, and 347 IOPS (4 KB) for random writes. During these tests, background logical drive / disk initialization was still taking place, so our numbers may have been a bit better after this was complete. The sequential write and even the read throughput is quite nice... we’re guessing this is thanks to the controller’s on-board SSD volume (CacheCade) and/or the 1 GB of controller cache.


DRBD Configuration
Now, we move on to configuring DRBD. In our setup, we will have (2) DRBD resources (volumes) in dual-primary mode, with LVM running on top of each of these (an LVM volume group for each). For the DRBD syncer rate, we read the rule of thumb for the max rate is 30% of your slowest link (I/O subsystem, replication link); we settled on 75 MB to start with. First, we set our global/common DRBD configuration on each host; we modified the ‘/etc/drbd.d/global_common.conf’ file to look like this on both hosts:

# 20130410 MAS

global {
        usage-count no;
}

common {
        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
                fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
        }
        startup {
                degr-wfc-timeout 120;
                outdated-wfc-timeout 2;
        }
        options {
                on-no-data-accessible io-error;
        }
        disk {
                on-io-error detach;
                disk-barrier no;
                disk-flushes no;
                fencing resource-only;
                al-extents 3389;
                c-plan-ahead 0;
                resync-rate 75M;
        }
        net {
                protocol C;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
                rr-conflict disconnect;
                max-buffers 8000;
                max-epoch-size 8000;
                sndbuf-size 512k;
        }
}

Next we created our DRBD resource configuration files; instead of using the generic “/dev/sdX” block device nodes for the storage backing, we used the unique SCSI disk identifiers populated in the "/dev/disk-by-id" directory. We created both of these files (exactly the same) on both ESOS storage server nodes.

/etc/drbd.d/r0.res:

# 20130410 MAS

resource r0 {
        net {
                allow-two-primaries;
        }
        on cantaloupe.mcc.edu {
                device     /dev/drbd0;
                disk       /dev/disk-by-id/LUN_NAA-600605b0054a753018f855fa236d6d41;
                address    192.168.50.21:7788;
                meta-disk  internal;
        }
        on raisin.mcc.edu {
                device    /dev/drbd0;
                disk      /dev/disk-by-id/LUN_NAA-600605b0054a751018f856b51625577a;
                address   192.168.50.22:7788;
                meta-disk internal;
        }
}

/etc/drbd.d/r1.res:

# 20130410 MAS

resource r1 {
        net {
                allow-two-primaries;
        }
        on cantaloupe.mcc.edu {
                device     /dev/drbd1;
                disk       /dev/disk-by-id/LUN_NAA-600605b0054a753018f8565a29255421;
                address    192.168.50.21:7789;
                meta-disk  internal;
        }
        on raisin.mcc.edu {
                device    /dev/drbd1;
                disk      /dev/disk-by-id/LUN_NAA-600605b0054a751018f856f319dfd5f7;
                address   192.168.50.22:7789;
                meta-disk internal;
        }
}

Now we are ready to setup the DRBD resources. On both nodes, run the following commands:

drbdadm create-md r0
drbdadm up r0
drbdadm create-md r1
drbdadm up r1

Now, on only one of the hosts (it doesn't really matter since this all fresh), run this:

drbdadm primary --force r0
drbdadm primary --force r1

The above commands make the DRBD resources primary on that host, and starts the full synchronization to the other host. On the non-primary host (“Secondary”) you can run the following to make the resources primary there:

drbdadm primary r0
drbdadm primary r1


LVM Configuration
Next, we need to get Logical Volume Manager (LVM) setup. For our configuration, we have (2) DRBD resources, and with this we will create (2) LVM physical volumes (PV), and (2) LVM volume groups (VG). We already setup our LVM device filter in the configuration file a few pages back, this way we don’t get complaints from LVM about finding duplicates, it will only match “/dev/drbdX” block devices. On just one of the hosts, we ran the following:

pvcreate /dev/drbd0
pvcreate /dev/drbd1
vgcreate -c y r0 /dev/drbd0
vgcreate -c y r1 /dev/drbd1

We can now check that our (2) new LVM volume groups are available (on both hosts):

vgdisplay


More Cluster Configuration
Now we are ready to finish configuring the cluster stack; we have our DRBD resources configured, and LVM volume groups setup. Lets start by disabling STONITH (we will enable it at the end):

crm configure property stonith-enabled="false"

We broke each chunk of the cluster configuration out into a separate step so we can explain each piece as we go in the article. The first chunk we added was for the DRBD resources:

crm
cib new drbd
configure primitive p_drbd_r0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
configure primitive p_drbd_r1 ocf:linbit:drbd \
        params drbd_resource="r1" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
configure group g_drbd p_drbd_r0 p_drbd_r1
configure ms ms_drbd g_drbd \
        meta master-max="2" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true" interleave="true"
cib commit drbd
quit

In the step above, we have two DRBD resources (r0, r1) that we configured previously, and we are setting two masters (two nodes, dual-primary mode). We used the advised/default resource agent parameters for ocf:linbit:drbd.

Next, we added the resource configuration for LVM2:

crm
cib new lvm
configure primitive p_lvm_r0 ocf:heartbeat:LVM \
        params volgrpname="r0" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
configure primitive p_lvm_r1 ocf:heartbeat:LVM \
        params volgrpname="r1" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
configure group g_lvm p_lvm_r0 p_lvm_r1
configure clone clone_lvm g_lvm \
        meta interleave="true" target-role="Started"
cib commit lvm
quit

For LVM, we have two DRBD resources (r0, r1) that we are running LVM on top of. As mentioned earlier, the clvmd service is used in conjunction with this type of setup. This setup could have been done other ways, but we felt it was simplest to use LVM on top of a couple large DRBD resources, instead of trying to setup a DRBD resource for each individual volume we wanted to share on our SAN. The cluster configuration for these resources was pretty straight forward, a primitive for each volume group (r0, r1) and then a clone statement so they are started on both of our nodes.

Next we added the SCST configuration. In this setup, only one of the two nodes will be a “Master” for the SCST resource (and one “Slave”). Again, this is used with the ALUA setup in SCST which is our extra state for the resource (SCST is always started/running, only ALUA information is updated). The parameters for this resource specify the SCST ALUA device group name, “local” target group name, and “remote” target group name. This sounds exactly like what it is... the local target group contains targets local to that node, and remote are the other node’s targets. We added the SCST ALUA device group and target groups earlier in the article.

crm
cib new scst
configure primitive p_scst ocf:esos:scst \
        params alua="true" device_group="esos" \
        local_tgt_grp="local" remote_tgt_grp="remote" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="60"
configure ms ms_scst p_scst \
        meta master-max="1" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true" interleave="true"
cib commit scst
quit

In the step above, the SCST RA is configured with only one master, and we don’t care which one it is since LVM and DRBD are both running active/active on the cluster.

Next, we added the order and colocation rules. At this point, resources have been trying to start, promote, etc. as we added them and some may have failed as we didn't add the constraints as we went, but in our case it didn't matter much since this is a new cluster, not existing, and we don’t have anything connected to it yet. Here are the constraints we used:

crm
cib new constraints
colocation c_r0_r1 inf: ms_scst:Started clone_lvm:Started ms_drbd:Master
order o_r0_r1 inf: ms_drbd:promote clone_lvm:start ms_scst:start
cib commit constraints
quit

Above you can see the colocation and order rules we added... we want DRBD to be promoted to master first, then LVM can start, and then SCST can start. This was the last main cluster configuration step.

When the cluster attempts to start the LVM resources, they will fail since at this point there were no logical volumes (LV) configured for the volume groups. So, we went ahead and created one on each:

lvcreate -L 4T -n big_vmfs_1 r0
lvcreate -L 4T -n big_vmfs_2 r1

 We used the ‘crm resource cleanup’ command to fix all of the failed / timed out resources and everything started as expected:

--snip--
Last updated: Thu Apr 11 11:54:41 2013
Last change: Thu Apr 11 11:49:44 2013 via cibadmin on raisin.mcc.edu
Stack: corosync
Current DC: cantaloupe.mcc.edu (1) - partition with quorum
Version: 1.1.8-1f8858c
2 Nodes configured, unknown expected votes
10 Resources configured.


Online: [ cantaloupe.mcc.edu raisin.mcc.edu ]

 Master/Slave Set: ms_scst [p_scst]
     Masters: [ cantaloupe.mcc.edu ]
     Slaves: [ raisin.mcc.edu ]
 Clone Set: clone_lvm [g_lvm]
     Started: [ cantaloupe.mcc.edu raisin.mcc.edu ]
 Master/Slave Set: ms_drbd [g_drbd]
     Masters: [ cantaloupe.mcc.edu raisin.mcc.edu ]
--snip--




Provisioning Storage
Now that the cluster is configured, we moved onto provisioning our storage. First we zoned all of our initiators with each target (on each switch). Then, using the TUI we created a host group (Hosts -> Add Group) for each server, and then added the server’s initiator to each group (Hosts -> Add Initiator). After zoning everything on our Fibre Channel switches, we used the ‘fcc.sh’ tool in the ESOS shell to get a list of the visible FC initiators. This made it very easy for copying/pasting the initiator names into the TUI.

Next we created a 50 GB boot volume for each of our (4) ESXi hosts; we used the CLI to do this (LVM logical volumes):

lvcreate -L 50G -n boot_mulberry r0
lvcreate -L 50G -n boot_lime r0
lvcreate -L 50G -n boot_banana r0
lvcreate -L 50G -n boot_keylime r0

Then, after we created the (4) ESXi boot volumes above, on each ESOS storage server, using the TUI, we added the SCST device for each (vdisk_blockio), and then mapped each device as LUN 0 to each corresponding host group (Devices -> Map to Group). For each SCST device we created using the vdisk_blockio mode, we made sure to set “Write Through” to Yes/1 and “NV Cache” to No/0 since we are using DRBD in dual-primary mode and would most definitely like to avoid data divergence!



For each SCST device we create, we need to run the following command on both hosts to add the device into our SCST implicit ALUA configuration:

scstadmin -add_dgrp_dev boot_mulberry -dev_group esos
scstadmin -add_dgrp_dev boot_lime -dev_group esos
scstadmin -add_dgrp_dev boot_banana -dev_group esos
scstadmin -add_dgrp_dev boot_keylime -dev_group esos
scstadmin -add_dgrp_dev big_vmfs_1 -dev_group esos
scstadmin -add_dgrp_dev big_vmfs_2 -dev_group esos


Final Cluster Setup
Now that our cluster is set up, some storage is provisioned, and everything is working, we can add fencing mechanisms into our configuration and re-enable STONITH:

crm
cib new stonith
configure primitive fence_cantaloupe stonith::fence_ipmilan \
params pcmk_host_list="cantaloupe.mcc.edu" ipaddr=”172.16.6.21” \
login=”user” passwd=”password” lanplus=”true” \
op monitor interval="60"
configure primitive fence_raisin stonith::fence_ipmilan \
params pcmk_host_list="raisin.mcc.edu" ipaddr=”172.16.6.22” \
login=”user” passwd=”password” lanplus=”true” \
op monitor interval="60"
cib commit stonith
quit

crm configure property stonith-enabled="true"

Finally, we tested our fencing mechanism (one at a time) on each node to ensure they work: crm node fence NODE_NAME

After we were sure everything was tested and working as it should be, we enabled a cluster-status-change email mechanism. The crm_mon utility supports an external agent; we used the ocf:pacemaker:ClusterMon resource agent and the crm_mon_email.sh script that ESOS includes, to send simple/basic emails regarding the cluster if anything changes. Not something you want enabled when testing as it sends an individual email for each cluster status change, so you can rack up a fair number of emails from something as simple as a node rebooting. We configured our ClusterMon RA like this:

crm
cib new clustermon
configure primitive p_notify ocf:pacemaker:ClusterMon \
params user="root" update="30" \
extra_options="-E /usr/local/bin/crm_mon_email.sh -e root" \
op monitor on-fail="restart" interval="10"
configure clone clone_notify p_notify \
meta target-role="Started"
cib commit clustermon
quit

There, thats it! Our ESOS disk array cluster is fully functional and tested. Here is our final cluster configuration (`crm configure show`), just for reference:

node $id="1" cantaloupe.mcc.edu
node $id="2" raisin.mcc.edu
primitive fence_cantaloupe stonith:fence_ipmilan \
        params pcmk_host_list="cantaloupe.mcc.edu" ipaddr="172.16.6.21" login="user" passwd="password" lanplus="true" \
        op monitor interval="60"
primitive fence_raisin stonith:fence_ipmilan \
        params pcmk_host_list="raisin.mcc.edu" ipaddr="172.16.6.22" login="user" passwd="password" lanplus="true" \
        op monitor interval="60"
primitive p_drbd_r0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
primitive p_drbd_r1 ocf:linbit:drbd \
        params drbd_resource="r1" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100"
primitive p_lvm_r0 ocf:heartbeat:LVM \
        params volgrpname="r0" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
primitive p_lvm_r1 ocf:heartbeat:LVM \
        params volgrpname="r1" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
primitive p_notify ocf:pacemaker:ClusterMon \
        params user="root" update="30" extra_options="-E /usr/local/bin/crm_mon_email.sh -e root" \
        op monitor on-fail="restart" interval="10"
primitive p_scst ocf:esos:scst \
        params alua="true" device_group="esos" local_tgt_grp="local" remote_tgt_grp="remote" \
        op monitor interval="10" role="Master" \
        op monitor interval="20" role="Slave" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="60"
group g_drbd p_drbd_r0 p_drbd_r1
group g_lvm p_lvm_r0 p_lvm_r1
ms ms_drbd g_drbd \
        meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" interleave="true"
ms ms_scst p_scst \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" interleave="true"
clone clone_lvm g_lvm \
        meta interleave="true" target-role="Started"
clone clone_notify p_notify \
        meta target-role="Started"
colocation c_r0_r1 inf: ms_scst:Started clone_lvm:Started ms_drbd:Master
order o_r0_r1 inf: ms_drbd:promote clone_lvm:start ms_scst:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.8-1f8858c" \
        cluster-infrastructure="corosync" \
        stonith-enabled="true" \
        last-lrm-refresh="1365772801"

This disk array is currently only being used in a VMware vSphere (ESXi) environment. VMware ESXi supports implicit ALUA, and you can check the pathing in vSphere Client by going to the Configuration tab for a host, then click Storage, click Properties for a datastore, and finally click Manage Paths. We used the “Most Recently Used” path selection policy and checked that for each datastore, it selected the correct path for I/O. We also noticed when using ALUA with SCST in ESOS, that the storage array type now shows “VMW_SATP_ALUA”. For a non-ALUA SCST/ESOS configuration, it usually shows “VMW_SATP_DEFAULT_AA”.

One other thing we typically do with VMware ESXi initiators when using them with ESOS/SCST is disable support for vStorage APIs for Array Integration (VAAI). Its not currently supported on the disk arrays, and it just seems to pollute the logs since the VAAI SCSI commands fail (not supported). In vSphere Client, for each host, go to the Configuration tab, then Advanced Settings, and set the following to ‘0’:
  • /VMFS3/HardwareAcceleratedLocking
  • /DataMover/HardwareAcceleratedMove
  • /DataMover/HardwareAcceleratedInit

This concludes my article on building and using a Fibre Channel disk array based on Enterprise Storage OS (ESOS). This unit has been in production for less than a week now, and I will follow-up to this article after some time with our experiences using the disk array. Please leave any comments/question; I hope others might find this useful!