Friday, June 3, 2011

SCST & SSD Arrays

So, we were so pleased with our first stab at an SSD-based Fibre Channel disk array (used with VMware View), we decided to create a couple more and use SSD storage for all of our VDI needs.

We purchased (2) additional "storage servers" that look something like this:
  • SuperMicro 2U 24x2.5in Bay Chassis w/900W Redundant
  • 5520 Chipset Motherboard 2xLAN IPMI 2.0
  • (2) Intel Xeon® Processor E5645 (12M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)
  • 12 GB 1333MHz DDR3 ECC Memory (8GB usable with sparing mode)
  • (1) LSI MegaRAID 9280-24i4e SATA/SAS 6Gb/s PCIe 2.0 w/ 512MB (with FastPath)
  • (2) 40GB X25-V 34NM 2.5IN SATA 9.5MM SSDs (for RAID1 system/OS volume)
  • (22) Crucial RealSSD C300 256GB CTFDDAC256MAG (for data RAID5 volumes)
  • (2) QLogic 8GB SINGLE PORT FC HBA PCIE8 LC MULTIMODE OPTIC
On each disk array, I used the same OS image from our first SCST disk array (see previous article). Two smaller disks are used for the system volume, and then we have (3) SSD volumes for the VMFS datastores using RAID5 with 7 disks in each, and a global hot spare (2 system + 21 data + 1 HS = 24 slots).

So, we now have a total of 6 separate (3 on each array) VMFS volumes that are ~1.5TB each. I did some initial testing on one of the arrays using fio to see what kind of "raw" performance numbers we would get (local to the disk array system, not through Fibre Channel / SCST).

"Local" Array Performance (4K random reads on a single 7 disk RAID5 SSD volume, Direct, WT, NORA):
fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [312M/0K /s] [78K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=11648
read : io=18,041MB, bw=301MB/s, iops=76,972, runt= 60001msec
slat (usec): min=3, max=156, avg= 5.31, stdev= 7.66
clat (usec): min=321, max=8,185, avg=824.74, stdev=169.80
bw (KB/s) : min=268120, max=315648, per=99.99%, avg=307869.58, stdev=8951.39
cpu : usr=15.01%, sys=49.67%, ctx=169252, majf=0, minf=12851
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=4618408/0, short=0/0
lat (usec): 500=4.13%, 750=27.28%, 1000=55.97%
lat (msec): 2=12.60%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=18,041MB, aggrb=301MB/s, minb=308MB/s, maxb=308MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=4608276/0, merge=0/0, ticks=3004407/0, in_queue=3003212, util=99.28%

"Local" Array Performance (4K random writes on a single 7 disk RAID5 SSD volume, Direct, WT, NORA):
fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/78M /s] [0/20K iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=11653
write: io=4,549MB, bw=77,632KB/s, iops=19,407, runt= 60004msec
slat (usec): min=3, max=59, avg= 6.34, stdev= 3.20
clat (usec): min=243, max=13,807, avg=3289.11, stdev=1249.05
bw (KB/s) : min=75032, max=79000, per=100.03%, avg=77652.57, stdev=626.48
cpu : usr=6.28%, sys=15.93%, ctx=249186, majf=0, minf=11782
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/1164553, short=0/0
lat (usec): 250=0.01%, 500=0.10%, 750=0.51%, 1000=1.38%
lat (msec): 2=16.30%, 4=53.07%, 10=28.63%, 20=0.01%

Run status group 0 (all jobs):
WRITE: io=4,549MB, aggrb=77,631KB/s, minb=79,494KB/s, maxb=79,494KB/s, mint=60004msec, maxt=60004msec

Disk stats (read/write):
sdb: ios=1/1161934, merge=0/0, ticks=1/3769371, in_queue=3769089, util=99.77%

So, with those "local" tests I'm seeing ~78,000 4K IOPS on reads, and ~20,000 IOPS on writes. I did some tests using fio with all three RAID5 volumes and the numbers stay about the same, so I assume those numbers (78K/20K) are limits of the MegaRAID controller, and not the SSD disks themselves. I also tried a RAID0 volume (with 7 SSDs) just to see what the numbers were like and it significantly improved the write IOPS: ~80K 4K IOPS for read and ~80K 4K IOPS for write.

Again, it seems like that ~80K read/write 4K IO per second max is the RAID controller. I suppose/assume an even higher performing solution would be to use a separate MegaRAID controller for each volume (one of those fancy new 2nd generation SAS models). I think for us the performance we are able to achieve with our current solution will be satisfactory.

Now lets look at 4K IOPS performance over SCST and our Fibre Channel SAN. Our SAN for VDI consists of two QLogic 5800 SANbox FC switches (8 Gbps Fibre Channel). Both of these new SSD disk arrays have (2) 8 Gbps FC HBAs each, however, the test box I used (initiator) only has (2) 4 Gbps QLogic HBAs -- this is setup like you would assume: A target HBA for each disk array goes to each fabric (switch), and on the Linux initiator, an HBA to each fabric.

On the initiator side, vanilla Linux 2.6.37.6 kernel, QLogic QLE2460 HBAs with firmware version 5.03.16; using multipath-tools round-robin pathing.

Across SCST Performance (same RAID volume as above; 4K random reads):
fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/mapper/tangerine_ssd_1
fio 1.50-rc4
/dev/mapper/tangerine_ssd_1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [250.9M/0K /s] [62.8K/0 iops] [eta 00m:00s]
/dev/mapper/tangerine_ssd_1: (groupid=0, jobs=1): err= 0: pid=9277
read : io=14632MB, bw=249721KB/s, iops=62430 , runt= 60001msec
slat (usec): min=3 , max=634 , avg=13.57, stdev=10.98
clat (usec): min=351 , max=8697 , avg=1009.51, stdev=81.83
lat (usec): min=403 , max=8703 , avg=1023.30, stdev=81.99
bw (KB/s) : min=236488, max=252488, per=100.00%, avg=249722.82, stdev=2990.20
cpu : usr=12.06%, sys=84.85%, ctx=86338, majf=0, minf=11513
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=3745882/0/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.30%, 1000=42.88%
lat (msec): 2=56.82%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=14632MB, aggrb=249721KB/s, minb=255714KB/s, maxb=255714KB/s, mint=60001msec, maxt=60001msec

Across SCST Performance (same RAID volume as above; 4K random writes):
fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/mapper/tangerine_ssd_1
fio 1.50-rc4
/dev/mapper/tangerine_ssd_1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/59027K /s] [0 /14.5K iops] [eta 00m:00s]
/dev/mapper/tangerine_ssd_1: (groupid=0, jobs=1): err= 0: pid=9280
write: io=3383.6MB, bw=57742KB/s, iops=14435 , runt= 60004msec
slat (usec): min=4 , max=271 , avg= 8.82, stdev= 5.54
clat (usec): min=345 , max=13390 , avg=4421.59, stdev=1390.88
lat (usec): min=358 , max=13397 , avg=4430.69, stdev=1390.62
bw (KB/s) : min=54768, max=60160, per=100.03%, avg=57759.66, stdev=955.39
cpu : usr=6.01%, sys=19.92%, ctx=315079, majf=0, minf=11447
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/866195/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.05%, 1000=0.10%
lat (msec): 2=1.64%, 4=39.88%, 10=58.25%, 20=0.08%

Run status group 0 (all jobs):
WRITE: io=3383.6MB, aggrb=57742KB/s, minb=59128KB/s, maxb=59128KB/s, mint=60004msec, maxt=60004msec

Looks like we're getting ~63,000 4K random read IOPS and ~15,000 4K random write IOPS across SCST and FC fabric; this is expected as I doubt we would be able to get the same local/raw performance numbers as above, as we have a few other layers in between now, but this still seems quite good. By the way, on the SCST target FC/SSD disk arrays, I'm using vanilla 2.6.36.2 and SCST 2.0.0.2-rc1.

For my final test, I wanted to use our Linux initiators across the fabric with a fio process on each SSD/RAID5 volume (utilizing all 6 volumes from one Linux server).

All 6 SSD volumes -- 4K random read:
fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/mapper/tangerine_ssd_1 --name=/dev/mapper/tangerine_ssd_2 --name=/dev/mapper/tangerine_ssd_3 --name=/dev/mapper/grapefruit_ssd_1 --name=/dev/mapper/grapefruit_ssd_2 --name=/dev/mapper/grapefruit_ssd_3
fio 1.50-rc4
/dev/mapper/tangerine_ssd_1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/tangerine_ssd_2: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/tangerine_ssd_3: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_2: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_3: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 6 processes
Jobs: 6 (f=6): [rrrrrr] [100.0% done] [547.2M/0K /s] [137K/0 iops] [eta 00m:00s]
/dev/mapper/tangerine_ssd_1: (groupid=0, jobs=1): err= 0: pid=9298
read : io=5369.5MB, bw=91638KB/s, iops=22909 , runt= 60001msec
slat (usec): min=3 , max=11269 , avg=20.81, stdev=72.91
clat (usec): min=299 , max=445984 , avg=2767.23, stdev=4533.97
lat (usec): min=341 , max=445991 , avg=2788.41, stdev=4534.80
bw (KB/s) : min=15610, max=151344, per=16.80%, avg=91660.17, stdev=13761.63
cpu : usr=5.19%, sys=43.27%, ctx=374857, majf=0, minf=51459
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1374589/0/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.07%, 1000=0.49%
lat (msec): 2=16.75%, 4=77.94%, 10=4.50%, 20=0.18%, 50=0.01%
lat (msec): 100=0.02%, 250=0.02%, 500=0.01%
/dev/mapper/tangerine_ssd_2: (groupid=0, jobs=1): err= 0: pid=9299
read : io=5343.9MB, bw=91197KB/s, iops=22799 , runt= 60003msec
slat (usec): min=3 , max=14037 , avg=21.40, stdev=76.86
clat (usec): min=251 , max=446537 , avg=2780.51, stdev=4427.36
lat (usec): min=280 , max=446543 , avg=2802.28, stdev=4428.62
bw (KB/s) : min=31528, max=159192, per=16.76%, avg=91477.19, stdev=15133.74
cpu : usr=5.04%, sys=43.70%, ctx=372558, majf=0, minf=51302
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1368022/0/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.03%, 1000=0.54%
lat (msec): 2=16.75%, 4=77.62%, 10=4.77%, 20=0.26%, 50=0.01%
lat (msec): 250=0.01%, 500=0.01%
/dev/mapper/tangerine_ssd_3: (groupid=0, jobs=1): err= 0: pid=9300
read : io=5268.4MB, bw=89912KB/s, iops=22478 , runt= 60001msec
slat (usec): min=3 , max=19369 , avg=21.89, stdev=78.76
clat (usec): min=178 , max=445746 , avg=2819.41, stdev=4855.91
lat (usec): min=316 , max=445753 , avg=2841.68, stdev=4857.13
bw (KB/s) : min=10968, max=123024, per=16.36%, avg=89298.15, stdev=14663.35
cpu : usr=4.82%, sys=43.83%, ctx=351152, majf=0, minf=52151
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1348707/0/0, short=0/0/0
lat (usec): 250=0.01%, 500=0.01%, 750=0.03%, 1000=0.45%
lat (msec): 2=16.25%, 4=78.00%, 10=4.91%, 20=0.29%, 50=0.01%
lat (msec): 100=0.01%, 250=0.03%, 500=0.01%
/dev/mapper/grapefruit_ssd_1: (groupid=0, jobs=1): err= 0: pid=9301
read : io=5343.6MB, bw=91187KB/s, iops=22796 , runt= 60001msec
slat (usec): min=3 , max=6865 , avg=21.04, stdev=74.24
clat (usec): min=84 , max=446529 , avg=2781.13, stdev=3999.71
lat (usec): min=253 , max=446535 , avg=2802.53, stdev=4001.31
bw (KB/s) : min=27016, max=111320, per=16.71%, avg=91186.03, stdev=12750.97
cpu : usr=5.01%, sys=42.90%, ctx=377644, majf=0, minf=51968
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1367822/0/0, short=0/0/0
lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.04%, 1000=0.25%
lat (msec): 2=15.46%, 4=79.31%, 10=4.61%, 20=0.30%, 50=0.01%
lat (msec): 250=0.01%, 500=0.01%
/dev/mapper/grapefruit_ssd_2: (groupid=0, jobs=1): err= 0: pid=9302
read : io=5407.4MB, bw=92281KB/s, iops=23070 , runt= 60003msec
slat (usec): min=3 , max=12871 , avg=20.83, stdev=72.89
clat (usec): min=247 , max=446736 , avg=2748.12, stdev=4044.39
lat (usec): min=287 , max=446754 , avg=2769.32, stdev=4045.63
bw (KB/s) : min=12690, max=185760, per=16.91%, avg=92259.09, stdev=16308.76
cpu : usr=5.20%, sys=43.25%, ctx=382840, majf=0, minf=51607
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1384289/0/0, short=0/0/0
lat (usec): 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.47%
lat (msec): 2=16.74%, 4=77.95%, 10=4.57%, 20=0.20%, 50=0.01%
lat (msec): 100=0.01%, 250=0.01%, 500=0.01%
/dev/mapper/grapefruit_ssd_3: (groupid=0, jobs=1): err= 0: pid=9303
read : io=5246.9MB, bw=89543KB/s, iops=22385 , runt= 60002msec
slat (usec): min=3 , max=19791 , avg=21.80, stdev=79.92
clat (usec): min=313 , max=445617 , avg=2831.70, stdev=4393.71
lat (usec): min=331 , max=445622 , avg=2853.87, stdev=4395.41
bw (KB/s) : min=29872, max=136048, per=16.46%, avg=89841.24, stdev=13524.75
cpu : usr=4.95%, sys=42.92%, ctx=360204, majf=0, minf=51719
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1343184/0/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.03%, 1000=0.24%
lat (msec): 2=14.45%, 4=80.09%, 10=4.79%, 20=0.35%, 50=0.01%
lat (msec): 100=0.01%, 250=0.02%, 500=0.01%

Run status group 0 (all jobs):
READ: io=31979MB, aggrb=545746KB/s, minb=91691KB/s, maxb=94496KB/s, mint=60001msec, maxt=60003msec

All 6 SSD volumes -- 4K random write:
fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/mapper/tangerine_ssd_1 --name=/dev/mapper/tangerine_ssd_2 --name=/dev/mapper/tangerine_ssd_3 --name=/dev/mapper/grapefruit_ssd_1 --name=/dev/mapper/grapefruit_ssd_2 --name=/dev/mapper/grapefruit_ssd_3
fio 1.50-rc4
/dev/mapper/tangerine_ssd_1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/tangerine_ssd_2: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/tangerine_ssd_3: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_2: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_3: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 6 processes
Jobs: 6 (f=6): [wwwwww] [100.0% done] [0K/188.4M /s] [0 /47.9K iops] [eta 00m:00s]
/dev/mapper/tangerine_ssd_1: (groupid=0, jobs=1): err= 0: pid=9306
write: io=1859.5MB, bw=31730KB/s, iops=7932 , runt= 60009msec
slat (usec): min=4 , max=547 , avg=13.38, stdev=16.73
clat (usec): min=643 , max=123142 , avg=8049.29, stdev=2221.02
lat (usec): min=654 , max=123150 , avg=8062.96, stdev=2220.61
bw (KB/s) : min=24761, max=49048, per=16.90%, avg=31748.24, stdev=1998.66
cpu : usr=3.21%, sys=15.46%, ctx=157631, majf=0, minf=38077
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/476019/0, short=0/0/0
lat (usec): 750=0.01%, 1000=0.01%
lat (msec): 2=0.03%, 4=0.60%, 10=96.35%, 20=2.98%, 50=0.01%
lat (msec): 100=0.01%, 250=0.03%
/dev/mapper/tangerine_ssd_2: (groupid=0, jobs=1): err= 0: pid=9307
write: io=1825.4MB, bw=31149KB/s, iops=7787 , runt= 60007msec
slat (usec): min=4 , max=545 , avg=13.52, stdev=16.77
clat (msec): min=1 , max=134 , avg= 8.20, stdev= 2.72
lat (msec): min=1 , max=134 , avg= 8.21, stdev= 2.72
bw (KB/s) : min=24024, max=39912, per=16.57%, avg=31139.16, stdev=1433.01
cpu : usr=3.05%, sys=15.58%, ctx=158030, majf=0, minf=37489
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/467289/0, short=0/0/0

lat (msec): 2=0.01%, 4=0.18%, 10=95.90%, 20=3.86%, 50=0.01%
lat (msec): 100=0.01%, 250=0.04%
/dev/mapper/tangerine_ssd_3: (groupid=0, jobs=1): err= 0: pid=9308
write: io=1833.9MB, bw=31296KB/s, iops=7824 , runt= 60004msec
slat (usec): min=4 , max=449 , avg=13.45, stdev=16.35
clat (usec): min=612 , max=129471 , avg=8160.29, stdev=2552.85
lat (usec): min=627 , max=129483 , avg=8174.02, stdev=2552.47
bw (KB/s) : min=23936, max=33312, per=16.56%, avg=31117.90, stdev=1168.57
cpu : usr=3.07%, sys=15.66%, ctx=158910, majf=0, minf=38423
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/469473/0, short=0/0/0
lat (usec): 750=0.01%, 1000=0.01%
lat (msec): 2=0.02%, 4=0.60%, 10=95.77%, 20=3.57%, 250=0.04%

/dev/mapper/grapefruit_ssd_1: (groupid=0, jobs=1): err= 0: pid=9309
write: io=1828.4MB, bw=31200KB/s, iops=7800 , runt= 60007msec
slat (usec): min=4 , max=1127 , avg=13.50, stdev=16.84
clat (msec): min=1 , max=132 , avg= 8.19, stdev= 2.54
lat (msec): min=1 , max=132 , avg= 8.20, stdev= 2.54
bw (KB/s) : min=24312, max=37792, per=16.60%, avg=31193.08, stdev=1208.43
cpu : usr=3.00%, sys=15.76%, ctx=158728, majf=0, minf=37903
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/468062/0, short=0/0/0

lat (msec): 2=0.01%, 4=0.19%, 10=96.04%, 20=3.73%, 250=0.04%
/dev/mapper/grapefruit_ssd_2: (groupid=0, jobs=1): err= 0: pid=9310
write: io=1830.9MB, bw=31237KB/s, iops=7809 , runt= 60017msec
slat (usec): min=4 , max=620 , avg=13.51, stdev=16.53
clat (usec): min=837 , max=133782 , avg=8175.49, stdev=2812.91
lat (usec): min=846 , max=133788 , avg=8189.30, stdev=2812.59
bw (KB/s) : min=20421, max=45008, per=16.64%, avg=31271.39, stdev=1898.01
cpu : usr=3.10%, sys=15.59%, ctx=158858, majf=0, minf=37621
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/468693/0, short=0/0/0
lat (usec): 1000=0.01%
lat (msec): 2=0.02%, 4=0.34%, 10=96.18%, 20=3.38%, 50=0.01%
lat (msec): 100=0.04%, 250=0.04%
/dev/mapper/grapefruit_ssd_3: (groupid=0, jobs=1): err= 0: pid=9311
write: io=1834.6MB, bw=31306KB/s, iops=7826 , runt= 60008msec
slat (usec): min=4 , max=683 , avg=13.43, stdev=16.62
clat (usec): min=747 , max=130950 , avg=8157.66, stdev=2858.53
lat (usec): min=833 , max=130977 , avg=8171.38, stdev=2858.18
bw (KB/s) : min=24296, max=33104, per=16.59%, avg=31172.13, stdev=1321.27
cpu : usr=3.06%, sys=15.56%, ctx=158550, majf=0, minf=38435
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/469650/0, short=0/0/0
lat (usec): 750=0.01%, 1000=0.01%
lat (msec): 2=0.01%, 4=0.45%, 10=96.14%, 20=3.34%, 250=0.05%

Run status group 0 (all jobs):
WRITE: io=11012MB, aggrb=187892KB/s, minb=31896KB/s, maxb=32491KB/s, mint=60004msec, maxt=60017msec

So, across all (6) for 4K random reads, I was able to obtain ~137,000 IOPS and for writes ~50,000 IOPS from a single Linux host!

Here is my current (haven't added ESX initiators yet) SCST configuration if anyone is interested:
# Automatically generated by SCST Configurator v2.0.0.

# Non-key attributes
setup_id 0x0
max_tasklet_cmd 20
threads 24

HANDLER vdisk_blockio {
DEVICE tangerine_ssd_1 {
t10_dev_id "90cc7637 tangerine_ssd_1"
threads_num 6
usn 90cc7637

filename /dev/disk/by-path/pci-0000:08:00.0-scsi-0:2:1:0

# Non-key attributes
threads_pool_type per_initiator
}

DEVICE tangerine_ssd_2 {
t10_dev_id "c048d915 tangerine_ssd_2"
threads_num 6
usn c048d915

filename /dev/disk/by-path/pci-0000:08:00.0-scsi-0:2:2:0

# Non-key attributes
threads_pool_type per_initiator
}

DEVICE tangerine_ssd_3 {
t10_dev_id "d7e6d5db tangerine_ssd_3"
threads_num 6
usn d7e6d5db

filename /dev/disk/by-path/pci-0000:08:00.0-scsi-0:2:3:0

# Non-key attributes
threads_pool_type per_initiator
}
}

TARGET_DRIVER qla2x00t {
TARGET 21:00:00:24:ff:00:bf:58 {
rel_tgt_id 1
enabled 1

# Non-key attributes
addr_method PERIPHERAL
explicit_confirmation 0
io_grouping_type auto

GROUP peach {
LUN 101 tangerine_ssd_1 {
read_only 0
}
LUN 102 tangerine_ssd_2 {
read_only 0

}
LUN 103 tangerine_ssd_3 {
read_only 0
}

io_grouping_type 2

# Non-key attributes
addr_method PERIPHERAL
}

GROUP pineapple {
LUN 101 tangerine_ssd_1 {
read_only 0
}
LUN 102 tangerine_ssd_2 {
read_only 0
}
LUN 103 tangerine_ssd_3 {
read_only 0
}

io_grouping_type 3

# Non-key attributes
addr_method PERIPHERAL
}

GROUP raspberry {
LUN 0 tangerine_ssd_1 {
read_only 0
}
LUN 1 tangerine_ssd_2 {
read_only 0
}
LUN 2 tangerine_ssd_3 {
read_only 0
}

INITIATOR 21:00:00:1b:32:87:cf:00

io_grouping_type 1

# Non-key attributes
addr_method PERIPHERAL
}
}

TARGET 21:00:00:24:ff:01:1c:08 {
rel_tgt_id 2
enabled 1

# Non-key attributes
addr_method PERIPHERAL
explicit_confirmation 0
io_grouping_type auto

GROUP peach {
LUN 101 tangerine_ssd_1 {
read_only 0
}
LUN 102 tangerine_ssd_2 {

read_only 0
}
LUN 103 tangerine_ssd_3 {
read_only 0
}

io_grouping_type 2

# Non-key attributes
addr_method PERIPHERAL
}

GROUP pineapple {
LUN 101 tangerine_ssd_1 {
read_only 0
}
LUN 102 tangerine_ssd_2 {
read_only 0
}
LUN 103 tangerine_ssd_3 {
read_only 0
}

io_grouping_type 3

# Non-key attributes
addr_method PERIPHERAL
}

GROUP raspberry {
LUN 0 tangerine_ssd_1 {
read_only 0
}
LUN 1 tangerine_ssd_2 {
read_only 0
}
LUN 2 tangerine_ssd_3 {
read_only 0
}

INITIATOR 21:00:00:1b:32:87:f8:00

io_grouping_type 1

# Non-key attributes
addr_method PERIPHERAL
}
}
}

Wednesday, May 25, 2011

LSI MegaRAID & SATA SSDs

So, continuing from my last post, we had such great success with our first stab at an SSD/FC disk array, that we wanted more. This time we plan on using these arrays for not just replica datastores, but for OS / persistent data volumes as well.

We ordered three new 2U SuperMicro systems (configured/built by New Tech Solutions); 1 of these is for development, the other 2 are production.

I will detail the specs on these machines in my next article, but for this post, our development system looks something like this:
8 GB RAM (12 GB system with sparring mode); 24 logical CPUs (2 x Intel E5645, 8 cores each); vanilla 2.6.36.2; LSI MegaRAID SAS 9280-24i4e (FW: 2.120.43-1223); (3) CTFDDAC256MAG -> RAID5

I wanted to look at "raw" performance numbers using the MegaRAID adapter with the SSDs, and the different attributes for a RAID5 volume (strip size, read cache, write cache, etc.). We also purchased the FastPath license for these systems which supposedly promises better IOPS performance. I tested using the FIO tool; 4K IO size and either random-read or random-write.


Initially RAID5, 64KB stripe size, no read cache, no write cache:
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -CfgLDAdd -R5[245:1,245:2,245:3] WT NORA -a0

Adapter 0: Created VD 1

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

Random read test:
apricot ~ # fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [340M/0K /s] [85K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=14573
read : io=19,539MB, bw=326MB/s, iops=83,364, runt= 60001msec
slat (usec): min=3, max=140, avg= 5.04, stdev= 5.95
clat (usec): min=301, max=8,230, avg=761.32, stdev=147.70
bw (KB/s) : min=298136, max=346048, per=100.00%, avg=333468.71, stdev=10766.04
cpu : usr=16.11%, sys=51.23%, ctx=198046, majf=0, minf=4326
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=5001929/0, short=0/0
lat (usec): 500=5.22%, 750=37.49%, 1000=54.79%
lat (msec): 2=2.49%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=19,539MB, aggrb=326MB/s, minb=333MB/s, maxb=333MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=4993426/0, merge=0/0, ticks=3016385/0, in_queue=3015085, util=99.50%

Random write test:
apricot ~ # fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/55M /s] [0/14K iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=14578
write: io=3,213MB, bw=54,826KB/s, iops=13,706, runt= 60003msec
slat (usec): min=3, max=54, avg= 6.23, stdev= 3.77
clat (msec): min=1, max=14, avg= 4.66, stdev= 1.18
bw (KB/s) : min=53664, max=55848, per=100.04%, avg=54846.86, stdev=443.66
cpu : usr=4.26%, sys=10.83%, ctx=143019, majf=0, minf=3893
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/822432, short=0/0

lat (msec): 2=0.14%, 4=32.40%, 10=67.46%, 20=0.01%

Run status group 0 (all jobs):
WRITE: io=3,213MB, aggrb=54,826KB/s, minb=56,141KB/s, maxb=56,141KB/s, mint=60003msec, maxt=60003msec

Disk stats (read/write):
sdb: ios=2/820949, merge=0/0, ticks=0/3788757, in_queue=3788666, util=99.83%


Turn read-ahead on:
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp RA -L1 -a0

Set Read Policy to ReadAhead on Adapter 0, VD 1 (target id: 1) success

Exit Code: 0x00

Random read test:
apricot ~ # fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [274M/0K /s] [68K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=14598
read : io=15,889MB, bw=265MB/s, iops=67,792, runt= 60001msec
slat (usec): min=3, max=472, avg= 4.99, stdev= 7.50
clat (usec): min=305, max=8,823, avg=937.73, stdev=166.98
bw (KB/s) : min=252448, max=275832, per=100.01%, avg=271188.24, stdev=3913.90
cpu : usr=12.16%, sys=39.26%, ctx=163051, majf=0, minf=4288
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=4067634/0, short=0/0
lat (usec): 500=2.68%, 750=11.75%, 1000=42.42%
lat (msec): 2=43.14%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=15,889MB, aggrb=265MB/s, minb=271MB/s, maxb=271MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=4060446/0, merge=0/0, ticks=2946651/0, in_queue=2945570, util=96.54%


Turn adaptive read-ahead on:
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp ADRA -L1 -a0

Set Read Policy to Adaptive ReadAhead on Adapter 0, VD 1 (target id: 1) success

Exit Code: 0x00

Random read test:
apricot ~ # fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [274M/0K /s] [69K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=14601
read : io=15,938MB, bw=266MB/s, iops=67,999, runt= 60001msec
slat (usec): min=3, max=155, avg= 4.95, stdev= 7.43
clat (usec): min=197, max=9,330, avg=934.95, stdev=166.96
bw (KB/s) : min=254872, max=275872, per=100.01%, avg=272026.69, stdev=3426.24
cpu : usr=11.89%, sys=39.05%, ctx=165357, majf=0, minf=4268
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=4080027/0, short=0/0
lat (usec): 250=0.01%, 500=2.73%, 750=11.65%, 1000=43.68%
lat (msec): 2=41.93%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=15,938MB, aggrb=266MB/s, minb=272MB/s, maxb=272MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=4072842/0, merge=0/0, ticks=2951169/0, in_queue=2950138, util=96.58%


Turn write-back cache on:
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WB -L1 -a0

Set Write Policy to WriteBack on Adapter 0, VD 1 (target id: 1) success

Exit Code: 0x00

Random write test:
apricot ~ # fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/46M /s] [0/12K iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=14612
write: io=2,722MB, bw=46,451KB/s, iops=11,612, runt= 60005msec
slat (usec): min=3, max=97, avg= 6.94, stdev= 4.81
clat (usec): min=319, max=121K, avg=5502.21, stdev=2138.39
bw (KB/s) : min=34994, max=74472, per=100.06%, avg=46477.89, stdev=2958.57
cpu : usr=2.84%, sys=9.90%, ctx=94985, majf=0, minf=3830
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/696820, short=0/0
lat (usec): 500=0.01%, 750=0.04%, 1000=0.23%
lat (msec): 2=2.27%, 4=20.39%, 10=76.54%, 20=0.52%, 250=0.01%

Run status group 0 (all jobs):
WRITE: io=2,722MB, aggrb=46,450KB/s, minb=47,565KB/s, maxb=47,565KB/s, mint=60005msec, maxt=60005msec

Disk stats (read/write):
sdb: ios=6/695544, merge=0/0, ticks=1/3793599, in_queue=3793665, util=99.83%


Now enabling FastPath:
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -ELF -Applykey key XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX -a0

Successfully applied the Activation key. Please restart the system for the changes to take effect.

FW error description:
To complete the requested operation, please reboot the system.

Exit Code: 0x59

Reboot... FastPath enabled, NORA, WT:
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -ELF -ControllerFeatures -a0

Activated Advanced Software Options
---------------------------

Advanced Software Option : MegaRAID FastPath
Time Remaining : Unlimited

Advanced Software Option : MegaRAID RAID6
Time Remaining : Unlimited

Advanced Software Option : MegaRAID RAID5
Time Remaining : Unlimited


Re-host Information
--------------------

Needs Re-hosting : No

Exit Code: 0x00

Random read test:
apricot ~ # fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [344M/0K /s] [86K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=4001
read : io=19,689MB, bw=328MB/s, iops=84,005, runt= 60001msec
slat (usec): min=3, max=147, avg= 4.99, stdev= 5.82
clat (usec): min=306, max=7,969, avg=755.55, stdev=144.88
bw (KB/s) : min=292720, max=350320, per=100.00%, avg=336026.69, stdev=11191.43
cpu : usr=16.16%, sys=50.87%, ctx=204438, majf=0, minf=4366
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=5040440/0, short=0/0
lat (usec): 500=5.11%, 750=39.35%, 1000=53.38%
lat (msec): 2=2.16%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=19,689MB, aggrb=328MB/s, minb=336MB/s, maxb=336MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=5031387/0, merge=0/0, ticks=3048876/0, in_queue=3047559, util=99.53%

Random write test:
apricot ~ # fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/62M /s] [0/16K iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=4007
write: io=3,657MB, bw=62,412KB/s, iops=15,602, runt= 60005msec
slat (usec): min=3, max=59, avg= 6.34, stdev= 3.15
clat (usec): min=758, max=13,638, avg=4093.62, stdev=1405.59
bw (KB/s) : min=58856, max=64280, per=100.03%, avg=62427.56, stdev=832.47
cpu : usr=5.45%, sys=12.53%, ctx=205523, majf=0, minf=3926
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/936258, short=0/0
lat (usec): 1000=0.01%
lat (msec): 2=3.66%, 4=47.02%, 10=49.30%, 20=0.01%

Run status group 0 (all jobs):
WRITE: io=3,657MB, aggrb=62,411KB/s, minb=63,909KB/s, maxb=63,909KB/s, mint=60005msec, maxt=60005msec

Disk stats (read/write):
sdb: ios=5/934504, merge=0/0, ticks=0/3788023, in_queue=3787869, util=99.82%


Now with a RAID5 8KB stripe, NORA, WT, FastPath.
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -a0

Adapter 0: Deleted Virtual Drive-1(target id-1)

Exit Code: 0x00

apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -CfgLDAdd -R5[245:1,245:2,245:3] WT NORA -strpsz8 -a0

Adapter 0: Created VD 1

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

Random read test:
apricot ~ # fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [340M/0K /s] [85K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=4083
read : io=19,528MB, bw=325MB/s, iops=83,318, runt= 60001msec
slat (usec): min=3, max=176, avg= 5.06, stdev= 5.98
clat (usec): min=304, max=8,223, avg=761.79, stdev=149.12
bw (KB/s) : min=282992, max=345560, per=100.00%, avg=333283.97, stdev=11634.94
cpu : usr=15.29%, sys=51.53%, ctx=199083, majf=0, minf=4357
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=4999217/0, short=0/0
lat (usec): 500=5.26%, 750=37.42%, 1000=54.64%
lat (msec): 2=2.67%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=19,528MB, aggrb=325MB/s, minb=333MB/s, maxb=333MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=4990350/0, merge=0/0, ticks=3019107/0, in_queue=3017794, util=99.43%

Random write test:
apricot ~ # fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/63M /s] [0/16K iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=4088
write: io=3,664MB, bw=62,521KB/s, iops=15,630, runt= 60005msec
slat (usec): min=3, max=55, avg= 6.41, stdev= 3.16
clat (usec): min=680, max=12,283, avg=4086.43, stdev=1412.30
bw (KB/s) : min=60384, max=64584, per=100.04%, avg=62548.03, stdev=787.59
cpu : usr=5.25%, sys=12.94%, ctx=207933, majf=0, minf=3921
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/937890, short=0/0
lat (usec): 750=0.01%, 1000=0.01%
lat (msec): 2=4.11%, 4=46.66%, 10=49.22%, 20=0.01%

Run status group 0 (all jobs):
WRITE: io=3,664MB, aggrb=62,520KB/s, minb=64,021KB/s, maxb=64,021KB/s, mint=60005msec, maxt=60005msec

Disk stats (read/write):
sdb: ios=6/936182, merge=0/0, ticks=1/3788157, in_queue=3787968, util=99.83%


Now with a RAID5 512KB stripe, NORA, WT, FastPath:
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -a0

Adapter 0: Deleted Virtual Drive-1(target id-1)

Exit Code: 0x00

apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -CfgLDAdd -R5[245:1,245:2,245:3] WT NORA -strpsz512 -a0

Adapter 0: Created VD 1

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

Random read test:
apricot ~ # fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [342M/0K /s] [86K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=4128
read : io=19,622MB, bw=327MB/s, iops=83,720, runt= 60001msec
slat (usec): min=3, max=144, avg= 5.11, stdev= 5.90
clat (usec): min=309, max=8,598, avg=758.01, stdev=144.97
bw (KB/s) : min=305024, max=344864, per=99.99%, avg=334864.07, stdev=11241.95
cpu : usr=16.00%, sys=52.34%, ctx=199592, majf=0, minf=4302
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=5023310/0, short=0/0
lat (usec): 500=4.85%, 750=39.31%, 1000=53.38%
lat (msec): 2=2.45%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=19,622MB, aggrb=327MB/s, minb=335MB/s, maxb=335MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=5014359/0, merge=0/0, ticks=3055952/0, in_queue=3054696, util=99.57%

Random write test:
apricot ~ # fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/62M /s] [0/15K iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=4121
write: io=3,630MB, bw=61,948KB/s, iops=15,486, runt= 60004msec
slat (usec): min=3, max=41, avg= 6.24, stdev= 2.96
clat (usec): min=755, max=13,862, avg=4124.50, stdev=1337.97
bw (KB/s) : min=60056, max=64792, per=100.05%, avg=61978.19, stdev=876.92
cpu : usr=4.81%, sys=12.06%, ctx=195535, majf=0, minf=3910
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/929279, short=0/0
lat (usec): 1000=0.01%
lat (msec): 2=2.60%, 4=47.48%, 10=49.91%, 20=0.01%

Run status group 0 (all jobs):
WRITE: io=3,630MB, aggrb=61,947KB/s, minb=63,434KB/s, maxb=63,434KB/s, mint=60004msec, maxt=60004msec

Disk stats (read/write):
sdb: ios=6/927586, merge=0/0, ticks=0/3792792, in_queue=3792684, util=99.83%


Now with a RAID5 64KB stripe, NORA, WT, FastPath, and setting the "Cached" option (instead of default "Direct" mode -- not sure exactly what this means?):
apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -a0

Adapter 0: Deleted Virtual Drive-1(target id-1)

Exit Code: 0x00

apricot ~ # /opt/MegaRAID/MegaCli/MegaCli64 -CfgLDAdd -R5[245:1,245:2,245:3] WT NORA Cached -a0

Adapter 0: Created VD 1

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

Random read test:
apricot ~ # fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [234M/0K /s] [59K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=4153
read : io=13,625MB, bw=227MB/s, iops=58,131, runt= 60001msec
slat (usec): min=3, max=133, avg= 4.78, stdev= 4.93
clat (usec): min=323, max=8,419, avg=1094.91, stdev=160.15
bw (KB/s) : min=216768, max=237352, per=100.00%, avg=232534.52, stdev=3063.09
cpu : usr=12.05%, sys=35.69%, ctx=200778, majf=0, minf=4351
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=3487946/0, short=0/0
lat (usec): 500=0.01%, 750=2.97%, 1000=21.33%
lat (msec): 2=75.68%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=13,625MB, aggrb=227MB/s, minb=233MB/s, maxb=233MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=3481717/0, merge=0/0, ticks=3351413/0, in_queue=3350535, util=99.55%

Random write test:
apricot ~ # fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/63M /s] [0/16K iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=4158
write: io=3,650MB, bw=62,295KB/s, iops=15,573, runt= 60004msec
slat (usec): min=4, max=56, avg= 6.43, stdev= 3.07
clat (usec): min=691, max=12,874, avg=4101.26, stdev=1420.81
bw (KB/s) : min=60544, max=64520, per=100.03%, avg=62313.63, stdev=734.21
cpu : usr=5.27%, sys=12.74%, ctx=206551, majf=0, minf=3936
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/934486, short=0/0
lat (usec): 750=0.01%, 1000=0.01%
lat (msec): 2=3.90%, 4=46.66%, 10=49.42%, 20=0.01%

Run status group 0 (all jobs):
WRITE: io=3,650MB, aggrb=62,294KB/s, minb=63,789KB/s, maxb=63,789KB/s, mint=60004msec, maxt=60004msec

Disk stats (read/write):
sdb: ios=8/932805, merge=0/0, ticks=2/3786854, in_queue=3786689, util=99.83%


Conclusion
Here is a table with the results summarized (all configurations are (3) SATA SSDs + RAID5):

SetupRandom Read (4K IOPS)Random Write (4K IOPS)
WT, NORA, 64K Strip, Direct85K14K
WT, RA, 64K Strip, Direct68K14K
WT, ADRA, 64K Strip, Direct69K14K
WB, NORA, 64K Strip, Direct85K12K
WT, NORA, 64K Strip, Direct, FastPath86K16K
WT, NORA, 8K Strip, Direct, FastPath85K16K
WT, NORA, 512K Strip, Direct, FastPath86K15K
WT, NORA, 64K Strip, Cached, FastPath59K16K


So, from the numbers above we can definitely see that not using read (adaptive read-ahead or read-ahead) and write (write-back) cache is better. FastPath didn't seem to make much of a difference -- maybe 1K? If I tested multiple times and averaged it would probably come out the same though.

Strip size doesn't seem to play much on performance either -- possibly with an IO size other than 4K it may.

Look for another article soon on using these new (24) slot systems with SCST...

Tuesday, March 1, 2011

Accelerating VDI Using SCST and SSDs

So, a few weeks ago we attended a VMware sponsored conference that had a number of sessions on desktop virtualization (VDI) using VMware View. One of the speakers (maybe a couple) mentioned with View 4.5 you have the option of specifying different data stores for your linked-clone replicas (from parent VMs). They recommended using SSDs for this data store to increase performance (with linked-clones, the majority of the reads will still come from these parent VM snaps).

We already knew that our disk array (SAN) vendor that we used with our VDI infrastructure supported SSDs, so we figured we’d get a quote for some of these bad boys... well it came back a lot higher than we anticipated (about $50K for 4 drives + 1 enclosure).

Our solution: Build an SSD disk array (for our Fibre Channel SAN) using SCST (open source SCSI target subsystem for Linux). The SCST project seemed pretty solid with a good community, but I wanted to try it out for myself before ordering SSDs and other hardware.

I setup an old Dell PowerEdge 6950 that had some QLogic 4GB FC HBAs and 15K SAS disks on it with Gentoo Linux. The SCST project is very well documented with lots of examples, so the whole setup was a breeze. I played around with the different device handlers a bit, but for our planned setup (VMware ESX VMFS volume), using the BLOCKIO mode seemed to be what we wanted. I played/tested quite a bit over the next couple weeks with a volume in BLOCKIO mode (SAS 15K - RAID5 PERC on the back storage) with different Fibre Channel initiators. I was sold -- now I had to figure out our “production” solution.


Our Solution
We decided to re-purpose an existing server that still had a good warranty left and a decent number of hot-swappable drive slots:

A Dell PowerEdge R710 (pulled from an ESX cluster):
  • (2) Intel Xeon X5570 @ 2.93 GHz (4 cores each)
  • 24 GB Memory
  • Intel Gigabit Quad-Port NIC
  • (2) QLogic QLE2460 HBAs (4Gbit Fibre Channel)
  • Basic Dell PERC RAID Controller
  • (8) 2.5” Bays

Next we wanted to update the RAID controller in the unit and get some SSDs; the majority of the servers we buy are hooked to a Fibre Channel SAN (for boot & data volumes), so the existing PERC controller left a little to be desired: We decided on the PERC H700 w/ 1GB NV Cache.

We then had to decide on some SSDs. There are some different options on the SSDs -- expensive “enterprise” SAS 6Gbps SLC drives and consumer grade SATA 3/6Gbps MLCs (and some other stuff in-between). We actually didn’t find any vendors that sold the enterprise SAS SSDs individually (eg, only via Dell, HP, etc. - all re-branded); we looked at Dell and they were in the $2K - $3K range for ~140GB (can’t remember exact size) each.

After reading some different reports, reviews, etc. we decided on the RealSSD (Micron) line of drives -- specifically the Crucial (Micron’s consumer division) RealSSD C300 CTFDDAC256MAG-1G1 2.5" 256GB SATA III MLC SSDs.

Great -- we put a requisition in and a few weeks later (the SSDs had to be bid out) we had some new toys.


From Marc's Adventures in IT Land

From Marc's Adventures in IT Land

From Marc's Adventures in IT Land



Dell wouldn’t sell us the hot-swap drive trays by themselves, so we had to buy some cheap SATA disks so we could get the carriers. We purchased eight drive carriers (with disks) and eight SSDs -- we were only going to use 6 in our array, but wanted to have a couple spares ready.

Once we had the new RAID controller installed, I went through and updated the BIOS and PERC firmware, tweaked the BIOS settings and HBA settings (namely just disabling the adapter BIOS as we won’t be booting from the SAN at all). The R710 has (8) 2.5” drive bays; we decided to use (2) of these bays for a RAID1 array (for the boot/system Linux volume) with 73GB 10K SAS disks.

I hooked up each of the SSDs to a stand-alone Windows workstation and updated the firmware to the latest and greatest.


From Marc's Adventures in IT Land



Linux Install/Setup
For the system Linux OS, I decided to use Gentoo Linux. We are a RHEL shop, but SCST appears to benefit greatly from a newer kernel. I’ve used Gentoo Linux in a production environment before, and my feeling on the whole stability thing with the “enterprise” Linux distributions is that its tossed out the window when you patch one of their kernels or use a newer vanilla kernel -- sure, you got the user-land stuff, but the main function of this server is going to be based in the kernel anyways.

I did use the Hardened Gentoo (amd64) profile and “vanilla-sources” kernel -- not necessarily for the security features, but for the “stability” (supposedly) of it. Most generally use the “hardened-sources” with Hardened Gentoo, but I figured having “clean” kernel source (instead of vendor patches) would be easier for integrating SCST. I got the OS installed and completely updated the whole thing with emerge. When installing Gentoo, I obviously chose a custom kernel (not genkernel) set with some standard options that I like and also what the branches/2.0.0.x/scst/README document from the SCST project recommended; namely:
  • Disable all “kernel hacking” features.
  • Use the CFQ IO scheduler.
  • Turn the kernel preemption off (server).
  • Enable the MCE features.
  • I didn’t configure my HBA driver at this point as I knew that would need to be patched when setting up SCST.

I also installed a few useful utilities:
  • sys-apps/hdparm
  • sys-apps/pciutils
  • sys-fs/lsscsi
  • app-admin/mcelog
  • app-admin/sysstat
  • sys-block/fio
  • dev-vcs/subversion

Plus the MegaCLI tool from LSI (for managing the RAID controller):
  • Grab the latest Linux package from LSI’s website.
  • Need the RPM tool in Gentoo: emerge app-arch/rpm
  • Extract the MegaCLI package -- install the “MegaCli” and “Lib_Utils” RPMs; don’t forget a ‘--nodeps’.
  • I didn’t need any other dependencies (check yours using ldd).
  • No more BIOS RAID management: /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL
  • A very useful cheat sheet: http://tools.rapidsoft.de/perc/perc-cheat-sheet.html


Back Storage Performance
Before setting up SCST, I wanted to do some quick and dirty performance/throughput tests on the back-end storage (the SSD array). This Dell PERC H700 controller has a feature called “Cut-through IO (CTIO)” that should supposedly increase the throughput for SSD drive arrays. Per the documentation it says its enabled on an LD (logical drive) by disabling read ahead and enabling write through cache (WT + NORA). I went ahead and created a RAID 5 array with my six SSD drives:

/opt/MegaRAID/MegaCli/MegaCli64 -CfgLDAdd -R5[:1,:2,:3,:5,:6,:7] WT NORA -a0

Adapter 0: Created VD 1

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

Presto! My new virtual disk is available:

[344694.250054] sd 0:2:1:0: [sdb] 2494300160 512-byte logical blocks: (1.27 TB/1.16 TiB)
[344694.250063] sd 0:2:1:0: Attached scsi generic sg3 type 0
[344694.250101] sd 0:2:1:0: [sdb] Write Protect is off
[344694.250104] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 10 08
[344694.250137] sd 0:2:1:0: [sdb] Write cache: disabled, read cache: disabled, supports DPO and FUA
[344694.250643] sdb: unknown partition table
[344694.250813] sd 0:2:1:0: [sdb] Attached SCSI disk

I waited for the RAID initialization process to finish (took about 23 minutes) before trying out a few tests; I’m not sure how much that affects performance. For the first test, I wanted to check the random access time of the array. I used a utility called “seeker” from this page: http://www.linuxinsight.com/how_fast_is_your_disk.html

./seeker /dev/sdb
Seeker v2.0, 2007-01-15, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sdb [1217920MB], wait 30 seconds..............................
Results: 6817 seeks/second, 0.15 ms random access time

So, we can definitely see one of the SSD perks -- very low random access times. Compare that to our system volume (RAID1 / 10K) below, we can see not having mechanical parts makes a big difference.

./seeker /dev/sda
Seeker v2.0, 2007-01-15, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda [69376MB], wait 30 seconds..............................
Results: 157 seeks/second, 6.34 ms random access time

I see lots of people also using the ‘hdparm’ utility, so I figured I’d throw that in too:

hdparm -Tt /dev/sdb

/dev/sdb:
Timing cached reads: 20002 MB in 2.00 seconds = 10011.99 MB/sec
Timing buffered disk reads: 1990 MB in 3.00 seconds = 663.24 MB/sec

I wanted to try out sequential IO throughput of the volume using the ‘dd’ tool. I read about this a little bit on the ‘net and everyone seems to agree that the Linux buffer/page cache can warp performance numbers a bit. I haven’t educated myself enough on that topic, but the general consensus seems to recommend pushing a lot more data than you have RAM (24 GB RAM in this machine) to get around this, so I did 60 GB:

dd of=/dev/null if=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 88.797 s, 676 MB/s

676 megabytes per second seems pretty nice. Lately I’ve been thinking of numbers in “Gbps” (gigabits per second), so that number is 5.28125 gigabits per second (Gbps). Lets check out the write speed:

dd if=/dev/zero of=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 353.116 s, 170 MB/s

Ouch, that seems a little slower than I had anticipated. I understand that with the MLC SSD drives, the write speed is generally slower, but that seems a lot slower. Now, on the PERC, I had disabled the read/write cache for this volume (per Dell’s recommendation for Cut-through IO mode / SSDs), but this is a RAID5 volume and these are SATA SSDs, not SAS (“enterprise”) SSDs, so I turned on the write cache (write back) to see what happens:

/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WB -L1 -a0

Set Write Policy to WriteBack on Adapter 0, VD 1 (target id: 1) success

Exit Code: 0x00

dd if=/dev/zero of=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 92.6398 s, 648 MB/s

Well that number is quite a bit more peppy; so, now this is going to bug me -- whats up with the drastic slow-downs on writes (without cache)? I wondered if it was possibly related to the RAID5 parity calculation / stripe size / something else with the writes. I was curious, so I destroyed that RAID5 logical disk and created a new volume using the six SSDs / RAID0 and tested again.

dd if=/dev/zero of=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 84.3099 s, 712 MB/s

Wow (write cache off). So it definitely seems to be a RAID5 thing; just to check the read speed again to be thorough:

dd of=/dev/null if=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 77.2353 s, 777 MB/s

Ugh. I’m not going to spend too much time trying to figure out why my write speed with RAID5 is so much slower -- we’re really just interested in reads since this array is only going to be used with VMware View as a read-only replica datastore, however, I did enable write back on the controller as I don’t want to be waiting forever on the parent VM -> replica clone operations.

We had originally read a Tom’s Hardware article about high-end SSD performance (http://www.tomshardware.com/reviews/x25-e-ssd-performance,2365.html) which gave us some inspiration for this project. They used (16) SSDs and (2) Adaptec RAID controllers with eight drives on each controller in RAID0 arrays and then used software (OS) RAID0 to stripe the two volumes as one logical disk; they were able to obtain 2.2 GB/sec (gigabytes). I was curious as to what our back-end storage was capable of (with six SSDs).

I created a RAID0 array with the six SSDs, 1MB (max) stripe size, write through and no read ahead. I used the ‘fio’ tool to push a bunch of data through to see what our max throughput is (similar to Tom’s Hardware: http://www.tomshardware.com/reviews/x25-e-ssd-performance,2365-8.html):

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [1,663M/0K /s] [406/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=22439
read : io=97,016MB, bw=1,616MB/s, iops=403, runt= 60044msec
slat (usec): min=144, max=5,282, avg=2471.42, stdev=2058.81
clat (msec): min=42, max=197, avg=155.79, stdev=12.62
bw (KB/s) : min=1185469, max=1671168, per=99.91%, avg=1653112.44, stdev=43452.90
cpu : usr=0.13%, sys=7.54%, ctx=13547, majf=0, minf=131099
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=24254/0, short=0/0

lat (msec): 50=0.01%, 100=0.09%, 250=99.90%

Run status group 0 (all jobs):
READ: io=97,016MB, aggrb=1,616MB/s, minb=1,655MB/s, maxb=1,655MB/s, mint=60044msec, maxt=60044msec

Disk stats (read/write):
sdb: ios=435762/0, merge=0/0, ticks=8518093/0, in_queue=8520113, util=99.85%

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/1,323M /s] [0/323 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=22445
write: io=77,328MB, bw=1,287MB/s, iops=321, runt= 60105msec
slat (usec): min=102, max=9,777, avg=3101.45, stdev=2663.09
clat (msec): min=98, max=323, avg=195.62, stdev=42.52
bw (KB/s) : min=901120, max=1351761, per=99.93%, avg=1316562.66, stdev=39517.45
cpu : usr=0.11%, sys=4.26%, ctx=10872, majf=0, minf=27
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/19332, short=0/0

lat (msec): 100=0.01%, 250=99.91%, 500=0.09%

Run status group 0 (all jobs):
WRITE: io=77,328MB, aggrb=1,287MB/s, minb=1,317MB/s, maxb=1,317MB/s, mint=60105msec, maxt=60105msec

Disk stats (read/write):
sdb: ios=6/347909, merge=0/0, ticks=0/8566762, in_queue=8571115, util=98.44%

I think those numbers look pretty nice for six (6) SSDs and one RAID controller: reads @ 1,655MB/s & writes @ 1,317MB/s (1.62 GB / sec, 1.29 GB / sec -- bytes, not bits).

Alright, for the “real” setup that we used, RAID0 is obviously not an option. We know RAID10 generally offers the best performance, but we didn’t want to miss that much space, so it looks like RAID5 is our BFF. I went with RAID5, 64KB stripe size (adapter default), write back, and no read ahead. I looked for information on optimal stripe size for use with VMware VMFS, but the opinions didn’t appear to be one-sided (bigger vs. smaller), so I stuck with the default. I ran our fio read/write throughput tests one more with the final array setup:

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [1,659M/0K /s] [405/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=22532
read : io=96,752MB, bw=1,612MB/s, iops=403, runt= 60019msec
slat (usec): min=144, max=115K, avg=2478.14, stdev=2177.20
clat (msec): min=16, max=277, avg=156.15, stdev= 7.52
bw (KB/s) : min=1171456, max=1690412, per=99.87%, avg=1648576.39, stdev=57546.44
cpu : usr=0.09%, sys=7.72%, ctx=13584, majf=0, minf=131100
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=24188/0, short=0/0

lat (msec): 20=0.01%, 50=0.05%, 100=0.08%, 250=99.60%, 500=0.26%

Run status group 0 (all jobs):
READ: io=96,752MB, aggrb=1,612MB/s, minb=1,651MB/s, maxb=1,651MB/s, mint=60019msec, maxt=60019msec

Disk stats (read/write):
sdb: ios=434574/0, merge=0/0, ticks=8512402/0, in_queue=8513463, util=99.85%

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/1,090M /s] [0/266 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=22583
write: io=64,368MB, bw=1,072MB/s, iops=268, runt= 60029msec
slat (usec): min=101, max=106K, avg=3726.06, stdev=3360.91
clat (msec): min=27, max=341, avg=234.77, stdev=13.74
bw (KB/s) : min=1062834, max=1338135, per=99.75%, avg=1095312.87, stdev=24085.25
cpu : usr=0.14%, sys=3.45%, ctx=9081, majf=0, minf=26
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/16092, short=0/0

lat (msec): 50=0.04%, 100=0.08%, 250=99.26%, 500=0.62%

Run status group 0 (all jobs):
WRITE: io=64,368MB, aggrb=1,072MB/s, minb=1,098MB/s, maxb=1,098MB/s, mint=60029msec, maxt=60029msec

Disk stats (read/write):
sdb: ios=11/288540, merge=0/0, ticks=1/8542345, in_queue=8544238, util=98.44%

So, we can see the writes are a bit slower than our RAID0 array -- no big deal for us. The read rate stayed nice and juicy (~ 1.6 GB / sec). I’m satisfied, so now its time to configure SCST.


SCST Setup
I started by grabbing the whole SCST project:

cd /usr/src
svn co https://scst.svn.sourceforge.net/svnroot/scst

By default, the SVN version is set for debugging/development, not performance, so I used the following make command to setup for performance:

cd /usr/src/scst/branches/2.0.0.x
make debug2perf

Next, we need to apply some of the kernel patches that are included with the SCST project. We’re using 2.6.36 kernel with this Gentoo install, so we really only need to use one patch file -- the other “enhancements” are already included in these newer kernels.

cd /usr/src
ln -s linux-2.6.36.2 linux-2.6.36
patch -p0 < /usr/src/scst/branches/2.0.0.x/scst/kernel/scst_exec_req_fifo-2.6.36.patch

Replace the kernel bundled QLogic FC driver with the SCST modified QLogic FC driver that enables target mode support:

mv /usr/src/linux-2.6.36.2/drivers/scsi/qla2xxx /usr/src/linux-2.6.36.2/drivers/scsi/qla2xxx.orig
ln -s /usr/src/scst/branches/2.0.0.x/qla2x00t /usr/src/linux-2.6.36.2/drivers/scsi/qla2xxx

I then built a new kernel with the QLA2XXX driver (as a module) and selected the “QLogic 2xxx target mode support” option, installed it, and rebooted. After the system came back up, I installed the latest QLogic firmware image (ftp://ftp.qlogic.com/outgoing/linux/firmware/) for my adapters:

mkdir /lib/firmware
cd /lib/firmware
wget ftp://ftp.qlogic.com/outgoing/linux/firmware/ql2400_fw.bin
modprobe -r qla2xxx
modprobe qla2xxx

Build and install the SCST core; debugging (performance hit) is enabled by default (so you might want to leave it on for testing), but we disabled it above with the ‘make debug2perf’:

cd /usr/src/scst/branches/2.0.0.x/scst/src
make all
make install

Build and install the QLogic target driver:

cd /usr/src/scst/branches/2.0.0.x/qla2x00t/qla2x00-target
make
make install

Build and install the scstadmin utility and start-up scripts (the ‘make install’ puts non-Gentoo init.d scripts in place by default):

cd /usr/src/scst/branches/2.0.0.x/scstadmin
make
make install
rm /etc/init.d/qla2x00t
rm /etc/init.d/scst
install -m 755 init.d/qla2x00t.gentoo /etc/init.d/qla2x00t
install -m 755 init.d/scst.gentoo /etc/init.d/scst
rc-update add qla2x00t default
rc-update add scst default
scstadmin -write_config /etc/scst.conf
/etc/init.d/qla2x00t start
/etc/init.d/scst start

Now its time to configure SCST -- the project is very well documented (see branches/2.0.0.x/scst/README), so I won’t go into all of the different configuration options, only what we decided on for our setup. First, we created a new virtual disk using BLOCKIO mode (vdisk_blockio):

scstadmin -open_dev vdi_ssd_vmfs_1 -handler vdisk_blockio -attributes filename=/dev/sdb,blocksize=512,nv_cache=0,read_only=0,removable=0
scstadmin -nonkey -write_config /etc/scst.conf

A little more tweaking; I set threads_num to 4 initially:

scstadmin -set_dev_attr vdi_ssd_vmfs_1 -attributes threads_pool_type=per_initiator,threads_num=4
scstadmin -nonkey -write_config /etc/scst.conf

Now for the security groups and target LUN setup; in our setup, each ESX host has (2) Fibre Channel HBAs and we have (2) fabrics (non-stacked, independent switches). Our disk array box has (2) HBAs, one going to each fabric, so each HBA on the disk array (SCST target) will “see” (1) initiator for each ESX host. The SCST documentation states that the “io_grouping_type” attribute can affect performance greatly -- I decided to initially put each initiator in its own security group and this way I could control the I/O grouping using explicit I/O group numbers and experiment a bit with this.

scstadmin -add_group vdiesxtemp1 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6
scstadmin -add_group vdiesxtemp2 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6
scstadmin -add_group vdiesxtemp3 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6
scstadmin -add_group vdiesxtemp1 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de
scstadmin -add_group vdiesxtemp2 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de
scstadmin -add_group vdiesxtemp3 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de
scstadmin -add_init 21:00:00:1b:32:17:00:f6 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp1
scstadmin -add_init 21:00:00:1b:32:17:d6:f7 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp2
scstadmin -add_init 21:00:00:1b:32:06:0f:a1 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp3
scstadmin -add_init 21:01:00:1b:32:37:00:f6 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp1
scstadmin -add_init 21:01:00:1b:32:37:d6:f7 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp2
scstadmin -add_init 21:01:00:1b:32:26:0f:a1 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp3
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp1 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp2 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp3 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp1 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp2 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp3 -device vdi_ssd_vmfs_1
scstadmin -set_grp_attr vdiesxtemp1 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -attributes io_grouping_type=1
scstadmin -set_grp_attr vdiesxtemp2 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -attributes io_grouping_type=2
scstadmin -set_grp_attr vdiesxtemp3 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -attributes io_grouping_type=3
scstadmin -set_grp_attr vdiesxtemp1 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -attributes io_grouping_type=1
scstadmin -set_grp_attr vdiesxtemp2 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -attributes io_grouping_type=2
scstadmin -set_grp_attr vdiesxtemp3 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -attributes io_grouping_type=3
scstadmin -nonkey -write_config /etc/scst.conf

The SCST documentation says to always start the LUN numbering at 0 for a target to be recognized, however, with our ESX hosts, we have another disk array on the same SAN with VMFS datastores. The other disk array also contains our boot disk which is LUN 0 -- by default the QLogic HBA BIOS looks for LUN 0 as the boot volume (you can set specific target(s) in the HBA BIOS settings). VMware ESX has “sparse LUN support” enabled by default, so the LUN numbers shouldn’t have to be sequential (it scans 0 to 255). We used a LUN number of 201 for the SSD volume and didn’t have any issues -- maybe other initiator types (Linux / Windows) need to start at 0?

In the above scstadmin commands, I used the ‘-set_grp_attr’ argument, it works, but it is not documented in the help output for the scstadmin command. It should be fixed in future versions.

A few more tweaks related read ahead and kernel settings; I set the RA value to 1024 KB (512-byte sectors * 2048 = 1048576 / 1024 = 1024 KB -- default was 128 KB) and I seem to recall others saying max_sectors_kb set to 64 was nice (maybe not for us). I added the following to /etc/conf.d/local.start (SCST needs to be restarted after modifying the read ahead value):

/etc/init.d/qla2x00tgt stop > /dev/null 2>&1
/etc/init.d/scst stop > /dev/null 2>&1
echo 64 > /sys/block/sdb/queue/max_sectors_kb
blockdev --setra 2048 /dev/sdb
/etc/init.dqla2x00tgt start > /dev/null 2>&1
/etc/init.d/scst start > /dev/null 2>&1

Next, per the scst/README I modified some settings for CPU / IRQ affinity. Our machine has 8 physical cores which in Linux is 16 logical CPUs and we started with using CPUs 0-1 only for IRQs (fffc); I added these to /etc/conf.d/local.start:

for i in /proc/irq/*; do if [ "$i" == "/proc/irq/default_smp_affinity" ]; then echo fffc > $i; else echo fffc > $i/smp_affinity; fi; done > /dev/null 2>&1
for i in scst_uid scstd{0..15} scst_initd scsi_tm scst_mgmtd; do taskset -p fffc `pidof $i`; done > /dev/null 2>&1

Finally, we enable the targets and configure the zoning on the Fibre Channel switches:

scstadmin -enable_target 21:00:00:1b:32:82:91:f6 -driver qla2x00t
scstadmin -enable_target 21:00:00:1b:32:8a:50:de -driver qla2x00t
scstadmin -nonkey -write_config /etc/scst.conf


VMware ESX Performance
Now a little SCST / VMware ESX 4.1 performance evaluation. I didn’t want to go into the full-bore setup for testing max IOPS / throughput like other articles (eg, http://blogs.vmware.com/performance/2008/05/100000-io-opera.html), but I did want to do a couple simple tests just to see what we’re working with. For a quick n’ dirty test setup, I used a single ESX 4.1 host, created the VMFS file system on our SSD volume (1 MB block size), and then created a new VM: 2 CPUs, 4 GB memory, Windows Server 2008 R2, and added a second 50 GB virtual disk. For the 50 GB virtual disk, be sure to check the “fault tolerance” option -- this will use the thick / eager-zeroed option. Without doing this (eg, using lazy zero) when using Iometer, it will give you some crazy numbers (like ~ 2 GB / sec reads and you won’t see any IO on the SCST disk array); this makes sense I guess since ESX knows that there haven’t been any blocks written, so its smart enough to not even try reading from the block device?

Anyways, for our 50 GB test virtual disk, I used it as a physical drive in Iometer (no partition / no NTFS file system). On the ESX host, I used “Round Robin (VMware)” as the path selection policy. Once the Windows guest OS was installed, I did the updates and installed Iometer 2006.07.27. The constants I used in the Iometer test were: (2) workers, number of outstanding IOs set to (64), and checked our 50 GB “physical drive”.

In this Iometer test, I did a 4 MB transfer request size, 100% read, and 100% sequential: ~ 760 MB / sec

I confirmed this number on the SCST disk array server using iostat:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.00 0.00 0.23 0.00 0.00 99.77

Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 4353.00 760.54 0.00 760 0
sda 0.00 0.00 0.00 0 0

I also checked on the ESX service console using esxtop:

4:28:56pm up 28 days  3:16, 142 worlds; CPU load average: 0.03, 0.03, 0.01

ADAPTR PATH NPTH CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cm
vmhba0 - 4 374.41 366.78 7.63 366.79 0.05 18.68 40.62 59.3
vmhba1 - 4 366.97 366.78 0.19 366.78 0.00 62.27 44.86 107.1
vmhba2 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba3 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba32 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba34 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba35 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0

We can see above IO flowing across both of our QLogic Fibre Channel HBAs (round robin path policy); what happens if we cut ESX down to using just one HBA (fixed path policy) -- here is another esxtop:

4:31:52pm up 28 days  3:19, 142 worlds; CPU load average: 0.04, 0.04, 0.02

ADAPTR PATH NPTH CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cm
vmhba0 - 4 390.43 388.15 2.29 375.18 0.00 70.95 89.59 160.5
vmhba1 - 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba2 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba3 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba32 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba34 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba35 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0

So we can see IO goes down by about half with only one HBA -- iostat on the SCST array confirms and so does Iometer in the VM (~ 380 MB / sec). So, it seems the round-robin path policy can be quite beneficial. We can also see that this one VM seems to be maxing out both of our 4 Gbps Fibre Channel HBAs.

I realize that the tests I ran here and above in the back storage performance section were focused on throughput and not max, small IOPS. I need to educate myself a little more on disk IO performance testing and will runs tests again focusing on the high IOPS.


Backup & Recovery
For the actual disk array “server” backup, I just used a simple tar + ssh + public key authentication combo to copy a tarball of the local system files over to another server. This way I have a little more control of when the backup occurs -- we all know our backup admins never purposely have those nightly backup jobs run long, but lets face it, it happens. The system configuration on our SSD disk array is probably not going to change much, so I just have a cron that runs weekly:

tar cvzfp - /bin /boot /dev /etc /home /lib /lib32 /lib64 /mnt /opt /root /sbin /tmp /usr /var | ssh user@host "cat > hostname_`date +%Y%m%d`.tar.gz"

I also created a simple shell script that checks our logical drives (system RAID volume and SSD RAID volume) to see if a disk failed using the MegaCli utility (runs hourly via cron).

What happens if our cool, new SSD disk array dies? This isn’t a dual-controller, HA capable storage device -- it is a higher end server that has dual power supplies, mutliple HBAs, RAID, memory sparing, etc., but what if the kernel panics? I will say that VMware ESX behaves unexpectedly well in this situation. Go ahead and try it, reboot the disk array. I tried this and the VMFS datastore shows up as inaccessible in vCenter, and everything seems to “pause” quite nicely. When the disk array (VMFS volume) comes back, everything starts working again. This being said, I did notice if you leave the volume down too long, things do start acting a bit strange (VMs hanging, etc.), but I imagine this is due to the guest OS hitting a timeout or ESX hitting something.

Anyhow, we wanted to have a backup of the volume, just in case. This doesn’t really mean much if only replicas are stored on this volume, but we also keep some parent VMs on it. We wanted to map a volume from a different disk array to our new SSD disk array so we could “clone” it.

To use QLogic HBAs in initiator mode and target mode (default is to disable initiator mode, when target mode is enabled):

echo “options qla2xxx qlini_mode=enabled” > /etc/modprobe.d/qla2xxx.conf

You can check it with “cat /sys/class/fc_host/hostX/device/scsi_host/hostX/active_mode” (should read ‘Initiator, Target’).

However, using our QLogic HBAs in target and initiator mode simulataneously with our SANbox switches did NOT work for us (there is a note in the Kconfig file for the qla2xxx module that mentions some switches don’t handle ports like this well, so I assume we were affected by that). VMware ESX would no longer recognize the volume when we tried this, so we did the sensible thing and ordered an Emulex HBA to keep in initiator mode.

I then mapped a new volume from our primary disk array to the SSD-disk-array-server. I have a cron that runs nightly which “clones” (via dd) the SSD volume. This “cloned” volume is mapped to our ESX hosts at all times. VMware ESX recognizes this volume as a “snapshot” -- it sees that the VMFS label / UUID are the same, but the disk serial numbers are not. It doesn’t mount it by default. It looks like this from the service console:

esxcfg-volume -l
VMFS3 UUID/label: 4d29289d-624abc58-f7b1-001d091c121f/vdi_ssd_vmfs_1
Can mount: No (the original volume is still online)
Can resignature: Yes
Extent name: naa.6000d3100011e5000000000000000079:1 range: 0 - 1217791 (MB)

Using dd to clone the SSD volume probably isn’t a realistic option with a “normal” VMFS volumes (one used for read/writes). In our situation, the SSD volume is just just to store the replicas, so until we recompose using a new snapshot from a parent VM, the data on the volume is likely to stay the same. I haven’t explored the possibility of block-level type snapshots / clones using SCST, but it would be interesting to look at -- I believe device-mapper has some type of “snapshot” support, so maybe that could be used in association with SCST? Something to think about...

Anyhow, so we have this cloned volume out there that our ESX servers can now see, but haven’t mounted. Using the esxcfg-volume utility in the service console we can “resignature” our cloned volume. This will write a new signature and allow ESX to mount it as a “new” datastore. It shows up as something like this: snap-084f0837-vdi_ssd_vmfs_1

So, this really doesn’t help us tremendously if our SSD volume dies, becomes unavailable, etc., but it would allow us to have access to the data, say if the parent VMs were stored on this datastore. There is probably some fancy things you could do with the View SQL database / View LDAP directory like changing the datastore name in records used for the linked-clones. I found an article that is remotely similar to doing something like this: http://virtualgeek.typepad.com/virtual_geek/2009/10/howto-use-site-recovery-manager-and-linked-clones-together.html


Results / Conclusion
I’ve talked a lot about our setup, using SCST, a few performance numbers, etc., but our end goal was to improve VDI performance and our end-user experience. I could run a bunch more numbers, but I think seeing is believing, and what the real end-user experience is like, is all that matters. So, for this demonstration, I wanted to see what the speed difference was with a “real” application between linked-clone VMs on our enterprise disk array vs. using our new SSD disk array for reads (replica datastore).

One of our floating, linked-clone pools had a “big” application (QuickBooks 2010) on it which was notoriously slow on our VDI implementation. I created a new pool using the same specifications (Windows 7 32-bit, (2) vCPUs, and (2) GB memory), the same parent VM / snapshot, and used the new View 4.5 feature of specifying a different datastore for replicas. I then used the View Client on a workstation and logged into a new session on each pool with each session (screen) side-by-side.

I used Camtasia Studio to capture the video; for the first clip, I opened QuickBooks 2010. The VM from the current enterprise disk array is on the right, and the VM from the new SSD disk array pool is on the left. Both are fresh VMs and the application hasn’t been opened at all:





Notice that I give the “slow” VM the advantage by clicking the QuickBooks 2010 shortcut there first. At the time this video was taken, the ESX cluster had several hundred VMs powered on with only about half of those that had active PCoIP sessions. With large applications, the speed difference is extremely noticeable; with smaller applications such as the Office 2010 products, the difference is still noticeable, but not nearly as dramatic as above.

It will be interesting to see how well this solution scales as our VDI user base grows. As a school, we also have the need for many different pools / parent VMs (lab software licensing restrictions, etc.), so the number of replicas will grow as well as the number of linked-clones that are associated with each replica.

We have been quite impressed with the SCST project and the performance of the SSDs. We are already looking at building new, bigger arrays that will be used for linked-clone datastores (not just read-only replicas) in our View deployment. Currently considering a 24-slot setup from Silicon Mechanics...