Friday, June 3, 2011

SCST & SSD Arrays

So, we were so pleased with our first stab at an SSD-based Fibre Channel disk array (used with VMware View), we decided to create a couple more and use SSD storage for all of our VDI needs.

We purchased (2) additional "storage servers" that look something like this:
  • SuperMicro 2U 24x2.5in Bay Chassis w/900W Redundant
  • 5520 Chipset Motherboard 2xLAN IPMI 2.0
  • (2) Intel Xeon® Processor E5645 (12M Cache, 2.40 GHz, 5.86 GT/s Intel® QPI)
  • 12 GB 1333MHz DDR3 ECC Memory (8GB usable with sparing mode)
  • (1) LSI MegaRAID 9280-24i4e SATA/SAS 6Gb/s PCIe 2.0 w/ 512MB (with FastPath)
  • (2) 40GB X25-V 34NM 2.5IN SATA 9.5MM SSDs (for RAID1 system/OS volume)
  • (22) Crucial RealSSD C300 256GB CTFDDAC256MAG (for data RAID5 volumes)
  • (2) QLogic 8GB SINGLE PORT FC HBA PCIE8 LC MULTIMODE OPTIC
On each disk array, I used the same OS image from our first SCST disk array (see previous article). Two smaller disks are used for the system volume, and then we have (3) SSD volumes for the VMFS datastores using RAID5 with 7 disks in each, and a global hot spare (2 system + 21 data + 1 HS = 24 slots).

So, we now have a total of 6 separate (3 on each array) VMFS volumes that are ~1.5TB each. I did some initial testing on one of the arrays using fio to see what kind of "raw" performance numbers we would get (local to the disk array system, not through Fibre Channel / SCST).

"Local" Array Performance (4K random reads on a single 7 disk RAID5 SSD volume, Direct, WT, NORA):
fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [312M/0K /s] [78K/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=11648
read : io=18,041MB, bw=301MB/s, iops=76,972, runt= 60001msec
slat (usec): min=3, max=156, avg= 5.31, stdev= 7.66
clat (usec): min=321, max=8,185, avg=824.74, stdev=169.80
bw (KB/s) : min=268120, max=315648, per=99.99%, avg=307869.58, stdev=8951.39
cpu : usr=15.01%, sys=49.67%, ctx=169252, majf=0, minf=12851
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=4618408/0, short=0/0
lat (usec): 500=4.13%, 750=27.28%, 1000=55.97%
lat (msec): 2=12.60%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=18,041MB, aggrb=301MB/s, minb=308MB/s, maxb=308MB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=4608276/0, merge=0/0, ticks=3004407/0, in_queue=3003212, util=99.28%

"Local" Array Performance (4K random writes on a single 7 disk RAID5 SSD volume, Direct, WT, NORA):
fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/78M /s] [0/20K iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=11653
write: io=4,549MB, bw=77,632KB/s, iops=19,407, runt= 60004msec
slat (usec): min=3, max=59, avg= 6.34, stdev= 3.20
clat (usec): min=243, max=13,807, avg=3289.11, stdev=1249.05
bw (KB/s) : min=75032, max=79000, per=100.03%, avg=77652.57, stdev=626.48
cpu : usr=6.28%, sys=15.93%, ctx=249186, majf=0, minf=11782
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/1164553, short=0/0
lat (usec): 250=0.01%, 500=0.10%, 750=0.51%, 1000=1.38%
lat (msec): 2=16.30%, 4=53.07%, 10=28.63%, 20=0.01%

Run status group 0 (all jobs):
WRITE: io=4,549MB, aggrb=77,631KB/s, minb=79,494KB/s, maxb=79,494KB/s, mint=60004msec, maxt=60004msec

Disk stats (read/write):
sdb: ios=1/1161934, merge=0/0, ticks=1/3769371, in_queue=3769089, util=99.77%

So, with those "local" tests I'm seeing ~78,000 4K IOPS on reads, and ~20,000 IOPS on writes. I did some tests using fio with all three RAID5 volumes and the numbers stay about the same, so I assume those numbers (78K/20K) are limits of the MegaRAID controller, and not the SSD disks themselves. I also tried a RAID0 volume (with 7 SSDs) just to see what the numbers were like and it significantly improved the write IOPS: ~80K 4K IOPS for read and ~80K 4K IOPS for write.

Again, it seems like that ~80K read/write 4K IO per second max is the RAID controller. I suppose/assume an even higher performing solution would be to use a separate MegaRAID controller for each volume (one of those fancy new 2nd generation SAS models). I think for us the performance we are able to achieve with our current solution will be satisfactory.

Now lets look at 4K IOPS performance over SCST and our Fibre Channel SAN. Our SAN for VDI consists of two QLogic 5800 SANbox FC switches (8 Gbps Fibre Channel). Both of these new SSD disk arrays have (2) 8 Gbps FC HBAs each, however, the test box I used (initiator) only has (2) 4 Gbps QLogic HBAs -- this is setup like you would assume: A target HBA for each disk array goes to each fabric (switch), and on the Linux initiator, an HBA to each fabric.

On the initiator side, vanilla Linux 2.6.37.6 kernel, QLogic QLE2460 HBAs with firmware version 5.03.16; using multipath-tools round-robin pathing.

Across SCST Performance (same RAID volume as above; 4K random reads):
fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/mapper/tangerine_ssd_1
fio 1.50-rc4
/dev/mapper/tangerine_ssd_1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [250.9M/0K /s] [62.8K/0 iops] [eta 00m:00s]
/dev/mapper/tangerine_ssd_1: (groupid=0, jobs=1): err= 0: pid=9277
read : io=14632MB, bw=249721KB/s, iops=62430 , runt= 60001msec
slat (usec): min=3 , max=634 , avg=13.57, stdev=10.98
clat (usec): min=351 , max=8697 , avg=1009.51, stdev=81.83
lat (usec): min=403 , max=8703 , avg=1023.30, stdev=81.99
bw (KB/s) : min=236488, max=252488, per=100.00%, avg=249722.82, stdev=2990.20
cpu : usr=12.06%, sys=84.85%, ctx=86338, majf=0, minf=11513
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=3745882/0/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.30%, 1000=42.88%
lat (msec): 2=56.82%, 4=0.01%, 10=0.01%

Run status group 0 (all jobs):
READ: io=14632MB, aggrb=249721KB/s, minb=255714KB/s, maxb=255714KB/s, mint=60001msec, maxt=60001msec

Across SCST Performance (same RAID volume as above; 4K random writes):
fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/mapper/tangerine_ssd_1
fio 1.50-rc4
/dev/mapper/tangerine_ssd_1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/59027K /s] [0 /14.5K iops] [eta 00m:00s]
/dev/mapper/tangerine_ssd_1: (groupid=0, jobs=1): err= 0: pid=9280
write: io=3383.6MB, bw=57742KB/s, iops=14435 , runt= 60004msec
slat (usec): min=4 , max=271 , avg= 8.82, stdev= 5.54
clat (usec): min=345 , max=13390 , avg=4421.59, stdev=1390.88
lat (usec): min=358 , max=13397 , avg=4430.69, stdev=1390.62
bw (KB/s) : min=54768, max=60160, per=100.03%, avg=57759.66, stdev=955.39
cpu : usr=6.01%, sys=19.92%, ctx=315079, majf=0, minf=11447
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/866195/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.05%, 1000=0.10%
lat (msec): 2=1.64%, 4=39.88%, 10=58.25%, 20=0.08%

Run status group 0 (all jobs):
WRITE: io=3383.6MB, aggrb=57742KB/s, minb=59128KB/s, maxb=59128KB/s, mint=60004msec, maxt=60004msec

Looks like we're getting ~63,000 4K random read IOPS and ~15,000 4K random write IOPS across SCST and FC fabric; this is expected as I doubt we would be able to get the same local/raw performance numbers as above, as we have a few other layers in between now, but this still seems quite good. By the way, on the SCST target FC/SSD disk arrays, I'm using vanilla 2.6.36.2 and SCST 2.0.0.2-rc1.

For my final test, I wanted to use our Linux initiators across the fabric with a fio process on each SSD/RAID5 volume (utilizing all 6 volumes from one Linux server).

All 6 SSD volumes -- 4K random read:
fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/mapper/tangerine_ssd_1 --name=/dev/mapper/tangerine_ssd_2 --name=/dev/mapper/tangerine_ssd_3 --name=/dev/mapper/grapefruit_ssd_1 --name=/dev/mapper/grapefruit_ssd_2 --name=/dev/mapper/grapefruit_ssd_3
fio 1.50-rc4
/dev/mapper/tangerine_ssd_1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/tangerine_ssd_2: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/tangerine_ssd_3: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_2: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_3: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 6 processes
Jobs: 6 (f=6): [rrrrrr] [100.0% done] [547.2M/0K /s] [137K/0 iops] [eta 00m:00s]
/dev/mapper/tangerine_ssd_1: (groupid=0, jobs=1): err= 0: pid=9298
read : io=5369.5MB, bw=91638KB/s, iops=22909 , runt= 60001msec
slat (usec): min=3 , max=11269 , avg=20.81, stdev=72.91
clat (usec): min=299 , max=445984 , avg=2767.23, stdev=4533.97
lat (usec): min=341 , max=445991 , avg=2788.41, stdev=4534.80
bw (KB/s) : min=15610, max=151344, per=16.80%, avg=91660.17, stdev=13761.63
cpu : usr=5.19%, sys=43.27%, ctx=374857, majf=0, minf=51459
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1374589/0/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.07%, 1000=0.49%
lat (msec): 2=16.75%, 4=77.94%, 10=4.50%, 20=0.18%, 50=0.01%
lat (msec): 100=0.02%, 250=0.02%, 500=0.01%
/dev/mapper/tangerine_ssd_2: (groupid=0, jobs=1): err= 0: pid=9299
read : io=5343.9MB, bw=91197KB/s, iops=22799 , runt= 60003msec
slat (usec): min=3 , max=14037 , avg=21.40, stdev=76.86
clat (usec): min=251 , max=446537 , avg=2780.51, stdev=4427.36
lat (usec): min=280 , max=446543 , avg=2802.28, stdev=4428.62
bw (KB/s) : min=31528, max=159192, per=16.76%, avg=91477.19, stdev=15133.74
cpu : usr=5.04%, sys=43.70%, ctx=372558, majf=0, minf=51302
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1368022/0/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.03%, 1000=0.54%
lat (msec): 2=16.75%, 4=77.62%, 10=4.77%, 20=0.26%, 50=0.01%
lat (msec): 250=0.01%, 500=0.01%
/dev/mapper/tangerine_ssd_3: (groupid=0, jobs=1): err= 0: pid=9300
read : io=5268.4MB, bw=89912KB/s, iops=22478 , runt= 60001msec
slat (usec): min=3 , max=19369 , avg=21.89, stdev=78.76
clat (usec): min=178 , max=445746 , avg=2819.41, stdev=4855.91
lat (usec): min=316 , max=445753 , avg=2841.68, stdev=4857.13
bw (KB/s) : min=10968, max=123024, per=16.36%, avg=89298.15, stdev=14663.35
cpu : usr=4.82%, sys=43.83%, ctx=351152, majf=0, minf=52151
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1348707/0/0, short=0/0/0
lat (usec): 250=0.01%, 500=0.01%, 750=0.03%, 1000=0.45%
lat (msec): 2=16.25%, 4=78.00%, 10=4.91%, 20=0.29%, 50=0.01%
lat (msec): 100=0.01%, 250=0.03%, 500=0.01%
/dev/mapper/grapefruit_ssd_1: (groupid=0, jobs=1): err= 0: pid=9301
read : io=5343.6MB, bw=91187KB/s, iops=22796 , runt= 60001msec
slat (usec): min=3 , max=6865 , avg=21.04, stdev=74.24
clat (usec): min=84 , max=446529 , avg=2781.13, stdev=3999.71
lat (usec): min=253 , max=446535 , avg=2802.53, stdev=4001.31
bw (KB/s) : min=27016, max=111320, per=16.71%, avg=91186.03, stdev=12750.97
cpu : usr=5.01%, sys=42.90%, ctx=377644, majf=0, minf=51968
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1367822/0/0, short=0/0/0
lat (usec): 100=0.01%, 250=0.01%, 500=0.01%, 750=0.04%, 1000=0.25%
lat (msec): 2=15.46%, 4=79.31%, 10=4.61%, 20=0.30%, 50=0.01%
lat (msec): 250=0.01%, 500=0.01%
/dev/mapper/grapefruit_ssd_2: (groupid=0, jobs=1): err= 0: pid=9302
read : io=5407.4MB, bw=92281KB/s, iops=23070 , runt= 60003msec
slat (usec): min=3 , max=12871 , avg=20.83, stdev=72.89
clat (usec): min=247 , max=446736 , avg=2748.12, stdev=4044.39
lat (usec): min=287 , max=446754 , avg=2769.32, stdev=4045.63
bw (KB/s) : min=12690, max=185760, per=16.91%, avg=92259.09, stdev=16308.76
cpu : usr=5.20%, sys=43.25%, ctx=382840, majf=0, minf=51607
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1384289/0/0, short=0/0/0
lat (usec): 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.47%
lat (msec): 2=16.74%, 4=77.95%, 10=4.57%, 20=0.20%, 50=0.01%
lat (msec): 100=0.01%, 250=0.01%, 500=0.01%
/dev/mapper/grapefruit_ssd_3: (groupid=0, jobs=1): err= 0: pid=9303
read : io=5246.9MB, bw=89543KB/s, iops=22385 , runt= 60002msec
slat (usec): min=3 , max=19791 , avg=21.80, stdev=79.92
clat (usec): min=313 , max=445617 , avg=2831.70, stdev=4393.71
lat (usec): min=331 , max=445622 , avg=2853.87, stdev=4395.41
bw (KB/s) : min=29872, max=136048, per=16.46%, avg=89841.24, stdev=13524.75
cpu : usr=4.95%, sys=42.92%, ctx=360204, majf=0, minf=51719
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=1343184/0/0, short=0/0/0
lat (usec): 500=0.01%, 750=0.03%, 1000=0.24%
lat (msec): 2=14.45%, 4=80.09%, 10=4.79%, 20=0.35%, 50=0.01%
lat (msec): 100=0.01%, 250=0.02%, 500=0.01%

Run status group 0 (all jobs):
READ: io=31979MB, aggrb=545746KB/s, minb=91691KB/s, maxb=94496KB/s, mint=60001msec, maxt=60003msec

All 6 SSD volumes -- 4K random write:
fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/mapper/tangerine_ssd_1 --name=/dev/mapper/tangerine_ssd_2 --name=/dev/mapper/tangerine_ssd_3 --name=/dev/mapper/grapefruit_ssd_1 --name=/dev/mapper/grapefruit_ssd_2 --name=/dev/mapper/grapefruit_ssd_3
fio 1.50-rc4
/dev/mapper/tangerine_ssd_1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/tangerine_ssd_2: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/tangerine_ssd_3: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_1: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_2: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
/dev/mapper/grapefruit_ssd_3: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 6 processes
Jobs: 6 (f=6): [wwwwww] [100.0% done] [0K/188.4M /s] [0 /47.9K iops] [eta 00m:00s]
/dev/mapper/tangerine_ssd_1: (groupid=0, jobs=1): err= 0: pid=9306
write: io=1859.5MB, bw=31730KB/s, iops=7932 , runt= 60009msec
slat (usec): min=4 , max=547 , avg=13.38, stdev=16.73
clat (usec): min=643 , max=123142 , avg=8049.29, stdev=2221.02
lat (usec): min=654 , max=123150 , avg=8062.96, stdev=2220.61
bw (KB/s) : min=24761, max=49048, per=16.90%, avg=31748.24, stdev=1998.66
cpu : usr=3.21%, sys=15.46%, ctx=157631, majf=0, minf=38077
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/476019/0, short=0/0/0
lat (usec): 750=0.01%, 1000=0.01%
lat (msec): 2=0.03%, 4=0.60%, 10=96.35%, 20=2.98%, 50=0.01%
lat (msec): 100=0.01%, 250=0.03%
/dev/mapper/tangerine_ssd_2: (groupid=0, jobs=1): err= 0: pid=9307
write: io=1825.4MB, bw=31149KB/s, iops=7787 , runt= 60007msec
slat (usec): min=4 , max=545 , avg=13.52, stdev=16.77
clat (msec): min=1 , max=134 , avg= 8.20, stdev= 2.72
lat (msec): min=1 , max=134 , avg= 8.21, stdev= 2.72
bw (KB/s) : min=24024, max=39912, per=16.57%, avg=31139.16, stdev=1433.01
cpu : usr=3.05%, sys=15.58%, ctx=158030, majf=0, minf=37489
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/467289/0, short=0/0/0

lat (msec): 2=0.01%, 4=0.18%, 10=95.90%, 20=3.86%, 50=0.01%
lat (msec): 100=0.01%, 250=0.04%
/dev/mapper/tangerine_ssd_3: (groupid=0, jobs=1): err= 0: pid=9308
write: io=1833.9MB, bw=31296KB/s, iops=7824 , runt= 60004msec
slat (usec): min=4 , max=449 , avg=13.45, stdev=16.35
clat (usec): min=612 , max=129471 , avg=8160.29, stdev=2552.85
lat (usec): min=627 , max=129483 , avg=8174.02, stdev=2552.47
bw (KB/s) : min=23936, max=33312, per=16.56%, avg=31117.90, stdev=1168.57
cpu : usr=3.07%, sys=15.66%, ctx=158910, majf=0, minf=38423
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/469473/0, short=0/0/0
lat (usec): 750=0.01%, 1000=0.01%
lat (msec): 2=0.02%, 4=0.60%, 10=95.77%, 20=3.57%, 250=0.04%

/dev/mapper/grapefruit_ssd_1: (groupid=0, jobs=1): err= 0: pid=9309
write: io=1828.4MB, bw=31200KB/s, iops=7800 , runt= 60007msec
slat (usec): min=4 , max=1127 , avg=13.50, stdev=16.84
clat (msec): min=1 , max=132 , avg= 8.19, stdev= 2.54
lat (msec): min=1 , max=132 , avg= 8.20, stdev= 2.54
bw (KB/s) : min=24312, max=37792, per=16.60%, avg=31193.08, stdev=1208.43
cpu : usr=3.00%, sys=15.76%, ctx=158728, majf=0, minf=37903
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/468062/0, short=0/0/0

lat (msec): 2=0.01%, 4=0.19%, 10=96.04%, 20=3.73%, 250=0.04%
/dev/mapper/grapefruit_ssd_2: (groupid=0, jobs=1): err= 0: pid=9310
write: io=1830.9MB, bw=31237KB/s, iops=7809 , runt= 60017msec
slat (usec): min=4 , max=620 , avg=13.51, stdev=16.53
clat (usec): min=837 , max=133782 , avg=8175.49, stdev=2812.91
lat (usec): min=846 , max=133788 , avg=8189.30, stdev=2812.59
bw (KB/s) : min=20421, max=45008, per=16.64%, avg=31271.39, stdev=1898.01
cpu : usr=3.10%, sys=15.59%, ctx=158858, majf=0, minf=37621
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/468693/0, short=0/0/0
lat (usec): 1000=0.01%
lat (msec): 2=0.02%, 4=0.34%, 10=96.18%, 20=3.38%, 50=0.01%
lat (msec): 100=0.04%, 250=0.04%
/dev/mapper/grapefruit_ssd_3: (groupid=0, jobs=1): err= 0: pid=9311
write: io=1834.6MB, bw=31306KB/s, iops=7826 , runt= 60008msec
slat (usec): min=4 , max=683 , avg=13.43, stdev=16.62
clat (usec): min=747 , max=130950 , avg=8157.66, stdev=2858.53
lat (usec): min=833 , max=130977 , avg=8171.38, stdev=2858.18
bw (KB/s) : min=24296, max=33104, per=16.59%, avg=31172.13, stdev=1321.27
cpu : usr=3.06%, sys=15.56%, ctx=158550, majf=0, minf=38435
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w/d: total=0/469650/0, short=0/0/0
lat (usec): 750=0.01%, 1000=0.01%
lat (msec): 2=0.01%, 4=0.45%, 10=96.14%, 20=3.34%, 250=0.05%

Run status group 0 (all jobs):
WRITE: io=11012MB, aggrb=187892KB/s, minb=31896KB/s, maxb=32491KB/s, mint=60004msec, maxt=60017msec

So, across all (6) for 4K random reads, I was able to obtain ~137,000 IOPS and for writes ~50,000 IOPS from a single Linux host!

Here is my current (haven't added ESX initiators yet) SCST configuration if anyone is interested:
# Automatically generated by SCST Configurator v2.0.0.

# Non-key attributes
setup_id 0x0
max_tasklet_cmd 20
threads 24

HANDLER vdisk_blockio {
DEVICE tangerine_ssd_1 {
t10_dev_id "90cc7637 tangerine_ssd_1"
threads_num 6
usn 90cc7637

filename /dev/disk/by-path/pci-0000:08:00.0-scsi-0:2:1:0

# Non-key attributes
threads_pool_type per_initiator
}

DEVICE tangerine_ssd_2 {
t10_dev_id "c048d915 tangerine_ssd_2"
threads_num 6
usn c048d915

filename /dev/disk/by-path/pci-0000:08:00.0-scsi-0:2:2:0

# Non-key attributes
threads_pool_type per_initiator
}

DEVICE tangerine_ssd_3 {
t10_dev_id "d7e6d5db tangerine_ssd_3"
threads_num 6
usn d7e6d5db

filename /dev/disk/by-path/pci-0000:08:00.0-scsi-0:2:3:0

# Non-key attributes
threads_pool_type per_initiator
}
}

TARGET_DRIVER qla2x00t {
TARGET 21:00:00:24:ff:00:bf:58 {
rel_tgt_id 1
enabled 1

# Non-key attributes
addr_method PERIPHERAL
explicit_confirmation 0
io_grouping_type auto

GROUP peach {
LUN 101 tangerine_ssd_1 {
read_only 0
}
LUN 102 tangerine_ssd_2 {
read_only 0

}
LUN 103 tangerine_ssd_3 {
read_only 0
}

io_grouping_type 2

# Non-key attributes
addr_method PERIPHERAL
}

GROUP pineapple {
LUN 101 tangerine_ssd_1 {
read_only 0
}
LUN 102 tangerine_ssd_2 {
read_only 0
}
LUN 103 tangerine_ssd_3 {
read_only 0
}

io_grouping_type 3

# Non-key attributes
addr_method PERIPHERAL
}

GROUP raspberry {
LUN 0 tangerine_ssd_1 {
read_only 0
}
LUN 1 tangerine_ssd_2 {
read_only 0
}
LUN 2 tangerine_ssd_3 {
read_only 0
}

INITIATOR 21:00:00:1b:32:87:cf:00

io_grouping_type 1

# Non-key attributes
addr_method PERIPHERAL
}
}

TARGET 21:00:00:24:ff:01:1c:08 {
rel_tgt_id 2
enabled 1

# Non-key attributes
addr_method PERIPHERAL
explicit_confirmation 0
io_grouping_type auto

GROUP peach {
LUN 101 tangerine_ssd_1 {
read_only 0
}
LUN 102 tangerine_ssd_2 {

read_only 0
}
LUN 103 tangerine_ssd_3 {
read_only 0
}

io_grouping_type 2

# Non-key attributes
addr_method PERIPHERAL
}

GROUP pineapple {
LUN 101 tangerine_ssd_1 {
read_only 0
}
LUN 102 tangerine_ssd_2 {
read_only 0
}
LUN 103 tangerine_ssd_3 {
read_only 0
}

io_grouping_type 3

# Non-key attributes
addr_method PERIPHERAL
}

GROUP raspberry {
LUN 0 tangerine_ssd_1 {
read_only 0
}
LUN 1 tangerine_ssd_2 {
read_only 0
}
LUN 2 tangerine_ssd_3 {
read_only 0
}

INITIATOR 21:00:00:1b:32:87:f8:00

io_grouping_type 1

# Non-key attributes
addr_method PERIPHERAL
}
}
}

2 comments:

  1. Marc-
    So after 5-6 Months of production service, how are the SSD's holding up? Any burn outs yet?
    Thanks!

    ReplyDelete
  2. Hi Greg,

    No, none yet! I hope I'm not jinxing it now! =)

    That being said, one of the SSD arrays (storage server?) did have a kernel panic last month, and unfortunately, I didn't have 'kernel.panic' set in sysctl.conf to "auto-reboot" on panic. So, it hung for a while, but once we realized what happened, we power cycled it and everything came back up nicely.

    That happened on our original Dell R710 SCST box (six-drive-RAID5-volume); I have since then taken that machine out of production in case it is a hardware issue -- its just running on the development side now.

    Up until last month, we haven't heard anything from the three SCST machines we have setup -- they are running perfectly. All of our virtual desktops are extremely fast! Currently running ~750 VMs across (3) ESX hosts, and (6) of the SSD VMFS volumes (two storage arrays/servers with three RAID5 volumes each).


    --Marc

    ReplyDelete