We already knew that our disk array (SAN) vendor that we used with our VDI infrastructure supported SSDs, so we figured we’d get a quote for some of these bad boys... well it came back a lot higher than we anticipated (about $50K for 4 drives + 1 enclosure).
Our solution: Build an SSD disk array (for our Fibre Channel SAN) using SCST (open source SCSI target subsystem for Linux). The SCST project seemed pretty solid with a good community, but I wanted to try it out for myself before ordering SSDs and other hardware.
I setup an old Dell PowerEdge 6950 that had some QLogic 4GB FC HBAs and 15K SAS disks on it with Gentoo Linux. The SCST project is very well documented with lots of examples, so the whole setup was a breeze. I played around with the different device handlers a bit, but for our planned setup (VMware ESX VMFS volume), using the BLOCKIO mode seemed to be what we wanted. I played/tested quite a bit over the next couple weeks with a volume in BLOCKIO mode (SAS 15K - RAID5 PERC on the back storage) with different Fibre Channel initiators. I was sold -- now I had to figure out our “production” solution.
Our Solution
We decided to re-purpose an existing server that still had a good warranty left and a decent number of hot-swappable drive slots:
A Dell PowerEdge R710 (pulled from an ESX cluster):
- (2) Intel Xeon X5570 @ 2.93 GHz (4 cores each)
- 24 GB Memory
- Intel Gigabit Quad-Port NIC
- (2) QLogic QLE2460 HBAs (4Gbit Fibre Channel)
- Basic Dell PERC RAID Controller
- (8) 2.5” Bays
Next we wanted to update the RAID controller in the unit and get some SSDs; the majority of the servers we buy are hooked to a Fibre Channel SAN (for boot & data volumes), so the existing PERC controller left a little to be desired: We decided on the PERC H700 w/ 1GB NV Cache.
We then had to decide on some SSDs. There are some different options on the SSDs -- expensive “enterprise” SAS 6Gbps SLC drives and consumer grade SATA 3/6Gbps MLCs (and some other stuff in-between). We actually didn’t find any vendors that sold the enterprise SAS SSDs individually (eg, only via Dell, HP, etc. - all re-branded); we looked at Dell and they were in the $2K - $3K range for ~140GB (can’t remember exact size) each.
After reading some different reports, reviews, etc. we decided on the RealSSD (Micron) line of drives -- specifically the Crucial (Micron’s consumer division) RealSSD C300 CTFDDAC256MAG-1G1 2.5" 256GB SATA III MLC SSDs.
Great -- we put a requisition in and a few weeks later (the SSDs had to be bid out) we had some new toys.
From Marc's Adventures in IT Land |
From Marc's Adventures in IT Land |
From Marc's Adventures in IT Land |
Dell wouldn’t sell us the hot-swap drive trays by themselves, so we had to buy some cheap SATA disks so we could get the carriers. We purchased eight drive carriers (with disks) and eight SSDs -- we were only going to use 6 in our array, but wanted to have a couple spares ready.
Once we had the new RAID controller installed, I went through and updated the BIOS and PERC firmware, tweaked the BIOS settings and HBA settings (namely just disabling the adapter BIOS as we won’t be booting from the SAN at all). The R710 has (8) 2.5” drive bays; we decided to use (2) of these bays for a RAID1 array (for the boot/system Linux volume) with 73GB 10K SAS disks.
I hooked up each of the SSDs to a stand-alone Windows workstation and updated the firmware to the latest and greatest.
From Marc's Adventures in IT Land |
Linux Install/Setup
For the system Linux OS, I decided to use Gentoo Linux. We are a RHEL shop, but SCST appears to benefit greatly from a newer kernel. I’ve used Gentoo Linux in a production environment before, and my feeling on the whole stability thing with the “enterprise” Linux distributions is that its tossed out the window when you patch one of their kernels or use a newer vanilla kernel -- sure, you got the user-land stuff, but the main function of this server is going to be based in the kernel anyways.
I did use the Hardened Gentoo (amd64) profile and “vanilla-sources” kernel -- not necessarily for the security features, but for the “stability” (supposedly) of it. Most generally use the “hardened-sources” with Hardened Gentoo, but I figured having “clean” kernel source (instead of vendor patches) would be easier for integrating SCST. I got the OS installed and completely updated the whole thing with emerge. When installing Gentoo, I obviously chose a custom kernel (not genkernel) set with some standard options that I like and also what the branches/2.0.0.x/scst/README document from the SCST project recommended; namely:
- Disable all “kernel hacking” features.
- Use the CFQ IO scheduler.
- Turn the kernel preemption off (server).
- Enable the MCE features.
- I didn’t configure my HBA driver at this point as I knew that would need to be patched when setting up SCST.
I also installed a few useful utilities:
- sys-apps/hdparm
- sys-apps/pciutils
- sys-fs/lsscsi
- app-admin/mcelog
- app-admin/sysstat
- sys-block/fio
- dev-vcs/subversion
Plus the MegaCLI tool from LSI (for managing the RAID controller):
- Grab the latest Linux package from LSI’s website.
- Need the RPM tool in Gentoo: emerge app-arch/rpm
- Extract the MegaCLI package -- install the “MegaCli” and “Lib_Utils” RPMs; don’t forget a ‘--nodeps’.
- I didn’t need any other dependencies (check yours using ldd).
- No more BIOS RAID management: /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL
- A very useful cheat sheet: http://tools.rapidsoft.de/perc/perc-cheat-sheet.html
Back Storage Performance
Before setting up SCST, I wanted to do some quick and dirty performance/throughput tests on the back-end storage (the SSD array). This Dell PERC H700 controller has a feature called “Cut-through IO (CTIO)” that should supposedly increase the throughput for SSD drive arrays. Per the documentation it says its enabled on an LD (logical drive) by disabling read ahead and enabling write through cache (WT + NORA). I went ahead and created a RAID 5 array with my six SSD drives:
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLDAdd -R5[:1,:2,:3,:5,:6,:7] WT NORA -a0
Adapter 0: Created VD 1
Adapter 0: Configured the Adapter!!
Exit Code: 0x00
Presto! My new virtual disk is available:
[344694.250054] sd 0:2:1:0: [sdb] 2494300160 512-byte logical blocks: (1.27 TB/1.16 TiB)
[344694.250063] sd 0:2:1:0: Attached scsi generic sg3 type 0
[344694.250101] sd 0:2:1:0: [sdb] Write Protect is off
[344694.250104] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 10 08
[344694.250137] sd 0:2:1:0: [sdb] Write cache: disabled, read cache: disabled, supports DPO and FUA
[344694.250643] sdb: unknown partition table
[344694.250813] sd 0:2:1:0: [sdb] Attached SCSI disk
I waited for the RAID initialization process to finish (took about 23 minutes) before trying out a few tests; I’m not sure how much that affects performance. For the first test, I wanted to check the random access time of the array. I used a utility called “seeker” from this page: http://www.linuxinsight.com/how_fast_is_your_disk.html
./seeker /dev/sdb
Seeker v2.0, 2007-01-15, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sdb [1217920MB], wait 30 seconds..............................
Results: 6817 seeks/second, 0.15 ms random access time
So, we can definitely see one of the SSD perks -- very low random access times. Compare that to our system volume (RAID1 / 10K) below, we can see not having mechanical parts makes a big difference.
./seeker /dev/sda
Seeker v2.0, 2007-01-15, http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda [69376MB], wait 30 seconds..............................
Results: 157 seeks/second, 6.34 ms random access time
I see lots of people also using the ‘hdparm’ utility, so I figured I’d throw that in too:
hdparm -Tt /dev/sdb
/dev/sdb:
Timing cached reads: 20002 MB in 2.00 seconds = 10011.99 MB/sec
Timing buffered disk reads: 1990 MB in 3.00 seconds = 663.24 MB/sec
I wanted to try out sequential IO throughput of the volume using the ‘dd’ tool. I read about this a little bit on the ‘net and everyone seems to agree that the Linux buffer/page cache can warp performance numbers a bit. I haven’t educated myself enough on that topic, but the general consensus seems to recommend pushing a lot more data than you have RAM (24 GB RAM in this machine) to get around this, so I did 60 GB:
dd of=/dev/null if=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 88.797 s, 676 MB/s
676 megabytes per second seems pretty nice. Lately I’ve been thinking of numbers in “Gbps” (gigabits per second), so that number is 5.28125 gigabits per second (Gbps). Lets check out the write speed:
dd if=/dev/zero of=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 353.116 s, 170 MB/s
Ouch, that seems a little slower than I had anticipated. I understand that with the MLC SSD drives, the write speed is generally slower, but that seems a lot slower. Now, on the PERC, I had disabled the read/write cache for this volume (per Dell’s recommendation for Cut-through IO mode / SSDs), but this is a RAID5 volume and these are SATA SSDs, not SAS (“enterprise”) SSDs, so I turned on the write cache (write back) to see what happens:
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WB -L1 -a0
Set Write Policy to WriteBack on Adapter 0, VD 1 (target id: 1) success
Exit Code: 0x00
dd if=/dev/zero of=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 92.6398 s, 648 MB/s
Well that number is quite a bit more peppy; so, now this is going to bug me -- whats up with the drastic slow-downs on writes (without cache)? I wondered if it was possibly related to the RAID5 parity calculation / stripe size / something else with the writes. I was curious, so I destroyed that RAID5 logical disk and created a new volume using the six SSDs / RAID0 and tested again.
dd if=/dev/zero of=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 84.3099 s, 712 MB/s
Wow (write cache off). So it definitely seems to be a RAID5 thing; just to check the read speed again to be thorough:
dd of=/dev/null if=/dev/sdb bs=4MB count=15000
15000+0 records in
15000+0 records out
60000000000 bytes (60 GB) copied, 77.2353 s, 777 MB/s
Ugh. I’m not going to spend too much time trying to figure out why my write speed with RAID5 is so much slower -- we’re really just interested in reads since this array is only going to be used with VMware View as a read-only replica datastore, however, I did enable write back on the controller as I don’t want to be waiting forever on the parent VM -> replica clone operations.
We had originally read a Tom’s Hardware article about high-end SSD performance (http://www.tomshardware.com/reviews/x25-e-ssd-performance,2365.html) which gave us some inspiration for this project. They used (16) SSDs and (2) Adaptec RAID controllers with eight drives on each controller in RAID0 arrays and then used software (OS) RAID0 to stripe the two volumes as one logical disk; they were able to obtain 2.2 GB/sec (gigabytes). I was curious as to what our back-end storage was capable of (with six SSDs).
I created a RAID0 array with the six SSDs, 1MB (max) stripe size, write through and no read ahead. I used the ‘fio’ tool to push a bunch of data through to see what our max throughput is (similar to Tom’s Hardware: http://www.tomshardware.com/reviews/x25-e-ssd-performance,2365-8.html):
fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [1,663M/0K /s] [406/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=22439
read : io=97,016MB, bw=1,616MB/s, iops=403, runt= 60044msec
slat (usec): min=144, max=5,282, avg=2471.42, stdev=2058.81
clat (msec): min=42, max=197, avg=155.79, stdev=12.62
bw (KB/s) : min=1185469, max=1671168, per=99.91%, avg=1653112.44, stdev=43452.90
cpu : usr=0.13%, sys=7.54%, ctx=13547, majf=0, minf=131099
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=24254/0, short=0/0
lat (msec): 50=0.01%, 100=0.09%, 250=99.90%
Run status group 0 (all jobs):
READ: io=97,016MB, aggrb=1,616MB/s, minb=1,655MB/s, maxb=1,655MB/s, mint=60044msec, maxt=60044msec
Disk stats (read/write):
sdb: ios=435762/0, merge=0/0, ticks=8518093/0, in_queue=8520113, util=99.85%
fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/1,323M /s] [0/323 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=22445
write: io=77,328MB, bw=1,287MB/s, iops=321, runt= 60105msec
slat (usec): min=102, max=9,777, avg=3101.45, stdev=2663.09
clat (msec): min=98, max=323, avg=195.62, stdev=42.52
bw (KB/s) : min=901120, max=1351761, per=99.93%, avg=1316562.66, stdev=39517.45
cpu : usr=0.11%, sys=4.26%, ctx=10872, majf=0, minf=27
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/19332, short=0/0
lat (msec): 100=0.01%, 250=99.91%, 500=0.09%
Run status group 0 (all jobs):
WRITE: io=77,328MB, aggrb=1,287MB/s, minb=1,317MB/s, maxb=1,317MB/s, mint=60105msec, maxt=60105msec
Disk stats (read/write):
sdb: ios=6/347909, merge=0/0, ticks=0/8566762, in_queue=8571115, util=98.44%
I think those numbers look pretty nice for six (6) SSDs and one RAID controller: reads @ 1,655MB/s & writes @ 1,317MB/s (1.62 GB / sec, 1.29 GB / sec -- bytes, not bits).
Alright, for the “real” setup that we used, RAID0 is obviously not an option. We know RAID10 generally offers the best performance, but we didn’t want to miss that much space, so it looks like RAID5 is our BFF. I went with RAID5, 64KB stripe size (adapter default), write back, and no read ahead. I looked for information on optimal stripe size for use with VMware VMFS, but the opinions didn’t appear to be one-sided (bigger vs. smaller), so I stuck with the default. I ran our fio read/write throughput tests one more with the final array setup:
fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [1,659M/0K /s] [405/0 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=22532
read : io=96,752MB, bw=1,612MB/s, iops=403, runt= 60019msec
slat (usec): min=144, max=115K, avg=2478.14, stdev=2177.20
clat (msec): min=16, max=277, avg=156.15, stdev= 7.52
bw (KB/s) : min=1171456, max=1690412, per=99.87%, avg=1648576.39, stdev=57546.44
cpu : usr=0.09%, sys=7.72%, ctx=13584, majf=0, minf=131100
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=24188/0, short=0/0
lat (msec): 20=0.01%, 50=0.05%, 100=0.08%, 250=99.60%, 500=0.26%
Run status group 0 (all jobs):
READ: io=96,752MB, aggrb=1,612MB/s, minb=1,651MB/s, maxb=1,651MB/s, mint=60019msec, maxt=60019msec
Disk stats (read/write):
sdb: ios=434574/0, merge=0/0, ticks=8512402/0, in_queue=8513463, util=99.85%
fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=64 --runtime=60 --name=/dev/sdb
/dev/sdb: (g=0): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/1,090M /s] [0/266 iops] [eta 00m:00s]
/dev/sdb: (groupid=0, jobs=1): err= 0: pid=22583
write: io=64,368MB, bw=1,072MB/s, iops=268, runt= 60029msec
slat (usec): min=101, max=106K, avg=3726.06, stdev=3360.91
clat (msec): min=27, max=341, avg=234.77, stdev=13.74
bw (KB/s) : min=1062834, max=1338135, per=99.75%, avg=1095312.87, stdev=24085.25
cpu : usr=0.14%, sys=3.45%, ctx=9081, majf=0, minf=26
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued r/w: total=0/16092, short=0/0
lat (msec): 50=0.04%, 100=0.08%, 250=99.26%, 500=0.62%
Run status group 0 (all jobs):
WRITE: io=64,368MB, aggrb=1,072MB/s, minb=1,098MB/s, maxb=1,098MB/s, mint=60029msec, maxt=60029msec
Disk stats (read/write):
sdb: ios=11/288540, merge=0/0, ticks=1/8542345, in_queue=8544238, util=98.44%
So, we can see the writes are a bit slower than our RAID0 array -- no big deal for us. The read rate stayed nice and juicy (~ 1.6 GB / sec). I’m satisfied, so now its time to configure SCST.
SCST Setup
I started by grabbing the whole SCST project:
cd /usr/src
svn co https://scst.svn.sourceforge.net/svnroot/scst
By default, the SVN version is set for debugging/development, not performance, so I used the following make command to setup for performance:
cd /usr/src/scst/branches/2.0.0.x
make debug2perf
Next, we need to apply some of the kernel patches that are included with the SCST project. We’re using 2.6.36 kernel with this Gentoo install, so we really only need to use one patch file -- the other “enhancements” are already included in these newer kernels.
cd /usr/src
ln -s linux-2.6.36.2 linux-2.6.36
patch -p0 < /usr/src/scst/branches/2.0.0.x/scst/kernel/scst_exec_req_fifo-2.6.36.patch
Replace the kernel bundled QLogic FC driver with the SCST modified QLogic FC driver that enables target mode support:
mv /usr/src/linux-2.6.36.2/drivers/scsi/qla2xxx /usr/src/linux-2.6.36.2/drivers/scsi/qla2xxx.orig
ln -s /usr/src/scst/branches/2.0.0.x/qla2x00t /usr/src/linux-2.6.36.2/drivers/scsi/qla2xxx
I then built a new kernel with the QLA2XXX driver (as a module) and selected the “QLogic 2xxx target mode support” option, installed it, and rebooted. After the system came back up, I installed the latest QLogic firmware image (ftp://ftp.qlogic.com/outgoing/linux/firmware/) for my adapters:
mkdir /lib/firmware
cd /lib/firmware
wget ftp://ftp.qlogic.com/outgoing/linux/firmware/ql2400_fw.bin
modprobe -r qla2xxx
modprobe qla2xxx
Build and install the SCST core; debugging (performance hit) is enabled by default (so you might want to leave it on for testing), but we disabled it above with the ‘make debug2perf’:
cd /usr/src/scst/branches/2.0.0.x/scst/src
make all
make install
Build and install the QLogic target driver:
cd /usr/src/scst/branches/2.0.0.x/qla2x00t/qla2x00-target
make
make install
Build and install the scstadmin utility and start-up scripts (the ‘make install’ puts non-Gentoo init.d scripts in place by default):
cd /usr/src/scst/branches/2.0.0.x/scstadmin
make
make install
rm /etc/init.d/qla2x00t
rm /etc/init.d/scst
install -m 755 init.d/qla2x00t.gentoo /etc/init.d/qla2x00t
install -m 755 init.d/scst.gentoo /etc/init.d/scst
rc-update add qla2x00t default
rc-update add scst default
scstadmin -write_config /etc/scst.conf
/etc/init.d/qla2x00t start
/etc/init.d/scst start
Now its time to configure SCST -- the project is very well documented (see branches/2.0.0.x/scst/README), so I won’t go into all of the different configuration options, only what we decided on for our setup. First, we created a new virtual disk using BLOCKIO mode (vdisk_blockio):
scstadmin -open_dev vdi_ssd_vmfs_1 -handler vdisk_blockio -attributes filename=/dev/sdb,blocksize=512,nv_cache=0,read_only=0,removable=0
scstadmin -nonkey -write_config /etc/scst.conf
A little more tweaking; I set threads_num to 4 initially:
scstadmin -set_dev_attr vdi_ssd_vmfs_1 -attributes threads_pool_type=per_initiator,threads_num=4
scstadmin -nonkey -write_config /etc/scst.conf
Now for the security groups and target LUN setup; in our setup, each ESX host has (2) Fibre Channel HBAs and we have (2) fabrics (non-stacked, independent switches). Our disk array box has (2) HBAs, one going to each fabric, so each HBA on the disk array (SCST target) will “see” (1) initiator for each ESX host. The SCST documentation states that the “io_grouping_type” attribute can affect performance greatly -- I decided to initially put each initiator in its own security group and this way I could control the I/O grouping using explicit I/O group numbers and experiment a bit with this.
scstadmin -add_group vdiesxtemp1 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6
scstadmin -add_group vdiesxtemp2 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6
scstadmin -add_group vdiesxtemp3 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6
scstadmin -add_group vdiesxtemp1 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de
scstadmin -add_group vdiesxtemp2 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de
scstadmin -add_group vdiesxtemp3 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de
scstadmin -add_init 21:00:00:1b:32:17:00:f6 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp1
scstadmin -add_init 21:00:00:1b:32:17:d6:f7 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp2
scstadmin -add_init 21:00:00:1b:32:06:0f:a1 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp3
scstadmin -add_init 21:01:00:1b:32:37:00:f6 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp1
scstadmin -add_init 21:01:00:1b:32:37:d6:f7 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp2
scstadmin -add_init 21:01:00:1b:32:26:0f:a1 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp3
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp1 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp2 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -group vdiesxtemp3 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp1 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp2 -device vdi_ssd_vmfs_1
scstadmin -add_lun 201 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -group vdiesxtemp3 -device vdi_ssd_vmfs_1
scstadmin -set_grp_attr vdiesxtemp1 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -attributes io_grouping_type=1
scstadmin -set_grp_attr vdiesxtemp2 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -attributes io_grouping_type=2
scstadmin -set_grp_attr vdiesxtemp3 -driver qla2x00t -target 21:00:00:1b:32:82:91:f6 -attributes io_grouping_type=3
scstadmin -set_grp_attr vdiesxtemp1 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -attributes io_grouping_type=1
scstadmin -set_grp_attr vdiesxtemp2 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -attributes io_grouping_type=2
scstadmin -set_grp_attr vdiesxtemp3 -driver qla2x00t -target 21:00:00:1b:32:8a:50:de -attributes io_grouping_type=3
scstadmin -nonkey -write_config /etc/scst.conf
The SCST documentation says to always start the LUN numbering at 0 for a target to be recognized, however, with our ESX hosts, we have another disk array on the same SAN with VMFS datastores. The other disk array also contains our boot disk which is LUN 0 -- by default the QLogic HBA BIOS looks for LUN 0 as the boot volume (you can set specific target(s) in the HBA BIOS settings). VMware ESX has “sparse LUN support” enabled by default, so the LUN numbers shouldn’t have to be sequential (it scans 0 to 255). We used a LUN number of 201 for the SSD volume and didn’t have any issues -- maybe other initiator types (Linux / Windows) need to start at 0?
In the above scstadmin commands, I used the ‘-set_grp_attr’ argument, it works, but it is not documented in the help output for the scstadmin command. It should be fixed in future versions.
A few more tweaks related read ahead and kernel settings; I set the RA value to 1024 KB (512-byte sectors * 2048 = 1048576 / 1024 = 1024 KB -- default was 128 KB) and I seem to recall others saying max_sectors_kb set to 64 was nice (maybe not for us). I added the following to /etc/conf.d/local.start (SCST needs to be restarted after modifying the read ahead value):
/etc/init.d/qla2x00tgt stop > /dev/null 2>&1
/etc/init.d/scst stop > /dev/null 2>&1
echo 64 > /sys/block/sdb/queue/max_sectors_kb
blockdev --setra 2048 /dev/sdb
/etc/init.dqla2x00tgt start > /dev/null 2>&1
/etc/init.d/scst start > /dev/null 2>&1
Next, per the scst/README I modified some settings for CPU / IRQ affinity. Our machine has 8 physical cores which in Linux is 16 logical CPUs and we started with using CPUs 0-1 only for IRQs (fffc); I added these to /etc/conf.d/local.start:
for i in /proc/irq/*; do if [ "$i" == "/proc/irq/default_smp_affinity" ]; then echo fffc > $i; else echo fffc > $i/smp_affinity; fi; done > /dev/null 2>&1
for i in scst_uid scstd{0..15} scst_initd scsi_tm scst_mgmtd; do taskset -p fffc `pidof $i`; done > /dev/null 2>&1
Finally, we enable the targets and configure the zoning on the Fibre Channel switches:
scstadmin -enable_target 21:00:00:1b:32:82:91:f6 -driver qla2x00t
scstadmin -enable_target 21:00:00:1b:32:8a:50:de -driver qla2x00t
scstadmin -nonkey -write_config /etc/scst.conf
VMware ESX Performance
Now a little SCST / VMware ESX 4.1 performance evaluation. I didn’t want to go into the full-bore setup for testing max IOPS / throughput like other articles (eg, http://blogs.vmware.com/performance/2008/05/100000-io-opera.html), but I did want to do a couple simple tests just to see what we’re working with. For a quick n’ dirty test setup, I used a single ESX 4.1 host, created the VMFS file system on our SSD volume (1 MB block size), and then created a new VM: 2 CPUs, 4 GB memory, Windows Server 2008 R2, and added a second 50 GB virtual disk. For the 50 GB virtual disk, be sure to check the “fault tolerance” option -- this will use the thick / eager-zeroed option. Without doing this (eg, using lazy zero) when using Iometer, it will give you some crazy numbers (like ~ 2 GB / sec reads and you won’t see any IO on the SCST disk array); this makes sense I guess since ESX knows that there haven’t been any blocks written, so its smart enough to not even try reading from the block device?
Anyways, for our 50 GB test virtual disk, I used it as a physical drive in Iometer (no partition / no NTFS file system). On the ESX host, I used “Round Robin (VMware)” as the path selection policy. Once the Windows guest OS was installed, I did the updates and installed Iometer 2006.07.27. The constants I used in the Iometer test were: (2) workers, number of outstanding IOs set to (64), and checked our 50 GB “physical drive”.
In this Iometer test, I did a 4 MB transfer request size, 100% read, and 100% sequential: ~ 760 MB / sec
I confirmed this number on the SCST disk array server using iostat:
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.23 0.00 0.00 99.77
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 4353.00 760.54 0.00 760 0
sda 0.00 0.00 0.00 0 0
I also checked on the ESX service console using esxtop:
4:28:56pm up 28 days 3:16, 142 worlds; CPU load average: 0.03, 0.03, 0.01
ADAPTR PATH NPTH CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cm
vmhba0 - 4 374.41 366.78 7.63 366.79 0.05 18.68 40.62 59.3
vmhba1 - 4 366.97 366.78 0.19 366.78 0.00 62.27 44.86 107.1
vmhba2 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba3 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba32 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba34 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba35 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
We can see above IO flowing across both of our QLogic Fibre Channel HBAs (round robin path policy); what happens if we cut ESX down to using just one HBA (fixed path policy) -- here is another esxtop:
4:31:52pm up 28 days 3:19, 142 worlds; CPU load average: 0.04, 0.04, 0.02
ADAPTR PATH NPTH CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cm
vmhba0 - 4 390.43 388.15 2.29 375.18 0.00 70.95 89.59 160.5
vmhba1 - 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba2 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba3 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba32 - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba34 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
vmhba35 - 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
So we can see IO goes down by about half with only one HBA -- iostat on the SCST array confirms and so does Iometer in the VM (~ 380 MB / sec). So, it seems the round-robin path policy can be quite beneficial. We can also see that this one VM seems to be maxing out both of our 4 Gbps Fibre Channel HBAs.
I realize that the tests I ran here and above in the back storage performance section were focused on throughput and not max, small IOPS. I need to educate myself a little more on disk IO performance testing and will runs tests again focusing on the high IOPS.
Backup & Recovery
For the actual disk array “server” backup, I just used a simple tar + ssh + public key authentication combo to copy a tarball of the local system files over to another server. This way I have a little more control of when the backup occurs -- we all know our backup admins never purposely have those nightly backup jobs run long, but lets face it, it happens. The system configuration on our SSD disk array is probably not going to change much, so I just have a cron that runs weekly:
tar cvzfp - /bin /boot /dev /etc /home /lib /lib32 /lib64 /mnt /opt /root /sbin /tmp /usr /var | ssh user@host "cat > hostname_`date +%Y%m%d`.tar.gz"
I also created a simple shell script that checks our logical drives (system RAID volume and SSD RAID volume) to see if a disk failed using the MegaCli utility (runs hourly via cron).
What happens if our cool, new SSD disk array dies? This isn’t a dual-controller, HA capable storage device -- it is a higher end server that has dual power supplies, mutliple HBAs, RAID, memory sparing, etc., but what if the kernel panics? I will say that VMware ESX behaves unexpectedly well in this situation. Go ahead and try it, reboot the disk array. I tried this and the VMFS datastore shows up as inaccessible in vCenter, and everything seems to “pause” quite nicely. When the disk array (VMFS volume) comes back, everything starts working again. This being said, I did notice if you leave the volume down too long, things do start acting a bit strange (VMs hanging, etc.), but I imagine this is due to the guest OS hitting a timeout or ESX hitting something.
Anyhow, we wanted to have a backup of the volume, just in case. This doesn’t really mean much if only replicas are stored on this volume, but we also keep some parent VMs on it. We wanted to map a volume from a different disk array to our new SSD disk array so we could “clone” it.
To use QLogic HBAs in initiator mode and target mode (default is to disable initiator mode, when target mode is enabled):
echo “options qla2xxx qlini_mode=enabled” > /etc/modprobe.d/qla2xxx.conf
You can check it with “cat /sys/class/fc_host/hostX/device/scsi_host/hostX/active_mode” (should read ‘Initiator, Target’).
However, using our QLogic HBAs in target and initiator mode simulataneously with our SANbox switches did NOT work for us (there is a note in the Kconfig file for the qla2xxx module that mentions some switches don’t handle ports like this well, so I assume we were affected by that). VMware ESX would no longer recognize the volume when we tried this, so we did the sensible thing and ordered an Emulex HBA to keep in initiator mode.
I then mapped a new volume from our primary disk array to the SSD-disk-array-server. I have a cron that runs nightly which “clones” (via dd) the SSD volume. This “cloned” volume is mapped to our ESX hosts at all times. VMware ESX recognizes this volume as a “snapshot” -- it sees that the VMFS label / UUID are the same, but the disk serial numbers are not. It doesn’t mount it by default. It looks like this from the service console:
esxcfg-volume -l
VMFS3 UUID/label: 4d29289d-624abc58-f7b1-001d091c121f/vdi_ssd_vmfs_1
Can mount: No (the original volume is still online)
Can resignature: Yes
Extent name: naa.6000d3100011e5000000000000000079:1 range: 0 - 1217791 (MB)
Using dd to clone the SSD volume probably isn’t a realistic option with a “normal” VMFS volumes (one used for read/writes). In our situation, the SSD volume is just just to store the replicas, so until we recompose using a new snapshot from a parent VM, the data on the volume is likely to stay the same. I haven’t explored the possibility of block-level type snapshots / clones using SCST, but it would be interesting to look at -- I believe device-mapper has some type of “snapshot” support, so maybe that could be used in association with SCST? Something to think about...
Anyhow, so we have this cloned volume out there that our ESX servers can now see, but haven’t mounted. Using the esxcfg-volume utility in the service console we can “resignature” our cloned volume. This will write a new signature and allow ESX to mount it as a “new” datastore. It shows up as something like this: snap-084f0837-vdi_ssd_vmfs_1
So, this really doesn’t help us tremendously if our SSD volume dies, becomes unavailable, etc., but it would allow us to have access to the data, say if the parent VMs were stored on this datastore. There is probably some fancy things you could do with the View SQL database / View LDAP directory like changing the datastore name in records used for the linked-clones. I found an article that is remotely similar to doing something like this: http://virtualgeek.typepad.com/virtual_geek/2009/10/howto-use-site-recovery-manager-and-linked-clones-together.html
Results / Conclusion
I’ve talked a lot about our setup, using SCST, a few performance numbers, etc., but our end goal was to improve VDI performance and our end-user experience. I could run a bunch more numbers, but I think seeing is believing, and what the real end-user experience is like, is all that matters. So, for this demonstration, I wanted to see what the speed difference was with a “real” application between linked-clone VMs on our enterprise disk array vs. using our new SSD disk array for reads (replica datastore).
One of our floating, linked-clone pools had a “big” application (QuickBooks 2010) on it which was notoriously slow on our VDI implementation. I created a new pool using the same specifications (Windows 7 32-bit, (2) vCPUs, and (2) GB memory), the same parent VM / snapshot, and used the new View 4.5 feature of specifying a different datastore for replicas. I then used the View Client on a workstation and logged into a new session on each pool with each session (screen) side-by-side.
I used Camtasia Studio to capture the video; for the first clip, I opened QuickBooks 2010. The VM from the current enterprise disk array is on the right, and the VM from the new SSD disk array pool is on the left. Both are fresh VMs and the application hasn’t been opened at all:
Notice that I give the “slow” VM the advantage by clicking the QuickBooks 2010 shortcut there first. At the time this video was taken, the ESX cluster had several hundred VMs powered on with only about half of those that had active PCoIP sessions. With large applications, the speed difference is extremely noticeable; with smaller applications such as the Office 2010 products, the difference is still noticeable, but not nearly as dramatic as above.
It will be interesting to see how well this solution scales as our VDI user base grows. As a school, we also have the need for many different pools / parent VMs (lab software licensing restrictions, etc.), so the number of replicas will grow as well as the number of linked-clones that are associated with each replica.
We have been quite impressed with the SCST project and the performance of the SSDs. We are already looking at building new, bigger arrays that will be used for linked-clone datastores (not just read-only replicas) in our View deployment. Currently considering a 24-slot setup from Silicon Mechanics...