Marc's Adventures in IT Land

Wednesday, August 16, 2017

Linux MD RAID0: The Weakest Link

I made an interesting discovery recently when testing performance with Linux MD RAID0 (software RAID) and some NVMe devices in an older chassis. The chassis had (4) Samsung NVMe drives that were connected to the system via PCIe (using the PCIe/M.2 adapter).

I started testing the performance of the drives individually using the 'f'io' tool and then finally across all (4) drives... for the simplicity of this post, I'll just mention that sequential read throughput was tested. Here was the command that was used and performance across all four of the drives:
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1
READ: bw=5383MiB/s (5645MB/s), 25.3MiB/s-353MiB/s (26.5MB/s-370MB/s), io=161GiB (173GB), run=30140-30575msec

So we're getting about 5.6 gigabytes per second, which is pretty fast. Now I thought I should get about the same performance in a MD RAID0 (striped) array using all (4) of these NVMe drives. I created the MD RAID0 array using a 64K chunk size, and then here is the fio command string used and the performance:
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/md/localhost\:4stripe
READ: bw=3168MiB/s (3322MB/s), 132MiB/s-393MiB/s (139MB/s-412MB/s), io=93.6GiB (100GB), run=30112-30240mse

Whoa, so 3.3 gigabytes per second is significantly lower than what we got using fio directly with all four NVMe drives above... some overhead was expected when using MD RAID, but not this much.

Next I looked at the performance of each NVMe device individually:

# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1
READ: bw=793MiB/s (832MB/s), 26.1MiB/s-198MiB/s (28.3MB/s-208MB/s), io=23.8GiB (25.5GB), run=30241-30593msec

# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme1n1
READ: bw=1576MiB/s (1652MB/s), 73.8MiB/s-131MiB/s (77.4MB/s-138MB/s), io=46.6GiB (50.8GB), run=30247-30264msec

# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme2n1
READ: bw=1572MiB/s (1648MB/s), 90.3MiB/s-113MiB/s (94.7MB/s-118MB/s), io=46.4GiB (49.8GB), run=30162-30180msec

# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme3n1
READ: bw=1580MiB/s (1657MB/s), 92.7MiB/s-108MiB/s (97.2MB/s-113MB/s), io=46.6GiB (50.8GB), run=30163-30178msec

So it's clear one of the devices is slower than the rest... and about half of the performance of the other three (800 MB/s vs. 1600 MB/s). I was under the assumption that all the PCIe slots in this system were PCIe 2.0 (Gen 2), but it turns out one of them is PCIe 1.0 (Gen 1):
# list-pcie-ports.sh -t
PCI1: x4 PCI Express 2 x4
PCI2: x8 PCI Express 2 x8
PCI3: x8 PCI Express 2 x8
PCI4: x8 PCI Express 2 x8
PCI5: x4 PCI Express
PCI6: x8 PCI Express 2 x8
PCI7: x4 PCI Express 2 x4
PCI8: x4 PCI Express 2 x4
PCI9: x4 PCI Express 2 x4
PCI10: x4 PCI Express 2 x4

It must be that the first NVMe device with the slow performance is in the PCIe 1.0 slot! A quick check from the output of "lspci -vv" confirms it:
--snip--
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
--snip--

Versus the output from the other faster NVMe devices:
--snip--
LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
--snip--

Yes, half the speed to be precise (the different between PCIe 1.0 and 2.0)... 2.5GT/s vs. 5GT/s. So, back to the read throughput performance of our NVMe-based MD array (3322MB/s)... it turns out that for MD RAID0 (and probably any type/level of RAID, and not just MD RAID) the maximum performance is that of the slowest device in the array, so if we take 832MB/s * 4 = 3328MB/s (which is the value we got with the MD array).

This information may have been common place and well known by others, and when you think about it, it makes sense that it works this way... but for me, I always had it in my head that the performance was concatenated value of each device in a stripe set... IT'S NOT! Maximum performance is the slowest device * number of devices. Again, this probably RAID "101" level information, but perhaps others will find this helpful.

Okay, got the "slow" NVMe device moved to a PCIe 2.0 slot, and ran the test again using fio with all (4) NVMe drives together (which should show a performance increase since one of the drives previously was slower than the others):
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1
READ: bw=6059MiB/s (6353MB/s), 42.4MiB/s-368MiB/s (44.5MB/s-386MB/s), io=180GiB (193GB), run=30144-30391msec

Excellent, getting about 6.3 gigabytes per second which is up about 800 MB/s from our very first test since now all (4) NVMe drives are in PCIe 2.0 slots. And finally, run our fio test against the MD RAID0 array that contains the (4) NVMe drives:
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/md/localhost\:4stripe
READ: bw=4758MiB/s (4989MB/s), 159MiB/s-952MiB/s (167MB/s-998MB/s), io=140GiB (150GB), run=30037-30122msec

There we go, getting about 4.9 gigabytes per second with the MD RAID0 array... yeah, that's costing us 1 GB per second of overhead, but we could probably play with the chunk size or possibly the alignment and get much closer to the 6.3 GB/s number. Or it may just be a matter of tuning our fio command test string to get more performance from a single MD array block device. But for this post, we're not tuning performance, it was just to illustrate the point above with the slowest device in a RAID array.

Okay, wait... it bugs me not knowing something or wondering why (performance overhead above), so I adjusted the fio command string to the following to seek better performance from our MD array (increased # of jobs to 64):
fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=64 --runtime=30 --name=/dev/md/localhost\:4stripe
READ: bw=5600MiB/s (5872MB/s), 79.0MiB/s-150MiB/s (82.9MB/s-157MB/s), io=167GiB (179GB), run=30227-30489msec

There, I'm satisfied with 5.8 gigabytes per second, that is within a few hundred megabytes of our non-RAID number, and again tuning could help more, but I'm good with leaving it at that -- nothing is free.

Friday, May 5, 2017

Millions of IOPS with ESOS & NVMe

In this article, we’re going to be focusing on performance numbers, capabilities, and generating I/O load from the initiator side. We’re testing a dual-head ESOS (Enterprise Storage OS) array using the new md-cluster RAID1 subsystem for our back-end storage. We’ll first look at the performance of the “raw” NVMe devices attached to the ESOS servers, and then examine the performance of the MD arrays used locally. Finally, we’ll test the performance from our set of ESXi initiators, and try various configurations of multipathing from the initiator-side (eg, round-robin I/O across both ESOS nodes, for a single LUN, individual access across one node, etc.).

Hardware Setup

The hardware we’re using for this storage setup is pretty nice… it's a 2U 40-drive NVMe CiB (cluster-in-a-box) solution from Advanced HPC in California. For our testing, we’ve only populated the system with (8) NVMe SSD’s; here is the full hardware configuration:

Mercury RM240 2U Rackmount Server (Supermicro SSG-2028R-DN2R40L)
Each individual server inside the 2U CiB contains:

2 x E5-2690v4 2.6GHz Fourteen-Core 135W Processor
64GB DDR4 2400 MHz Memory (8 x 8GB)
1 x QLogic Dual Port Gen3 x4 2x LC PCIExpress 3.0 x8 16 Gbit/s Fibre Channel Host Bus Adapter

Shared Dual Port NVMe Drives:

8 x Intel 1TB DC D3600 2.5 Inch PCIe NVMe 3.0 2x2 Dual Port MLC Sold State Drive

Initial Performance Testing

We’ll start testing performance by running the “fio” tool directly on the ESOS servers themselves. In this section, we’ll examine the performance of the NVMe drives from just one server. For the sake of brevity, we’ll simply provide the full command string to run the test, and the summary output.

100% random, 100% read, 4K blocks, (1) NVMe drive:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1
Total I/O’s issued 11425602 / 30 sec = 380,853.4 IOPS

100% random, 100% read, 4K blocks, (2) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1
Total I/O’s issued (11393023 + 11214911) / 30 sec = 753,597.8 IOPS

100% random, 100% read, 4K blocks, (3) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1
Total I/O’s issued (11377964 + 11249952 + 11118897) / 30 sec = 1,124,893.8 IOPS

100% random, 100% read, 4K blocks, (4) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1
Total I/O’s issued (11409532 + 11272475 + 11425103 + 11030073) / 30 sec = 1,504,572.8 IOPS

100% random, 100% read, 4K blocks, (5) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1
Total I/O’s issued (11304612 + 11220192 + 11375031 + 11364989 + 11241157) / 30 sec = 1,883,532.7 IOPS

100% random, 100% read, 4K blocks, (6) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1
Total I/O’s issued (11331295 + 11196744 + 11221628 + 11268383 + 11174166 + 10899390) / 30 sec = 2,236,386.8 IOPS

100% random, 100% read, 4K blocks, (7) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1
Total I/O’s issued (11316790 + 10897138 + 11323183 + 11293924 + 11151865 + 11146491 + 11166972) / 30 sec = 2,609,878.7 IOPS

100% random, 100% read, 4K blocks, (8) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1
Total I/O’s issued (11293419 + 11128253 + 11279560 + 11264734 + 11123366 + 11130072 + 11134920 + 11149058) / 30 sec = 2,983,446 IOPS

That’s nearly 3 million 4K IOPS across just eight (8) NVMe drives! Notice how the performance scales nearly linearly when adding each drive. Here is a breakdown so you can see that easier:

1 NVMe drive: 380,853.4 IOPS
2 NVMe drives: 753,597.8 IOPS
3 NVMe drives: 1,124,893.8 IOPS
4 NVMe drives: 1,504,572.8 IOPS
5 NVMe drives: 1,883,532.7 IOPS
5 NVMe drives: 2,236,386.8 IOPS
7 NVMe drives: 2,609,878.7 IOPS
8 NVMe drives: 2,983,446 IOPS

That linear performance is much different from what I’m used to with SAS HBA’s / hardware RAID controllers. Anyways, let’s see some more raw performance numbers across all eight of these bad boys…

100% random, 100% write, 4K blocks, (8) NVMe drives:

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1
Total I/O’s issued (7093051 + 7101924 + 7065189 + 7059711 + 6767097 + 7087157 + 7082874 + 7023196) / 30 sec = 1,876,006.6 IOPS

Wow, so even with 100% random writes, we’re getting nearly 2 million 4K IOPS! These drives are MLC, so it's expected the write performance will be significantly less than that of reads. And for reference, let’s look at random I/O 4K write performance on a single drive...

100% random, 100% write, 4K blocks, (1) NVMe drive:

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1
Total I/O’s issued 7149045 / 30 sec = 238,301.5 IOPS

Now let’s take a look at throughput with sequential I/O loads…

100% sequential, 100% read, 4M blocks, (8) NVMe drives:

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1
Aggregate throughput: 12,339 megaBYTES/sec = 98.712 gigabits/sec

100% sequential, 100% write, 4M blocks, (8) NVMe drives:

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1
Aggregate throughput: 7,700.3 megaBYTES/sec = 61.6024 gigabits/sec

Amazing, so for throughput numbers, across just (8) of these NVMe drives, we’re looking at just over 12 GB/sec for reads, and nearly 8 GB/sec for writes. So we certainly could saturate a 100 Gb Ethernet link!

NVMe Thermal Issues / Adequate Cooling

During our initial testing with the NVMe drives, we experienced some NVMe drives disappearing when testing under high-load… sometimes after only a few minutes the drives would “drop” and other times after 30 minutes or more. We’d get errors like the following, and the drives would disappear from both CiB server nodes:

--snip--
[70961.868655] nvme nvme0: I/O 1009 QID 1 timeout, aborting
[70961.868666] nvme nvme0: I/O 1010 QID 1 timeout, aborting
[70961.868670] nvme nvme0: I/O 1011 QID 1 timeout, aborting
[70961.868673] nvme nvme0: I/O 1013 QID 1 timeout, aborting
[70992.073974] nvme nvme0: I/O 1009 QID 1 timeout, reset controller
[71022.727229] nvme nvme0: I/O 237 QID 0 timeout, reset controller
[71052.589227] nvme nvme0: completing aborted command with status: 0007
[71052.589230] blk_update_request: I/O error, dev nvme0n1, sector 1699051304
[71052.589240] nvme nvme0: completing aborted command with status: 0007
[71052.589241] blk_update_request: I/O error, dev nvme0n1, sector 921069792
[71052.589243] nvme nvme0: completing aborted command with status: 0007
[71052.589244] blk_update_request: I/O error, dev nvme0n1, sector 503421912
[71052.589246] nvme nvme0: completing aborted command with status: 0007
[71052.589247] blk_update_request: I/O error, dev nvme0n1, sector 459191600
[71052.589249] nvme nvme0: completing aborted command with status: 0007
[71052.589250] blk_update_request: I/O error, dev nvme0n1, sector 541938152
[71052.589252] nvme nvme0: completing aborted command with status: 0007
[71052.589253] blk_update_request: I/O error, dev nvme0n1, sector 454021704
[71052.589255] nvme nvme0: completing aborted command with status: 0007
[71052.589255] blk_update_request: I/O error, dev nvme0n1, sector 170843976
[71052.589257] nvme nvme0: completing aborted command with status: 0007
[71052.589258] blk_update_request: I/O error, dev nvme0n1, sector 1632333960
[71052.589259] nvme nvme0: completing aborted command with status: 0007
[71052.589260] blk_update_request: I/O error, dev nvme0n1, sector 463726632
[71052.589262] nvme nvme0: completing aborted command with status: 0007
[71052.589262] blk_update_request: I/O error, dev nvme0n1, sector 1402584824
[71052.589264] nvme nvme0: completing aborted command with status: 0007
[71052.589267] nvme nvme0: completing aborted command with status: 0007
[71052.589273] nvme nvme0: completing aborted command with status: 0007
[71052.589275] nvme nvme0: completing aborted command with status: 0007
[71052.589277] nvme nvme0: completing aborted command with status: 0007
[71052.589280] nvme nvme0: completing aborted command with status: 0007
[71052.589282] nvme nvme0: completing aborted command with status: 0007
[71052.589284] nvme nvme0: completing aborted command with status: 0007
[71052.589286] nvme nvme0: completing aborted command with status: 0007
[71052.589288] nvme nvme0: completing aborted command with status: 0007
[71052.589290] nvme nvme0: completing aborted command with status: 0007
[71052.589292] nvme nvme0: completing aborted command with status: 0007
[71052.589294] nvme nvme0: completing aborted command with status: 0007
[71052.589297] nvme nvme0: completing aborted command with status: 0007
--snip--

We worked through Advanced HPC whom submitted a ticket to Intel on our behalf… after going back and forth a few times with Intel support, we submitted the NVMe drive log data to them, and it was determined some of the NVMe drives were overheating! Turns out that the NVMe drives are pretty sensitive to high temperatures, and they have an internal safety mechanism that literally shuts down the drive to protect itself.

I researched this a bit, and some other drives will actually throttle I/O internally if they get too hot. These enterprise Intel NVMe drives just drop, and maybe it’s just a setting that can be modified to adjust the behavior (eg, dropout vs. throttle), we haven’t looked any more into it since our data center was the culprit. Turns out we were having some massive hot / cold swings (huge ranges) on the cold aisle side, and this was enough to cause the NVMe drives to drop out like that. We fixed the cooling issues in our server room, and haven’t had any problems since. So, it is important to have adequate cooling in your data center to host these NVMe systems.

Linux md-cluster RAID1

Now we’re on to testing the drives in a MD RAID1 configuration… before we can create our md-cluster RAID1 arrays, we need to first start DLM. We’ll let the cluster take care of the DLM service, so we first disabled STONITH and then added a new resource configuration for handling DLM.

Now we can continue with partitioning the NVMe drives and creating the MD arrays:

--snip--
for i in {0..7}; do
parted --machine --script --align optimal /dev/nvme${i}n1 mklabel gpt -- mkpart primary 1 -1
parted --machine --script /dev/nvme${i}n1 -- set 1 raid on
done
mdadm --create --verbose --run /dev/md/array1 --level=mirror --bitmap=clustered --homehost=any --raid-devices=2 /dev/nvme0n1p1 /dev/nvme1n1p1
mdadm --create --verbose --run /dev/md/array2 --level=mirror --bitmap=clustered --homehost=any --raid-devices=2 /dev/nvme2n1p1 /dev/nvme3n1p1
mdadm --create --verbose --run /dev/md/array3 --level=mirror --bitmap=clustered --homehost=any --raid-devices=2 /dev/nvme4n1p1 /dev/nvme5n1p1
mdadm --create --verbose --run /dev/md/array4 --level=mirror --bitmap=clustered --homehost=any --raid-devices=2 /dev/nvme6n1p1 /dev/nvme7n1p1
mdadm --examine --brief --scan --config=partitions > /etc/mdadm.conf
--snip--

Now on the second cluster node run this to discover partitions, assemble the array, and write the configuration file:

--snip--
for i in {0..7}; do
partprobe /dev/nvme${i}n1
done
mdadm --examine --brief --scan --config=partitions > /etc/mdadm.conf
mdadm --assemble --scan
--snip--

We waited for the resync of the devices to finish, then we continued our performance testing…

For testing the MD array performance, we’ll test directly on the ESOS servers, and we’ll test from just one node for now; we’ll go through each of the tests we used above on a single MD RAID1 array first. For the write tests below, we use a ramp-up time of 60 seconds so we can get correct performance numbers (due to the write-intent bitmap used).

100% random, 100% read, 4K blocks, (1) MD RAID1 array:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/md/array1
Total I/O’s issued 21352204 / 30 sec = 711,740.1 IOPS

100% random, 100% write, 4K blocks, (1) MD RAID1 array:

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --ramp_time=60 --name=/dev/md/array1
Total I/O’s issued 6833079 / 30 sec = 227,769.3 IOPS

100% sequential, 100% read, 4M blocks, (1) MD RAID1 array:

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/md/array1
Aggregate throughput: 3,134.2 megaBYTES/sec = 25.0736 gigabits/sec

100% sequential, 100% write, 4M blocks, (1) MD RAID1 array:

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --ramp_time=60 --name=/dev/md/array1
Aggregate throughput: 657.382 megaBYTES/sec = 5.259056 gigabits/sec

Okay, so, the read tests make sense… since this is a mirror set, data is read from both drives, thus doubling the performance of our read tests. Remember above, random 4K I/O performance of one NVMe drive was roughly ~350K IOPS, and we’re getting double from one MD RAID1 array.

The write performance numbers for the 4K random I/O test does show some overhead (lesser performance compared to a single standalone NVMe drive) but that is expected -- nothing is free. It’s still pretty close, and the same for the sequential (throughout) write performance, it is lesser than a single NVMe drive, but that’s okay, we’re getting redundancy from our RAID1 set, the lesser performance is the cost in doing that.

Now let’s look at the same performance numbers, from a single ESOS node (direct from OS) across all (4) of our NVMe RAID1 arrays…

100% random, 100% read, 4K blocks, (4) MD RAID1 arrays:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/md/array1 --name=/dev/md/array2 --name=/dev/md/array3 --name=/dev/md/array4
Total I/O’s issued 86333489 / 30 sec = 2,877,782.9 IOPS

100% random, 100% write, 4K blocks, (4) MD RAID1 arrays:

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --ramp_time=60 --name=/dev/md/array1 --name=/dev/md/array2 --name=/dev/md/array3 --name=/dev/md/array4
Total I/O’s issued 26215140 / 30 sec = 873,838 IOPS

100% sequential, 100% read, 4M blocks, (4) MD RAID1 arrays:

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/md/array1 --name=/dev/md/array2 --name=/dev/md/array3 --name=/dev/md/array4
Aggregate throughput: 12,357 megaBYTES/sec = 98.856 gigabits/sec

100% sequential, 100% write, 4M blocks, (4) MD RAID1 arrays:

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --ramp_time=60 --name=/dev/md/array1 --name=/dev/md/array2 --name=/dev/md/array3 --name=/dev/md/array4
Aggregate throughput: 2,634.9 megaBYTES/sec = 21.0792 gigabits/sec

Wow, so our random I/O 4K read performance is nearly 3 million IOPS! Nearly the same as just running I/O across all (8) of the NVMe drives, so that’s awesome. As expected, the random I/O write performance is less, close to that of half the drives (4) which gives us nearly 1 million IOPS, not too shabby. And the throughput numbers are as expected as well, for reads, getting the performance of all (8) NVMe drives giving us just about 100 Gb/sec and the writes 21 Gb/sec!

Target Port Configuration and Setup

Now that we’ve tested the back-end storage directly from the ESOS array heads, we’ll test performance from across our 16 Gb Fibre Channel SAN. First we need to configure SCST and the remaining cluster resources.

For our testing, we simply started SCST by hand from the shell. After testing, we’ll configure the cluster resources (primitives and clone sets) for SCST and the MD RAID arrays.

After SCST was started on both nodes, we then configured our security group for each target interface, added the initiators, and mapped each MD array to a LUN in the group using the vdisk_blockio handler (default settings, except rotational set to no).

Testing with VMware vSphere (ESXi)

On the initiator side, we’re running VMware ESXi 6.0U2 hosts… we have (4) ESXi hosts that we’re testing with, and each server has 64 logical processors, 384GB memory, and a dual-port QLogic 16 Gb HBA. The targets and initiators are connected to two Brocade 6510 16 Gb Fibre Channel switches.

Next we created VMFS datastores using all (4) of the LUN’s mapped to the ESXi hosts. For this test configuration, we didn’t use any type of ALUA setup; we will control I/O pathing manually using the vSphere Client. Below we’ll test running I/O across one path on one node, two paths on one node, and two paths on each node.

We’re testing I/O performance using the VMware I/O Analyzer tool (https://labs.vmware.com/flings/io-analyzer) and we’re using version 1.6.2 in this article. We’ll start a simple test, using just one ESXi host, with (2) worker appliances per VMFS datastore (2 * 4 = 8). We’ll manually control pathing for these experiments by selecting either the fixed or round-robin pathing policy in the vSphere client. For the first test, we’re using just one path (one target port) on one ESOS node. We did a 100% read, 100% random, 4K I/O size test and we got 334,179.66 IOPS as a result… not too shabby!

Now let's try changing the pathing for all (4) datastores to round-robin, and we’ll disable two of the targets so I/O is only going to one ESOS node (not both). We ran the same (8) worker VM’s which are placed on the four NVMe-based datastores on a single ESXi host. Same 4K 100% read, 100% random test: 423,184.78 IOPS! So we gained nearly 100K IOPS by round-robin’ing across two target ports on a single ESOS node.

Now, the big question, how about round-robin’ing the I/O across all four target ports (2 ports per node). So we’re making use of both ESOS nodes in a true active/active configuration this way… same test with all (8) VMware I/O Analyzer worker VM’s on the single host, 2 VM’s per datastore. And yes, same 4K 100% read, 100% random test: 470,751.55 IOPS! There is some improvement with utilizing both nodes, we just might not be pushing them…

We tried adding one more worker VM to each datastore on our single ESXi (giving us a total of 12 worker VM’s)... ran the same 4K 100%, 100% random test and this is what we got: 431,786.04 IOPS. Performance actually dropped when adding the additional VM; we even ran the test a couple times to make sure it wasn’t a fluke. So we’ll continue our testing with just (2) VM’s per datastore, per host.

We are now testing using (32) VMware I/O Analyzer worker virtual machines (2 VM’s per datastore, per ESXi host -- 4 NVMe-based datastores and 4 ESXi 6.0U2 hosts). We first performed the same test we used above, 100% read, 100% random, 4K I/O size: 1,068,815.78 IOPS! Yes, we broke the 1 million IOPS mark using 16 Gb Fibre Channel and ESXi! We’re not using racks and racks full of equipment either… the ESOS storage array is a 2U NVMe CiB, and we’re using (4) 2U ESXi hosts to generate the I/O load.

And now for write performance… same I/O generator load as described above, this time 100% write, 100% random, and 4K I/O’s: 221,433.14 IOPS. Yes, not as impressive as the +1M number for reads, but remember, these are MLC SSD’s so the write performance is not nearly what read performance is. And these our RAID1 arrays, so maximum performance is only that of one drive per RAID1 array set (4). In our raw RAID1 performance testing numbers above, we were only getting 800K IOPS.

Finally, let’s look at VMware ESXi performance for throughput (eg, Gb/sec) numbers for reads and writes. We ran the “Max Throughput” test in VMware I/O Analyzer on all (32) of the worker VM’s which is testing maximum read throughput: 6,626.79 MB/sec (53.01432 gigabits per second). And for maximum write throughput: 3,568.04 MB/sec (28.54432 gigabits per second).

Performance Tuning Ideas

So, where is the bottleneck? Why don’t we get the same performance numbers that we get when testing the raw storage performance (directly from ESOS)? A couple things come to mind:

The I/O stack in VMware ESXi is naive, and just can’t push that much I/O (we’ll test this theory below).
Using LVM on top of the RAID1 arrays from the target (ESOS) side and making logical volumes from each RAID1 array, then mapping those logical devices, thus producing more LUN’s would likely improve performance, especially in the case of write I/O.

Additional target ports on the ESOS side -- with the current setup, we only have (4) 16 Gb Fibre Channel ports, having (8) FC target ports (4 per node) would likely improve performance.

If we stuck with VMware ESXi for testing, adding additional ESXi hosts to generate more I/O load would likely help too. Moving to InfiniBand (SRP) for the SAN (IB adapters on target side and initiator side) would definitely improve performance as well.

Testing Performance Using ESOS as the Initiators

Now, we’ll try to test a few of those theories and see if we can get some more performance out of this array. We booted our (4) ESXi hosts using ESOS 1.0.0, so we can test performance without the hypervisor. We’ll skip using dm-multipath on the initiator side, and just manually use the various paths (each appears as a different SCSI disk block device in Linux).

First we tested 4K random read IOPS with each initiator host accessing one volume (we have four total, our MD RAID1 arrays) per disk array node, so not testing I/O across both ESOS storage controller nodes per volume, 2 and 2. We’re collecting the data very low tech… running fio for 30 seconds on each initiator host, trying to start all 4 tests at the same time (or as close as possible) and then dividing the I/O’s performed by time… 4K I/O size, 100% random, 100% read:

Initiator Host 1: (3421766 + 4108476) = 7530242 / 30 = 251,008 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdc --name=/dev/sdk

Initiator Host 2: (4318446 + 3573203) = 7891649 / 30 = 263,054 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdh --name=/dev/sdp

Initiator Host 3: (3662219 + 4014342) = 7676561 / 30 = 255,885 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sde --name=/dev/sdm

Initiator Host 4: (3755349 + 4166440) = 7921789 / 30 = 264,059 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdf --name=/dev/sdr

Total IOPS across all 4 hosts: 251,008 + 263,054 + 255,885 + 264,059 = 1,034,006 IOPS

So, over 1 million 4K IOPS (read)... pretty much in line with what we were getting using VM’s with VMware’s I/O Analyzer on the ESXi hosts. What about if we run fio test from the initiator side and use paths on both ESOS controller nodes, but only issue I/O to one volume per initiator host (4 volumes, 4 hosts)... 4K I/O size, 100% random, 100% read:

Initiator Host 1: (2106721 + 2077032 + 2154767 + 2210212) = 8548732 / 30 = 284,957 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdc --name=/dev/sdg --name=/dev/sdk --name=/dev/sdo

Initiator Host 2: (2253785 + 2252450 + 2092336 + 2205984) = 8804555 / 30 = 293,485 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdd --name=/dev/sdh --name=/dev/sdl --name=/dev/sdp

Initiator Host 3: (2155791 + 1594768 + 1998447 + 2083954) = 7832960 / 30 = 261,098 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sde --name=/dev/sdi --name=/dev/sdm --name=/dev/sdq

Initiator Host 4: (2210913 + 1596689 + 2088723 + 1933829) = 7830154 / 30 = 261,005 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdf --name=/dev/sdj --name=/dev/sdn --name=/dev/sdr

Total IOPS across all 4 hosts: 284,957 + 293,485 + 261,098 + 261,005 = 1,100,545 IOPS

Coming in at 1.1 million 4K IOPS (read) which is slightly faster than the test above, and that is inline with what we discovered when testing with ESXi too, round-robin’ing I/O across both ESOS controller nodes improves performance.

Okay, one more test using the same setup as above, but this time with 4K 100% random, 100% writes, and we’ll skip any ramp-up time with fio, as it appears that using VMware I/O Analyzer there was no ramp-up time used in that test either, so we want it to be similar:

Initiator Host 1: (1297907 + 1314948 + 1361892 + 1456250) = 5430997 / 30 = 181,033 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdc --name=/dev/sdg --name=/dev/sdk --name=/dev/sdo

Initiator Host 2: (1194802 + 1222547 + 1209346 + 1170264) = 4796959 / 30 = 159,898 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdd --name=/dev/sdh --name=/dev/sdl --name=/dev/sdp

Initiator Host 3: (1098480 + 1104492 + 1191456 + 1093438) = 4487866 / 30 = 149,595 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sde --name=/dev/sdi --name=/dev/sdm --name=/dev/sdq

Initiator Host 4: (980810 + 974095 + 1050419 + 1036180) = 4041504 / 30 = 134,716 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdf --name=/dev/sdj --name=/dev/sdn --name=/dev/sdr

Total IOPS across all 4 hosts: 181,033 + 159,898 + 149,595 + 134,716 = 625,242 IOPS

Wow, so the 4K random IOPS performance when used without VMware is nearly 3x more! Getting 600K IOPS random write performance is pretty good, especially considering we were only able to achieve ~900K IOPS on the raw MD arrays directly from a ESOS controller node.

Okay, we then performed a quick and dirty test (not gonna do all the copying/pasting as we did above) for the throughput tests… same configuration as far as hosts/volumes, but this time 4M blocks, 100% sequential:

Read Throughput (~ all 4 hosts): 1690MB/s + 1669MB/s + 1660MB/s + 1665MB/s = 6,684 megabytes / second (53.472 gigabits / second)
Write Throughput (~all 4 hosts): 908MB/s + 735MB/s + 721MB/s + 830MB/s = 3,194 megabytes / second (25.552 gigabits / second)

So the throughput numbers are about the same as we were getting on VMware ESXi… good to know! We have (4) 16 Gb links on the target side (dual ESOS nodes), so that’s 64 Gb theoretical maximum, and we’re getting 53 Gb/s across the SAN… pretty darn close.

Okay, so right now the theory is we are hitting the maximum on our target ports for the read performance, but what about the write performance? We know from previous ESOS storage configurations, that more LUN’s is beneficial, especially for writes. So what if we take our raw MD RAID1 arrays and carve it into logical devices, and then map those logical devices as LUN’s… we created (2) LVM logical volumes (LV’s) on each of the RAID1 arrays (400 GB each), which gives us a total of (8) logical devices to present to the initiators. We did this, using same steps as above, and then tested the performance from the initiators across all eight of these devices.

100% write, 100% random, 4K I/O size across all (8) of the LVM logical volumes:

Initiator Host 1: (767110 + 771404 + 876698 + 867843 + 790119 + 693851 + 788120 + 773363) = 6328508 / 30 = 210,950 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdt --name=/dev/sdu --name=/dev/sdc --name=/dev/sdd --name=/dev/sds --name=/dev/sdv --name=/dev/sdk --name=/dev/sdl

Initiator Host 2: (730813 + 815150 + 732723 + 746267 + 768621 + 783575 + 889413 + 882937) = 6349499 / 30 = 211,649 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdab --name=/dev/sdac --name=/dev/sdl --name=/dev/sdm --name=/dev/sdt --name=/dev/sdu --name=/dev/sdd --name=/dev/sde

Initiator Host 3: (803676 + 805702 + 907712 + 920879 + 763160 + 849438 + 829408 + 833149) = 6713124 / 30 = 223,770 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdaa --name=/dev/sdac --name=/dev/sdg --name=/dev/sdh --name=/dev/sdab --name=/dev/sdad --name=/dev/sdo --name=/dev/sdp

Initiator Host 4: (978094 + 12 + 854488 + 839868 + 889876 + 889207 + 892317 + 832459) = 6176321 / 30 = 205,877 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdi --name=/dev/sdj --name=/dev/sdaf --name=/dev/sdah --name=/dev/sdae --name=/dev/sdag --name=/dev/sdq --name=/dev/sdr

Total IOPS across all 4 hosts: 210,950 + 211,649 + 223,770 + 205,877 = 852,246 IOPS

Heck ya! Getting 852K 4K write IOPS now, which is pretty much just about the same we were able to achieve directly from ESOS, so gonna call that good. And no, the “12” I/O’s issued above for host 4 is not a typo… not sure why that one was starved, but numbers still came out consistent.

We tested the random 4K read performance using the (8) LVM logical volumes too, but came out with just about the same numbers ~1.1M IOPS so we’re not gonna bother copying/pasting all of that here.

Sequential write performance improved as well when utilizing all (8) of the LVM logical volumes: 957MB/s + 989MB/s + 1027MB/s + 942MB/s = 3,915 megabytes / second (31.32 gigabits / second)

ATTO Gen 6 Celerity 32 Gb Fibre Channel Adapters

Okay, in the quest to keep improving performance on this simple setup, we decided to pop in a set ATTO 32 Gb Fibre Channel adapters on the target (ESOS) side. Our SAN is still only 16 Gb FC, but these ATTO adapters are still very nice. With the ATTO adapters in and everything re-zoned, we ran the same 100% read, 100% random 4K I/O test above across all (8) of our LVM logical volumes, on the (4) hosts running ESOS (initiators):

Initiator Host 1: (1235786 + 1207139 + 1296953 + 1306500 + 1267106 + 1278816 + 1275006 + 1185872) = 10053178 / 30 = 335,105 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdc --name=/dev/sdd --name=/dev/sdk --name=/dev/sdl --name=/dev/sds --name=/dev/sdt --name=/dev/sdaa --name=/dev/sdab

Initiator Host 2: (1322952 + 1347346 + 868248 + 1364162 + 1247805 + 1325718 + 1278337 + 1217814) = 9972382 / 30 = 332,412 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdd --name=/dev/sde --name=/dev/sdl --name=/dev/sdm --name=/dev/sdt --name=/dev/sdu --name=/dev/sdab --name=/dev/sdac

Initiator Host 3: (1363445 + 939371 + 682103 + 1133577 + 1285775 + 1278859 + 1264690 + 1177453) = 9125273 / 30 = 304,175 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdg --name=/dev/sdh --name=/dev/sdo --name=/dev/sdp --name=/dev/sdw --name=/dev/sdx --name=/dev/sdae --name=/dev/sdaf

Initiator Host 4: (1263533 + 1275503 + 1318398 + 1268626 + 1454765 + 146180 + 1354319 + 1362510) = 9443834 / 30 = 314,794 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/sdy --name=/dev/sdz --name=/dev/sdo --name=/dev/sdq --name=/dev/sdag --name=/dev/sdah --name=/dev/sdp --name=/dev/sdr

Total IOPS across all 4 hosts: 335,105 + 332,412 + 304,175 + 314,794 = 1,286,486 IOPS

Wow! So simply replacing the QLogic adapters with the ATTO adapters on the target side gained us nearly 200K more IOPS! Imagine if we replaced all of the QLogic adapters on the initiator side with ATTO adapters too! Too bad we don’t have enough of those ATTO HBA’s to do that!

Too Much I/O? Or not...

One final test, for read IOPS performance, this time, we used a dual-port QLogic 16 Gb FC HBA AND a dual-port ATTO 16/32 Gb FC HBA (only links at 16 Gb) in each CiB storage node. We ran the test, this time across (16) paths per initiator host (4 total)... 100% random, 100% read, 4K I/O size:

Initiator Host 1: (734847 + 744174 + 938859 + 940328+ 854763 + 856253 + 697171 + 674670 + 761749 + 745315 + 64401 + 979215 + 845611 + 846934 + 591548 + 679349) = 11955187 / 30 = 398,506 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --group_reporting --name=/dev/sdc --name=/dev/sdd --name=/dev/sdk --name=/dev/sdl --name=/dev/sds --name=/dev/sdt --name=/dev/sdaa --name=/dev/sdab --name=/dev/sdai --name=/dev/sdaj --name=/dev/sdaq --name=/dev/sdar --name=/dev/sday --name=/dev/sdaz --name=/dev/sdbg --name=/dev/sdbh

Initiator Host 2: (759041 + 758082 + 921248 + 965074 + 824554 + 833895 + 671157 + 693560 + 730971 + 741205 + 934092 + 975119 + 844210 + 848787 + 681916 + 694012) = 12876923 / 30 = 429,230 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --group_reporting --name=/dev/sdd --name=/dev/sde --name=/dev/sdl --name=/dev/sdm --name=/dev/sdt --name=/dev/sdu --name=/dev/sdab --name=/dev/sdac --name=/dev/sdaj --name=/dev/sdak --name=/dev/sdar --name=/dev/sdas --name=/dev/sdaz --name=/dev/sdba --name=/dev/sdbh --name=/dev/sdbi

Initiator Host 3: (737064 + 733298 + 813389 + 727646 + 848344 + 878888 + 699004 + 695109 + 771393 + 766124 + 971937 + 791482 + 836747 + 710805 + 677586 + 665718) = 12324534 / 30 = 410,817 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --group_reporting --name=/dev/sdg --name=/dev/sdh --name=/dev/sdo --name=/dev/sdp --name=/dev/sdw --name=/dev/sdx --name=/dev/sdae --name=/dev/sdaf --name=/dev/sdam --name=/dev/sdan --name=/dev/sdau --name=/dev/sdav --name=/dev/sdbc --name=/dev/sdbd --name=/dev/sdbk --name=/dev/sdbl

Initiator Host 4: (747177 + 744964 + 779572 + 742691 + 858590 + 854303 + 636166 + 695256 + 760252 + 764723 + 924540 + 920781 + 832038 + 833653 + 668432 + 671074) = 12434212 / 30 = 414,473 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --group_reporting --name=/dev/sdi --name=/dev/sdj --name=/dev/sdq --name=/dev/sdr --name=/dev/sdy --name=/dev/sdz --name=/dev/sdag --name=/dev/sdah --name=/dev/sdao --name=/dev/sdap --name=/dev/sdaw --name=/dev/sdax --name=/dev/sdbe --name=/dev/sdbf --name=/dev/sdbm --name=/dev/sdbn

Total IOPS across all 4 hosts: 398,506 + 429,230 + 410,817 + 414,473 = 1,653,026 IOPS

Yeah, so definitely a performance improvement with 1.6 million 4K IOPS, but… the “live” test actually showed higher numbers, the full performance didn’t make it the entire 30 second test! It started dwindling in the last 5 seconds or so of the test… I/O errors! Snap! We started seeing things like this on the target (ESOS) side:

--snip--
[20897.765655] INFO: rcu_sched self-detected stall on CPU
[20897.765660] 15-...: (1 GPs behind) idle=3a5/140000000000001/0 softirq=83706/83706 fqs=12830
[20897.765660] (t=60001 jiffies g=112961 c=112960 q=4958)
[20897.765665] Task dump for CPU 15:
[20897.765666] nullio58_0 R running task 0 22873 2 0x00000008
[20897.765669] 0000000000000000 ffffffff81081cbf 000000000000000f ffffffff82049000
[20897.765671] ffffffff810e4ef5 ffffffff82049000 ffff88107fa57640 ffff88107fa43e08
[20897.765673] 0000000000000001 ffffffff810a63d1 000000000000135e ffff88105b77c840
--snip--

And on the initiator side we would see stuff like this:

--snip--
[727110.366861] blk_update_request: I/O error, dev sdbm, sector 128902360
[727110.366864] sd 13:0:3:14: [sddy] tag#441 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[727110.366865] sd 13:0:3:14: [sddy] tag#2602 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[727110.366867] sd 13:0:3:14: [sddy] tag#441 CDB: Read(16) 88 00 00 00 00 00 e5 82 79 a0 00 00 00 08 00 00
[727110.366869] sd 13:0:3:14: [sddy] tag#2602 CDB: Read(16) 88 00 00 00 00 01 3d ab 86 68 00 00 00 08 00 00
[727110.366872] blk_update_request: I/O error, dev sddy, sector 5329618536
[727119.546594] qla2xxx [0000:05:00.0]-801c:12:
[727119.546597] Abort command issued nexus=12:0:15 -- 1 2002.
[727119.546725] qla2xxx [0000:05:00.0]-801c:12:
[727119.546726] Abort command issued nexus=12:0:15 -- 1 2002.
[727119.546834] qla2xxx [0000:05:00.0]-801c:12:
[727119.546835] Abort command issued nexus=12:0:15 -- 1 2002.
[727119.546945] qla2xxx [0000:05:00.0]-801c:12:
[727119.546946] Abort command issued nexus=12:0:15 -- 1 2002.
[727119.547056] qla2xxx [0000:05:00.0]-801c:12:
[727119.547057] Abort command issued nexus=12:0:15 -- 1 2002.
--snip--

And not to mention the NIC’s on the two NVMe CiB server nodes kept “failing” and losing link… this would happen and DLM would of course freak out, and cause issues with the locks used by the md-cluster RAID1 arrays and LVM logical volumes (clvmd). All around, just too much I/O for these server it seems. We believe it has to do with the number of interrupts that the CPU’s are processing, and possibly a bottleneck of the PCIe bus that the Fibre Channel HBA’s are plugged into (on the target side). In these particular servers, it’s a riser card with two PCIe slots, not sure what the PCIe width is between that riser card and the motherboard, but I doubt it’s 16x + 8x (what the physical slots are on the riser).

We experimented with turning off hyper-threading on each server node, but that didn’t seem to make any difference. We also contemplated getting creative with the IRQ SMP affinity and using CPU masks to handle interrupts for different IRQ’s (by driver) to different processor sockets/cores, but the reality is, for this setup, using extra FC ports on our switches isn’t worth it… makes more sense to scale out, and add more NVMe CiB’s to gain more performance.

So, to sum all of that up above, there is an improvement in IOPS performance when adding a second dual-port FC adapter in each server node, but it’s just not stable in that configuration… too many interrupts to the processors, or just too much for the processors, or something… either way, not for us. Okay, we take all of that back (well most of it), we figured out the instability issue, see the next section...

Wait... a Solution to High I/O Stability: Interrupt Coalescing

Okay, hold the phone, just figured out the stability issue… this was due to interrupt coalescing not being enabled for the qla2xxx driver. Unfortunately we were too late to figure this out by the time we took out the second FC adapters on each node, so couldn’t test again with this enabled. For the qla2xxx driver this is referred to as “ZIO” and set it to mode 6 and used a timer value of 200us like this:

--snip--
echo 200 > /sys/class/fc_host/host11/device/scsi_host/host11/zio_timer
echo 200 > /sys/class/fc_host/host12/device/scsi_host/host12/zio_timer
echo 6 > /sys/class/fc_host/host11/device/scsi_host/host11/zio
echo 6 > /sys/class/fc_host/host12/device/scsi_host/host12/zio
--snip--

A couple notes about this… seems like to you have to echo the mode value (eg, 6) to the zio sysfs attribute AFTER setting the zio_timer value for the new/updated timer value to take effect. Also, the timer value rounds to hundreds, so you can’t use values like 150us it will just round down to 100us. We testing with 100us but experienced stability issues after short periods of high IOPS being generated… observed 800K interrupts/sec on each node with that value. When the value is at 200us we see about 400K interrupts/sec on both nodes, and it stays quite stable then. There is of course a performance drop in maximum IOPS, but we’ll take the stability over the IOPS loss.

To be clear, this does solve our high I/O stability issue we thought we were experiencing above, the issue was we didn't have interrupt coalescing enabled for the QLogic FC driver. Enabling it makes everything perfectly stable under high I/O load, but with slightly reduced overall peak IOPS performance.

By the time we figured all of this out, we no longer had the ATTO FC adapter installed on these CiB server nodes to test. We believe the ATTO adapters and driver to be superior, as it supports spreading interrupts across a number of IRQ's and hence across more processor cores. The ideal performer for FC with ESOS would be (2) dual-port ATTO FC adapters in each node.

Sixteen (16) Additional NVMe Drives

So, Advanced HPC and Intel heard about our project, and were generous enough to lend us sixteen (16) additional dual-port 1TB NVMe drives so we can find out how much raw performance this CiB is capable of… awesome!

In our Supermicro NVMe CiB, we are now testing with (24) Intel 1TB dual-port NVMe drives, and we’ll continue testing performance numbers from where we left off originally, directly inside of ESOS from just one of the two server nodes, starting with (8) drives and going up to see if the performance still scales linearly:

100% random, 100% read, 4K blocks, (8) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1
Total I/O’s issued (11320919 + 11310400 + 11291229 + 11243776 + 11376618 + 11374893 + 11377254 + 11372375) / 30 sec = 3,022,248 IOPS

100% random, 100% read, 4K blocks, (9) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 --name=/dev/nvme8n1
Total I/O’s issued (9999743 + 10009445 + 10003916 + 10003477 + 10121713 + 10038004 + 10119442 + 10034697 + 11949316) / 30 sec = 3,075,991 IOPS

100% random, 100% read, 4K blocks, (10) NVMe drives:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 --name=/dev/nvme8n1 --name=/dev/nvme9n1
Total I/O’s issued (8673632 + 8681650 + 8571290 + 8747606 + 8664119 + 8670831 + 8739767 + 8751708 + 11932122 + 11935911) / 30 sec = 3,112,287 IOPS

Okay, so there is clearly a different trend here past eight NVMe drives, it is not scale like original test for drives one through eight, past eight drives, we’re not getting the same incremental performance when adding a drive. If you recall from testing originally above, when adding a NVMe drive to the fio test string, we literally incremented by ~380K IOPS after each additional drive all the way up to 8 drives (nearly identical net performance increase by each drive).

So are we CPU bound, or is the PCIe interface oversubscribed? When examining our NVMe drives with the ‘lsipci’ tool it tells us our NVMe drives are linked at 2x width (LnkCap: Port #0, Speed 8GT/s, Width x2). And looking at the output of “list-pcie-ports.sh -t” we see this:

--snip--
CPU1: x16 PCI Express 3 x16
CPU1: x8 PCI Express 3 x8
CPU2: x8 PCI Express 3 x8
CPU2: x16 PCI Express 3 x16
--snip--

And we already know our PCIe connectivity on the back of the server, for I/O expansion cards, is (1) 16x PCIe slot, and (1) 8x PCIe slot. So we believe that leaves us with a 16x and 8x that the NVMe drives are plugged into (identical for each server, since the drives themselves are “dual-port”). This means we have 24x (16 + 8) PCIe width for the (40) NVMe drives (forty slots, not all filled with our chassis). And thus this means the PCIe width must be shared between the drives since each drive port has a 2x width, so some type of NVMe expander maybe? We’ve asked Supermicro for some documentation to confirm this, but we’re still waiting.

On this NVMe CiB chassis, we have a bank of (20) 2.5” NVMe drive slots on the front of the unit, and (20) 2.5” NVMe drive slots in the middle of the chassis (see pictures at top of post). The Supermicro OOB web management interface has a whole screen dedicated to the NVMe SSD stuff, and it shows us the group (aka “bank”) number, slot number, serial number, temperature, and other stuff for the NVMe drives. Even has a utility for locating a drive (eg, blink the LED). We used the slot numbers and serial numbers to verifying the ordering of the /dev/nvmeXn1 block devices inside of ESOS, and the SM OOB interface shows a group ‘0’ and group ‘1’, with slots 0-19 in group 0, and slots 20-39 in group 1. The /dev/nvmeXn1 names correspond correctly in order to slots 0-39. We have (12) drives in the front bank (group 0), and (12) drives in the middle bank (group 1). So /dev/nvme0n1 through /dev/nvme11n1 are in the front bank, and /dev/nvme12n1 through /dev/nvme23n1 are in in the middle bank.

Where we’re going with all of that… our assumption is the front bank of drives is connected to one PCIe interface, and the middle bank is connected to another (different) PCIe interface. That is just our theory without having any documentation or pulling apart the system and looking how things are physically connected. Let’s see if we can prove this through performance testing.

100% random, 100% read, 4K blocks, (12) NVMe drives using slots 0-5 (front bank) and slots 20-24 (middle bank):

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme12n1 --name=/dev/nvme13n1 --name=/dev/nvme14n1 --name=/dev/nvme15n1 --name=/dev/nvme16n1 --name=/dev/nvme17n1
Total I/O’s issued (11438758 + 11401632 + 11394834 + 11511493 + 11877809 + 11871676 + 11605456 + 11411918 + 11649548 + 11392311 + 11839238 + 11848564) / 30 sec = 4,641,441 IOPS

Yeah, 4.6 million IOPS across (12) NVMe drives. So there is a difference between the two banks of drives (front and middle). We couldn’t obtain more performance using all of the drives in one bank, so going across both banks we get the number. Let’s see if the performance scales linearly across both… ~380K IOPS per drive * 12 NVMe drives = 4,560,000 IOPS! So yes, we’re back to scaling linearly!

So, how about utilizing (8) NVMe drives on the front bank, and (8) from the middle bank (16 total)...

100% random, 100% read, 4K blocks, (16) NVMe drives using slots 0-7 (front bank) and slots 20-27 (middle bank):

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme6n1 --name=/dev/nvme7n1 --name=/dev/nvme12n1 --name=/dev/nvme13n1 --name=/dev/nvme14n1 --name=/dev/nvme15n1 --name=/dev/nvme16n1 --name=/dev/nvme17n1 --name=/dev/nvme18n1 --name=/dev/nvme19n1
Total I/O’s issued (11080387 + 11084434 + 11079944 + 11106383 + 11135800 + 11111604 + 11119726 + 11209990 + 11072286 + 11139581 + 11072957 + 11064331 + 11148331 + 11155518 + 11151491 + 11185928) / 30 sec = 5,930,623 IOPS

So, 5.9 million 4K read IOPS across (16) drives using eight in the front and eight in the middle. And if we calculate our linear performance number… ~380K IOPS per drive * 16 NVMe drives = 6,080,000 IOPS! Yes, we are on par with linear performance once again by using both banks of drives (front and middle).

Okay, we need to get on with this article, and finish this testing… to finish out the raw performance testing, we’ll utilize both CiB server nodes, and spread load across both banks of NVMe drives. For the tests below, the performance numbers are captured at the same time, so we’re running load across both servers simultaneously (direct on ESOS, no SAN)...

100% random, 100% read, 4K blocks, (12) NVMe drives using slots 0-5 (front bank) and slots 20-25 (middle bank) on the top (first) CiB server node:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme12n1 --name=/dev/nvme13n1 --name=/dev/nvme14n1 --name=/dev/nvme15n1 --name=/dev/nvme16n1 --name=/dev/nvme17n1
Total I/O’s issued (11394261 + 11421819 + 11403925 + 11467459 + 11886012 + 11855272 + 11602969 + 11404394 + 11679340 + 11405406 + 11859379 + 11875537) / 30 sec = 4,641,859 IOPS

100% random, 100% read, 4K blocks, (12) NVMe drives using slots 6-11 (front bank) and slots 26-31 (middle bank) on the bottom (second) CiB server node:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme6n1 --name=/dev/nvme7n1 --name=/dev/nvme8n1 --name=/dev/nvme9n1 --name=/dev/nvme10n1 --name=/dev/nvme11n1 --name=/dev/nvme18n1 --name=/dev/nvme19n1 --name=/dev/nvme20n1 --name=/dev/nvme21n1 --name=/dev/nvme22n1 --name=/dev/nvme23n1
Total I/O’s issued (11920774 + 11925177 + 11924093 + 11930973 + 11928665 + 11923601 + 11936195 + 11927025 + 11927547 + 11922739 + 11926890 + 11929518) / 30 sec = 4,770,773 IOPS

So 4,641,859 + 4,770,773 = 9,412,632 4K read IOPS across both server nodes! Nearly 10 million IOPS using a 2U system filled with (24) 1TB drives… pretty impressive! Now lets see what 4K write performance looks like using the same setup as we did for reads…

100% random, 100% write, 4K blocks, (12) NVMe drives using slots 0-5 (front bank) and slots 20-25 (middle bank) on the top (first) CiB server node:

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme12n1 --name=/dev/nvme13n1 --name=/dev/nvme14n1 --name=/dev/nvme15n1 --name=/dev/nvme16n1 --name=/dev/nvme17n1
Total I/O’s issued (7081240 + 7056027 + 7098665 + 7013238 + 7308305 + 7236825 + 7133519 + 7044900 + 7157262 + 7037950 + 7389595 + 7326656) / 30 sec = 2,862,806 IOPS

100% random, 100% write, 4K blocks, (12) NVMe drives using slots 6-11 (front bank) and slots 26-31 (middle bank) on the bottom (second) CiB server node:

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme6n1 --name=/dev/nvme7n1 --name=/dev/nvme8n1 --name=/dev/nvme9n1 --name=/dev/nvme10n1 --name=/dev/nvme11n1 --name=/dev/nvme18n1 --name=/dev/nvme19n1 --name=/dev/nvme20n1 --name=/dev/nvme21n1 --name=/dev/nvme22n1 --name=/dev/nvme23n1
Total I/O’s issued (7278443 + 7244754 + 7271487 + 7403016 + 7240445 + 7349885 + 7311890 + 7326243 + 7241286 + 7280798 + 7344219 + 7348774) / 30 sec = 2,921,374 IOPS

Alright, so for 4K random writes, we got 2,862,806 + 2,921,374 = 5,784,180 IOPS! Nearly 6 million 4K 100% random, 100% write IOPS! Okay, so now we’re on to maximum throughput utilizing both ESOS NVMe CiB server nodes, I/O to (12) drives on each server…

100% sequential, 100% read, 4M blocks, (12) NVMe drives using slots 0-5 (front bank) and slots 20-25 (middle bank) on the top (first) CiB server node:

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme12n1 --name=/dev/nvme13n1 --name=/dev/nvme14n1 --name=/dev/nvme15n1 --name=/dev/nvme16n1 --name=/dev/nvme17n1
Aggregate throughput: 18.2 gigaBYTES/sec = 145.6 gigabits/sec

100% sequential, 100% read, 4M blocks, (12) NVMe drives using slots 6-11 (front bank) and slots 26-31 (middle bank) on the bottom (second) CiB server node:

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme6n1 --name=/dev/nvme7n1 --name=/dev/nvme8n1 --name=/dev/nvme9n1 --name=/dev/nvme10n1 --name=/dev/nvme11n1 --name=/dev/nvme18n1 --name=/dev/nvme19n1 --name=/dev/nvme20n1 --name=/dev/nvme21n1 --name=/dev/nvme22n1 --name=/dev/nvme23n1
Aggregate throughput: 18.2 gigaBYTES/sec = 145.6 gigabits/sec

Okay and then combine both of those numbers (from both servers) and we get 18.2 + 18.2 = 36.4 gigaBYTES/sec = 291.2 gigabits/sec! Nearly 300 Gbps of 100% sequential read throughput… it would take three 100 GbE NIC’s to handle that load (assuming best case)! Okay, now for testing the write throughput…

100% sequential, 100% write, 4M blocks, (12) NVMe drives using slots 0-5 (front bank) and slots 20-25 (middle bank) on the top (first) CiB server node:

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1 --name=/dev/nvme4n1 --name=/dev/nvme5n1 --name=/dev/nvme12n1 --name=/dev/nvme13n1 --name=/dev/nvme14n1 --name=/dev/nvme15n1 --name=/dev/nvme16n1 --name=/dev/nvme17n1
Aggregate throughput: 11.3 gigaBYTES/sec = 90.4 gigabits/sec

100% sequential, 100% write, 4M blocks, (12) NVMe drives using slots 6-11 (front bank) and slots 26-31 (middle bank) on the bottom (second) CiB server node:

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme6n1 --name=/dev/nvme7n1 --name=/dev/nvme8n1 --name=/dev/nvme9n1 --name=/dev/nvme10n1 --name=/dev/nvme11n1 --name=/dev/nvme18n1 --name=/dev/nvme19n1 --name=/dev/nvme20n1 --name=/dev/nvme21n1 --name=/dev/nvme22n1 --name=/dev/nvme23n1
Aggregate throughput: 11.7 gigaBYTES/sec = 93.6 gigabits/sec

So let’s combine those two numbers and we get 11.3 + 11.7 = 23 gigaBYTES/sec = 184 gigabits/sec utilizing all (24) NVMe drives with both server nodes! That’s fast! Now we’ll move into creating our md-cluster RAID1 arrays and test performance with that...

More md-cluster RAID1 Arrays (12)

Now that we have (24) Intel NVMe drives to test with, we’ll create (12) RAID1 arrays. When testing above, we learned that there is a performance benefit to running I/O across both banks/groups of NVMe drives (20 in each bank). We learned from Supermicro that for this NVMe CiB, the backplane for each bank of twenty NVMe drives is a PCIe switch, and each PCIe switch is connected to each server via a x16 PCIe link. So this means that server “A” has a x16 PCIe interface to the front bank of drives, and a x16 PCIe interface to the middle bank of drives. And server “B” has the same setup.

So, when creating these MD arrays, each arrays (mirror) contains two NVMe drives, so we’ll use one from the front bank, and one from the middle bank. We partitioned and created our NVMe drives as we did above; we then let them finish resync before continuing our testing.

100% random, 100% read, 4K blocks, (6) MD RAID1 (mirror) arrays on the top (first) CiB server node:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/md/array1 --name=/dev/md/array2 --name=/dev/md/array3 --name=/dev/md/array4 --name=/dev/md/array5 --name=/dev/md/array6
Total I/O’s issued 127806080 / 30 sec = 4,260,202.6 IOPS

100% random, 100% read, 4K blocks, (6) MD RAID1 (mirror) arrays on the bottom (second) CiB server node:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/md/array7 --name=/dev/md/array8 --name=/dev/md/array9 --name=/dev/md/array10 --name=/dev/md/array11 --name=/dev/md/array12
Total I/O’s issued 134155178 / 30 sec = 4,471,839.2 IOPS

And we put both of those numbers together for the aggregate performance and we have 4,260,202.6 + 4,471,839.2 = 8,732,041.8 IOPS! So, nearly 9 million 4K IOPS across (12) RAID1 mirror sets! That’s pretty darn close to our raw read performance across all (24) NVMe drives, and yes there is some overhead for using MD RAID but not much (those MD threads need to use processor time too). And remember we with RAID1 (mirroring) we get the performance of both drives/disks for reads, and for writes only one device. Now let’s take a look at random IOPS write performance…

100% random, 100% write, 4K blocks, (6) MD RAID1 (mirror) arrays on the top (first) CiB server node:

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --ramp_time=60 --name=/dev/md/array1 --name=/dev/md/array2 --name=/dev/md/array3 --name=/dev/md/array4 --name=/dev/md/array5 --name=/dev/md/array6
Total I/O’s issued 39962459 / 30 sec = 1,332,081.9 IOPS

100% random, 100% write, 4K blocks, (6) MD RAID1 (mirror) arrays on the bottom (second) CiB server node:

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --ramp_time=60 --name=/dev/md/array7 --name=/dev/md/array8 --name=/dev/md/array9 --name=/dev/md/array10 --name=/dev/md/array11 --name=/dev/md/array12
Total I/O’s issued 40957927 / 30 sec = 1,365,264.2 IOPS

So, we’ll combine those two IOPS numbers and we have 1,332,081.9 + 1,365,264.2 = 2,697,346.1 IOPS! So we’re coming in about half of what the raw write performance numbers are for utilizing all (24) NVMe drives. Remember, with RAID1 (mirroring) for each 2-drive mirror set (array), we get the performance of two drives for reads, but only one drive for writes. So if we half our 5.7M 4K write IOPS number from above (when we used all 24 drives without RAID), that gives us 2.85M and we’re coming in at 2.7M with our md-cluster RAID1 arrays, so that’s pretty darn close! We expect a little overhead from using RAID, but it isn’t much so that’s excellent.

Now we’ll move on to testing the throughput numbers for our (12) md-cluster RAID1 arrays. We’ll start with testing sequential read throughput, with the same configuration used above, utilizing both server nodes, and running the fio command string concurrently on each.

100% sequential, 100% read, 4M blocks, (6) MD RAID1 (mirror) arrays on the top (first) CiB server node:

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/md/array1 --name=/dev/md/array2 --name=/dev/md/array3 --name=/dev/md/array4 --name=/dev/md/array5 --name=/dev/md/array6
Aggregate throughput: 18.2 gigaBYTES/sec = 145.6 gigabits/sec

100% sequential, 100% read, 4M blocks, (6) MD RAID1 (mirror) arrays on the bottom (second) CiB server node:

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --name=/dev/md/array7 --name=/dev/md/array8 --name=/dev/md/array9 --name=/dev/md/array10 --name=/dev/md/array11 --name=/dev/md/array12
Aggregate throughput: 18.2 gigaBYTES/sec = 145.6 gigabits/sec

And we can combine both of those 18.2 + 18.2 = 36.4 gigaBYTES/sec = 291.2 gigabits/sec! Snap! The exact same throughput read number we get when using all (24) NVMe drives raw, so that is excellent for MD RAID -- basically no overhead with sequential reads. Now let’s checkout sequential write throughput…

100% sequential, 100% write, 4M blocks, (6) MD RAID1 (mirror) arrays on the top (first) CiB server node:

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --ramp_time=60 --name=/dev/md/array1 --name=/dev/md/array2 --name=/dev/md/array3 --name=/dev/md/array4 --name=/dev/md/array5 --name=/dev/md/array6
Aggregate throughput: 5,172 megaBYTES/sec = 41.376 gigabits/sec

100% sequential, 100% write, 4M blocks, (6) MD RAID1 (mirror) arrays on the bottom (second) CiB server node:

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --group_reporting --ramp_time=60 --name=/dev/md/array7 --name=/dev/md/array8 --name=/dev/md/array9 --name=/dev/md/array10 --name=/dev/md/array11 --name=/dev/md/array12
Aggregate throughput: 5,278 megaBYTES/sec = 42.224 gigabits/sec

And we can put those numbers together 5,172 + 5,278 = 10,450 megaBYTES/sec = 83.6 gigabits/sec! And when using the raw NVMe drives (all 24) for sequential write throughput we were getting 23 gigabytes/sec so we’re doing pretty darn good with about half, which is what we expect for RAID1 (mirroring). Now, we can finally continue and look at performance using all (24) NVMe drives across our 16 Gb Fibre Channel SAN one last time…

16 Gb FC SAN Performance with (24) NVMe Drives

We’re using the same setup as before, four Dell PE ESXi hosts on the initiator side, each server has a QLogic 16 Gb dual-port FC HBA in it, and on the target side we have our lovely ESOS NVMe CiB array. Each CiB server (2 total) has a QLogic 16 Gb dual-port FC HBA.

On the ESOS (target) side, we’ve configured (12) md-cluster RAID1 arrays, and we’ve created a LVM volume group (VG) on each MD array, and then created (2) 400GB LVM logical volumes (LV)... this gives us a total of (24) LVM LV’s that are mapped as SCST devices, and then to LUN’s for each initiator.

One other item to note is we enabled interrupt coalescing (QLogic ZIO mode 6, see section earlier in post) so we didn’t get the same peak performance numbers as we did before, but we did have stability when performing these tests… something we didn’t catch on to until late in our testing. But we know what it is now and how to deal with it. We also reduced our job count in the fio command string from previous runs since we have so many devices we’re pushing I/O across now (too many processes). So we’ll start by testing random read IOPS performance across the Fibre Channel SAN…

100% random, 100% read, 4K I/O size:

Initiator Host 1: 8698219 I/O’s issued (and completed) / 30 seconds = 289,940.6 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --name=/dev/sdc --name=/dev/sdd --name=/dev/sde --name=/dev/sdf --name=/dev/sdg --name=/dev/sdh --name=/dev/sdaa --name=/dev/sdab --name=/dev/sdac --name=/dev/sdad --name=/dev/sdae --name=/dev/sdaf --name=/dev/sday --name=/dev/sdaz --name=/dev/sdba --name=/dev/sdbb --name=/dev/sdbc --name=/dev/sdbd --name=/dev/sdbw --name=/dev/sdbx --name=/dev/sdby --name=/dev/sdbz --name=/dev/sdca --name=/dev/sdcb

Initiator Host 2: 8684627 I/O’s issued (and completed) / 30 seconds = 289,487.5 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --name=/dev/sdi --name=/dev/sdj --name=/dev/sdk --name=/dev/sdl --name=/dev/sdm --name=/dev/sdn --name=/dev/sdag --name=/dev/sdah --name=/dev/sdai --name=/dev/sdaj --name=/dev/sdak --name=/dev/sdal --name=/dev/sdbe --name=/dev/sdbf --name=/dev/sdbg --name=/dev/sdbh --name=/dev/sdbi --name=/dev/sdbj --name=/dev/sdcc --name=/dev/sdcd --name=/dev/sdce --name=/dev/sdcf --name=/dev/sdcg --name=/dev/sdch

Initiator Host 3: 8818667 I/O’s issued (and completed) / 30 seconds = 293,955.5 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --name=/dev/sdo --name=/dev/sdp --name=/dev/sdq --name=/dev/sdr --name=/dev/sds --name=/dev/sdt --name=/dev/sdam --name=/dev/sdan --name=/dev/sdao --name=/dev/sdap --name=/dev/sdaq --name=/dev/sdar --name=/dev/sdbk --name=/dev/sdbl --name=/dev/sdbm --name=/dev/sdbn --name=/dev/sdbo --name=/dev/sdbp --name=/dev/sdci --name=/dev/sdcj --name=/dev/sdck --name=/dev/sdcl --name=/dev/sdcm --name=/dev/sdcn

Initiator Host 4: 8726719 I/O’s issued (and completed) / 30 seconds = 290,890.6 IOPS

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --name=/dev/sdu --name=/dev/sdv --name=/dev/sdw --name=/dev/sdx --name=/dev/sdy --name=/dev/sdz --name=/dev/sdas --name=/dev/sdat --name=/dev/sdau --name=/dev/sdav --name=/dev/sdaw --name=/dev/sdax --name=/dev/sdbq --name=/dev/sdbr --name=/dev/sdbs --name=/dev/sdbt --name=/dev/sdbu --name=/dev/sdbv --name=/dev/sdco --name=/dev/sdcp --name=/dev/sdcq --name=/dev/sdcr --name=/dev/sdcs --name=/dev/sdct

Total IOPS across all 4 hosts: 289,940.6 + 289,487.5 + 293,955.5 + 290,890.6 = 1,164,274.2 IOPS

Alright, so just over 1.1 million 4K IOPS… we didn’t expect any different, and that number may not have been as high as our peak number with QLogic adapters that we achieved earlier in our testing, but previously we did not have ZIO mode enabled for the QLogic adapters (interrupt coalescing). With ZIO mode enabled, we do see a reduction in IOPS across the FC SAN, but it’s perfectly stable with it enabled as it greatly reduces the number of interrupts that occur on our processor cores, thus eliminating the CPU core starvation we experienced at times before. Now we’ll continue testing the write performance…

100% random, 100% write, 4K I/O size:

Initiator Host 1: 6265418 I/O’s issued (and completed) / 30 seconds = 208,847.2 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --ramp_time=30 --runtime=30 --name=/dev/sdi --name=/dev/sdj --name=/dev/sdk --name=/dev/sdl --name=/dev/sdm --name=/dev/sdn --name=/dev/sdag --name=/dev/sdah --name=/dev/sdai --name=/dev/sdaj --name=/dev/sdak --name=/dev/sdal --name=/dev/sdbe --name=/dev/sdbf --name=/dev/sdbg --name=/dev/sdbh --name=/dev/sdbi --name=/dev/sdbj --name=/dev/sdcc --name=/dev/sdcd --name=/dev/sdce --name=/dev/sdcf --name=/dev/sdcg --name=/dev/sdch

Initiator Host 2: 6306021 I/O’s issued (and completed) / 30 seconds = 210,200.7 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --ramp_time=30 --runtime=30 --name=/dev/sdc --name=/dev/sdd --name=/dev/sde --name=/dev/sdf --name=/dev/sdg --name=/dev/sdh --name=/dev/sdaa --name=/dev/sdab --name=/dev/sdac --name=/dev/sdad --name=/dev/sdae --name=/dev/sdaf --name=/dev/sday --name=/dev/sdaz --name=/dev/sdba --name=/dev/sdbb --name=/dev/sdbc --name=/dev/sdbd --name=/dev/sdbw --name=/dev/sdbx --name=/dev/sdby --name=/dev/sdbz --name=/dev/sdca --name=/dev/sdcb

Initiator Host 3: 6207820 I/O’s issued (and completed) / 30 seconds = 206,927.3 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --ramp_time=30 --runtime=30 --name=/dev/sdu --name=/dev/sdv --name=/dev/sdw --name=/dev/sdx --name=/dev/sdy --name=/dev/sdz --name=/dev/sdas --name=/dev/sdat --name=/dev/sdau --name=/dev/sdav --name=/dev/sdaw --name=/dev/sdax --name=/dev/sdbq --name=/dev/sdbr --name=/dev/sdbs --name=/dev/sdbt --name=/dev/sdbu --name=/dev/sdbv --name=/dev/sdco --name=/dev/sdcp --name=/dev/sdcq --name=/dev/sdcr --name=/dev/sdcs --name=/dev/sdct

Initiator Host 4: 6737068 I/O’s issued (and completed) / 30 seconds = 224,568.9 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --ramp_time=30 --runtime=30 --name=/dev/sdo --name=/dev/sdp --name=/dev/sdq --name=/dev/sdr --name=/dev/sds --name=/dev/sdt --name=/dev/sdam --name=/dev/sdan --name=/dev/sdao --name=/dev/sdap --name=/dev/sdaq --name=/dev/sdar --name=/dev/sdbk --name=/dev/sdbl --name=/dev/sdbm --name=/dev/sdbn --name=/dev/sdbo --name=/dev/sdbp --name=/dev/sdci --name=/dev/sdcj --name=/dev/sdck --name=/dev/sdcl --name=/dev/sdcm --name=/dev/sdcn

Total IOPS across all 4 hosts: 208,847.2 + 210,200.7 + 206,927.3 + 224,568.9 = 850,544.1 IOPS

Dang, we’d really like to hit the 1M IOPS number for writes… let’s try lowering the interrupt coalescing timeout value (QLogic ZIO timeout) to 100 us and see if we can boost that number…

100% random, 100% write, 4K I/O size:

Initiator Host 1: 6672599 I/O’s issued (and completed) / 30 seconds = 222,419.9 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --ramp_time=30 --runtime=30 --name=/dev/sdi --name=/dev/sdj --name=/dev/sdk --name=/dev/sdl --name=/dev/sdm --name=/dev/sdn --name=/dev/sdag --name=/dev/sdah --name=/dev/sdai --name=/dev/sdaj --name=/dev/sdak --name=/dev/sdal --name=/dev/sdbe --name=/dev/sdbf --name=/dev/sdbg --name=/dev/sdbh --name=/dev/sdbi --name=/dev/sdbj --name=/dev/sdcc --name=/dev/sdcd --name=/dev/sdce --name=/dev/sdcf --name=/dev/sdcg --name=/dev/sdch

Initiator Host 2: 6737023 I/O’s issued (and completed) / 30 seconds = 224,567.4 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --ramp_time=30 --runtime=30 --name=/dev/sdc --name=/dev/sdd --name=/dev/sde --name=/dev/sdf --name=/dev/sdg --name=/dev/sdh --name=/dev/sdaa --name=/dev/sdab --name=/dev/sdac --name=/dev/sdad --name=/dev/sdae --name=/dev/sdaf --name=/dev/sday --name=/dev/sdaz --name=/dev/sdba --name=/dev/sdbb --name=/dev/sdbc --name=/dev/sdbd --name=/dev/sdbw --name=/dev/sdbx --name=/dev/sdby --name=/dev/sdbz --name=/dev/sdca --name=/dev/sdcb

Initiator Host 3: 6732777 I/O’s issued (and completed) / 30 seconds = 224,425.9 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --ramp_time=30 --runtime=30 --name=/dev/sdu --name=/dev/sdv --name=/dev/sdw --name=/dev/sdx --name=/dev/sdy --name=/dev/sdz --name=/dev/sdas --name=/dev/sdat --name=/dev/sdau --name=/dev/sdav --name=/dev/sdaw --name=/dev/sdax --name=/dev/sdbq --name=/dev/sdbr --name=/dev/sdbs --name=/dev/sdbt --name=/dev/sdbu --name=/dev/sdbv --name=/dev/sdco --name=/dev/sdcp --name=/dev/sdcq --name=/dev/sdcr --name=/dev/sdcs --name=/dev/sdct

Initiator Host 4: 7139599 I/O’s issued (and completed) / 30 seconds = 237,986.6 IOPS

fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --ramp_time=30 --runtime=30 --name=/dev/sdo --name=/dev/sdp --name=/dev/sdq --name=/dev/sdr --name=/dev/sds --name=/dev/sdt --name=/dev/sdam --name=/dev/sdan --name=/dev/sdao --name=/dev/sdap --name=/dev/sdaq --name=/dev/sdar --name=/dev/sdbk --name=/dev/sdbl --name=/dev/sdbm --name=/dev/sdbn --name=/dev/sdbo --name=/dev/sdbp --name=/dev/sdci --name=/dev/sdcj --name=/dev/sdck --name=/dev/sdcl --name=/dev/sdcm --name=/dev/sdcn

Total IOPS across all 4 hosts: 222,419.9 + 224,567.4 + 224,425.9 + 237,986.6 = 909,399.8 IOPS

Eh, gave us a 60K performance boost, but still not quite hitting that 1M IOPS number for 4K writes, but we’re darn close! Now let’s move on to testing throughput across the SAN, we’ll start with sequential read throughput…

100% sequential, 100% read, 4M I/O size:

Initiator Host 1: 1,763 MB/s

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --ramp_time=30 --name=/dev/sdi --name=/dev/sdj --name=/dev/sdk --name=/dev/sdl --name=/dev/sdm --name=/dev/sdn --name=/dev/sdag --name=/dev/sdah --name=/dev/sdai --name=/dev/sdaj --name=/dev/sdak --name=/dev/sdal --name=/dev/sdbe --name=/dev/sdbf --name=/dev/sdbg --name=/dev/sdbh --name=/dev/sdbi --name=/dev/sdbj --name=/dev/sdcc --name=/dev/sdcd --name=/dev/sdce --name=/dev/sdcf --name=/dev/sdcg --name=/dev/sdch

Initiator Host 2: 1,772 MB/s

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --ramp_time=30 --name=/dev/sdc --name=/dev/sdd --name=/dev/sde --name=/dev/sdf --name=/dev/sdg --name=/dev/sdh --name=/dev/sdaa --name=/dev/sdab --name=/dev/sdac --name=/dev/sdad --name=/dev/sdae --name=/dev/sdaf --name=/dev/sday --name=/dev/sdaz --name=/dev/sdba --name=/dev/sdbb --name=/dev/sdbc --name=/dev/sdbd --name=/dev/sdbw --name=/dev/sdbx --name=/dev/sdby --name=/dev/sdbz --name=/dev/sdca --name=/dev/sdcb

Initiator Host 3: 1,771 MB/s

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --ramp_time=30 --name=/dev/sdu --name=/dev/sdv --name=/dev/sdw --name=/dev/sdx --name=/dev/sdy --name=/dev/sdz --name=/dev/sdas --name=/dev/sdat --name=/dev/sdau --name=/dev/sdav --name=/dev/sdaw --name=/dev/sdax --name=/dev/sdbq --name=/dev/sdbr --name=/dev/sdbs --name=/dev/sdbt --name=/dev/sdbu --name=/dev/sdbv --name=/dev/sdco --name=/dev/sdcp --name=/dev/sdcq --name=/dev/sdcr --name=/dev/sdcs --name=/dev/sdct

Initiator Host 4: 1,797 MB/s

fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --ramp_time=30 --name=/dev/sdo --name=/dev/sdp --name=/dev/sdq --name=/dev/sdr --name=/dev/sds --name=/dev/sdt --name=/dev/sdam --name=/dev/sdan --name=/dev/sdao --name=/dev/sdap --name=/dev/sdaq --name=/dev/sdar --name=/dev/sdbk --name=/dev/sdbl --name=/dev/sdbm --name=/dev/sdbn --name=/dev/sdbo --name=/dev/sdbp --name=/dev/sdci --name=/dev/sdcj --name=/dev/sdck --name=/dev/sdcl --name=/dev/sdcm --name=/dev/sdcn

Aggregate throughput across all 4 hosts: 1,763 + 1,772 + 1,771 + 1,797 = 7,103 megaBYTES/sec = 56.824 gigabits/sec

Nice! So about 7 gigabytes a second of throughput from using all (4) initiator hosts. That translates to ~56 Gb per second, and in our NVMe CiB we have a dual-port 16 Gb FC HBA in each server, so that’s 4 * 16 Gb = 64 Gb of theoretical bandwidth, so we’re coming pretty darn close that. We used a the ramp-time setting in the fio command strings above… there seemed to be some delay generated the full load/throughput at the beginning of each test, so we employed that setting here. Now let’s look at the write throughput…

100% sequential, 100% write, 4M I/O size:

Initiator Host 1: 1,701 MB/s

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --ramp_time=30 --name=/dev/sdi --name=/dev/sdj --name=/dev/sdk --name=/dev/sdl --name=/dev/sdm --name=/dev/sdn --name=/dev/sdag --name=/dev/sdah --name=/dev/sdai --name=/dev/sdaj --name=/dev/sdak --name=/dev/sdal --name=/dev/sdbe --name=/dev/sdbf --name=/dev/sdbg --name=/dev/sdbh --name=/dev/sdbi --name=/dev/sdbj --name=/dev/sdcc --name=/dev/sdcd --name=/dev/sdce --name=/dev/sdcf --name=/dev/sdcg --name=/dev/sdch

Initiator Host 2: 1,716 MB/s

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --ramp_time=30 --name=/dev/sdc --name=/dev/sdd --name=/dev/sde --name=/dev/sdf --name=/dev/sdg --name=/dev/sdh --name=/dev/sdaa --name=/dev/sdab --name=/dev/sdac --name=/dev/sdad --name=/dev/sdae --name=/dev/sdaf --name=/dev/sday --name=/dev/sdaz --name=/dev/sdba --name=/dev/sdbb --name=/dev/sdbc --name=/dev/sdbd --name=/dev/sdbw --name=/dev/sdbx --name=/dev/sdby --name=/dev/sdbz --name=/dev/sdca --name=/dev/sdcb

Initiator Host 3: 1,708 MB/s

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --ramp_time=30 --name=/dev/sdu --name=/dev/sdv --name=/dev/sdw --name=/dev/sdx --name=/dev/sdy --name=/dev/sdz --name=/dev/sdas --name=/dev/sdat --name=/dev/sdau --name=/dev/sdav --name=/dev/sdaw --name=/dev/sdax --name=/dev/sdbq --name=/dev/sdbr --name=/dev/sdbs --name=/dev/sdbt --name=/dev/sdbu --name=/dev/sdbv --name=/dev/sdco --name=/dev/sdcp --name=/dev/sdcq --name=/dev/sdcr --name=/dev/sdcs --name=/dev/sdct

Initiator Host 4: 1,725 MB/s

fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=12 --numjobs=8 --group_reporting --runtime=30 --ramp_time=30 --name=/dev/sdo --name=/dev/sdp --name=/dev/sdq --name=/dev/sdr --name=/dev/sds --name=/dev/sdt --name=/dev/sdam --name=/dev/sdan --name=/dev/sdao --name=/dev/sdap --name=/dev/sdaq --name=/dev/sdar --name=/dev/sdbk --name=/dev/sdbl --name=/dev/sdbm --name=/dev/sdbn --name=/dev/sdbo --name=/dev/sdbp --name=/dev/sdci --name=/dev/sdcj --name=/dev/sdck --name=/dev/sdcl --name=/dev/sdcm --name=/dev/sdcn

Aggregate throughput across all 4 hosts: 1,701 + 1,716 + 1,708 + 1,725 = 6,850 megaBYTES/sec = 54.8 gigabits/sec

Coming in at just about the same as we got for reads… 54.8 Gbps! As we saw in the random IOPS testing, there is more overhead with writes, so we see slightly less performance for the write tests vs. read tests. But that’s still really good considering our theoretical maximum for our 16 Gb FC SAN setup is 64 Gbps (4 x 16 Gb ports).

Tweaking Interrupt Placement / IRQ SMP Affinity

It is worth mentioning that we attempted to control which CPU’s handler interrupts by setting CPU masks for the SMP affinity of IRQ’s and we didn’t get much, if any performance improvement. The idea is that the SCST and MD RAID threads which are processing I/O might run on the same CPU cores that are serving interrupts for the NVMe drives and Fibre Channel HBA ports. We’ll document what we did here for reference, and perhaps someone can point out if we did this wrong or interrupted the CPU mask settings / placement incorrectly (certainly possible, and may be why we didn’t see any performance improvement).

Each of systems has (56) logical CPU’s (2 sockets, 14 cores per socket + hyperthreading). We chose CPU masks that grouped two logical CPU’s together… it is our assumption the CPU0 and CPU1 are the same core, just two logical threads. We assumed that for the remaining pairs.

First, we decided to limit SCST and it’s threads to (28) logical CPU’s with this mask: 333333,33333333

We then decided we’d limit interrupts (via IRQ’s) to only occur on (14) logical CPU’s with this mask: c0c0c0,c0c0c0c0

And we’d limit the MD RAID threads to the other (14) logical CPU’s (opposite of what we dedicated to IRQ’s): 0c0c0c,0c0c0c0c

We used the “IRQBALANCE_BANNED_CPUS” environment variable for irqbalance which tells the irqbalance daemon to not place IRQ’s on certain CPU’s via a CPU mask. We used the inverse of “c0c0c0,c0c0c0c0” which we calculated to be “3f3f3f,3f3f3f3f”. We spot checked some IRQ’s and saw only 4 and 8 values and none where the f value was (all CPU’s banned with an f, and only 4 or 8 value possible with a banned value of c) so that seemed to be working.

We updated the sysfs “cpu_mask” attribute value to “333333,33333333” for all of the SCST targets and security groups. We then ran taskset to manually set the CPU mask for all of the processes like this:

--snip--
for i in scst_uid scstd{28..55} scst_initd scst_mgmtd; do
./taskset -p 33333333333333 $(pidof ${i})
done
for i in md{127..116}_raid1 md{127..116}_cluster_r; do
./taskset -p 0c0c0c0c0c0c0c $(pidof ${i})
done
--snip--

We did all of that on both nodes, and didn’t really notice any significant performance improvement before test runs… obviously there is always some variance between the test runs, so we believe we only saw 2-3K performance increase when using the settings above for IRQ SMP affinity.

Conclusion

So, we’ll sum up our performance findings here in this conclusion, first with the raw performance that the ESOS NVMe CiB is capable of utilizing (24) 1 TB NVMe drives with md-cluster RAID1 (12 mirror sets) across both NVMe CiB server nodes:

100% random, 100% read, 4K blocks: 8,732,041.8 IOPS
100% random, 100% write, 4K blocks: 2,697,346.1 IOPS
100% sequential, 100% read, 4M blocks: 36.4 gigaBYTES/sec = 291.2 gigabits/sec
100% sequential, 100% write, 4M blocks: 10.45 gigaBYTES/sec = 83.6 gigabits/sec

Yeah, so very fast performance that the ESOS NVMe CiB is capable of, even with RAID1 (mirroring). For maximum read throughput and IOPS, we get nearly the full performance of the (24) NVMe drives. For writes (throughput and IOPS), we only get the performance of half the drives (12) since we’re using RAID1/mirroring.

And finally the performance numbers we were able to obtain using our 16 Gb Fibre Channel SAN (Storage Area Network)... we utilized (4) Dell PowerEdge 730 servers on the initiator side to generate the load, and each Dell initiator server has a dual-port 16 Gb QLogic HBA in it. Each Dell PE server is running ESOS 1.1.1 (for initiator stack).

The ESOS NVMe CiB (target side) contains two server nodes, and each node has a dual-port 16 Gb QLogic HBA. The initiator and target sides are connected via (2) Brocade 6510 16 Gb Fibre Channel switches. We split each MD RAID1 array into two LVM logical volumes… additional devices gives us better write performance across the SAN. For the testing, we can an fio command string concurrently on all (4) initiators to produce the load. We then added the I/O’s completed together and divided by time (see paragraphs above for details). We ran I/O across all targets (paths) and devices (across both CiB server nodes). Here are the results across the SAN:

100% random, 100% read, 4K blocks: 1,164,274.2 IOPS
100% random, 100% write, 4K blocks: 909,399.8 IOPS
100% sequential, 100% read, 4M blocks: 7,103 megaBYTES/sec = 56.824 gigabits/sec
100% sequential, 100% write, 4M blocks: 6,850 megaBYTES/sec = 54.8 gigabits/sec

This is 16 Gb Fibre Channel, so really, +1 million IOPS is pretty darn good coming out of one box. For random I/O (IOPS) 4K reads were getting 1.1M IOPS, and for writes we get about 900K IOPS. More overhead with writes, so a lesser value compared to reads there is expected, but still very good overall. And for throughput, we’re seeing 56 Gbps for reads, and 54 Gbps for writes… on the ESOS (target) side we have (4) 16 Gb FC ports, giving us a theoretical maximum of 64 Gbps, so we are pretty close to that number! Nothing is free, so marginal overhead is expected (and it’s what we got).

So our ESOS NVMe CiB is very capable of more I/O’s but we’re limited by our Storage Area Network (SAN). How can we make that faster? For Fibre Channel, there was a clear performance advantage of using ATTO Fibre Channel adapters vs. QLogic HBA’s. The ATTO Fibre Channel HBA driver also appears to spread interrupts across all CPU’s with a total number IRQ’s for that driver equaling the number of logical processors. As we saw in our testing above, when dealing with high I/O like this, interrupt placement / coalescing is important, as core starvation is a real thing. With the QLogic driver, it appears there are (3) IRQ’s per FC port, although it seems as though only two IRQ’s per port generate a significant amount of interrupts. Using (2) dual-port ATTO FC HBA’s in each CiB server node would likely allow a greater number of IOPS as well, and certainly increase the overall throughput (Gbps). Another option for even greater SAN performance would be using an RDMA solution (eg, SRP/InfiniBand).

And finally to sum it all up: This 40-slot NVMe CiB (cluster-in-a-box) from Advanced HPC running ESOS (Enterprise Storage OS) is an extremely fast dual-head (HA) SAN disk array in only 2U’s. Using md-cluster RAID1 provides redundancy at the drive level, which has minimal overhead, and supports an active/active I/O path'ing policy. All of this performance at a fraction of the price of traditional proprietary SAN disk arrays.

Saturday, January 17, 2015

Crazy Performance From Something So Small

So, I did a refresh on my home machine recently, or really just an entirely new machine... I picked up a used Dell Precision T7500 workstation (24 GB memory, 2 x Xeon W5590 processors). I also bought a used Fusion-io ioDrive 160 GB SLC flash memory device. I knew it was going to be fast, but was surprised at just how fast with such a little card.

I'm running Fedora 21 "Workstation" on this system. The drive for this card, called "VSL" is available from fusionio.com but you need to create an account first to access it. It also appears there is a newer version of the driver/firmware if you pay for a support contract. I used the 2.3.11 version of driver, and it lists supporting Fedora 17. The driver is written for older kernels, so I had to change it a bit to work with 3.x -- let me know if you're interested in the changes needed for newer kernels.

Anyhow, here is a quick peak at the performance numbers on this system using the fio tool...

--snip--
# fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=/dev/fioa --size=10G
/dev/fioa: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [750.3MB/0KB/0KB /s] [192K/0/0 iops] [eta 00m:00s]
/dev/fioa: (groupid=0, jobs=1): err= 0: pid=1406: Sat Jan 17 11:00:38 2015
read : io=10240MB, bw=763767KB/s, iops=190941, runt= 13729msec
slat (usec): min=1, max=172, avg= 2.85, stdev= 2.61
clat (usec): min=199, max=3604, avg=331.24, stdev=77.36
lat (usec): min=201, max=3625, avg=334.22, stdev=77.33
clat percentiles (usec):
| 1.00th=[ 245], 5.00th=[ 253], 10.00th=[ 270], 20.00th=[ 294],
| 30.00th=[ 318], 40.00th=[ 326], 50.00th=[ 330], 60.00th=[ 330],
| 70.00th=[ 334], 80.00th=[ 350], 90.00th=[ 402], 95.00th=[ 426],
| 99.00th=[ 454], 99.50th=[ 462], 99.90th=[ 540], 99.95th=[ 2544],
| 99.99th=[ 2992]
bw (KB /s): min=673840, max=768568, per=100.00%, avg=763737.48, stdev=18102.33
lat (usec) : 250=3.48%, 500=96.39%, 750=0.04%, 1000=0.01%
lat (msec) : 2=0.03%, 4=0.06%
cpu : usr=23.24%, sys=62.81%, ctx=254638, majf=0, minf=664
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=2621440/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: io=10240MB, aggrb=763767KB/s, minb=763767KB/s, maxb=763767KB/s, mint=13729msec, maxt=13729msec

Disk stats (read/write):
fioa: ios=2607327/0, merge=31/0, ticks=815401/0, in_queue=815145, util=99.34%
--snip--

--snip--
# fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=/dev/fioa --size=10G
/dev/fioa: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0KB/747.3MB/0KB /s] [0/191K/0 iops] [eta 00m:00s]
/dev/fioa: (groupid=0, jobs=1): err= 0: pid=1433: Sat Jan 17 11:01:49 2015
write: io=10240MB, bw=746955KB/s, iops=186738, runt= 14038msec
slat (usec): min=1, max=192, avg= 3.33, stdev= 2.83
clat (usec): min=192, max=3048, avg=338.28, stdev=70.32
lat (usec): min=194, max=3052, avg=341.74, stdev=70.41
clat percentiles (usec):
| 1.00th=[ 262], 5.00th=[ 282], 10.00th=[ 298], 20.00th=[ 310],
| 30.00th=[ 318], 40.00th=[ 322], 50.00th=[ 330], 60.00th=[ 334],
| 70.00th=[ 342], 80.00th=[ 366], 90.00th=[ 398], 95.00th=[ 414],
| 99.00th=[ 454], 99.50th=[ 478], 99.90th=[ 1144], 99.95th=[ 2024],
| 99.99th=[ 2800]
bw (KB /s): min=660624, max=765872, per=99.99%, avg=746907.14, stdev=25759.49
lat (usec) : 250=0.32%, 500=99.39%, 750=0.18%, 1000=0.01%
lat (msec) : 2=0.06%, 4=0.05%
cpu : usr=23.67%, sys=68.75%, ctx=110028, majf=0, minf=431
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=0/w=2621440/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: io=10240MB, aggrb=746955KB/s, minb=746955KB/s, maxb=746955KB/s, mint=14038msec, maxt=14038msec

Disk stats (read/write):
fioa: ios=109/2595463, merge=110/28, ticks=9/814160, in_queue=813744, util=99.39%
--snip--

So, in both of those tests, the first being 100% random, 100% read with 4K IOs, I'm getting 192K (192,000) IOPS! And in the second test its 100% random, 100% write with 4K IOs: 191K (191,000) IOPS! That's pretty fast for such a little package... just a single PCIe flash device.

And for some sequential IO tests with a much larger IO size...

--snip--
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=64 --name=/dev/fioa --size=10G
/dev/fioa: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [R] [92.9% done] [800.0MB/0KB/0KB /s] [200/0/0 iops] [eta 00m:01s]
/dev/fioa: (groupid=0, jobs=1): err= 0: pid=1452: Sat Jan 17 11:06:41 2015
read : io=10240MB, bw=819392KB/s, iops=200, runt= 12797msec
slat (usec): min=110, max=19743, avg=4959.85, stdev=8374.12
clat (msec): min=92, max=392, avg=312.69, stdev=20.35
lat (msec): min=92, max=411, avg=317.65, stdev=18.69
clat percentiles (msec):
| 1.00th=[ 212], 5.00th=[ 302], 10.00th=[ 302], 20.00th=[ 302],
| 30.00th=[ 322], 40.00th=[ 322], 50.00th=[ 322], 60.00th=[ 322],
| 70.00th=[ 322], 80.00th=[ 322], 90.00th=[ 322], 95.00th=[ 322],
| 99.00th=[ 322], 99.50th=[ 334], 99.90th=[ 392], 99.95th=[ 392],
| 99.99th=[ 392]
bw (KB /s): min=442593, max=835584, per=97.73%, avg=800802.08, stdev=80018.74
lat (msec) : 100=0.12%, 250=1.33%, 500=98.55%
cpu : usr=0.07%, sys=3.44%, ctx=1028, majf=0, minf=65543
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=0.6%, 32=1.2%, >=64=97.5%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=2560/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: io=10240MB, aggrb=819392KB/s, minb=819392KB/s, maxb=819392KB/s, mint=12797msec, maxt=12797msec

Disk stats (read/write):
fioa: ios=20256/0, merge=0/0, ticks=1799659/0, in_queue=1806118, util=99.29%
--snip--

--snip--
# fio --bs=4m --direct=1 --rw=write --ioengine=libaio --iodepth=64 --name=/dev/fioa --size=10G
/dev/fioa: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0KB/767.3MB/0KB /s] [0/191/0 iops] [eta 00m:00s]
/dev/fioa: (groupid=0, jobs=1): err= 0: pid=1448: Sat Jan 17 11:06:11 2015
write: io=10240MB, bw=786157KB/s, iops=191, runt= 13338msec
slat (usec): min=124, max=20466, avg=5167.94, stdev=8529.73
clat (msec): min=99, max=412, avg=326.06, stdev=21.06
lat (msec): min=99, max=413, avg=331.23, stdev=19.41
clat percentiles (msec):
| 1.00th=[ 225], 5.00th=[ 314], 10.00th=[ 314], 20.00th=[ 314],
| 30.00th=[ 334], 40.00th=[ 334], 50.00th=[ 334], 60.00th=[ 334],
| 70.00th=[ 334], 80.00th=[ 334], 90.00th=[ 334], 95.00th=[ 334],
| 99.00th=[ 334], 99.50th=[ 351], 99.90th=[ 412], 99.95th=[ 412],
| 99.99th=[ 412]
bw (KB /s): min=407157, max=802816, per=98.28%, avg=772616.08, stdev=74921.31
lat (msec) : 100=0.12%, 250=1.17%, 500=98.71%
cpu : usr=3.31%, sys=2.05%, ctx=1139, majf=0, minf=7
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=0.6%, 32=1.2%, >=64=97.5%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=0/w=2560/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
WRITE: io=10240MB, aggrb=786156KB/s, minb=786156KB/s, maxb=786156KB/s, mint=13338msec, maxt=13338msec

Disk stats (read/write):
fioa: ios=59/20405, merge=55/0, ticks=7/1888035, in_queue=1893181, util=98.52%
--snip--

So with 100% sequential, 100% read using 4M IOs we see 800 MB/sec; with same test using writes I'm seeing 767 MB/sec. Pretty fast! I'm not sure where the bottleneck is here... I believe this card is PCIe 2.0 4x so that bus may be the crippler, not sure, I'll have to look into it. Either way, the random IO performance is really where its at, and I am very much impressed.