Wednesday, August 16, 2017

Linux MD RAID0: The Weakest Link

I made an interesting discovery recently when testing performance with Linux MD RAID0 (software RAID) and some NVMe devices in an older chassis. The chassis had (4) Samsung NVMe drives that were connected to the system via PCIe (using the PCIe/M.2 adapter).

I started testing the performance of the drives individually using the 'f'io' tool and then finally across all (4) drives... for the simplicity of this post, I'll just mention that sequential read throughput was tested. Here was the command that was used and performance across all four of the drives:
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1
READ: bw=5383MiB/s (5645MB/s), 25.3MiB/s-353MiB/s (26.5MB/s-370MB/s), io=161GiB (173GB), run=30140-30575msec

So we're getting about 5.6 gigabytes per second, which is pretty fast. Now I thought I should get about the same performance in a MD RAID0 (striped) array using all (4) of these NVMe drives. I created the MD RAID0 array using a 64K chunk size, and then here is the fio command string used and the performance:
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/md/localhost\:4stripe
READ: bw=3168MiB/s (3322MB/s), 132MiB/s-393MiB/s (139MB/s-412MB/s), io=93.6GiB (100GB), run=30112-30240mse

Whoa, so 3.3 gigabytes per second is significantly lower than what we got using fio directly with all four NVMe drives above... some overhead was expected when using MD RAID, but not this much.

Next I looked at the performance of each NVMe device individually:

# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1
READ: bw=793MiB/s (832MB/s), 26.1MiB/s-198MiB/s (28.3MB/s-208MB/s), io=23.8GiB (25.5GB), run=30241-30593msec

# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme1n1
READ: bw=1576MiB/s (1652MB/s), 73.8MiB/s-131MiB/s (77.4MB/s-138MB/s), io=46.6GiB (50.8GB), run=30247-30264msec

# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme2n1
READ: bw=1572MiB/s (1648MB/s), 90.3MiB/s-113MiB/s (94.7MB/s-118MB/s), io=46.4GiB (49.8GB), run=30162-30180msec

# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme3n1
READ: bw=1580MiB/s (1657MB/s), 92.7MiB/s-108MiB/s (97.2MB/s-113MB/s), io=46.6GiB (50.8GB), run=30163-30178msec

So it's clear one of the devices is slower than the rest... and about half of the performance of the other three (800 MB/s vs. 1600 MB/s). I was under the assumption that all the PCIe slots in this system were PCIe 2.0 (Gen 2), but it turns out one of them is PCIe 1.0 (Gen 1):
# -t
PCI1: x4 PCI Express 2 x4
PCI2: x8 PCI Express 2 x8
PCI3: x8 PCI Express 2 x8
PCI4: x8 PCI Express 2 x8
PCI5: x4 PCI Express
PCI6: x8 PCI Express 2 x8
PCI7: x4 PCI Express 2 x4
PCI8: x4 PCI Express 2 x4
PCI9: x4 PCI Express 2 x4
PCI10: x4 PCI Express 2 x4

It must be that the first NVMe device with the slow performance is in the PCIe 1.0 slot! A quick check from the output of "lspci -vv" confirms it:
 LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Versus the output from the other faster NVMe devices:
 LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Yes, half the speed to be precise (the different between PCIe 1.0 and 2.0)... 2.5GT/s vs. 5GT/s. So, back to the read throughput performance of our NVMe-based MD array (3322MB/s)... it turns out that for MD RAID0 (and probably any type/level of RAID, and not just MD RAID) the maximum performance is that of the slowest device in the array, so if we take 832MB/s * 4 = 3328MB/s (which is the value we got with the MD array).

This information may have been common place and well known by others, and when you think about it, it makes sense that it works this way... but for me, I always had it in my head that the performance was concatenated value of each device in a stripe set... IT'S NOT! Maximum performance is the slowest device * number of devices. Again, this probably RAID "101" level information, but perhaps others will find this helpful.

Okay, got the "slow" NVMe device moved to a PCIe 2.0 slot, and ran the test again using fio with all (4) NVMe drives together (which should show a performance increase since one of the drives previously was slower than the others):
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/nvme0n1 --name=/dev/nvme1n1 --name=/dev/nvme2n1 --name=/dev/nvme3n1
READ: bw=6059MiB/s (6353MB/s), 42.4MiB/s-368MiB/s (44.5MB/s-386MB/s), io=180GiB (193GB), run=30144-30391msec

Excellent, getting about 6.3 gigabytes per second which is up about 800 MB/s from our very first test since now all (4) NVMe drives are in PCIe 2.0 slots. And finally, run our fio test against the MD RAID0 array that contains the (4) NVMe drives:
# fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=16 --runtime=30 --name=/dev/md/localhost\:4stripe
READ: bw=4758MiB/s (4989MB/s), 159MiB/s-952MiB/s (167MB/s-998MB/s), io=140GiB (150GB), run=30037-30122msec

There we go, getting about 4.9 gigabytes per second with the MD RAID0 array... yeah, that's costing us 1 GB per second of overhead, but we could probably play with the chunk size or possibly the alignment and get much closer to the 6.3 GB/s number. Or it may just be a matter of tuning our fio command test string to get more performance from a single MD array block device. But for this post, we're not tuning performance, it was just to illustrate the point above with the slowest device in a RAID array.

Okay, wait... it bugs me not knowing something or wondering why (performance overhead above), so I adjusted the fio command string to the following to seek better performance from our MD array (increased # of jobs to 64):
fio --bs=4m --direct=1 --rw=read --ioengine=libaio --iodepth=12 --numjobs=64 --runtime=30 --name=/dev/md/localhost\:4stripe
READ: bw=5600MiB/s (5872MB/s), 79.0MiB/s-150MiB/s (82.9MB/s-157MB/s), io=167GiB (179GB), run=30227-30489msec

There, I'm satisfied with 5.8 gigabytes per second, that is within a few hundred megabytes of our non-RAID number, and again tuning could help more, but I'm good with leaving it at that -- nothing is free.