Why isn't my ZFS pool the size I expect it to be?

I've seen this question on several forums, and I've worked through several examples myself. I thought I understood why my pools where the sizes, they were.
That was, until I rebuilt a server, the main change was 512 sectors to 4K sectors, and suddenly about 6% of the disk space disappeared.
After some research, I can now explain how much disk space you will get from a ZFS volume.

512 byte sectors and 4K byte sectors

For a long time the default sector size for disks has been 512 bytes. This is the basic size of a read or write transaction over the disk cable.
One of the ways to increase the hard drive storage density is increase the block size (as this reduces the overhead of the non-storage data).
A bigger block size has other advantages with the more complex disk encoding, and error recovery mechanisms.

Internally drives have been moving to bigger blocks for a while. But externally they have still presented a 512-byte interface:
For reading - It reads the bigger block into cache-RAM, and return the requested 512-byte block.
For writing - It reads the bigger block into cache-RAM, modifies the 512-bytes within it, then writes the bigger block back out.

So for reading, it's not really an issue (in fact the drive is likely to read a track at a time, while it has the chance).
But for writing, it can lead to performance issues as the drive has to do this read/modify/write sequence.

As the typical filing system grows, so has the typical block size.
The FAT12/16/32 system uses clusters of sectors, as it's minimum block size, this is the smallest data block that the file system will read or write.
As Fat32 only has 28-bits for the cluster index. It can only use 1 cluster=1 sector for upto (about) 128GB partitions.
For (about) 1TByte, the minimum cluster size is 4K. So with modern drives, FAT32 will read/write multiples of (at least) 4K.

So what are the problem?

Alignment - While the file system is reading and writing multiples of 4K, traditionally they haven't been aligned. Partitions have been started on track (63 sector) or smaller boundaries (34 sectors is common).
Timing - The drive has to wait to see if there is more data coming, to know if it can avoid the read/modify/write.
If the CPU doesn't write fast enough, then the drive will decide to write the data anyway, and incurr the performance hit.

So, even if the file system is writing 4K blocks, there can still be performance problems.

On unix you can check the alignment with gpart
gpart show {device or diskid/name}
If the data partition starts on a multiple of 8 sectors, then it's 4K aligned. (usually 40).

Raid stripe size

ZFS reads and writes in fixed records. The record size has to be from 512 bytes to 128 kilobytes, in power of 2 steps.
It can be changed with the zfs set recordsize=X command, but it defauts to 128K.
The block can be sub-allocated (so multiple 1K files will fit in a single block), but this is minimum data read/write size for ZFS.
ZFS has to split this block, across the number of data drives in your set. If it doesn't exactly fit, then you waste some disk space.
The mount of wasted disk space varies with the sector size (512 or 4096 bytes), and the number of data drives (drives-1 for raidz1), using a record size of 128K, minimizes the waste, but even then is can be significant.
For example: With 3 data drives (maybe 4 drives in raidz1), and 512-byte sectors. The stripe is 3x512 bytes= 1.5K bytes. So to store 128K, you would use 86 stripes, but 86x1.5K=129K. So you store 128K of data in 129K of data drive space (plus the expected overhead of mirror, raidz1, raidz2, or raidz3).
Here is a table, showing the number of data drives vs the stripe size, and the number of stripes to store the 128K ZFS block.

512-byte sectors

Data drives	stripe size(K)	stripes per 128K	Actual used (K)	% Unused
1	0.5	256	128	0.00%
2	1	128	128	0.00%
3	1.5	86	129	0.78%
4	2	64	128	0.00%
5	2.5	52	130	1.56%
6	3	43	129	0.78%
7	3.5	37	129.5	1.17%
8	4	32	128	0.00%
9	4.5	29	130.5	1.95%
10	5	26	130	1.56%
11	5.5	24	132	3.13%
12	6	22	132	3.13%
13	6.5	20	130	1.56%
14	7	19	133	3.91%
15	7.5	18	135	5.47%
16	8	16	128	0.00%
17	8.5	16	136	6.25%
18	9	15	135	5.47%
19	9.5	14	133	3.91%
20	10	13	130	1.56%
21	10.5	13	136.5	6.64%
22	11	12	132	3.13%
23	11.5	12	138	7.81%
24	12	11	132	3.13%
25	12.5	11	137.5	7.42%
26	13	10	130	1.56%
27	13.5	10	135	5.47%
28	14	10	140	9.38%
29	14.5	9	130.5	1.95%
30	15	9	135	5.47%
31	15.5	9	139.5	8.98%
32	16	8	128	0.00%

So it's not too bad. Upto 10 data drives, the worst case is only 1.95% overhead. Quite acceptable.
When you get to the bigger arrays things get worse, 28 data drives has 9.38% unavailable.

4K-byte sectors

Data drives	stripe size(K)	stripes per 128K	Actual used (K)	% Unused
1	4	32	128	0.00%
2	8	16	128	0.00%
3	12	11	132	3.13%
4	16	8	128	0.00%
5	20	7	140	9.38%
6	24	6	144	12.50%
7	28	5	140	9.38%
8	32	4	128	0.00%
9	36	4	144	12.50%
10	40	4	160	25.00%
11	44	3	132	3.13%
12	48	3	144	12.50%
13	52	3	156	21.88%
14	56	3	168	31.25%
15	60	3	180	40.63%
16	64	2	128	0.00%
17	68	2	136	6.25%
18	72	2	144	12.50%
19	76	2	152	18.75%
20	80	2	160	25.00%
21	84	2	168	31.25%
22	88	2	176	37.50%
23	92	2	184	43.75%
24	96	2	192	50.00%
25	100	2	200	56.25%
26	104	2	208	62.50%
27	108	2	216	68.75%
28	112	2	224	75.00%
29	116	2	232	81.25%
30	120	2	240	87.50%
31	124	2	248	93.75%
32	128	1	128	0.00%

This is where the problems become more obvious... Dividing N*4 into 128K, gives some quite big rounding errors!
I was using 7 x 4TB drives in a raidz2 array. With 5 data drives the 128K blocks are actually stored in 140K of space.
Instead of the expected 5x4TB = 20TB, I actually got 5x4TB*128/140 = 18.29TB, so 9.375% of my data space was unavailable.

Creating arrays with more data drives can result in some strange situations.
If you have 8 or 10 data drives, adding a drive (or 2) doesn't actually gain you any data space. You are better off increasing the raidz or adding hot-spares
If you already have 16 data drives, you need to be very carefull.

So.... How big will my array be?

Starting with my previous example, 7 drives, each of 4TB, in a raidz2 array.

What's a Mega-byte?

To start with, the drive manufacturers decided to go decimal, so that they could hood-wink us into thinking we're getting more for our money.
So that 4TB drive, actually has 7,814,037,135 sectors, so 4,000,787,013,120 bytes.
While the SI units are 10³=1000 based, the computer versions are 2¹⁰=1024 based.
So in computer terms, the 4,000,787,013,120 bytes is 3,907,018,568 K-bytes, 3,815,448 M-bytes, 3,726 G-bytes, or 3.639 T-Bytes.
So a single drive will often show as 3.639TBytes, already nearly 10% lower than we expected, thanks to the HDD vendor's marketing team.

Partitioned space

It uses a few sectors to store information about what type of information is where on the drive.
The amount used is tiny compared to modern drive sizes, so it's best to use it.
Here is the partition info for one drive:
We use -a 4K to align the block to a 4K boundary, as the drive will actually have 4K sectors internally.

gpart create -s gpt /dev/ada{n}

gpart add -a 4K -t freebsd-zfs /dev/ada{n}

gpart show /dev/ada{n}

=>	34	7814037101	ada{n}	GPT (3.6T)
34		6		- free - (3.0K)
40		7814037088	1	freebsd-zfs (3.6T)
7814037128		7		- free - (3.5K)

ZFS space

So with 7 drives, in raidz2, we have 5 data drives. For 4TB drives, we would expect to have 5 x 7,814,037,088 sectors of usable storage.
So we would expect 39,070,185,440 (512 byte) sectors = 19,535,092,720 KBytes of data space.
Which is 20.004 TB (Where 1TB=10¹² bytes), or 18.193 TB (Where 1TB=2⁴⁰ bytes).

But the ZFS 128K record size reduces this:
With 4K byte sectors = 20K stripes = 139,536,376 x 128K records = 17,860,656,128K bytes = 18.289TB(16.634TB), so 9.375% is unavailable.
With 512byte sectors = 2.5K stripes = 150,269,944 x 128K records = 19,234,552,832K bytes = 19.696TB(17.914TB), so 1.5625% is unavailable.

How to avoid it

Use a number of data drives that is a power of 2. So 1,2,4,8,16 or 32 data drives.
Even with 4K sectors, there will be no unused data space.
3 or 11 data drives only have 3.13% overhead. 17 data drives has 6.25%. 5 and 7 have 9.38%. The other (non-power-of-2) combinations end up with over 10% unusable.
Use 512-byte sectors
This reduces the 128K/stripe rounding error. For mainly read data sets, the performance impact will be minimal.

How many drives should you use?

If you use 4K-byte sectors with 1TB drives, here is the storage space you can expect for the various raidz modes:

Drives	raidz1	raidz2	raidz3
2	1.00
3	2.00	1.00
4	2.91	2.00	1.00
5	4.00	2.91	2.00
6	4.57	4.00	2.91
7	5.33	4.57	4.00
8	6.40	5.33	4.57
9	8.00	6.40	5.33
10	8.00	8.00	6.40
11	8.00	8.00	8.00
12	10.67	8.00	8.00
13	10.67	10.67	8.00
14	10.67	10.67	10.67
Avoid 15 or 16 drives
17	16.00	10.67	10.67
18	16.00	16.00	10.67
19	16.00	16.00	16.00
Avoid 20..32 drives
33	32.00	16.00	16.00
34	32.00	32.00	16.00
35	32.00	32.00	32.00
36	32.00	32.00	32.00

Try to use the Green ones, as all of the data drive space will be used.
Avoid the ones marked in Red, you can get the same storage space with less drives.
Re-consider using the ones marked in Yellow, as they waste more than 5% of the storage space.

The Roger Writes series

I research / dabble with lots of things, and figured that if I write my notes here, I can quickly reference them, also, sometimes, they are useful to others!
Here is what I have so far:

Portable LiFePo₄ battery bank (RW#19 - January 2026)
Upgrading the caravan battery to LiFePo₄ (RW#18 - July 2025)
Replacing the rear motor mount and rear pads/discs on a Mitsubishi Outlander PHEV (RW#17 - March 2025)
A simple 15m/21MHz inverted-V antenna (RW#16 - February 2025)
Outside Tree Christmas light project (RW#15 - December 2024)
Fitting a dashcam to a BMW i3 (RW#14 - October 2024)
Cosplay Arduino programming (RW#13 - June 2024)
ATU-100 modifications (RW#12 - February 2024)
Garage door open warning (RW#11 - January 2024)
Horizontal Halos and Vs (RW#10 - December 2023)
Faking Dell PSUs (RW#9 - December 2023)
Designing NEC2 antennas (RW#8 - November 2023)
4NEC2 antenna models (RW#7 - October 2023)
3-element cubical quad for 2m (RW#6 - September 2023)
Cobwebb Antenna dimensions (RW#5 - August 2023)
WS2812 and similar 'Smart' LEDs (RW#4 - January 2023)
Investigation and details of the AFHDS-2A protocol (RW#3 - June 2022)
Changes we've made to our caravan (RW#2 - January 2022)
Basic solar/battery/off-grid calculations (RW#1 - January 2022)
Credit card size walking robot aka Monster-Chan (RW#0 - Originally 2016 - Moved here 2022)
Why you should avoid certain number of drives with ZFS (aka why isn't my ZFS/raidz array the size I expected)
Simple way to connect a Servo to a PC (RS232 port)
Using ZFS with FreeBSD

Homepage.
This page was lasted updated on Thursday, 17-Aug-2023 13:03:14 BST