Why isn't my ZFS pool the size I expect it to be?

I've seen this question on several forums, and I've worked through several examples myself. I thought I understood why my pools where the sizes, they were.
That was, until I rebuilt a server, the main change was 512 sectors to 4K sectors, and suddenly about 6% of the disk space disappeared.
After some research, I can now explain how much disk space you will get from a ZFS volume.

512 byte sectors and 4K byte sectors

For a long time the default sector size for disks has been 512 bytes. This is the basic size of a read or write transaction over the disk cable.
One of the ways to increase the hard drive storage density is increase the block size (as this reduces the overhead of the non-storage data).
A bigger block size has other advantages with the more complex disk encoding, and error recovery mechanisms.

Internally drives have been moving to bigger blocks for a while. But externally they have still presented a 512-byte interface:
For reading - It reads the bigger block into cache-RAM, and return the requested 512-byte block.
For writing - It reads the bigger block into cache-RAM, modifies the 512-bytes within it, then writes the bigger block back out.

So for reading, it's not really an issue (in fact the drive is likely to read a track at a time, while it has the chance).
But for writing, it can lead to performance issues as the drive has to do this read/modify/write sequence.

As the typical filing system grows, so has the typical block size.
The FAT12/16/32 system uses clusters of sectors, as it's minimum block size, this is the smallest data block that the file system will read or write.
As Fat32 only has 28-bits for the cluster index. It can only use 1 cluster=1 sector for upto (about) 128GB partitions.
For (about) 1TByte, the minimum cluster size is 4K. So with modern drives, FAT32 will read/write multiples of (at least) 4K.

So what are the problem? So, even if the file system is writing 4K blocks, there can still be performance problems.

On unix you can check the alignment with gpart
gpart show {device or diskid/name}
If the data partition starts on a multiple of 8 sectors, then it's 4K aligned. (usually 40).

Raid stripe size

ZFS reads and writes in fixed records. The record size has to be from 512 bytes to 128 kilobytes, in power of 2 steps.
It can be changed with the zfs set recordsize=X command, but it defauts to 128K.
The block can be sub-allocated (so multiple 1K files will fit in a single block), but this is minimum data read/write size for ZFS.
ZFS has to split this block, across the number of data drives in your set. If it doesn't exactly fit, then you waste some disk space.
The mount of wasted disk space varies with the sector size (512 or 4096 bytes), and the number of data drives (drives-1 for raidz1), using a record size of 128K, minimizes the waste, but even then is can be significant.
For example: With 3 data drives (maybe 4 drives in raidz1), and 512-byte sectors. The stripe is 3x512 bytes= 1.5K bytes. So to store 128K, you would use 86 stripes, but 86x1.5K=129K. So you store 128K of data in 129K of data drive space (plus the expected overhead of mirror, raidz1, raidz2, or raidz3).
Here is a table, showing the number of data drives vs the stripe size, and the number of stripes to store the 128K ZFS block.

512-byte sectors

Data drivesstripe size(K)stripes per 128KActual used (K)% Unused
10.52561280.00%
211281280.00%
31.5861290.78%
42641280.00%
52.5521301.56%
63431290.78%
73.537129.51.17%
84321280.00%
94.529130.51.95%
105261301.56%
115.5241323.13%
126221323.13%
136.5201301.56%
147191333.91%
157.5181355.47%
168161280.00%
178.5161366.25%
189151355.47%
199.5141333.91%
2010131301.56%
2110.513136.56.64%
2211121323.13%
2311.5121387.81%
2412111323.13%
2512.511137.57.42%
2613101301.56%
2713.5101355.47%
2814101409.38%
2914.59130.51.95%
301591355.47%
3115.59139.58.98%
321681280.00%

So it's not too bad. Upto 10 data drives, the worst case is only 1.95% overhead. Quite acceptable.
When you get to the bigger arrays things get worse, 28 data drives has 9.38% unavailable.

4K-byte sectors

Data drivesstripe size(K)stripes per 128KActual used (K)% Unused
14321280.00%
28161280.00%
312111323.13%
41681280.00%
52071409.38%
624614412.50%
72851409.38%
83241280.00%
936414412.50%
1040416025.00%
114431323.13%
1248314412.50%
1352315621.88%
1456316831.25%
1560318040.63%
166421280.00%
176821366.25%
1872214412.50%
1976215218.75%
2080216025.00%
2184216831.25%
2288217637.50%
2392218443.75%
2496219250.00%
25100220056.25%
26104220862.50%
27108221668.75%
28112222475.00%
29116223281.25%
30120224087.50%
31124224893.75%
3212811280.00%

This is where the problems become more obvious... Dividing N*4 into 128K, gives some quite big rounding errors!
I was using 7 x 4TB drives in a raidz2 array. With 5 data drives the 128K blocks are actually stored in 140K of space.
Instead of the expected 5x4TB = 20TB, I actually got 5x4TB*128/140 = 18.29TB, so 9.375% of my data space was unavailable.

Creating arrays with more data drives can result in some strange situations.
If you have 8 or 10 data drives, adding a drive (or 2) doesn't actually gain you any data space. You are better off increasing the raidz or adding hot-spares
If you already have 16 data drives, you need to be very carefull.

So.... How big will my array be?

Starting with my previous example, 7 drives, each of 4TB, in a raidz2 array.

What's a Mega-byte?

To start with, the drive manufacturers decided to go decimal, so that they could hood-wink us into thinking we're getting more for our money.
So that 4TB drive, actually has 7,814,037,135 sectors, so 4,000,787,013,120 bytes.
While the SI units are 103=1000 based, the computer versions are 210=1024 based.
So in computer terms, the 4,000,787,013,120 bytes is 3,907,018,568 K-bytes, 3,815,448 M-bytes, 3,726 G-bytes, or 3.639 T-Bytes.
So a single drive will often show as 3.639TBytes, already nearly 10% lower than we expected, thanks to the HDD vendor's marketing team.

Partitioned space

It uses a few sectors to store information about what type of information is where on the drive.
The amount used is tiny compared to modern drive sizes, so it's best to use it.
Here is the partition info for one drive:
We use -a 4K to align the block to a 4K boundary, as the drive will actually have 4K sectors internally.
gpart create -s gpt /dev/ada{n}
gpart add -a 4K -t freebsd-zfs /dev/ada{n}
gpart show /dev/ada{n}

=>347814037101ada{n}GPT (3.6T)
346- free - (3.0K)
4078140370881freebsd-zfs (3.6T)
78140371287- free - (3.5K)

ZFS space

So with 7 drives, in raidz2, we have 5 data drives. For 4TB drives, we would expect to have 5 x 7,814,037,088 sectors of usable storage.
So we would expect 39,070,185,440 (512 byte) sectors = 19,535,092,720 KBytes of data space.
Which is 20.004 TB (Where 1TB=1012 bytes), or 18.193 TB (Where 1TB=240 bytes).

But the ZFS 128K record size reduces this:
With 4K byte sectors = 20K stripes = 139,536,376 x 128K records = 17,860,656,128K bytes = 18.289TB(16.634TB), so 9.375% is unavailable.
With 512byte sectors = 2.5K stripes = 150,269,944 x 128K records = 19,234,552,832K bytes = 19.696TB(17.914TB), so 1.5625% is unavailable.

How to avoid it

How many drives should you use?

If you use 4K-byte sectors with 1TB drives, here is the storage space you can expect for the various raidz modes:
Drivesraidz1raidz2raidz3
21.00
32.001.00
42.912.001.00
54.002.912.00
64.574.002.91
75.334.574.00
86.405.334.57
98.006.405.33
108.008.006.40
118.008.008.00
1210.678.008.00
1310.6710.678.00
1410.6710.6710.67
Avoid 15 or 16 drives
1716.0010.6710.67
1816.0016.0010.67
1916.0016.0016.00
Avoid 20..32 drives
3332.0016.0016.00
3432.0032.0016.00
3532.0032.0032.00
3632.0032.0032.00
Try to use the Green ones, as all of the data drive space will be used.
Avoid the ones marked in Red, you can get the same storage space with less drives.
Re-consider using the ones marked in Yellow, as they waste more than 5% of the storage space.





Homepage.
This page was lasted updated on Tuesday, 28-Jul-2015 16:50:21 BST

This content comes from a hidden element on this page.

The inline option preserves bound JavaScript events and changes, and it puts the content back where it came from when it is closed.

Click me, it will be preserved!

If you try to open a new Colorbox while it is already open, it will update itself with the new content.

Updating Content Example:
Click here to load new content