How to Install and Configure ZFS on Ubuntu 16.04 LTS

Prerequisites in order to install ZFS on Ubuntu 14.04 LTS or 16.04 LTS

  1. 64-bit capable CPU
  2. ECC RAM
  3. 64-Bit Ubuntu 14.04 LTS or 16.04 LTS installation


Ubuntu 14.04 LTS Install ZFS packages (Do not use for Ubuntu 14.04)

In the first command below, we gave ourselves access to the apt-add-repository command, which makes it much simpler to safely add PPAs to our repository list. Then we added the PPA, updated our source list to reflect that, and installed the package itself.

sudo apt-get update
sudo apt-get install python-software-properties
sudo apt-add-repository ppa:zfs-native/stable
sudo apt-get update
sudo apt-get install ubuntu-zfs

Load the ZFS module

modprobe zfs


Ubuntu 16.04 LTS Install ZFS pacages (Do not use for Ubuntu 16.04)

Ubuntu 16.04 LTS comes with built-in support for ZFS, so it's just a matter of installing and enabling ZFS

sudo apt-get install zfsutils-linux zfs-initramfs
sudo modprobe zfs


Create Zpool

First get a listing of all the disk device names you will be using by using this command:

fdisk -l|more

You should get a listing like below:

Disk /dev/sda: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sdb: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sdc: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sdd: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sde: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sdf: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sdg: 32.0 GB, 32017047552 bytes
255 heads, 63 sectors/track, 3892 cylinders, total 62533296 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0005ae32

In my particular case, I will be using all the 4000.8 GB drives in my zpool so, I will be using the following devices:

/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf

I will not be using the 32GB /dev/sdg since it's my boot drive. So now that we have the devices names, let's get a listing of all the drives in the system by wwn ID (this is the preferred method of adding drives in your zpool just in case the /dev/sdx assignment ever changes in your system. Additionally, the WWN ID is usually printed on the actual drive itself just in case you have to replace it later, you will know exactly which one it is):

ls -l /dev/disk/by-id

You should get a listing like below:

lrwxrwxrwx 1 root root 9 Jun 8 10:48 ata-TOSHIBA_MD04ACA400_15O8K1NGFSBA -> ../../sda
lrwxrwxrwx 1 root root 9 Jun 8 10:48 ata-TOSHIBA_MD04ACA400_15PDKBNIFSAA -> ../../sde
lrwxrwxrwx 1 root root 9 Jun 8 10:48 ata-TOSHIBA_MD04ACA400_15Q1KCFMFSAA -> ../../sdf
lrwxrwxrwx 1 root root 9 Jun 8 10:48 ata-TOSHIBA_MD04ACA400_15Q2KETKFSAA -> ../../sdc
lrwxrwxrwx 1 root root 9 Jun 8 10:48 ata-TOSHIBA_MD04ACA400_15Q2KETLFSAA -> ../../sdd
lrwxrwxrwx 1 root root 9 Jun 8 10:48 ata-TOSHIBA_MD04ACA400_15Q3KFGKFSAA -> ../../sdb
lrwxrwxrwx 1 root root 9 Jun 8 10:48 ata-TSSTcorp_DVD+_-RW_TS-H653B -> ../../sr0
lrwxrwxrwx 1 root root 9 Jun 8 10:48 ata-V4-CT032V4SSD2_200118513 -> ../../sdg
lrwxrwxrwx 1 root root 10 Jun 8 10:48 ata-V4-CT032V4SSD2_200118513-part1 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Jun 8 10:48 ata-V4-CT032V4SSD2_200118513-part2 -> ../../sdg2
lrwxrwxrwx 1 root root 10 Jun 8 10:48 ata-V4-CT032V4SSD2_200118513-part5 -> ../../sdg5
lrwxrwxrwx 1 root root 9 Jun 8 10:48 wwn-0x500003960b704511 -> ../../sdf
lrwxrwxrwx 1 root root 9 Jun 8 10:48 wwn-0x500003960b784775 -> ../../sdc
lrwxrwxrwx 1 root root 9 Jun 8 10:48 wwn-0x500003960b784776 -> ../../sdd
lrwxrwxrwx 1 root root 9 Jun 8 10:48 wwn-0x500003960b804868 -> ../../sdb
lrwxrwxrwx 1 root root 9 Jun 8 10:48 wwn-0x500003960ba809f1 -> ../../sda
lrwxrwxrwx 1 root root 9 Jun 8 10:48 wwn-0x500003960bd03569 -> ../../sde
lrwxrwxrwx 1 root root 9 Jun 8 10:48 wwn-0x500a07560bed90f1 -> ../../sdg
lrwxrwxrwx 1 root root 10 Jun 8 10:48 wwn-0x500a07560bed90f1-part1 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Jun 8 10:48 wwn-0x500a07560bed90f1-part2 -> ../../sdg2
lrwxrwxrwx 1 root root 10 Jun 8 10:48 wwn-0x500a07560bed90f1-part5 -> ../../sdg5

Now we match the device names to the corresponding wwn ID, so in my particular case, I will be using the following wwn IDs:

sda --> wwn-0x500003960ba809f1
sdb --> wwn-0x500003960b804868
sdc --> wwn-0x500003960b784775
sdd --> wwn-0x500003960b784776
sde --> wwn-0x500003960bd03569
sdf --> wwn-0x500003960b704511

We will be creating a RAID6 ZFS pool. I prefer RAID6 over RAID5 since it has more resiliency that RAID5 since it can withstand two drive failures before the array goes down. Just for reference, the following RAID levels can be created:

  • RAID0
  • RAID1 (mirror)
  • RAID5 (raidz)
  • RAID6 (raidz2)

Let's create the RAID6 ZFS pool named array1 using 4K blocksizes (-o ashift=12) vs the default 512 byte:

sudo zpool create -o ashift=12 -f array1 raidz2 /dev/disk/by-id/wwn-0x500003960ba809f1 /dev/disk/by-id/wwn-0x500003960b804868 /dev/disk/by-id/wwn-0x500003960b784775 /dev/disk/by-id/wwn-0x500003960b784776 /dev/disk/by-id/wwn-0x500003960bd03569 /dev/disk/by-id/wwn-0x500003960b704511

Check the newly created zpool:

sudo zpool status

should output the following:

pool: array1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
array1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x500003960ba809f1 ONLINE 0 0 0
wwn-0x500003960b804868 ONLINE 0 0 0
wwn-0x500003960b784775 ONLINE 0 0 0
wwn-0x500003960b784776 ONLINE 0 0 0
wwn-0x500003960bd03569 ONLINE 0 0 0
wwn-0x500003960b704511 ONLINE 0 0 0

Show the zpool listing:

sudo zpool list

The command above, will output the raw NOT the usable capacity of the zpool since two of our drives are taken up for parity:

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
array1 21.8T 756K 21.7T - 0% 0% 1.00x ONLINE -

Running the zfs list command

sudo zfs list

will output the usable capacity of the zpool:

NAME USED AVAIL REFER MOUNTPOINT
array1 480K 14.3T 192K /array1


Disable the ZIL (ZFS Intent Log) or disable sync writes

This cannot be stressed enough. If you intent on turning off ZIL you absolutely must have UPS battery backup that will gracefully shutdown your server when the battery runs out. If you don't, your data will get fucked!!!!

If you intent to use your ZFS pool to store virtual machines or databases, you should not turn off the ZIL but instead use an SSD for the SLOG to boost performance (Explained below)

If you intent to use your ZFS pool for NFS which issues sync writes by default, then you should turn OFF ZIL. What if you want to store virtual machines on NFS then? Then you simply set the async flag for your NFS share, duh!!

Turn off ZIL support for synchronous writes with the following command:

sudo zfs set sync=disabled array1

 

Create ZFS Filesystem

Let's create some filesystems on the newly created zfs pool. In ZFS, filesystems look like folders under the zfs pool. We could simply create folders, but then we would lose the ability to create snapshots or set properties such as compression, deduplication, quotas etc.

In my particular case, I need some of the ZFS pool for iSCSI target. So, I'm going to create a iscsi filesystem:

sudo zfs create array1/iscsi


Running df -h will output the following. Notice the array1/iscsi filesystem that was created:

Filesystem Size Used Avail Use% Mounted on
/dev/sdg1 14G 2.3G 11G 18% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 7.9G 4.0K 7.9G 1% /dev
tmpfs 1.6G 684K 1.6G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 7.9G 0 7.9G 0% /run/shm
none 100M 0 100M 0% /run/user
array1 15T 128K 15T 1% /array1
array1/iscsi 15T 128K 15T 1% /array1/iscsi


You can create as many filesystems as you need in the ZFS pool and set properties.

If I want to enable compression on the newly created filesystem, I would issue the following command:

sudo zfs set compression=on array1/iscsi


To turn off compression use the following command:

sudo zfs set compression=off array1/iscsi


Important note. Simply setting compression=on defaults the compression algorithm to lzjb. It's recommended to use the lz4 algorithm.

That is easily set by issuing the following command:

sudo zfs set compression=lz4 array1/iscsi


If I wanted to set a quota, I would issue the following command:

sudo zfs set quota=200G array1/iscsi


To remove the quota, use the following command:

sudo zfs set quota=none array1/iscsi


If for some reason, you ever want to destroy your zpool you issue the following command which will force it:

sudo zpool destroy -f array1


Add Cache Drive to Zpool

If you happen to have an SSD drive, you can utilize that drive as a cache drive (L2ARC Cache) for your Zpool. The idea behind it is data read from the SSD drive will have significantly faster access times than traditional spinning disks. So for instance, if you were to add a 250GB SSD drive, then 250GB of the most frequently accessed data will be kept in the cache. Now, it goes without saying that in case of power failure, any data that was kept in the cache and wasn't written to the spinning disks would be lost so a good battery backup is an absolute must.

Identify your SSD drive wwn ID as described above. So, assuming the wwn ID for your SSD drive is wwn-0x50025388500f8522. So, we'll add it to our previously created Zpool like below:

zpool add array1 cache -f /dev/disk/by-id/wwn-0x50025388500f8522


Check the Zpool status:

sudo zpool status

 

pool: array1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
array1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x500003960ba809f1 ONLINE 0 0 0
wwn-0x500003960b804868 ONLINE 0 0 0
wwn-0x500003960b784775 ONLINE 0 0 0
wwn-0x500003960b784776 ONLINE 0 0 0
wwn-0x500003960bd03569 ONLINE 0 0 0
wwn-0x500003960b704511 ONLINE 0 0 0
cache
wwn-0x50025388500f8522 ONLINE 0 0 0

As you can see the drive has been added as cache.

Add Log Drives (ZIL) to Zpool

ZIL (ZFS Intent Log) drives can be added to a ZFS pool to speed up the write capabilities of any level of ZFS RAID. It writes the metadata for a file to a very fast SSD drive to increase the write throughput of the system. When the physical spindles have a moment, that data is then flushed to the spinning media and the process starts over. We have observed significant performance increases by adding ZIL drives to our ZFS configuration. One thing to keep in mind is that the ZIL should be mirrored to protect the speed of the ZFS system. If the ZIL is not mirrored, and the drive that is being used as the ZIL drive fails, the system will revert to writing the data directly to the disk, severely hampering performance. Alternatively, you can always remove the bad drive and add another one as a ZIL drive.

If the ZIL drive fails you will lost a few seconds of data. If that's acceptable to you, then a mirror is not necessary. If you are going to be storing MISSION CRITICAL data where even a few seconds of lost data will cost significant sums of money, adding ZIL drives in mirror configuration is a MUST!!!!

If you are going to be using two SSD drives in mirror mode, identify the SSD drives by wwn ID as described above and then add them to your array in mirror mode like below:

zpool add array1 log mirror -f /dev/disk/by-id/wwn-0x50025388500f8668 /dev/disk/by-id/wwn-0x50025388500ffg12


If you are going to be using one SSD drives, identify the SSD drive by wwn ID as described above and then add it to your array like below:

zpool add array1 log -f /dev/disk/by-id/wwn-0x50025388500f5af8


Check the Zpool status:

sudo zpool status

 

pool: array1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
array1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x500003960ba809f1 ONLINE 0 0 0
wwn-0x500003960b804868 ONLINE 0 0 0
wwn-0x500003960b784775 ONLINE 0 0 0
wwn-0x500003960b784776 ONLINE 0 0 0
wwn-0x500003960bd03569 ONLINE 0 0 0
wwn-0x500003960b704511 ONLINE 0 0 0
logs
wwn-0x50025388500f5af8 ONLINE 0 0 0
cache
wwn-0x50025388500f8522 ONLINE 0 0 0

As you can see it has been added as a ZIL drive.

Issues with zpool status on Ubuntu 14.04 after reboot

An issue I've ran into is after a reboot if you do a zpool status, your zpool will show the devicenames vs the device IDs like below:

sudo zpool status

 

pool: array1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
array1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdd ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdg ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
sde ONLINE 0 0 0
sdi ONLINE 0 0 0
logs
sdh ONLINE 0 0 0
cache
sdf ONLINE 0 0 0
errors: No known data errors


This only seems to be a cosmetic issue because issuing the zdb command shows the device IDs like it's supposed to:

sudo zdb

 

array1:
version: 5000
name: 'array1'
state: 0
txg: 200
pool_guid: 12136950353410592998
errata: 0
hostid: 2831217162
hostname: 'nas3'
vdev_children: 4
vdev_tree:
type: 'root'
id: 0
guid: 12136950353410592998
children[0]:
type: 'mirror'
id: 0
guid: 7548278309220334221
metaslab_array: 39
metaslab_shift: 35
ashift: 12
asize: 4000771997696
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 2562845451665823060
path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5N8KZ7N-part1'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 291777340882840666
path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EDYVYU2J-part1'
whole_disk: 1
create_txg: 4
children[1]:
type: 'mirror'
id: 1
guid: 8578547322301695916
metaslab_array: 37
metaslab_shift: 35
ashift: 12
asize: 4000771997696
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 2041375668167635066
path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4ENPN3V47-part1'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 15162795176142751617
path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E6YRLCHH-part1'
whole_disk: 1
create_txg: 4
children[2]:
type: 'mirror'
id: 2
guid: 302043060234775242
metaslab_array: 35
metaslab_shift: 35
ashift: 12
asize: 4000771997696
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 5285723468079384932
path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EDYVY6HT-part1'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 5203540854438335529
path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1VUAYCJ-part1'
whole_disk: 1
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 1510858814325079212
path: '/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DDNEAF407950E-part1'
whole_disk: 1
metaslab_array: 49
metaslab_shift: 31
ashift: 13
asize: 250045005824
is_log: 1
create_txg: 62
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data


You MAY be able to fix the issue by issuing the following commands:

zpool export array1
zpool import -d /dev/disk/by-id/ array1
zpool set cachefile= array1
update-initramfs -k all -u


Reboot the machine and do a zpool status. Again, this is only a cosmetic issue and it shouldn't affect anything.

Zpool Status Failure Notifications

The following script will run every hour to check every Zpool's status and it will notify you in case a Zpool encounters a problem such as a failed drive.

First of all, install mailutils package if not already installed:

apt-get install mailutils


Next, create a script in /etc/cron.hourly/ named zpoolstatus

vi /etc/cron.hourly/zpoolstatus


Paste the following, adjust the someone@domain.tld to the email address you want to the notifications sent and save the file:

#!/bin/bash
EMAIL_ADD=someone@domain.tld
zpool status -x | grep 'all pools are healthy'
if [ $? -ne 0 ]; then
/bin/date > /tmp/zfs.stat
echo >> /tmp/zfs.stat
/bin/hostname >> /tmp/zfs.stat
echo >> /tmp/zfs.stat
/sbin/zpool status -x >> /tmp/zfs.stat
cat /tmp/zfs.stat | /usr/bin/mail -s "Disk failure in server : `hostname`" $EMAIL_ADD
fi


Make the file executable:

chmod +x /etc/cron.hourly/zpoolstatus


Verify that the file will run every hour by running the following command:

sudo run-parts --report --test /etc/cron.hourly


should give you the following output, if blank, check your script again. Ensure the script does not have .sh extension on it or it will not work:

/etc/cron.hourly/zpoolstatus

That's it!