display | more...
ZFS (zettabyte file system) is a file system developed by Sun Microsystems. To be released in update 2 of Solaris 10 it is the first 128-bit file system. Supported on both SPARC and PC (x86) platforms it is endian-neutral and is a journaled file system.

The storage capacity provided should outlast Moore’s Law for some time. To quote a Sun developer, "Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans."

ZFS is, as noted above, a rather natty new filesystem, which Sun are touting as the "World's most advanced filesystem", not entirely without justification. It has all the features you'd expect of a modern filesystem: it's journaled to ensure data integrity if Very Bad Things should happen to the computer, and its 128-bit storage addressing scheme means that you'll run out of metal to build disk platters from long before you run out of the capacity to address it. We can more or less take these things for granted these days: the leading edge of even desktop operating systems has kept filesystem limitatitons far ahead of most hardware requirements, with the brief exception of the FAT fiasco circa 1996.

But what about the novel features? The features that distinguish ZFS from every other 'Enterprise' class filesystem out there? Well, ZFS has a bundle of them.

Storage Pools

Firstly, and most visibly (not to mention most confusingly for UNIX admins), ZFS subsumes the functionality of a volume manager into the filesystem layer. In the traditional UNIX scheme of things, an element of storage (a partition, a disk or a slice, typically) is represented as a block device. A volume manager interfaces one or more block devices and presents them, unified, as one device, on which the filesystem resides. As far as the filesystem is concerned, it has one large extent of space on which to lay out its files.

It's a nice abstraction because it makes the filesystem simple, confining troublesome hardware issues such as disk layout and redundancy through RAID in the Volume Management layer, but it's not entirely abstract: timing behaviour depends on the underlying disks and their arrangements, so file systems are still forced to take some account of the disk behaviour. Also, because all the space associated with a volume is allocated to a single filesystem, if you find you've suddenly run out of space on one filesystem but have plenty of space on another filesystem, you can't reallocate the free space to the full filesystem to ease the pressure on it.

ZFS dispenses with these issues by managing the storage itself, so the filesystem code knows explicitly about the layout of its disks and their performance characteristics. So instead of having one filesystem mounted on one device (volume), ZFS groups together disks into storage 'pools', consisting of 'virtual devices' (or 'vdevs'), which can be simple disks, 'mirror'ed sets of disks (a la RAID1) or partially redundant sets (as in RAID5 and its ilk).

In a storage pool, we can create ZFS 'filesystems', which just like a conventional UNIX filesystem can be mounted anywhere in the VFS hierarchy. Any number of ZFS filesystems can be created in a pool, and the storage is shared between them dynamically.

When the storage in a pool is exhausted, more storage can be added to the pool dynamically by adding extra vdevs to the pool, and all the filesystems in the pool will be able to use the newly available space. We don't have to worry about which filesystem gets the space: they'll use it as and when they need it.

Of course, when you add more physical devices to a storage pool, you also increase the potentially available read and write bandwidth of the pool, because two disks can transfer twice as much data as one disk (the fundamental principle that underpins RAID0 disk striping, for example), so ZFS will dynamically balance the allocation of blocks between all available storage devices to maximise the use of the available bandwidth.

These design features owe a lot to ZFS's conceptual predecessor, WAFL, the 'Write Anywhere File Layout' filesystem developed by NetApp for their network file server applicances. Just like the name implies, WAFL can write anywhere: plug a new disk into the NetApp, and it will start extending its existing filesystems onto it.

Simplified administration

The reason that this is so confusing for UNIX admins is because it breaks the UNIX model and, more to the point, the standard tools, such as mount(1) which makes the fundamental underlying assumption that there will be at most one filesystem per device, and at most one device per filesystem.

Evidently, new tools are required to work with this new paradigm. But, rather than attempt to add yet more tools into the vocabulary necessary for an admin to manage their storage (one might imagine, for instance, that they'd invent a command to create a virtual device on a pool, which the admin could then mount(1) in the traditional fashion), the ZFS command line interface, like the filesystem itself, takes on all the roles itself, to the extent that in order to administrate a ZFS based system, you only need to know two commands, in total.

The zpool command manages storage pools. You can create a new pool with 'zpool create ...'. You can add new storage elements to it with 'zpool add ...'. There are a bundle more options, for managing mirror sets and RAID sets, but fundamentally, that's about all there is to it.

The zfs command manages filesystems within pools. You can use 'zfs create ...' to create a new filesystem, and when you do, that filesystem is automatically mounted at an appropriate mount point (which you can of course change later... using zfs) and remounted there the next time the system is restarted. No messing around with mount, no editing /etc/fstab.

Snapshots

Another feature, and again one also present in WAFL, is the ability to create instantaneous snapshots of a filesystem.

A snapshot is, simply put, an exact image of the filesystem at a given moment in time, which appears like a normal directory and can be browsed and read (although not written to). It's a bit like a complete backup of the filesystem, except that it doesn't take any time to create (at least, not much), and it doesn't take any extra storage space. Instead of overwriting data when you modify the filesystem, ZFS simply writes the updated data somewhere else on the disk. The snapshot still refers to the old version, and the real filesystem refers to the new version.

If you don't immediately see how useful this could be, then I can only assume you've been lucky (or, possibly, 'clever') enough to have never hit 'delete', or to have overwritten the 'current' version of your thesis/novel/masterpiece with the one that you sent to your supervisor/editor/publisher last month.

This is the feature that made me fall in love with NetApp. It's the feature that saved my ass more times than I'd care to count, and the feature which for a time had me scouring eBay for second hand NetApp filers. (Alas, you can get the hardware for reasonably cheap, but not the software licenses...)

Not only can it save a user's hide, snapshots are useful for system admin purposes such as backups: creating a full backup of a filesystem takes time, and in the time it takes to back it up, live data will be constantly changing, resulting in a backup that does not represent a single coherent state of the system. With snapshots, you can create a snapshot and back up the snapshot rather than the live filesystem, safe in the knowledge that you're capturing a single, consistent state of the filesystem.

Where WAFL creates snapshots automatically at set times, ZFS snapshots are created and removed explicitly; but that's nothing that judicious application of cron can't fix.

A closely related feature is the ability to create clone filesystems, turning a snapshot into a writable filesystem in its own right. Once a filesystem has been snapshot'ed and cloned, existing data is shared until either verson is written, at which point the two copies diverge. Great for experimental changes to your filesystem, configuration or source trees (if you're not already using Subversion, that is).

Miscellany

There's a handful of other features, too. For instance, ZFS is 'self-healing' for redundant RAID devices: if it detects any sort of data corruption on a device, ZFS will silently restore the correct data to the device.

There's also online compression, something which the Windows folks have been lording over us UNIX types for quite some time now.

The endian-ness of ZFS is, apparently, adaptive, in order to allow disks to be moved between SPARC (big endian) and Intel/AMD systems (little endian). In practical terms, I'd imagine this means that the endian-ness of filesystem data is explicitly flagged in the storage format, and checked/decoded when read, with new data being written in the host's native format.

Now, if you'll excuse me, I have to go and try and figure out how to make Solaris 11 do all the things that my Debian server currently does for me... in the meantime, you can find out a hell of a lot more over at:

http://opensolaris.org/os/community/zfs/

Log in or register to write something here or to contact authors.