![]() When the application requests the data, ZFS constructs the stripe as we just learned, and compares each block against a default checksum in the metadata, which is currently fletcher4. Suppose for a moment that there is bad data on a disk in the array, for whatever reason. ZFS can detect silent errors, and fix them on the fly. This brings us to the single-largest reason why I've become such a ZFS fan. In fact, it's highly recommended that you use cheap SATA disk, rather than expensive fiber channel or SAS disks for ZFS. So, ZFS comes back to the old promise of a "Redundant Array of Inexpensive Disks". You don't need expensive NVRAM to buffer your write, nor do you need it for battery backup in the event of RAID write hole. So, metadata traversal of the filesystem can actually be faster in many respects. There is no worry about reading "dead" data, or unallocated space. Reading filesystem metadata to construct the RAID stripe means only reading live, running data. ![]() This is what makes ZFS win.įurther, because ZFS knows about the underlying RAID, performance isn't an issue unless the disks are full. If you're paying attention, you'll notice the impossibility of such if the filesystem and the RAID are separate products your RAID card knows nothing of your filesystem, and vice-versa. Instead, we must pull up the ZFS metadata to determine RAIDZ geometry on every read. With dynamic variable stripe width, such as RAIDZ, this doesn't work. With standardized parity-based RAID, the logic is as simple as "every disk XORs to zero". But, your disks will not be inconsistent.ĭemonstrating the dynamic stripe size of RAIDZ So, in the event of a power failure, you either have the latest flush of data, or you don't. Further, the parity bit is flushed with the stripe simultaneously, completely eliminating the RAID-5 write hole. Every RAIDZ write is a full stripe write. Every block transactionally flushed to disk is its own stripe width. Rather than the stripe width be statically set at creation, the stripe width is dynamic. So, we need to rethink parity-based RAID. In both cases, the RAID-5 write hole, and writing data to disk that is smaller than the stripe size, the atomic transactional nature of ZFS does not like the hardware solutions, as it's impossible, and does not like existing software solutions as it opens up the possibility of corrupted data. So, as a result, expensive battery-backed NVRAM hardware RAID cards can hide this latency from the user, while the NVRAM buffer fills working on this stripe, until it's been flushed to disk. Rather than reading only live, running data, you spend a great deal of time reading "dead" or old data. This causes you to read and write data that is not pertinent to the application. ![]() If the data being written to the stripe is smaller than the stripe size, then the data must be read on the rest of the stripe, and the parity recalculated. There is also a big performance problem to deal with. Rather, expensive (and failure-prone) hardware cards, with battery backups on the card, have become commonplace. As a result, software-based RAID has fallen out of favor with storage administrators. Now, there are software work-arounds to identify that the parity is inconsistent with the data, but they're slow, and not reliable. What sucks, is the software-based RAID is not aware that a problem exists. If there exists any possibility that you can write the data blocks without writing the parity bit, then we have the "write hole". In reality, it's a problem, no matter how small, for all parity-based RAID arrays. Jeff Bonwick, the creator of ZFS, refers to this as a "RAID-5 write hole". Suppose that you write the data out in the RAID-5 stripe, but a power outtage occurs before you can write the parity. Thus, any disk can fail, and the data can still be restored. Instead, the parity is distributed throughout all of the disks. Further, in RAID-5, no single disk in the array is dedicated for the parity data. This allows you to suffer one disk failure, and recalculate the data. A parity bit is then calculated such than the XOR of all three stripes in the set is calculated to zero. You need a minimum of 3 disks for a proper RAID-5 array. Let's discuss the standard RAID-5 layout. To understand RAIDZ, you first need to understand parity-based RAID levels, such as RAID-5 and RAID-6. This post continues the topic discusing the RAIDZ VDEVs in great detail. The previous post introduced readers to the concept of VDEVs with ZFS.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |