RAID5 DeepDive and Full-Stripe Nerdvana
Ask any user of a SAN if cache matter. Cache DOES matter. Cache is King! But apart from being “just” something that can handle your bursty workloads, there is another advantage some vendors offer when you have plenty of cache. It is all in the implementation, but the smarter vendors out there will save you significant overhead when you use RAID5 or RAID6, especially in a write-intensive environment.
Recall on RAID
Flashback to a post way back: Throughput part 2: RAID types and segment sizes. Here you can read all about RAID types and their pros and cons. For now we focus on RAID5 and RAID6: These RAID types are the most space efficient ones, but they have a rather big impact on small random writes. The write penalty as it is called is high on these RAID types, effectively putting a lot of pressure on the back end of a storage array.
The write penalty is something most people know about, but only few actually understand how this is conceived. A short recall on what this is all about. We will use a RAID5 (4+1) setup as an example here:
See Figure 1. There are five spindles (yes I often call disks “spindles”) drawn here, where data is striped across. The green bits represent data on the RAID5 set, while the grey bits contain parity data of each “stripe”. A stripe is all the segments in a single “row” touching all the disks in the RAID set. Note that the grey parity segment is shifted between stripes. This is what RAID5 makes RAID5; the older RAID3 is basically the same thing, but in RAID3 one dedicated spindle is used for parity alone (but there are some downsides to this approach).
So in this case there are 4 segments carrying data plus one parity segment per stripe. Many stripes together form the RAID5 group, where one or more volumes (LUNs) may be carved out of (in this case we use stripe “n” through stripe “n+y” for a single volume).
Reading data from the RAID5 group
So what happens if we need to read a small piece of data from the RAID5 set? This is very simple. As long as the data is smaller than a single segment, AND the data has been aligned (for more details on alignment see my posts Throughput part 3: Data alignment and The Elusive Miss Alignment) only a single segment will be read from a single spindle. In a random I/O environment (and today’s virtual environments are almost always very random in nature) that would mean one single spindle needs to seek out the data (For those that can picture this in their minds: You may understand that a misaligned small piece of data may fall “in between” two spindles which would mean two spindles need to seek for the data which will impact performance).
Writing data to the RAID5 group
Writing is more of a hassle. Just like reading, it would be very easy indeed to access a single spindle and write the data. But there is one problem: On each stripe there is also something called PARITY. The segment containing the parity will need modification as well. Even worse, how are we going to recalculate the parity when we only know the one segment we are about to write? The answer is simple: We cannot. We need to do more to get this done:
- Read the segment about to be modified;
- Read the parity segment;
- Recalculate the parity. Because we know the “old” and “new” segment and the parity, we can recalculate the parity;
- Write the new segment to disk;
- Write the new parity to disk.
As you can see, we need a total of FOUR disk I/O’s and some CPU power to complete modifying one single segment on the RAID5 group. This is what we call a “Write Penalty of FOUR”, because we need four I/O’s for every single segment write.
Writing multiple segments to a RAID5 group
If you think writing a single segment is real fun, what about the case where you need to modify two segments on disks? Well, here it goes:
- Read both segments from disk;
- Read the parity segment;
- Recalculate the new parity;
- Write both segments to disk;
- Write the parity segment to disk.
As you can see, we now need SIX disk I/O’s for modifying TWO segments. So in this case, the write penalty is THREE!
So what would happen if we need to modify three segments on a stripe? Here we go again:
- Read three segments;
- Read the parity;
- Recalculate the parity;
- Write three segments;
- Write the parity.
Here we are performing EIGHT disk I/O’s for modifying THREE segments. So the write penalty in this case is 8/3rd, or 2.66 π
Now if you’re really smart (and some Storage Processors out there ARE), you could actually use a smarter way to do this. Compare the previous action to this one:
- Read the ONE segment that will not be modified;
- Calculate the parity;
- Write out three new segments;
- Write out the parity.
As you can see, we here use the non-changing data together with the new ones to calculate the parity. Because in this case we know ALL the stripe members by just reading the one we will not modify! Now the write penalty will look like this: First we read ONE segment, then we write out THREE segments and the parity segment. This results in FIVE disk I/O’s for THREE segments, giving us a very good write penalty of only 5/3rd, or 1.6666 which is actually BETTER than a RAID10 (which always has a write penalty of TWO).
The power of a FULL-STRIPE write
As we are modifying more and more segments in the stripe, we only have one situation left, and this one really rocks: What if we need to modify ALL segments in the stripe?
- Calculate the parity;
- Write FOUR segments to disk;
- Write the parity segment to disk.
Amazingly effective, here we perform FIVE disk I/O’s for modifying FOUR segments, giving us a whopping 5/4th or 1.25 write penalty. This is BY FAR the best we have seen so far, even much more efficient compared to a RAID10 group, which is always considered to be effective in performing writes.
The Big “IF”
So as we determined, the RAID5 type of RAID is actually the most effective when it comes to writing out to disk. The bigger the number of members in the stripe, the more effective it becomes (think about an (8+1) RAID5 set where the write penalty is only 8/7th on full-stripe writes). Only RAID0 is more effective (which has a write penalty of 1) but as this RAID type does not deliver any redundancy, we do not consider this type to be a “true” RAID type (not saying it has no use cases!)
So now for the “Big IF”. A RAID5 is the most efficient RAID type for writing, IF you perform full-stripe writes only. And there is the big if: Given the fact that most writes in modern virtualized environments are both small and random, we almost never get to perform a full-stripe write. So how could we improve the chances on being able to do a full-stripe write?
Optimizing writes to RAID5: Cache is King
The more efficient Storage Processors will cache their writes. So far there is no issue writing out to any RAID type, because the writes are safe and sound kept in the write cache (assuming you have a serious dual-controller setup that mirrors the writes in their caches). But at some point the write cache fills up, and parts of the cache need to be flushed out to disk.
If you implement a write cache algorithm you may consider this: Why not keep all writes in cache, but flush any pending writes that will result in a full-stripe write? This is actually what helps a very great deal; a smart caching algorithm will promote the use of full-stripe writes out to disk, specifically for RAID5 and RAID6 setups (and all other types that use parity in their protection model like RAID3 etc.).
So the bottom line: The more cache you have, the more writes can be performed as a full-stripe write optimizing performance.
Caching in SSD: Nerdvana?
Some vendors have begun to “extend” their cache memory using SSDs. This can greatly enhance the effective amount of cache memory. As you grow the amount of cache, the percentage of writes that are full-stripe writes will go way up, possibly near-100%…
You need to beware though: Some vendors are adding SSD (or flash cards in their arrays) that will only perform READ caching. As you can imagine, caching reads only will not help here!
Growing out the cache like this is not the complete Nerdvana yet though: With everything getting faster and faster, the CPU performance is actually caught up with at this moment. Storage networks are getting faster (10Gb ethernet, 8Gb Fiberchannel, backend SAS loops). Disks are finally getting faster too (SSD, Flash). So right now the CPU is once again becoming the bottleneck inside storage arrays.
Optimizing for full-stripe writes can be done relatively easy with only little CPU overhead. But tricks like “find me the biggest part of sequential full-stripe writes in cache” is a more complex one. It would be a very effective way to flush your cache… All spindles in a RAID set would need to seek, but after that (slow) seek there are many full-stripe writes flushed from cache all in one swift stroke (no need to seek for every full-stripe write, but only perform track-to-track seeks which are way faster). Plus the amount of data flushed out to disk would in general be a pretty big one too.
As CPU’s have seen their biggest growth (at least for now), we might consider to think about separate processing units just for handling these big and blazingly fast caches. So what’s it going to be: Faster CPUs, bigger CPUs, or maybe separate caching engines? Needless to say: The near future will be interesting times for sure.
As usual very informative blog post!
Are the SP’s in my EMC CX4-240c smart enough to do RAID5 write optimization?
If not, can FAST Cache on EFD be an option to overcome overall performance problems, especially to cache writes?
Thx,
Didier
Thanks! In answer to your question: Yes, an EMC CX4-240 is certainly smart enough to try and promote full-stripe writes as data is flushed from its cache.
Adding FAST-cache will certainly help a lot. The effectiveness of a write cache is all about its size; the more it can buffer the more optimal the flushing can be. EMCs FAST-cache looks like a straight expansion of the DRAM cache from the outside. Looking deeper, it actually is not. In reality it sits under the DRAM cache. So writes will hit the DRAM cache, and after a forced flush they can be buffered in FAST-cache. But again, “from the outside” it looks like one big cache (unless you give it really specific workloads then you may be able to see how it is truly positioned in the flow).
Thanks for the excellent post!
Out of curiosity, what performance gains could be squeezed out of spilitting the RAID-5 parity segments from group of disks that comprise the RAID-5 array and instead storing those parity segments on a RAID-1 array comprised of two SSD drives? The concept would be similar to RAID-3, only the parity data would end up on redundant, super fast storage.
Hi, thanks for the reply.
Interesting to see that you’re thinking about mixing drive types within one raid group. If you offloaded the parity to a RAID1 set of SSDs, you could possibly speed up the overhead with writing to a RAID5/6 set.
But: You’d need the SSDs to be just as big as a any other spindle in the RAID5/6 set. That would mean that you’d need a RAID10 set of 4 200GB SSDs just to cover for a 300GB SAS RAID5 set. For RAID6 that amount would double.
For SATA / NL-SAS RAID sets using 2TB drives you’d need 20 SSDs (!) for a RAID5 set and 40 SSDs (!!!!) for a RAID6 set. As you can see that would not really help, you’d be way better off by just using the SSDs in their own RAID10/RAID5 sets and used in a smart way (in an auto tiering or cache-extending way) to make best use of them.
If you do this properly, you can get immense gains by just adding 3-5 SSDs to an existing array (if the workload has a nice skew – meaning the hot data is localized to a limited set of stripes to match the footprint of the SSDs)
Of course, I hadn’t thought of the obvious storage capacity differential. Good point, as always π
[…] RAID5 DeepDive and Full-Stripe Nerdvana […]
[…] and force flush out to spinning disks. Why are full-stripe writes so cool? Read about it in “RAID5 DeepDive and Full-Stripe Nerdvana” A closer look at […]
Isn’t your example diagram that of a 3+1 R5? Would be common in Symmetrix, but Clariion has traditionally favored 4+1 or 8+1 to have a higher chamce of performing full stripe writes. Just curious… I think you forgot a spindle π
Ha you are right! Never caught that one… The post is written around (4+1), the diagram comes from one of my presentations where I used (3+1) just because it did not fit the screen in a (4+1)… I’ll try to get (4+1) in the picture π
No matter how big the stripe is, the idea is always the same. The reason for favoring certain number of spindles on certain hardware is sometimes tuned to the number of spindles per DAE, sometimes optimized in the SAN’s code.
π
Yep we do 4+1 often due to our 15 drive DAEs and nice even stripe size with 64k elements.
Take care,
Adam (EMC)