VMdamentals.com

VMware Technical Deepdive by Erik Zandboer

RAID5 write performance – Revisited

November 4th, 2010 |

In this post: Throughput part 2: RAID types and segment sizes I wrote that a RAID5 setup can potentially perform better in a heavy-write environment over RAID10, if tunes right. While theoretically this might be true, I have some important addendum’s to this statement that vote against RAID5 which I’d like to share.

The original idea

The original idea was hidden inside the “write penalty” of a RAID array. A RAID10 is relatively simple: the Write Penalty is always 2. This is because the way a RAID10 is constructed: It is basically a striped set of mirrors (a RAID0 of several RAID1’s if you like). So for every segment you write, you need to perform a seek of TWO disks (both disks in a single RAID1 mirror). This makes a RAID10 a simple and high performing setup, especially for heavy writes.

Now we look at RAID5. RAID5 is basically a RAID0 striped set of disks. But since RAID0 will loose ALL of its data is a single member fails, RAID5 adds a parity segment to every stripe set. This will impact writing to a RAID5 array though: If a single segment has to be written (like in the RAID10 example above), the array actually needs to first read that segment, read the parity segment, recalculate the parity, then write both the new segment and the parity back to the two disks. This is commonly called “read-modify-write”, and has a write penalty of 4, which is obviously worse than the RAID10 variant.

But potentially a RAID5 set can perform very well when looking at write penalty: If you have an entire stripe to write, you simply calculate the parity, and then proceed to write the ENTIRE stripe in one swift stroke. SANs like EMC’s CLARiiON actually will try and keep a RAID5 write into its write cache hoping it will gain all segments of a particular stripe so that a full stripe write can be performed. Very smart indeed 🙂

Knowing all the things above, I figured you could optimize a RAID5 setup with heavy writes by tuning the segment size: If you were writing mostly 64K blocks on a (4+1) RAID5 set, you could optimize these writes by selecting a segment size of 16KB. This way every 64K write is split up over 4 data segments and a parity segment, effectively performing a full-stripe write on the array for each 64KB write.

Where it all falls down

This tuning sounds very nice. But I figured out this tuning actually almost never helps! The explanation is simple: In a random-write environment (which virtually every VMware environment is) any full-stripe write would need ALL disks in the RAID5 set to seek to a certain segment. The write after that is very effective though.

Now look at the alternative: let’s say any random write fits on a single segment (for example use a 64KB segment size; VMware hardly writes block larger than this). Now the array will perform a “read modify write”, which causes the array first to read one segment and its parity (only two spindles seek). After a recalculation of the parity, two writes occur (the same two spindles seek to the same cylinder; with a little luck the heads are still around!).

This alternative gets more and more effective as you add more members in the RAID5 set, actually BOOSTING write performance over the “tuned” version. The more members in the RAID5 set, the more disks need to seek for each write in the tuned version.

Is this RAID5 segment-size tuning never a good idea?

In some isolated cases, it might still be effective to tune the segment size of a RAID5 array. Basically writing gets a real boost if you use sequential writing, and the array for some reason is not able to combine these writes into Full Stripe Writes by itself (like EMC’s CLARiiON does, see above). Tuning the segment size could “force” the array to perform Full Stripe Writes in this case, boosting performance by great lengths.

Considering most modern arrays, the chances of RAID5 tuning to be effective are slim at best. Most arrays will not even allow you to adjust the segment size, and even if they do they are most of the time able to tune their writes to Full Stripe Writes internally.

While discussing RAID5 anyway: Do not forget that apart from the write performance, the performance impact when rebuilding a RAID5 array is usually very high. Some arrays (yes the EMC CLARiiON for example) can detect disks which are about to fail, and copy their contents off to a hot spare disk before the actual failure occurs. The impact for this action will be much smaller, since only one disk in the array performs a full capacity read, and all other disks can still handle IOPS. Rebuild impact is very much like a RAID10 set in this case. But these “smart” rebuilds are not always possible; disks fail, sometimes without notice.

So if you scale a RAID5 set to be able to perform no more IOPS than is required by the workload, a rebuild of that RAID5 set should be noted as downtime: During a rebuild, all disks in the set perform a full read of their ENTIRE capacity, and the disk being rebuild performs a write of its ENTIRE capacity, heavilly impacting performance for the regular workload. And a full rebuild can go on for hours or even days!

So when using RAID5, you should consider two important things: Make sure no heavy writing is going on, or in specific situations: Tune the segment size.. Also think about whether or not you need to size for RAID5-rebuild impact.

Posted in VMware |

14 Responses to “RAID5 write performance – Revisited”

PiroNet says:

November 5, 2010 at 21:24

It becomes so complex, there are so many variables to consider and at the same time enterprise class SAN’s are so powerful that nowadays people, I meant IT admins/managers, want two things from the storage vendors, that many throughput (IOPS) and that many bandwidth (GB/s) garanteed in any situation. The rest, what happens behind the scene, what’s under the hood, not interested!

Rgds,
Didier
- Erik Zandboer says:
  
  November 5, 2010 at 21:55
  
  @PiroNet: Surely a lot of people are not interested at all in the technical details. They just want something that works, at the best price. That of course makes sense from a managers point of view, but these blogs are written for the technical people who need to put these customer’s requirements into an actual design. You have to know if you are to scale it right.
  
  From what I mostly see, people tend to aim for capacity, not for IOPS at all. More IOPS in their eyes means more cost for the same amount of storage. They buy Terabytes, that is all that matters right until you explain the possible performance issues…
  
  When virtualizing servers, you might even get away with designing for capacity. But with virtual desktops capacity is much less of an issue, especially with linked cloning technology. But if you do not scale it right to get the right performance for the number of desktops you need to run, you’ll hang at some point. If you size for capacity in a virtual desktop environment, I guess you’ll do OK for 150 to perhaps 200 virtual desktops in most scenarios. Above that point, it all slows down. That is why we go through such great lengths to know how storage “ticks” inside…!!
PiroNet says:

November 6, 2010 at 10:28

Erik, don’t get me wrong, I like your blog just because of it’s valuable source of very technical and detailed information especially for storage…Keep it that way!

Cheers,
Didier
- Erik Zandboer says:
  
  November 6, 2010 at 21:25
  
  PiroNet, thanks for the kind words. A lot of people don’t care about the technical details. My blog is for those who do 🙂
Anthony says:

November 19, 2010 at 05:06

Erik, excellent blog and very helpful in sizing. A question – Generally segment sizes are on a per LUN basis and your setting the segment size on the write characteristics even though the majority of the transactions are read based. How do you factor segment size for reads? Thanks!
- Erik Zandboer says:
  
  November 19, 2010 at 10:14
  
  Hi,
  
  In most environments most actions are reads, although not all. Think about VDI environments, where I see up to 80% writes on linked clones. and some other “odd” loads of course which can yield high write percentages.
  
  But basically reads or writes do not really matter. You select segment sizes per LUN or per set of physical disks depending on the array type, but the idea is still the same: The major delaying factor in physical disks is the seek time across the platter. And for both reads and writes your disks need to perform seeks. The less disks have to seek for a read OR a write, the more effective your setup will be (talking about RANDOM loads of course!). So for read optimization you’d need to know/measure read operation sizes as well and design accordingly.
  
  RAID5 (and RAID6 even worse for that matter) in writes has a particular “problem” in recalculating the parity which leads to the read-modify-write behavior. It is very hard to tell what a RAID5 will do when for example two segments are changed in a raid5 of (8+1). Will the array read two segments + parity and write them back, or will it decide to perform a read of all 6 “missing” segments and then perform a full stripe write? What about 4 or 6 segments that need to be “changed”? Very hard to get answers on these questions.
Anthony says:

November 19, 2010 at 17:57

Thanks Erik, I agree and understand your points. I have a Oracle DB that is primarily reads (75R%/25%W). On writes, that data is staged into cache and flushed based on a filler-up % algorithm. When it destages cache or flushes it, it will write out to disk based on my segment size. (This is based on a R10, not R5, which most DB are running R1 or R10). However on my reads of a 8k, if the LBA is not residing in cache, and hence, no cache hit, it most go out to disk and grab that data. My real question is, does it grab the 8k? or the segment size of 128k? or even the cache block size setting of 16k? Which one?
- Erik Zandboer says:
  
  November 19, 2010 at 19:19
  
  Hi Anthony,
  
  Normally the array will read at its segment size. So if you configured 128KB as your segment size, changes are the array will read 128K from a single stripe member. That appears to be a problem, since you only require 8K. But do not forget how a disk works. Once the head seeks to the correct cylinder (takes most time and thus latency), all sectors are read sequential after that, with a little luck all from the same cylinder on the one just after that. So there is little time lost.
  
  On the other hand there is no reason why an array could not read only the segment it needs; normally a disk is devided in 512KB sectors anyway (except the newest SATA disks who use 4KB and EMC who use 520 bytes/sector for some extra checkups). It is very hard to get answers on those tiny little details. I do not think the difference could ever be really measured, so that makes these details a little less important.
  
  It is important though to make sure your reads fit into a single segment. If the read size approaches the segment size, be aware of alignment since that becomes more critical then.
Anthony says:

November 19, 2010 at 21:58

Erik, you are very good and yes these are the smallest of details. However, it matters because if I truly have to read segment size into my cache every time I have to fetch 8k, and my Segment size is too large it would fill my cache and cause unnecessary reallocations and flushing. I believe the answer is in the storage vendor code. It will at a minimum grab the 8k obviously, and it may grab the delta between the cache block size and the requested block size. It may grab segment size, but it may be adaptive depending on how much it has to grab over the original requested block.
Thanks for you help and I appreciate you keeping me thinking.
- Erik Zandboer says:
  
  November 20, 2010 at 08:37
  
  Ahh those dirty details! I think an EMC CLARiiON for example uses 64K segment sizes which are fixed (and is almost always the way to go btw), and the entries in cache are also fixed at 64KB. It is a real programming challenge to allow a cache to have different blocksizes inside I guess…. So I think it is not adaptive. Not sure how a raid array that allows different segment sizes per RAID group would solve this… Like you state, the answer is in the code (and the minds of those who wrote that code 🙂 )
Patty says:

October 12, 2012 at 22:07

During a *WRITE*… why can’t a RAID 5 just do a “write data” “write data” “write partity”?

No “read” at all.
- Erik Zandboer says:
  
  October 14, 2012 at 10:51
  
  Actually, RAID5 *CAN* just do writes without any reads. But that requires that all of the stripes are modified. We call this a “full stripe write”. You can read more about it here: http://www.vmdamentals.com/?p=3200
  
  This does not work when you just need to write to a subset of stripes: Because one of the stripe members carries parity information, you will have to rewrite (and recalculate!) the parity information after such a write. But in order to recalculate the parity, you need to know the old data as well as the old parity. That in turn means you need to read this before you overwrite with the new data.
rehan says:

December 28, 2012 at 07:02

Greetings Erik!
for a RAID-5 write to say, a single segment, why would I need to read old parity information?
I can simply read the old data segments, XOR them with the new data segment, write the new data segment, and write the new parity.
would that not work?
I mean, to recalculate parity, why do i need the old parity information? Should not the old data suffice?
thanks.
- Erik Zandboer says:
  
  December 28, 2012 at 09:53
  
  Hi Rehan,
  
  What you describe here would work from a PARITY point of view: The parity would be correct in this case for the stripe. But look at what you just did: You XORed and wrote your new data. You just destroyed your data segment by XORing it against the parity 🙂
  
  Because you need to write the data segment in an unmodified way, you need to modify the parity, not the data segment. That is why you need 4 disk I/O’s (RAID5) or even 6 (RAID6).