Throughput part 2: RAID types and segment sizes

January 18th, 2010 |

In part one I covered all stuff you can think of in regards to delays and latencies you encounter on physical disk drives and solid states. Now it is time to see how we can string together multiple drives in order to get the performance and storage space we actually require. I’ll discuss RAID types, number of disks in such a RAID set, segment sizes to optimize your storage for particular needs and so on.

–> For those of you who haven’t read part 1 yet: Thoughput Part1: The Basics

A short intro to RAID types

Now finally it is on to the stringing together of disks. More disks is more space, more performance, right? Yes right – sometimes. I am not zooming in too deep on the RAID types. I assume you have some knowledge on different types of RAID, mainly RAID1, RAID10 and RAID5. All that I’ll say about it:

RAID1 = putting two disks together, write the same data to both, read half of the data from each. This is an effective way to cope with disk failure, while boosting performance;
RAID10 = Taking two or more RAID1 sets, and striping them together;
RAID5 = Here it suffices to look at RAID5 like the almost forgotten RAID3: Simply striping two or more disks together, and add another disk to hold the parity of all other members in the stripe. (In reality in RAID5, the parity data is balanced across all disks in the RAID5 set instead of having one dedicated parity disk as in RAID3).

So what about read and write performance? Two disks have double the performance and double the size? Well actually, no. Each RAID type has its advantages, and also its disadvantages. There are three main advantages for all these RAID types though:

They increase the amount of storage;
They increase the amount of IOPS your disk group can perform;
They protect you from a single failing disk (for RAID10 even more than one failing disks in some cases)

So when to choose what? Now it comes down to the (dis)advantages of each RAID type, where basically it is always between RAID10 and RAID5 (with RAID1 being a sort of 2 disk version of a RAID10 group):

RAID10

In RAID10, you basically loose 50% of your storage space, because each disk is mirrored. On the other hand you gain a lot of performance (each disk is used for performing reads; each mirror-set inside the RAID10 is used for performing writes). Especially when disk groups which need to perform a lot of writes, RAID10 is an absolute winner over RAID5 (see RAID5 below). Also, in case of a disk failure, RAID10 will rebuild quickly and, even more important perhaps, with very little impact on overall performance during rebuild. This is because all that needs to be done is to copy the data from a single disk to its (replaced or hot-standby) mirror.

RAID5

RAID5 is another beast. RAID5 gets more efficient in storage size as you use more disks in the RAID group. This is because when you put data on a RAID5 group, all disks are used for data storage (in a stripe), and only one disk is used to put a parity segment on. In RAID5, these parity segments are put on a different disk in the RAID-group for each stripe. This is because you need the parity data more often when writing, and this way the load of the parity segment is divided over all members of the RAID5 group (in RAID3, one physical disk is used for all parity which can cause writing performance limitations).

Rebuilding a RAID5 array is somewhat difficult: If a disk fails and is replaced, the rebuild has to read data from all other members in each full stripe. This should return all data minus parity, or all data minus one plus parity. In both cases, the remaining segment can be calculated, and is then written back to the replaced disk in the RAID group. This is very disk intensive; all disks in the stripe have to read ALL their data in order to perform a rebuild. And worse, the rebuild time gets larger as the number of members in the RAID5 group grows.

Write penalties

For most RAID types (in fact all data-protecting RAID types), read performance is not equal to write performance. In RAID10 things are relatively simple; when reading all disks in the group are used, when writing in effect half of the disks is used (this is because each write has to be written to two disks who make up a mirror set. So a write penalty of 1:2 you might say (for 1 write you need 2 IOPs).

RAID5 write penalties are more complicated. It is basically a stripe of disks (like RAID0), but the added parity data is kind of a deal breaker. Depending on what you need to write, there is different behaviour. Lets say you want to write to only a single segment. In that case you could perform a read of that segment and read the parity of that particular stripe, then recalculate the parity data, then write both the segment and the parity. Overall this hits you for 4 IOPs for a single segment write. You can imagine this being a problem in a random IO pattern environment (where total seek time is the “killer” for your performance). The write penalty if 1:4 in that case (for one write you need 4 IOPS)

It gets even stranger if you write a “partial” stripe. In this case you basically have to read the entire stripe minus the segments to write, perform a calculation of the parity, then write the changed segments and the parity. You could calculate the write penalty for this, but it is kind of complicated:

if n = number of members in the stripe, m = number of segments to modify,

then

Number of IOPS needed = ( (n-m) + 1 parity ROPS) + ( (n+1) WOPS).

Example 1: If you have a RAID5 of (8+1) disks (8 in the stripe and one parity), and you have 2 segments to modify, you’d need 8-2+1 = 7 ROPS followed by 2+1=3 WOPS. So for 2 writes you’d need 7+3=10 IOPS, or a write penalty of 1:5.

Example 2: If you have a RAID5 of (8+1) disks again, but now you have 6 segments to modify, you’d need 8-6+1 = 3 ROPS followed by 6+1=7 WOPS. For for 6 writes you now need 10 IOPS, or a write penalty of 6:10 or 1:1,66

In the previous case, a full stripe write would require only 9 IOPS (in this case WOPS) for 8 segments, so a penalty of only 8:9, or 1:1,125 !

So the most effective way of writing to a RAID5 array, is to write one entire stripe at once. In this case, all disks can be used to simply perform a write operation, and one (calculated) parity segment. A lot of SAN vendors (like EMC for example), try to maintain written segments inside their write cache, and flush only when they have a full stripe to write out to disk.

Write penalties: what it comes down to

Looking at the previous calculations, you could extract the following:

For random writes, RAID10 gives a static write penalty of only 1:2;
For random writes where the cache of the SAN cannot fill up to a full stripe write, the write penalty lies somewhere at 1:4 or slightly better;
For sequential writes, RAID10 still gives you a write penalty of 1:2, but RAID5 (when using write cache) gives you only a write penalty of 1:1,125!

So the statement that RAID10 gives you better write performance is not always true…!

Sequential, Random, segment size??

So now we have got a clear view on segments to write, to read, RAID types, and latencies on physical disks. If we put all of this stuff together, we can actually optimize storage using RAID types and segment sizes together!

From here on there are so many parameters, that it becomes kind of hard to write a “one for all” best practice. I think it is better to put some examples to the test, and find out what would work best:

Example one: Database doing heavy writes

Let’s assume we have a very busy database, and we want to put this database on a number of disks we reserve especially for this database. Let’s say the blocksize the database writes in is 8 KBytes, and the pattern is random IO (which is almost always the case in a database environment).

In random IO patterns, the main delay is the head seektime and the rotational latency. So in this case you’d want to use as little disks as possible for a write (because each disk that is touched by the write needs to perform the seek). So a segment size of 8Kbytes would probably be optimal in this scenario.

When looking at the write penalty, the fully random IO pattern would mean that a RAID5 group would probably encounter a write penalty of 1:4, where a RAID10 group would only encounter a penalty of 1:2. This means you would require less disks to obtain the performance you need when comparing to RAID5.

Finally, disksize comes to mind. If the database is relatively small, RAID10 would propably be cheaper (!!) than RAID5. Why? Well, lets assume you use 15K SAS disks, which perform about 200 IOPS each, and your database requires 600 ROPS and 600 WOPS simultaneously, and is only 500Mbytes in size. You then would need:

RAID10 would need:

For performance: 600 ROPS / 200 = 3 spindles for reading, and 600 WOPS / 200 * 2 = 6 spindles for writing. This sums up to 3+6=9 so 10 disks in RAID10 would suffice. Using 146GB disks, you would get 5*146 GB = 730 GB, so this would suffice.

RAID5 would need:

For performance, 600 ROPS /200 = 3 spindles for reading, and 600 WOPS / 200 * 4 = 12 spindles for writing. This sums up to 3+12 = 15 spindles in total, so you’d require a (14+1) RAID5 set, which personally I would never do (rebuild time and impact!).

So in this example, RAID10 would actually save you 5 disks (and gain a much better rebuild time with much less impact). RAID10 would be THE way to go here.

Example two: sequential writing video application

In example two, we have a video application which needs to stream a lot of data to a raid group. Again we reserve a set of disks for this application (so the stream will remain purely sequential). Lets say we need to save a stream of 100MB/sec to 7200rpm SATA spindles, 100% writes in blocks of 1Mbyte (larger blocks help of course in a sequential write scenario)

As we have seen, RAID5 can be very effective when writing full stripes. So if we create a RAID5 array out of (4+1) disks, and we make the segment size 1Mbyte/4 = 256Kbytes, each WOP from the video app would exactly fill up one entire stripe. At 100MB/sec, each disk would have to write at 25MB/sec, and the fifth disk would perform at 25MB/sec as well. Because we need to write 100MB every second, we would require 100 IOPS (given that the blocksize is 1MB). Write penalty of a (4+1) RAID5 is 1:(5/4), or 1:1.25 would give us the requirement of 100 * 1.25 = 125 WOPS.

so each disk would have to perform: 125 / 4 = 31,25 IOPS and 100/4 = 25MB/sec. This is something a 7200rpm SATA drive should be able to support. Because we are using RAID5 in this example, we would end up with 5 drives. If we used 1TB SATA drives, we would have 4TB of net storage.

Conclusion

From the previous part and this part it becomes clear, that there is never a “best” solution given all environments. Sometimes RAID5 is better because you end up with more storage space than when you’d be using RAID10, sometimes RAID10 is better for its simple and relatively low write penalty for random IO loads.

Some things to consider in ANY case, is the segment size and the RAID type you choose for your storage. Also, never forget the impact of a rebuild in case of a disk failure. Especially the larger stripes in RAID5 can have enormous rebuild times (several days is no exception). During these rebuilds there is a significant impact on performance. Even though most SAN vendors allow you to prioritize your reguilar data flow and the rebuild priority, but this still leaves you with a problem: a short rebuild time with heavy user-impact, or a very low priority rebuild that does not impact users too much but leaves you with a rebuild going on for days. Do not forget, that within a RAID5 group disks are often the same make and model, and they are all just as old. This creates a serious threat, when you combine it with the heavier load during a rebuild: Changes are, you loose a second disk during rebuild which will basically destroy all your data on the raid group.

Also very important is getting to know your data. The IO pattern behaviour, as well as the read/write ratios must be taken into account when designing your storage.

Finally, for everyone using their storage in a VMware environment: Mostly your IO pattern will be random (unless you reserve disk groups for special functions). In this case, at least consider RAID10. It is a better overall performer for writes, and impact of a rebuild is much smaller compared to the popular RAID5.

Posted in Storage |

Tags: disk, IOPS, RAID, RAID10, RAID5, random I/O, ROPS, sequential I/O, spindle, throughput, WOPS

Jeff Goldschrafe says:

August 23, 2011 at 21:45

I noticed that your calculation of write IOPS is wrong for parity writes. When calculating the new parity for a stripe, a RAID-5 implementation can do one of two things:

1. Take the parity of all blocks and XOR them together. This is fastest if you are overwriting a majority of blocks in the stripe.
2. Take the existing parity block, the values of the old blocks, and the values of the new blocks, and XOR them together. This is fastest if you are overwriting a minority of blocks in the stripe.

This is confirmed at least by a Xyratex document:
http://www.xyratex.com/pdfs/whitepapers/Xyratex_White_Paper_RAID_Chunk_Size_1-0.pdf

“Read old data from target disk for new data: Reading only the data in the location that is about to be
written to eliminates the need to read all the other disks. The number of steps involved in the read-modifywrite operation is the same regardless of the number of disks in the array.”

Daniel Myers says:

January 27, 2010 at 14:38

Hi,
Thanks for taking the time to write such a good series of articles. I feel like my brain has been expanded.
Not sure if it’s being late in the evening or not, but I think there was a slight error “or a write penalty of 6:10 or 1:2,5”.

Thanks again
Dan

Erik Zandboer says:

January 27, 2010 at 18:26

Oops, brain implosion 😉 Thanks for the feedback, I’ll change it quickly so no one will notice 😉

LAPD Oral Exam says:

January 27, 2010 at 16:44

Hello. This is kind of an “unconventional” question , but have other visitors asked you how get the menu bar to look like you’ve got it? I also have a blog and am really looking to alter around the theme, however am scared to death to mess with it for fear of the search engines punishing me. I am very new to all of this …so i am just not positive exactly how to try to to it all yet. I’ll just keep working on it one day at a time Thanks for any help you can offer here.

Erik Zandboer says:

January 27, 2010 at 22:01

The menu bar is actually generated with an editor. This editor delivers the template. Within wordpress you can simply select the template. One note though, it does not work in wordpress.com. You have to use the wordpress.org project, meaning you must run your own webserver in order to be able to use custom built templates.

RAID5 write performance – Revisited says:

November 4, 2010 at 10:47

[…] performance – Revisited November 4th, 2010 | Author: Erik Zandboer In this post: Throughput part 2: RAID types and segment sizes I wrote that a RAID5 setup can potentially perform better in a heavy-write environment over RAID10, […]

Best of VMdamentals.com 2010 Posts says:

December 31, 2010 at 09:54

[…] Throughput part 2: RAID types and segment sizes […]

Kerry Cage says:

February 16, 2011 at 02:44

Great stuff!
One question; In example 2, it looks like each of the 5 drives in the 4+1 would see 100IOPs. 25MB/Sec, 256K segments = 100IOPs. What am I missing?
Thanks,

Erik Zandboer says:

February 16, 2011 at 11:22

Hi,

Each drive would see 100 writes per second, 256KB per write. So 100*256KB = 25600KB/s, or 25MB/s. Because there are 4 spindles recording data (not 5; the 5th is for parity only), the total write troughtput is 4*25MB/s = 100MB/s.

Is this what you mean (or missed?)

VM performance troubleshooting: A quick list of things to check says:

August 24, 2011 at 22:27

[…] For some special applications it may be important to format the virtual disks to a specific format or blocksize. For example, the database of a Microsoft SQL 2005 server is generally put on an NTFS that has a blocksize (in NTFS called clustersize) of 64KB. It may make a big difference! You might even tune your underlying RAID sets to match the workload. For more of a deepdive on that, you could check out Throughput part 2: RAID types and segment sizes; […]

RAID5 DeepDive and Full-Stripe Nerdvana says:

December 1, 2011 at 17:49

[…] to a post way back: Throughput part 2: RAID types and segment sizes. Here you can read all about RAID types and their pros and cons. For now we focus on RAID5 and […]

Throughput part 4: A day at the races (Hotspotting case) says:

December 29, 2011 at 11:41

[…] described in Throughput part 1: the basics and Throughput part2: RAID types and segment sizes, in a random I/O environment you optimally want only one member of a stripe to perform a seek over […]

X H D – Negative Archiving in the Digital Age – Part 2 says:

June 21, 2012 at 19:08

[…] more information on RAID types, please visit my other blog: Throughput part 2: RAID types and segment sizes. RAID made easy: […]

Whiteboxing part 3: Choosing storage for your homelab says:

December 28, 2012 at 15:31

[…] has to go both mirrors. Especially when you use RAID5 or RAID6 in the nodes, the write penalty (see Throughput part 2: RAID types and segment sizes) can be pretty devastating for your performance. When a disk has failed and you need to rebuild the […]

VMdamentals.com

Throughput part 2: RAID types and segment sizes

14 Responses to “Throughput part 2: RAID types and segment sizes”

Coming soon