In part one I covered all stuff you can think of in regards to delays and latencies you encounter on physical disk drives and solid states. Now it is time to see how we can string together multiple drives in order to get the performance and storage space we actually require. I’ll discuss RAID types, number of disks in such a RAID set, segment sizes to optimize your storage for particular needs and so on.
–> For those of you who haven’t read part 1 yet: Thoughput Part1: The Basics
A short intro to RAID types
Now finally it is on to the stringing together of disks. More disks is more space, more performance, right? Yes right – sometimes. I am not zooming in too deep on the RAID types. I assume you have some knowledge on different types of RAID, mainly RAID1, RAID10 and RAID5. All that I’ll say about it:
- RAID1 = putting two disks together, write the same data to both, read half of the data from each. This is an effective way to cope with disk failure, while boosting performance;
- RAID10 = Taking two or more RAID1 sets, and striping them together;
- RAID5 = Here it suffices to look at RAID5 like the almost forgotten RAID3: Simply striping two or more disks together, and add another disk to hold the parity of all other members in the stripe. (In reality in RAID5, the parity data is balanced across all disks in the RAID5 set instead of having one dedicated parity disk as in RAID3).
So what about read and write performance? Two disks have double the performance and double the size? Well actually, no. Each RAID type has its advantages, and also its disadvantages. There are three main advantages for all these RAID types though:
- They increase the amount of storage;
- They increase the amount of IOPS your disk group can perform;
- They protect you from a single failing disk (for RAID10 even more than one failing disks in some cases)
So when to choose what? Now it comes down to the (dis)advantages of each RAID type, where basically it is always between RAID10 and RAID5 (with RAID1 being a sort of 2 disk version of a RAID10 group):
In RAID10, you basically loose 50% of your storage space, because each disk is mirrored. On the other hand you gain a lot of performance (each disk is used for performing reads; each mirror-set inside the RAID10 is used for performing writes). Especially when disk groups which need to perform a lot of writes, RAID10 is an absolute winner over RAID5 (see RAID5 below). Also, in case of a disk failure, RAID10 will rebuild quickly and, even more important perhaps, with very little impact on overall performance during rebuild. This is because all that needs to be done is to copy the data from a single disk to its (replaced or hot-standby) mirror.
RAID5 is another beast. RAID5 gets more efficient in storage size as you use more disks in the RAID group. This is because when you put data on a RAID5 group, all disks are used for data storage (in a stripe), and only one disk is used to put a parity segment on. In RAID5, these parity segments are put on a different disk in the RAID-group for each stripe. This is because you need the parity data more often when writing, and this way the load of the parity segment is divided over all members of the RAID5 group (in RAID3, one physical disk is used for all parity which can cause writing performance limitations).
Rebuilding a RAID5 array is somewhat difficult: If a disk fails and is replaced, the rebuild has to read data from all other members in each full stripe. This should return all data minus parity, or all data minus one plus parity. In both cases, the remaining segment can be calculated, and is then written back to the replaced disk in the RAID group. This is very disk intensive; all disks in the stripe have to read ALL their data in order to perform a rebuild. And worse, the rebuild time gets larger as the number of members in the RAID5 group grows.
For most RAID types (in fact all data-protecting RAID types), read performance is not equal to write performance. In RAID10 things are relatively simple; when reading all disks in the group are used, when writing in effect half of the disks is used (this is because each write has to be written to two disks who make up a mirror set. So a write penalty of 1:2 you might say (for 1 write you need 2 IOPs).
RAID5 write penalties are more complicated. It is basically a stripe of disks (like RAID0), but the added parity data is kind of a deal breaker. Depending on what you need to write, there is different behaviour. Lets say you want to write to only a single segment. In that case you could perform a read of that segment and read the parity of that particular stripe, then recalculate the parity data, then write both the segment and the parity. Overall this hits you for 4 IOPs for a single segment write. You can imagine this being a problem in a random IO pattern environment (where total seek time is the “killer” for your performance). The write penalty if 1:4 in that case (for one write you need 4 IOPS)
It gets even stranger if you write a “partial” stripe. In this case you basically have to read the entire stripe minus the segments to write, perform a calculation of the parity, then write the changed segments and the parity. You could calculate the write penalty for this, but it is kind of complicated:
if n = number of members in the stripe, m = number of segments to modify,
Number of IOPS needed = ( (n-m) + 1 parity ROPS) + ( (n+1) WOPS).
Example 1: If you have a RAID5 of (8+1) disks (8 in the stripe and one parity), and you have 2 segments to modify, you’d need 8-2+1 = 7 ROPS followed by 2+1=3 WOPS. So for 2 writes you’d need 7+3=10 IOPS, or a write penalty of 1:5.
Example 2: If you have a RAID5 of (8+1) disks again, but now you have 6 segments to modify, you’d need 8-6+1 = 3 ROPS followed by 6+1=7 WOPS. For for 6 writes you now need 10 IOPS, or a write penalty of 6:10 or 1:1,66
In the previous case, a full stripe write would require only 9 IOPS (in this case WOPS) for 8 segments, so a penalty of only 8:9, or 1:1,125 !
So the most effective way of writing to a RAID5 array, is to write one entire stripe at once. In this case, all disks can be used to simply perform a write operation, and one (calculated) parity segment. A lot of SAN vendors (like EMC for example), try to maintain written segments inside their write cache, and flush only when they have a full stripe to write out to disk.
Write penalties: what it comes down to
Looking at the previous calculations, you could extract the following:
- For random writes, RAID10 gives a static write penalty of only 1:2;
- For random writes where the cache of the SAN cannot fill up to a full stripe write, the write penalty lies somewhere at 1:4 or slightly better;
- For sequential writes, RAID10 still gives you a write penalty of 1:2, but RAID5 (when using write cache) gives you only a write penalty of 1:1,125!
So the statement that RAID10 gives you better write performance is not always true…!
Sequential, Random, segment size??
So now we have got a clear view on segments to write, to read, RAID types, and latencies on physical disks. If we put all of this stuff together, we can actually optimize storage using RAID types and segment sizes together!
From here on there are so many parameters, that it becomes kind of hard to write a “one for all” best practice. I think it is better to put some examples to the test, and find out what would work best:
Example one: Database doing heavy writes
Let’s assume we have a very busy database, and we want to put this database on a number of disks we reserve especially for this database. Let’s say the blocksize the database writes in is 8 KBytes, and the pattern is random IO (which is almost always the case in a database environment).
In random IO patterns, the main delay is the head seektime and the rotational latency. So in this case you’d want to use as little disks as possible for a write (because each disk that is touched by the write needs to perform the seek). So a segment size of 8Kbytes would probably be optimal in this scenario.
When looking at the write penalty, the fully random IO pattern would mean that a RAID5 group would probably encounter a write penalty of 1:4, where a RAID10 group would only encounter a penalty of 1:2. This means you would require less disks to obtain the performance you need when comparing to RAID5.
Finally, disksize comes to mind. If the database is relatively small, RAID10 would propably be cheaper (!!) than RAID5. Why? Well, lets assume you use 15K SAS disks, which perform about 200 IOPS each, and your database requires 600 ROPS and 600 WOPS simultaneously, and is only 500Mbytes in size. You then would need:
RAID10 would need:
For performance: 600 ROPS / 200 = 3 spindles for reading, and 600 WOPS / 200 * 2 = 6 spindles for writing. This sums up to 3+6=9 so 10 disks in RAID10 would suffice. Using 146GB disks, you would get 5*146 GB = 730 GB, so this would suffice.
RAID5 would need:
For performance, 600 ROPS /200 = 3 spindles for reading, and 600 WOPS / 200 * 4 = 12 spindles for writing. This sums up to 3+12 = 15 spindles in total, so you’d require a (14+1) RAID5 set, which personally I would never do (rebuild time and impact!).
So in this example, RAID10 would actually save you 5 disks (and gain a much better rebuild time with much less impact). RAID10 would be THE way to go here.
Example two: sequential writing video application
In example two, we have a video application which needs to stream a lot of data to a raid group. Again we reserve a set of disks for this application (so the stream will remain purely sequential). Lets say we need to save a stream of 100MB/sec to 7200rpm SATA spindles, 100% writes in blocks of 1Mbyte (larger blocks help of course in a sequential write scenario)
As we have seen, RAID5 can be very effective when writing full stripes. So if we create a RAID5 array out of (4+1) disks, and we make the segment size 1Mbyte/4 = 256Kbytes, each WOP from the video app would exactly fill up one entire stripe. At 100MB/sec, each disk would have to write at 25MB/sec, and the fifth disk would perform at 25MB/sec as well. Because we need to write 100MB every second, we would require 100 IOPS (given that the blocksize is 1MB). Write penalty of a (4+1) RAID5 is 1:(5/4), or 1:1.25 would give us the requirement of 100 * 1.25 = 125 WOPS.
so each disk would have to perform: 125 / 4 = 31,25 IOPS and 100/4 = 25MB/sec. This is something a 7200rpm SATA drive should be able to support. Because we are using RAID5 in this example, we would end up with 5 drives. If we used 1TB SATA drives, we would have 4TB of net storage.
From the previous part and this part it becomes clear, that there is never a “best” solution given all environments. Sometimes RAID5 is better because you end up with more storage space than when you’d be using RAID10, sometimes RAID10 is better for its simple and relatively low write penalty for random IO loads.
Some things to consider in ANY case, is the segment size and the RAID type you choose for your storage. Also, never forget the impact of a rebuild in case of a disk failure. Especially the larger stripes in RAID5 can have enormous rebuild times (several days is no exception). During these rebuilds there is a significant impact on performance. Even though most SAN vendors allow you to prioritize your reguilar data flow and the rebuild priority, but this still leaves you with a problem: a short rebuild time with heavy user-impact, or a very low priority rebuild that does not impact users too much but leaves you with a rebuild going on for days. Do not forget, that within a RAID5 group disks are often the same make and model, and they are all just as old. This creates a serious threat, when you combine it with the heavier load during a rebuild: Changes are, you loose a second disk during rebuild which will basically destroy all your data on the raid group.
Also very important is getting to know your data. The IO pattern behaviour, as well as the read/write ratios must be taken into account when designing your storage.
Finally, for everyone using their storage in a VMware environment: Mostly your IO pattern will be random (unless you reserve disk groups for special functions). In this case, at least consider RAID10. It is a better overall performer for writes, and impact of a rebuild is much smaller compared to the popular RAID5.