Archive for March, 2009

VMware HA, slotsizes and constraints – by Erik Zandboer

There is always a lot of talk about VMware HA, and how it works with slotsizes and determines host failover using the slotsizes. But nobody seems to know exactly how this is done inside VMware. By running a few tests (on VMware ESX 3.5U3 build 143128), I was hoping to unravel how this works.

 

Why do we need slotsizes

In order to be able to guess how many VMs can be run on how many hosts, VMware HA does a series of simple calculations to “guestimate” the possible workload given a number of ESX hosts in a VMware cluster. Using these slotsizes, HA determines if more VMs can be started on a host in case of failure of one (or more) other hosts in the cluster (only applicable if HA runs in the “Prevent VMs from being powered on if they violate constraints”), and how big the failover capacity is (failover capacity is basically how many ESX hosts in a cluster can fail while maintaining performance on all running and restarted VMs).

 

Slotsizes and how they are calculated

The slotsize is basically the size of a default VM, in terms of used memory and CPU. You could think of a thousand smart algorithms in order to determine a worst-case, best-case or somewhere-in-between slot size for any given environment. VMware HA however does some pretty basic calculations on this. As far as I have figured it out, here it comes:

Looking through all running (!!) VMs, find the VM with the highest cpu reservation, and find the VM with the highest memory reservation (actually reservation+memory overhead of the VM). These worst-case numbers are used to determine the HA slotsize.

A special case in this calculation is when reservations are set to zero for some or all VMs. In this case, HA uses its default settings for such a VM: 256MB of memory and 256MHz of processing power. You can actually change these defaults, by specifying these variables in the HA advanced settings:

          das.vmCpuMinMHz
          das.vmMemoryMinMB

In my testing environment, I had no reservations set, not specified these variables, and did not have any host fail-over (Failover capacity= 0). As soon as I introduced these variables, both set at 128, my failover level was instantaneously increased to 1. When I started to add reservations, I was done quite quickly: adding a reservation of 200MB to a single running VM, changed my failover level back to 0. So yes, my environment proves to be a little “well filled” 😉

 

Failover capacity

Now we have determined the slotsize, the next question which arises, is: How does VMware calculate the current failover capacity? This calculation is also pretty basic (in contrast to the thousand interesting calculations you could think of): Basically VMware HA uses the ESX host with the least resources, calculates the number of slots that would fit into that particular host, then determines the number of slots per ESX host (which are also projected to any larger hosts in the ESX cluster!). What?? Exactly: using ESX hosts in a HA enabled cluster which do have different sizes for memory and/or processing power impacts the failover level!

In effect, these calculations are done for both memory and CPU. Again, the worst-case value is used as the number of slots you can put on any ESX host in the cluster.

After both values are known (slotsize and number of slots per ESX host), it is a simple task to calculate the failover level: Take the sum of all resources, divide them by the slotsize resources, and this will give you the number of slots available to the environment. Subtract the number of running VMs from the available slots, and presto, you have the number of slots left. Now divide this number by the number of slots per host, and you end up with the current failover level. Simple as that!

 

Example

Lets say we have a testing environment (just like mine 😉 ), with two ESX nodes in a HA-enabled cluster, configured with 512MB of SC memory. Each has a 2GHz dualcore CPU and 4GB of memory. On this cluster, we have 12 VMs running, with no reservations set anywhere. All VMs are Windows 2003 std 32 bit, which gives a worst-case memory overhead of (in this case) 96Mb.

At first, we have no reservations set, and no variables set. So the slotsize is calculated as 256MHz / 256MByte. As both hosts are equal, I can use any of the hosts for the number of slots per hosts calculation:

CPU –> 2000MHz x 2 (dualcore) = 4000 MHz / 256 MHz = 15,6 = 15 slots per host
MEM –> (4000-512) Mbytes / (256+96) Mbytes = 9,9 = 9 slots per host

So in this scenario 9 slots are available per host, so in my case 9 slots x 2 host = 18 slots for the total environment. I am running 12 VMs, so 18 – 12 = 6 slots left. 6/9 = 0,6 hosts left for handling failovers. Shame, as you can see I need 0,4 hosts extra to have any failover capacity.

Now in order to change the stakes, I put in the two variables, specifying CPU at 300MHz, and memory at 155Mbytes (of course I just “happened” to use exactly these numbers in order to get both CPU and memory “just pass” the HA-test):

          das.vmCpuMinMHz = 300
          das.vmMemoryMinMB = 155


Since I have no reservations set on any VMs, these are also the highest values to use for slotsizes. Now we get another situation:

CPU –> 2000MHz x 2 (dualcore) = 4000 MHz / 300 MHz = 13.3 = 13 slots per host
MEM –> (4000-512) Mbytes / (155+96) Mbytes = 13.9 = 13 slots per host

So now 13 slots are available per host. You can imagine where this is going when using 12 VMs… In my case 13 slots x 2 host = 26 slots for the total environment. I am running 12 VMs, so 26 – 12 = 14 slots left. 14/13 = 1,07 hosts left for handling failovers. Yes! I just upgraded my environment to a current failover level of 1!

Finally, Lets look at a situation where I would upgrade one of the hosts to 8GB. Yep you guessed right, the smaller host will still force its values into the calculations, so basically nothing would change. This is where the calculations go wrong, or even seriously wrong: Assume you have a cluster of 10ESX nodes, all big and strong, but you add a single ESX host having only a single dualcore CPU and 4GB of memory. Indeed, this would impose a very small number of slots per ESX hosts on the cluster. So there you have it: yet another reason to always keep all ESX hosts in a cluster equal in sizing!

Looking at these calculations, I actually was expecting the tilt point to be at 12 slots per host (because I have 12 VMs active), not 13. I might have left out some values of smaller influence somewhere, like used system memory on the host (host configuration… Memory view). Also, the Service Console might count as a VM?!? Or maybe VMware just likes to keep “one free” before stating that yet another host may fail… This is how far I got, maybe I’ll be able to ad more detail as I test more. Therefore the calculations shown here may not be dead-on, but at least precise enough for any “real life situation” estimates.

 

So what is “Configured failover”?

This setting is related to what we have just calculated. You must specify this number, but what does it do?

As seen above, VMware HA calculates the current host failover capacity, which is a calculation based on resources in the ESX hosts, the number of running VMs and the resource settings on those VMs.

Now, the configured failover capacity determines how many host failures you are willing to tolerate. You can imagine that if you have a cluster of 10 ESX hosts, and a failover capacity of one, the environment will basically come to a complete halt if five ESX nodes fail at the same time (assuming all downed VMs have to be restarted on the remaining five). In order to make sure this does not happen, you have to configure a maximum number of ESX hosts that may fail. If more hosts fail than the specified number, HA will not kick in. So, it is basically a sort of self-preservation of the ESX cluster.

esXpress as low-cost yet effective DR – by Erik Zandboer

You want to have some form of fast and easy Disaster Recovery, but you do not want to spend a lot of money in order to get it. What can you do? You might consider buying two SANs, and leaving out SRM. That will work, it will make your recovery and testing more complex, but it will work. But even then, you still have to buy two SANs, the expensive WAN etc. What if you want to do these things – on a budget

DR – What does that actually mean?
More and more people start to implement some form of what they call Disaster Recovery. I too am guilty of misusing that name (who isn’t), Disaster Recovery. My point is, tape backups made for ages now are also part of Disaster Recovery. Your datacenter explodes, you buy new servers, you restore the backups. There you go: Disaster Recovery in action. What comes in reach now, for the larger part because of virtualization, is what is called Disaster Restart. This is when no complex actions are required, you “press a button” and basically – you’re done. I conveniently kept the title to “DR”, which kind of favors both 🙂

Products like VMware SRM make the restart after a disaster quite easy, and more important, for the larger part you can actually test the failover without interrupting your production environment. This is a very impressive way of doing Disaster Restarting, but still quite a lot of money is involved. You need extra servers, you need an extra (SRM supported!) SAN in order to get this into action.

Enter esXpress
Recovering or Restarting from a disaster is all about RPO and RTO – The point in time to recover to, and the time required to get your server up and running (from that point in time). The smaller the numbers, the more expensive the solution. Now lets put things in reverse. Why not build a DR solution with esXpress, and see how far we get!

 
DR setup using esXpress
The setup is quite simple. EsXpress is primarily a backup product, and that is just what we are going to setup first. Lets assume we have two sites. One is production with four ESX nodes, and the other site with two nodes is the recovery site (oops restarting site). For the sake of evading these terms, we’ll use Site-A and Site-B 🙂

At Site-A, we have four nodes running esXpress. At site-B, we have one or more FTP servers running (why not as a VM !) which receive the backups over the WAN. Now, Disaster Recovery is in place, since all backups go off-site. Now all we have to do, is try and get as near to Disaster Restart as we can get.

For the WAN link, we basically need the bandwidth to perform the backups (and perhaps to use for regular networking in case of failover). The WAN could be upgraded as needed, and you can balance between backup frequency versus available bandwidth. EsXpress can even limit its bandwidth if required…

Performing mass-restores
All backups now reside on the FTP server(s) on Site-B. If we were to install esXpress on the ESX nodes at Site-B as well, all we need to do is use esXpress to restore the backups there. And it just so happens that esXpress has a feature for this: Mass Restores.

When you configure mass-restores, the ESX nodes at Site-B are “constantly” checking for new backups on the FTP servers. As soon as a backup finishes, esXpress at Site-B will discover this backup, and start a restore automatically. Where does it restore to? Simple! It restores to a powered-off VM at Site-B.

What this accomplishes is, that at Site-B you have your backups of your VMs (with their history captured in FULL and DELTA backups), and the ability to put that to tape if you like. You also have each VM (or just the most important if you choose) in the state of the last successful backup standing there, just waiting for a power-on. As a bonus on this bonus, you also have just found a way to test your backups on the most regular basis you can think of – every single backup is tested by actually performing a restore!

What does this DR setup cost?
There is no such thing as a free lunch. You have to consider these costs:

  1. Extra ESX servers (standby at the recover/restart site) plus licenses; ESXi is not supported by esXpress (yet);
  2. esXpress licenses for each ESX server (on both sites);
  3. A speedy WAN link (fast enough to offload backups);
  4. Double or even triple the amount of storage on the recover/restart site (space for backups+standby VMs. This is only a rough rule-of-thumb).

Still, way below the costs of any list that holds two SANs and SRM licenses…

So what do you get in the end?
Final question of course, is what do you get from a setup such as this? In short:

  1. Full-image Backups of your VMs (FULLs and DELTAs), which are instantaneously offloaded to the recover/restart site;
  2. The ability to make backups more than one time per 24 hours, tunable on a “per VM” basis;
  3. Have standby VMs that match the latest successful backup of the originating VMs;
  4. Failover to the DR site is as simple as a click… shiftclick… “power on VMs” !;
  5. Ability to put all VM backups to tape with ease;
  6. All backups created are tested by performing automated full restores;
  7. Ability to test your Disaster Restart (only manual reconnection to a “dummy” network is needed in order not to disturb production);
  8. RTO is short. Very short. Keep in mind, that the RTO for one or two VMs can be longer if a restore is running at the DR site: The VM being restored has to finish the restore before it can be started again;
  9. Finally (and this one is important!), if the primary site “breaks” during a replication action (backup action in this case), the destination VM is still functional (in the state of the latest successful backup made).

Using a setup like this is dirt-cheap when compared to SRM-like setups, you can even get away with using local storage only! The RPO is quite long (in the range of several hours to 24 hours), but RTO is short- In a smaller environment (like 30-50 VMs) RTO can easily be shorter than 30 minutes.

If this fits your needs, then there is no need to spend more – I would advise you to look at a solution like this using esXpress! You can actually build a fully automated DR environment without complex scripting or having to sell your organs 😉 . You even get backup as a bonus (never confuse backup with DR!)

Soon to come
  • Coming soon

    • Determining Linked Clone overhead
    • Designing the Future part1: Server-Storage fusion
    • Whiteboxing part 4: Networking your homelab
    • Deduplication: Great or greatly overrated?
    • Roads and routes
    • Stretching a VMware cluster and "sidedness"
    • Stretching VMware clusters - what noone tells you
    • VMware vSAN: What is it?
    • VMware snapshots explained
    • Whiteboxing part 3b: Using Nexenta for your homelab
    • widget_image
    • sidebars_widgets
  • Search