esXpress uses vStorage API for detecting changed blocks
Today at VMworld 2009 is joined a breakout session presented by PHD Virtual about their latest version of esXpress (3.6). Great stuff once again! Apart from the fact that esXpress is now fully functional on vSphere (still no ESXi support though), they also managed to use the vStorage API for “changed block reporting”. Basically what this means, is that when you are using vSphere and doing delta or deduped backups, you no longer need to read all the blocks of a VM and then decide is that block was changed or not. PHD managed to get esXpress so far that it reads only the changed blocks directly by using this “cheat sheet” that VMware was so nice to make available though the vStorage API.
What this means is, that backup speeds will be way higher when you do delta or deduped backups.
When you also use their dedup targt, with the dedup action going on on the SOURCE, you get tremendous backup speeds and as an added bonus you can use smaller WAN links when you are sending these backups offsite. Wonderfull guys, you did it again!
VMware ThinApp becoming automagic!
Yesterday on VMworld 2009 I went to a breakout session on VMware ThinApp. To my surprise we saw a demo on a new ThinApp feature. This feature is basically, that you can automagically rebuild your ThinApps! In the demo five Windows XP VMs were used, and all “setup.exe” files resides on some share. When the repacker was kicked off, the VMs were snapshotted. Then, ThinApp got kicked off inside each VM. After the ThinApps were regenerated, they were automagically copied off the VM, after which the snapshot was reverted. This process repeated itself on all available VMs until all ThinApps were rebuilt.
Magical!
VCP4 certified
I am not (yet?) the kind of blogger to throw on everything I see around me on my blog just because it is “new”; I think blogging should be more about things you have tested or measured.
Yet today on VMworld 2009 I make an exception: I just got my VMware VCP4 certification. Yeah!
The new esXpress 3.5
For a long time now I have been a fan of PHDs esXpress. It is still the only VMware backup solution I know that scales, has no single points of failure and works reliable with VMware snapshots. The solution has always been “other than others”: At first it appears to be a really weird piece of software, that creates its own appliances to perform its backups. Once you get to know it, esXpress’s way of working is great. So great in fact, that VMware themselves are now adopting this very way of working with their Disaster Recovery feature in vSphere 4, maybe even stepping away from their beloved VCB (VMware Consolidated Backup).
VCB in my opinion has never been that great, apart for some special uses in special environments. esXpress fits all, from single ESX hosts to large clusters. In contrast to VMware’s Disaster Recovery which is still buggy at the time of this blog, esXpress has been on this train for years now, and definitely knows the drill. EsXpress 3.1 is not the holy grail though. Some features were just not easy to use, there was no global GUI to manage all nodes easily and there was no Data deduplication available (not that I am that big a fan data dedup for backup, but ey, everbody does it!).
Enter esXpress 3.5
To make up for most of the shortcomings, esXpress version 3.5 has been introduced. The engine itself still is pretty much the same. And exactly there lies the power of esXpress: It still WORKS. It just works, it always works. Extra features have been added in such a smart and incredible simple way, that the product remains rock stable. No “waiting for the point 1 release” needed here!
I was over at a client who suffered a SAN failure (when upgrading firmware). They were in progress of failing over to their recovery site, when the administrator got an email from one of the production ESX hosts: esXpress had successfully completed its backups. What? All LUNs appeared unavailable at the production site. This host did not have its storage devices rescanned; it still kept on ticking. I think things like this are major plusses for both VMware ESX and esXpress showing their enterprise readiness.
Finally: A working global GUI
From the initial ESX 3.5 release, PHD also released a GUI to manage all esXpress instances from one central portal. In the old 3.1 (and before) days, you ended up copying configfiles between hosts; working, but not very user friendly. You might think that adding a central GUI took a lot of deep digging in the code of esXpress. But, they surprised once again: The GUI just holds the config files and, could it be more simple, the GUI appliance introduces a small NFS store. The NFS store is automagically mounted to the ESX servers, and presto! That is where the config files can be found. EsXpress itself just has to check the share for a new config, something already (partly) in existance in the previous version.
Even better: the GUI does a great job. I had some trouble with the first versions, some manual labor was needed to get it going (like manually needing to change the time zone and not being able to add a second DNS server). All these issues are fixed now, but even those early versions were already very effective. And things have become only better since then!
Because “everybody has it”: Deduplication
What should we do without deduplication nowadays? It is a major hype around storage and backup. If you don’t have it, you’re out of business it seems. But who ever thinks about the risks and limitations involved (see: The Dedup Dillema).
The idea of deduplication is brilliant, but the implementation has to be right. I must admit, I am not a big fan of deduplication. It is still your vital data you are talking about! Admit nr.2: EsXpress 3.5 managed to change my opinion to dedup a little.
The deduplication implementation of esXpress is in style with PHDs way of working: both effective and simple. A separate appliance is installed (which is in fact the same one as the GUI appliance. At first boot of the appliance you choose what the appliance will become. Smart!). The dedup appliance (called PHDD for PHd Data Dedup) can mount a datastore or an NFS store for storing its deduped data. It performs quite well, saving diskspace as you backup more of the same (or alike) data. It is now much “cheaper” to keep more backups of your VMs.
Only few changes appear to have been made to esXpress itself to allow PHDD as a backup target, so once again, stability guaranteed.
So now all your data lives inside the PHDD appliance. Now how do I get out this data the way I want it? PHD did something clever: They added a CIFS/SAMBA interface to the appliance, allowing you to browse, copy and backup your VMs as if they weren’t deduped at all! This last feature makes the mix of backup and dedup more acceptable, even effectively useable 🙂
When will the fun EVER stop? File level restore!
The best feature of the PHDD dedup target in my opinion, next to dedup itself, is the ability to perform file level restores. At last you can get out that one single file of a full VM without having to restore the whole thing. This option is so cool, you simply browse to the appliance, select your files, and save the collection you marked as a single zip file! Couldn’t be easier, another bulls eye for PHD, even in their first release of this piece of software.
Scaling esXpress 3.5 with dedup
Not all is bright and shiny with dedup. I found it hard to scale the solution: If there is only one PHDD target, scaling ends somewhere, and a SPOF (single point of failure) is introduced. Not good (although PHD is working on a way to link the dedup appliance to a secondary one). Still, one may consider to use two or more PHDD appliances in parallel. This will work, but the dedup effectiveness will drop sharply, especially when you use DRS and all VM backups end up on all PHDD targets in time (this happens when you design the often used strategy where one assings for a backup target to each ESX server individually with failovers to others). You can make it somewhat more effective by specifying a backup target for each VM (in the local config), a best practice that also stands when using multiple FTP targets btw. This will ensure that a backup of a particular VM will always end up on the same backup target, making things clearer and making dedup more effective (although far from ideal – Every PHDD target has its own library of data, meaning that identical blocks still get stored on EACH PHDD target instead of just one).
The limitations mentioned above are not a limit of esXpress though, but more a limitation of dedup in itself. PHD choose to use online dedup (basically you dedup while you write), which will use CPU power during backup and restores. CPU power might even be the limiting factor in your backup speed. Luckily CPU power is usually available in abundance nowadays. I will dive deeper into performance and scaling of deduped installations in the next blogpost, which will hopefully prove that dedup really performs (like the setup using multiple FTP targets simultaneously described in my blogpost Scaling VMware hot-backups using esXpress).
Conclusion
The new version of esXpress 3.5 is in terms of speed and reliability on par with its predecessor version 3.1. It is still the only backup solution I know that has no Single Point of Failure, scales (REALLY scales) up to whatever size you want without any issues, and best of all: Once it works it KEEPS working with hardly any problems around VM snapshotting like some other backup solutions do have.
On top of all the good things that already were, a global GUI is added which manages all esXpress installs at the same time, and there is a Data Deduplication appliance which features a very well working single file restore option. I would like to have seen a file restore option in a non-dedup target as well. From what I’ve seen, online deduping costs a lot of CPU power, and the backup speeds go down because of this. Once the database is built though, things do get better (less data to backup because more and more blocks are already backup up in th dedup appliance). Still, calculations have to be done.
In a smaller environment, the dedup appliance is no match for a set of non-dedupping FTP targets. This is a drawback from which any dedup system suffers… It is just the way the “thingy” works. Still I see a solid future for esXpresses PHDD dedup targets where speed is not of the utmost importance.
Make no mistake on backup speeds: IF esXpress and its backup targets are designed and configured properly, it is by far the fastest full-VM backup solution I’ve seen. It does not mess with taking backups through the service console network, it creates Virtual Appliances runtime that perform the backups – and many in parallel. If you want to see real backup speed from esXpress, do not test it on a single VM like some people tend to do when comparing. If you do, speeds are about on par with other 3rd party vendors. But when scaled up to make 8 or more backups in parallel to several backup targets with matched bandwidth, esXpress will start to shine and leave the competition far behind.
The Dedup Dilemma
Everybody does it – and if you don’t, you can’t play along. What am I talking about? Data deduplication. It’s the best thing since sliced bread I hear people say. Sure it saves you a lot of disk space. But is it really all that brilliant in all scenarios?
The theory behind Data Deduplication
The idea is truly brilliant – You store blocks of data in a storage solution, and you create a hash which identifies the data inside the block uniquely. Every time you need to backup a block, you check (using the hash) if you already have the block in storage. If you do, just write a pointer to the data. Only if you have not got the block yet, copy it and include it into the storage dedup Dbase. The advantage is clear: The more equal data you store, the more you save in disk space. This is, especially in VMware, using equal VMs from templates a very big saving in disk space.
�
The actual dilemma
Certainly a nice thing about deduplication is, next to the large amounts of storage (and associated costs) you save, is that when you deduplicate at the source, you end up only sending new blocks across the line, which could dramatically reduce the bandwidth you need between remote offices and central backup locations. Deduplication at the source also means, you generally spread CPU loads better across your remote servers instead of locally in the storage solution.
Since there is a downside on every upside – Data Deduplication certainly has its downsides. For example, if I had 100 VMs, all from the same template, there surely are blocks that occur in each and every one of them. If that particular block gets corrupted… Indeed! You loose ALL your data. Continuing to scare you, if the hash algorithm you use is insufficient, two different data blocks might be identified as being equal, resulting in corrupted data. Make no mistake, the only way you can be 100% percent sure the blocks are equal, you need a hash number as big as the block itself (rendering the solution kind of useless). All dedup vendors use shorter hashes (I wonder why 😉 ), and live with the risk (which is VERY small in practice but never zero). Third mayor drawback, is the speed at which the storage device is able to deliver your data (un-deduplicated) back to you (which especially hurts on backup targets which have to perform massive restore operations). Final drawback: You need your ENTIRE database in order to perform any restore (at least you cannot be sure which blocks are going to be required to restore a particular set of data).
�
So – should I use it?
The reasons stated above always kept me a skeptic when it came to data deduplication, especially for backup purposes. Because at the end of the day, you want your backups to be functional, and not requiring the ENTIRE dataset in order to perform a restore. Speed can also be a factor, especially when you rely on restores from the dedup solution in a case of disaster recovery.
Still, there are definitely uses for deduplication. Most vendors have solved most issues with success, for example being able to access un-deduplicated data directly from the storage solution (enabling separate backups to tape etc). I have been looking at the new version of esXpress with their PHDD dedup targets, and I must say it is a very elegant solution (on which I will create a blog shortly 🙂
Surviving total SAN failure
Almost every enterprise setup for ESX features multiple ESX nodes, multiple failover paths, multiple IP and/or fiber switches… But having multiple SANs is hardly ever done, except in Disaster Recovery environments. But what if your SAN decides to fail altogether? And even more important, how can you prevent impact if it happens to your production environment?
Using a DR setup to cope with SAN failure
One option to counter the problem of total SAN failure would of course be to use your DR-site’s SAN, and perform a failover (either manual or via SRM). This is kind of a hard call to make: Using SRM will probably not get your environment up within the hour, and if you have a proper underlying contract with the SAN vendor, you might be able to fix your issue on the primary SAN within the hour. No matter how you look at it, you will always have downtime in this scenario. But in these modern times of HA and even Fault Tolerance (vSphere4 feature), why live with downtime at all?
Using vendor-specific solutions
A lot of vendors have thought about this problem, and especially in the IP-storage corner one sees an increase in “high available” solutions. Most of the time relative simple building blocks are simply stacked, and can then survive a SAN (component) failure in that case. This is one way to cope with issues, but it generally has a lot of restrictions – such as vendor lock-in and an impact on performance.
Why not do it the simple way?
I have found that simple solutions are generally the best solutions. So I tried to approach this problem from a very simple angle: From within the VM. The idea is simple: You use two storage boxes which your ESX cluster can use, you put a VMs disk on a LUN on the first storage box, and you simply add a software mirror on a LUN on the second storage. It is almost too easy to be true. I used a windows 2003 server VM, converted the bootdrive to a dynamic disk, and simply added the second disk to the VM, choose “add mirror” from the bootdisk which I placed on the second disk.
Unfortunately, it did not work right away. As soon as one of the storages fails, VMware ESX reports “SCSI BUSY” to the VM, which will cause the VM to freeze forever. After adding the following to the *.vmx file of the VM, things got a lot better:
scsi0.returnBusyOnNoConnectStatus = “FALSE”
Now, as soon as one of the LUNs fail, the VM has a slight “hiccup” before it decides that the mirror is broken, and it continues to run without issue or even lost sessions! After the problem with the SAN is fixed, you simply perform an “add mirror” within the VM again, and after syncing to are ready for your next SAN failure. Of course you need to remember that if you have 100+ VMs to protect this way, there is a lot of work involved…
This has proven to be a simple yet very effective way to protect your VMs from a total (or partial) SAN failure. A lot of people do not like the idea of using software RAID within the VMs, but eh, in the early days, who gave ESX a thought for production workloads? And just to keep the rumors going: To my understanding vSphere is going to be doing exactly this from an ESX point of view in the near future…
To my knowledge, at this time there are no alternatives besides the two described above to survive a SAN failure with “no” downtime (unless you go down the software clustering path of course).
Resistance is ViewTile!
Nowadays, more and more companies realize that virtual desktops is the way to go. It seems inevitable. Resistance is Futile. But how do you scale up to for example 1000 users per building block? How much storage do you need, how many spindles do you need? Especially with the availability of VMware View 3, the answers to these questions become more and more complex.
Spindle counts
Many people still design their storage requirements based on the amount (in GBytes) of storage needed. For smaller environments, you can actually get away with this. It seems to “fix itself” given the current spindle sizes (just don’t go and fill up 1TB SATA spindles with VMs). The larger spindle sizes of today and the near future however, make it harder and harder to maintain proper performance if you are ignorant about spindle counts. Do not forget, those 50 physical servers you had before actually had at least 100 spindles to run from. After virtualization, you cannot expect them all to “fit” on a (4+1) RAID5. The resulting storage might be large enough, but will it be fast enough?
Then VMware introduced the VMmark Tiles. This was a great move; a Tile is a simulated common load for server VMs. The result: The more VMmark Tiles you can run on a box, the faster the box is from a VMware ESX point of view.
In the world of view, there really is no difference. A thousand physical desktops have a thousand CPUs, and a thousand (mostly SATA) spindles. Just as in the server virtualization world, one cannot expect to be able to run a thousand users off of ten 1TB SATA drives. Although the storage might be sufficient in the number of resulting Terabytes, the number of spindles in this example would obviously not be sufficient. A hundred users would all share have to share a single SATA spindle!
So basically, we need more spindles, and we might even have to keep expensive gigabytes or even terabytes unused. The choice of spindle type is going to be the key here – using 1TB SATA drives, you’d probably end up using 10TB, leaving about 40TB empty. Unless you have a master plan for putting your disk based backups there (if no vDesktops are used at night), you might consider to go for faster, smaller spindles. Put finance in the mix and you have some hard design choices to make.
Linked cloning
Just when you thought the equation was relatively simple, like “a desktop has a 10GB virtual drive period”, Linked cloning came about. Now you have master images, replicas of these masters, and linked clones from the replicas. Figuring out how much storage you need, and how many spindles you need just got even harder to determine!
Lets assume we have one master image which is 10GB in size. Per +-64 clones, you are going to need a replica. You can add up to about 4 replicas per master image. All this is not an exact science though; just recommendations found here and there. But how big are these linked clones going to be? This again depends heavily on things like:
- do you design separate D: drives for the linked clones where they can put their local data and page files;
- What operating system are you running for the vDesktops;
- Do you allow vDesktops to “live” beyond one working day (e.g. do you revert to the master image every working day or not).
Luckily, the amount of disk IOPS per VM is not affected by the underlying technology. Or is it? SAN caching is about to add yet another layer of complexity to the world of View…
Cache is King
Let’s add another layer of complexity: SAN caching. From the example above, if you would like to scale up that environment to 1000 users, you would end up with 1000/64 = 16 LUNs, each having their own replica put on there, together with its linked clones. If in a worst-case scenario, all VMs boot up in parallel, you would have enormous amount of disk reads on the replicas (since booting requires mostly read actions). Although all replicas are identical, the SAN has no knowledge of this. The result is, that the blocks used for booting the VMs of all 16 replica’s should be in the read-cache in a perfect world. Lets say our XP image uses 2GB of blocks for booting, you would optimally require a read cache in the SAN of 16*2=32GB. Performance will degrade the less cache you have. Avoiding these worst-case scenarios is another option to manage with less cache of course. Still I guess in a View 3 environment: “Cache is King“!
While I’m at it, I might just express my utmost interest in the development from SUN, their Amber Road product line to be more exact. On the inside of these storage boxes, SUN uses the ZFS file system. One of the things that really could make a huge difference here is the ability of ZFS to move content to different tiers (faster storage versus slower storage) depending on how heavily this content is being used. Add high-performance SSD disks in the mix, and you just might have an absolute winner, even if the slowest-tier storage is “only” SATA. I cannot wait on performance results regarding a VDI-like usage on these boxes! My expectations are high, if you can get a decent load-balance on the networking side of things (even a static load balance per LUN would work in VDI-like environments).
Resistance is ViewTile!
As I laid out in this blog post, there are many layers of complexity involved when attempting to design a VDI environment (especially the storage-side of things). It is becoming almost too complex to use “theory only” on these design challenges. It would really help to have a View-Tile (just like the server-side VMmark Tiles we have now). The server tiles are mostly used just to prove the effectiveness of a physical server running ESX, the CPU, the bus structure etc. A View-Tile would potentially not only prove server effectiveness, but also very much the storage solution used (and the FC- / IP-storage network design in between). So VMware: a View-Tile is definitely on my wish list for Christmas (or should I consider to get a life after all? 😉 )
The VCDX "not quite Design" exam
Last week I was in London to complete the VCDX Design beta exam. This long awaited exam consists of a load of questions to be completed in four hours. In this blog post I will give my opinion on this exam. Because the contents of the exam should not be shared with others, I will not be giving any hints and tips on how to maximize your score if you participate, but I will address the kind of questions and my expectations in this blog post.
First off, there was way to little time to complete the exam. I have been trying to type comments on some questions with obvious errors or questions where I suspected something wasn’t quite right. All questions require some reading, so I could nicely time if I needed to speed up or down. Unfortunately, somewhere near question 100, I stumbled upon a question with pages worth of reading! So I had to “hurry” that one, and as a result, the rest as well. Shame. Also, VMware misses out vital information this way because people just don’t have the time to comment.
As I have noticed with other VMware exams, the scenarios are never anywhere near realism (at least not for European standards). Also having questions about for example the bandwidth of a T1 line is not very bright, given the fact that the exam is to be held worldwide. In Europe, we have no clue on what a T1 line is.
But the REAL problem with this design exam is, that in my opinion, this is NO design exam at all. Sorry VMware, I was very disappointed. If this were a real design exam, I would actually encourage people to bring all PDFs and books they can find; it would (should!) not help you. Questions like “how to change a defective HBA inside an ESX node without downtime” ? Sorry, nice for the enterprise exam, but it has absolutely nothing to do with designing. Offload that kind of stuff to the enterprise exam please! And if you have to ask about things like this, then ask about how to go about the rezoning of the fiber switches. That would at least prove of some understanding how you design a FC network. But that question was lacking.
There are numerous other examples of this, all about just knowing that little tiny detail to get you a passing score. That is not designing! I had been hoping for questions like how many spindles to design under VMware and why. When to use SATA and when not to. Customers having blades with only two uplinks. Things that actually happen in reality, things that bring out the designer in you (or should bring out the designer in you)! Designing is not knowing about what action will force you to shut a VM and which action will not. <Sigh>.
I know, the answers could not be A, B, C or D. This exam should have open questions. More work for VMware, but hey, that is life. Have people pick up their pen and write it down! Give them space for creativity, avoiding the pitfalls that were sneaked into the scenario. These things are what should be a quality of a designer anyway. That’s the way to get it tested.
I’ll just keep hoping the final part of the VCDX certification (defending a design for a panel) will finally bring that out. If it doesn’t, we’ll end up with just another “VCP++” exam, for which everyone can get a passing score if you study for a day or two. I hope VCDX will not become “that kind of a certification”!
I hope VMware will look at comments like these in a positive manner, and create an exam which can actually be called a DESIGN exam. VMware, Please PLEASE put all the little knowledge tidbits into the Enterprise exam, and create a design exam that actually forces people to DESIGN! Until that time, I’ll be hoping the final stage of VCDX will give back my hopes that this certification will really make a difference.
VMware HA, slotsizes and constraints – by Erik Zandboer
There is always a lot of talk about VMware HA, and how it works with slotsizes and determines host failover using the slotsizes. But nobody seems to know exactly how this is done inside VMware. By running a few tests (on VMware ESX 3.5U3 build 143128), I was hoping to unravel how this works.
Why do we need slotsizes
In order to be able to guess how many VMs can be run on how many hosts, VMware HA does a series of simple calculations to “guestimate” the possible workload given a number of ESX hosts in a VMware cluster. Using these slotsizes, HA determines if more VMs can be started on a host in case of failure of one (or more) other hosts in the cluster (only applicable if HA runs in the “Prevent VMs from being powered on if they violate constraints”), and how big the failover capacity is (failover capacity is basically how many ESX hosts in a cluster can fail while maintaining performance on all running and restarted VMs).
Slotsizes and how they are calculated
The slotsize is basically the size of a default VM, in terms of used memory and CPU. You could think of a thousand smart algorithms in order to determine a worst-case, best-case or somewhere-in-between slot size for any given environment. VMware HA however does some pretty basic calculations on this. As far as I have figured it out, here it comes:
Looking through all running (!!) VMs, find the VM with the highest cpu reservation, and find the VM with the highest memory reservation (actually reservation+memory overhead of the VM). These worst-case numbers are used to determine the HA slotsize.
A special case in this calculation is when reservations are set to zero for some or all VMs. In this case, HA uses its default settings for such a VM: 256MB of memory and 256MHz of processing power. You can actually change these defaults, by specifying these variables in the HA advanced settings:
das.vmCpuMinMHz
das.vmMemoryMinMB
In my testing environment, I had no reservations set, not specified these variables, and did not have any host fail-over (Failover capacity= 0). As soon as I introduced these variables, both set at 128, my failover level was instantaneously increased to 1. When I started to add reservations, I was done quite quickly: adding a reservation of 200MB to a single running VM, changed my failover level back to 0. So yes, my environment proves to be a little “well filled” 😉
Failover capacity
Now we have determined the slotsize, the next question which arises, is: How does VMware calculate the current failover capacity? This calculation is also pretty basic (in contrast to the thousand interesting calculations you could think of): Basically VMware HA uses the ESX host with the least resources, calculates the number of slots that would fit into that particular host, then determines the number of slots per ESX host (which are also projected to any larger hosts in the ESX cluster!). What?? Exactly: using ESX hosts in a HA enabled cluster which do have different sizes for memory and/or processing power impacts the failover level!
In effect, these calculations are done for both memory and CPU. Again, the worst-case value is used as the number of slots you can put on any ESX host in the cluster.
After both values are known (slotsize and number of slots per ESX host), it is a simple task to calculate the failover level: Take the sum of all resources, divide them by the slotsize resources, and this will give you the number of slots available to the environment. Subtract the number of running VMs from the available slots, and presto, you have the number of slots left. Now divide this number by the number of slots per host, and you end up with the current failover level. Simple as that!
Example
Lets say we have a testing environment (just like mine 😉 ), with two ESX nodes in a HA-enabled cluster, configured with 512MB of SC memory. Each has a 2GHz dualcore CPU and 4GB of memory. On this cluster, we have 12 VMs running, with no reservations set anywhere. All VMs are Windows 2003 std 32 bit, which gives a worst-case memory overhead of (in this case) 96Mb.
At first, we have no reservations set, and no variables set. So the slotsize is calculated as 256MHz / 256MByte. As both hosts are equal, I can use any of the hosts for the number of slots per hosts calculation:
CPU –> 2000MHz x 2 (dualcore) = 4000 MHz / 256 MHz = 15,6 = 15 slots per host
MEM –> (4000-512) Mbytes / (256+96) Mbytes = 9,9 = 9 slots per host
So in this scenario 9 slots are available per host, so in my case 9 slots x 2 host = 18 slots for the total environment. I am running 12 VMs, so 18 – 12 = 6 slots left. 6/9 = 0,6 hosts left for handling failovers. Shame, as you can see I need 0,4 hosts extra to have any failover capacity.
Now in order to change the stakes, I put in the two variables, specifying CPU at 300MHz, and memory at 155Mbytes (of course I just “happened” to use exactly these numbers in order to get both CPU and memory “just pass” the HA-test):
das.vmCpuMinMHz = 300
das.vmMemoryMinMB = 155
Since I have no reservations set on any VMs, these are also the highest values to use for slotsizes. Now we get another situation:
CPU –> 2000MHz x 2 (dualcore) = 4000 MHz / 300 MHz = 13.3 = 13 slots per host
MEM –> (4000-512) Mbytes / (155+96) Mbytes = 13.9 = 13 slots per host
So now 13 slots are available per host. You can imagine where this is going when using 12 VMs… In my case 13 slots x 2 host = 26 slots for the total environment. I am running 12 VMs, so 26 – 12 = 14 slots left. 14/13 = 1,07 hosts left for handling failovers. Yes! I just upgraded my environment to a current failover level of 1!
Finally, Lets look at a situation where I would upgrade one of the hosts to 8GB. Yep you guessed right, the smaller host will still force its values into the calculations, so basically nothing would change. This is where the calculations go wrong, or even seriously wrong: Assume you have a cluster of 10ESX nodes, all big and strong, but you add a single ESX host having only a single dualcore CPU and 4GB of memory. Indeed, this would impose a very small number of slots per ESX hosts on the cluster. So there you have it: yet another reason to always keep all ESX hosts in a cluster equal in sizing!
Looking at these calculations, I actually was expecting the tilt point to be at 12 slots per host (because I have 12 VMs active), not 13. I might have left out some values of smaller influence somewhere, like used system memory on the host (host configuration… Memory view). Also, the Service Console might count as a VM?!? Or maybe VMware just likes to keep “one free” before stating that yet another host may fail… This is how far I got, maybe I’ll be able to ad more detail as I test more. Therefore the calculations shown here may not be dead-on, but at least precise enough for any “real life situation” estimates.
So what is “Configured failover”?
This setting is related to what we have just calculated. You must specify this number, but what does it do?
As seen above, VMware HA calculates the current host failover capacity, which is a calculation based on resources in the ESX hosts, the number of running VMs and the resource settings on those VMs.
Now, the configured failover capacity determines how many host failures you are willing to tolerate. You can imagine that if you have a cluster of 10 ESX hosts, and a failover capacity of one, the environment will basically come to a complete halt if five ESX nodes fail at the same time (assuming all downed VMs have to be restarted on the remaining five). In order to make sure this does not happen, you have to configure a maximum number of ESX hosts that may fail. If more hosts fail than the specified number, HA will not kick in. So, it is basically a sort of self-preservation of the ESX cluster.
esXpress as low-cost yet effective DR – by Erik Zandboer
You want to have some form of fast and easy Disaster Recovery, but you do not want to spend a lot of money in order to get it. What can you do? You might consider buying two SANs, and leaving out SRM. That will work, it will make your recovery and testing more complex, but it will work. But even then, you still have to buy two SANs, the expensive WAN etc. What if you want to do these things – on a budget
DR – What does that actually mean?
More and more people start to implement some form of what they call Disaster Recovery. I too am guilty of misusing that name (who isn’t), Disaster Recovery. My point is, tape backups made for ages now are also part of Disaster Recovery. Your datacenter explodes, you buy new servers, you restore the backups. There you go: Disaster Recovery in action. What comes in reach now, for the larger part because of virtualization, is what is called Disaster Restart. This is when no complex actions are required, you “press a button” and basically – you’re done. I conveniently kept the title to “DR”, which kind of favors both 🙂
Products like VMware SRM make the restart after a disaster quite easy, and more important, for the larger part you can actually test the failover without interrupting your production environment. This is a very impressive way of doing Disaster Restarting, but still quite a lot of money is involved. You need extra servers, you need an extra (SRM supported!) SAN in order to get this into action.
Enter esXpress
Recovering or Restarting from a disaster is all about RPO and RTO – The point in time to recover to, and the time required to get your server up and running (from that point in time). The smaller the numbers, the more expensive the solution. Now lets put things in reverse. Why not build a DR solution with esXpress, and see how far we get!
DR setup using esXpress
The setup is quite simple. EsXpress is primarily a backup product, and that is just what we are going to setup first. Lets assume we have two sites. One is production with four ESX nodes, and the other site with two nodes is the recovery site (oops restarting site). For the sake of evading these terms, we’ll use Site-A and Site-B 🙂
At Site-A, we have four nodes running esXpress. At site-B, we have one or more FTP servers running (why not as a VM !) which receive the backups over the WAN. Now, Disaster Recovery is in place, since all backups go off-site. Now all we have to do, is try and get as near to Disaster Restart as we can get.
For the WAN link, we basically need the bandwidth to perform the backups (and perhaps to use for regular networking in case of failover). The WAN could be upgraded as needed, and you can balance between backup frequency versus available bandwidth. EsXpress can even limit its bandwidth if required…
Performing mass-restores
All backups now reside on the FTP server(s) on Site-B. If we were to install esXpress on the ESX nodes at Site-B as well, all we need to do is use esXpress to restore the backups there. And it just so happens that esXpress has a feature for this: Mass Restores.
When you configure mass-restores, the ESX nodes at Site-B are “constantly” checking for new backups on the FTP servers. As soon as a backup finishes, esXpress at Site-B will discover this backup, and start a restore automatically. Where does it restore to? Simple! It restores to a powered-off VM at Site-B.
What this accomplishes is, that at Site-B you have your backups of your VMs (with their history captured in FULL and DELTA backups), and the ability to put that to tape if you like. You also have each VM (or just the most important if you choose) in the state of the last successful backup standing there, just waiting for a power-on. As a bonus on this bonus, you also have just found a way to test your backups on the most regular basis you can think of – every single backup is tested by actually performing a restore!
What does this DR setup cost?
There is no such thing as a free lunch. You have to consider these costs:
- Extra ESX servers (standby at the recover/restart site) plus licenses; ESXi is not supported by esXpress (yet);
- esXpress licenses for each ESX server (on both sites);
- A speedy WAN link (fast enough to offload backups);
- Double or even triple the amount of storage on the recover/restart site (space for backups+standby VMs. This is only a rough rule-of-thumb).
Still, way below the costs of any list that holds two SANs and SRM licenses…
So what do you get in the end?
Final question of course, is what do you get from a setup such as this? In short:
- Full-image Backups of your VMs (FULLs and DELTAs), which are instantaneously offloaded to the recover/restart site;
- The ability to make backups more than one time per 24 hours, tunable on a “per VM” basis;
- Have standby VMs that match the latest successful backup of the originating VMs;
- Failover to the DR site is as simple as a click… shiftclick… “power on VMs” !;
- Ability to put all VM backups to tape with ease;
- All backups created are tested by performing automated full restores;
- Ability to test your Disaster Restart (only manual reconnection to a “dummy” network is needed in order not to disturb production);
- RTO is short. Very short. Keep in mind, that the RTO for one or two VMs can be longer if a restore is running at the DR site: The VM being restored has to finish the restore before it can be started again;
- Finally (and this one is important!), if the primary site “breaks” during a replication action (backup action in this case), the destination VM is still functional (in the state of the latest successful backup made).
Using a setup like this is dirt-cheap when compared to SRM-like setups, you can even get away with using local storage only! The RPO is quite long (in the range of several hours to 24 hours), but RTO is short- In a smaller environment (like 30-50 VMs) RTO can easily be shorter than 30 minutes.
If this fits your needs, then there is no need to spend more – I would advise you to look at a solution like this using esXpress! You can actually build a fully automated DR environment without complex scripting or having to sell your organs 😉 . You even get backup as a bonus (never confuse backup with DR!)

LinkedIn
Twitter