The temptation of "Quantum-Entangling" Virtual Machines – by Erik Zandboer

More and more vendors of SANs and NASses are starting to add synchronous replication to their storage devices – some are even able to deliver the same data locally on different sites using nfs. This sounds great, but more and more people tend to use VMware clusters across sites – and that is where it goes wrong: VMs run here, using storage there. It all becomes “quantum entangled”, leaving you nowhere when disaster strikes.

These storage offerings are causing people to translate this into creating a single VMware HA-cluster across sites. And really- I cannot blame them. It all sounds too good to be true: “If an ESX node at site A fails, the VMs are automagically started on an ESX server at another site. Better yet, you can actually VMotion VMs from site A to site B and vice versa.” Who would not want this?

VMware thinks differently – and with reason. They state that a VMware cluster is meant for failover/load balancing between LOCAL ESX nodes, and failover is a whole other ballgame (where Site Recovery Manager or SRM comes in). This decision was not made for no reason as I will try to explain.

 

How you should not do DR

If you have one big single storage array across sites, you could run VMs on either side, using whatever storage is local to that VM. That way, you do not have your disk access from VM to storage over the WAN. But when DRS kicks in, the VMs will start to migrate between ESX nodes – and between sites! And that is where it goes wrong, the VMs and their respective storage will get “entangled”. I like to call that “quantum-entanglement of VMs”, because it is kind of alike, and of course, because I can 🙂

Even without DRS, but with manual VMotions, in time you will definitely loose track on which VM runs where, and more import: use storage from where. In the end 50% of your VMs might be using storage on the other site, loading the WAN with disk I/O and introducing the WANs latency to the disk I/O of the VMs that have become “stretched”.

All this is pretty bad, but let’s say something really bad happens: your datacenter at one location is flooded, and management decides you have to perform a failover to the other site. Now panic strikes: There is probably no Disaster Recovery plan, and even if there is, it is probably way off from being actually useable. VMs have VMotioned to the other site, storage has been added from either side. VMs have been created somewhere, using storage somewhere and possibly everywhere. In other words: You have no idea where to begin, let alone being able to automate or test a failover.

 

VMware’s way of doing DR

In order to be able to overcome the problems with this “entanglement”, VMware defines a few clear design limitations as to how you should setup DR failover, with SRM helping out if you choose to. But even without SRM, it is still a very good way of designing DR.

VMware states, that you should keep a VMware cluster within a single site. DRS and HA will then take care of the “smaller disasters” such as NICs going down, ESX nodes failing, basically all events that are not to be seen as a total disaster. These failovers are automatic, they correct without any human intervention.

The other site should be totally separated (from a storage point of view). The only connection between the storages on both sides should be a replication connection. So both sites are completely holding their own as far as storage is concerned. Out of scope of this blog, yet VERY important: When you decide on using asynchronous replication, make sure your storage devices can guarantee data integrity across both sites! A lot of vendors “just copy blocks” from one site to the other. Failure of one site during this block copy can (and will) lead to data corruption. For example, EMC storage creates a snapshot just before an asynchronous replication starts, and can revert to that snapshot in case of real problems. Once again, make sure your SAN supports this (or use synchronous replication). 

Now let’s say disaster strikes. One site is flooded. HA and DRS are not able to keep up, serves go down. This is beyond what the environment should be allowed to “fix” by itself – So management decides to go for a failover. Using SRM, it should only take the press of a button, some patience (and coffee); but even without SRM you will know exactly what to do: Make replicated data visible (read/write) on the other site, browse for any VMs on them, register, and start. Even without any DR-plan in place, it is still doable!

 

Where to leave your DR capacity: 50-50 or 100-0?

So let’s assume you went for the “right” solution. Next to decide will be, what you are going to run where. Having a DR site, it would make sense to run all VMs (or at least almost all VMs) on the primary site, and leave the DR site dormant. Even better, if your company structure allows it, run test and development at the DR site. In case of a major disaster you can failover production to the DR site, and loosing only test and development (if that is allowable).

The problem often is your manager: He paid a lot of money for the second SAN, and DR ESX nodes. Now you will have to explain that these will do absolutely nothing as long as no disaster takes place. Technically there is no difference: You either run both sites at 50%, or one on 100% and the other dormant at 0%. Politically it is much more difficult to sell.

If you use SRM, there is a clear business case: If you run at 50-50, SRM needs double the licenses. And SRM is not cheap. Without SRM, it takes more explanation, but in my opinion running at 100-0 is still the way to go. As an added bonus, you might use less ESX nodes on the DR site if you do not have to failover the full production environment (which will reduce cost without SRM as well).

 

Conclusion

–> Don’t ever be tempted to quantum-entangle your VMs and their storage!

Comments are closed.

Soon to come
  • Coming soon

    • Determining Linked Clone overhead
    • Designing the Future part1: Server-Storage fusion
    • Whiteboxing part 4: Networking your homelab
    • Deduplication: Great or greatly overrated?
    • Roads and routes
    • Stretching a VMware cluster and "sidedness"
    • Stretching VMware clusters - what noone tells you
    • VMware vSAN: What is it?
    • VMware snapshots explained
    • Whiteboxing part 3b: Using Nexenta for your homelab
    • widget_image
    • sidebars_widgets
  • Blogroll
    Links
    Archives