disaster restart

esXpress as low-cost yet effective DR – by Erik Zandboer

March 3rd, 2009 |

You want to have some form of fast and easy Disaster Recovery, but you do not want to spend a lot of money in order to get it. What can you do? You might consider buying two SANs, and leaving out SRM. That will work, it will make your recovery and testing more complex, but it will work. But even then, you still have to buy two SANs, the expensive WAN etc. What if you want to do these things – on a budget

DR – What does that actually mean?
More and more people start to implement some form of what they call Disaster Recovery. I too am guilty of misusing that name (who isn’t), Disaster Recovery. My point is, tape backups made for ages now are also part of Disaster Recovery. Your datacenter explodes, you buy new servers, you restore the backups. There you go: Disaster Recovery in action. What comes in reach now, for the larger part because of virtualization, is what is called Disaster Restart. This is when no complex actions are required, you “press a button” and basically – you’re done. I conveniently kept the title to “DR”, which kind of favors both 🙂

Products like VMware SRM make the restart after a disaster quite easy, and more important, for the larger part you can actually test the failover without interrupting your production environment. This is a very impressive way of doing Disaster Restarting, but still quite a lot of money is involved. You need extra servers, you need an extra (SRM supported!) SAN in order to get this into action.

Enter esXpress
Recovering or Restarting from a disaster is all about RPO and RTO – The point in time to recover to, and the time required to get your server up and running (from that point in time). The smaller the numbers, the more expensive the solution. Now lets put things in reverse. Why not build a DR solution with esXpress, and see how far we get!

DR setup using esXpress
The setup is quite simple. EsXpress is primarily a backup product, and that is just what we are going to setup first. Lets assume we have two sites. One is production with four ESX nodes, and the other site with two nodes is the recovery site (oops restarting site). For the sake of evading these terms, we’ll use Site-A and Site-B 🙂

At Site-A, we have four nodes running esXpress. At site-B, we have one or more FTP servers running (why not as a VM !) which receive the backups over the WAN. Now, Disaster Recovery is in place, since all backups go off-site. Now all we have to do, is try and get as near to Disaster Restart as we can get.

For the WAN link, we basically need the bandwidth to perform the backups (and perhaps to use for regular networking in case of failover). The WAN could be upgraded as needed, and you can balance between backup frequency versus available bandwidth. EsXpress can even limit its bandwidth if required…

Performing mass-restores
All backups now reside on the FTP server(s) on Site-B. If we were to install esXpress on the ESX nodes at Site-B as well, all we need to do is use esXpress to restore the backups there. And it just so happens that esXpress has a feature for this: Mass Restores.

When you configure mass-restores, the ESX nodes at Site-B are “constantly” checking for new backups on the FTP servers. As soon as a backup finishes, esXpress at Site-B will discover this backup, and start a restore automatically. Where does it restore to? Simple! It restores to a powered-off VM at Site-B.

What this accomplishes is, that at Site-B you have your backups of your VMs (with their history captured in FULL and DELTA backups), and the ability to put that to tape if you like. You also have each VM (or just the most important if you choose) in the state of the last successful backup standing there, just waiting for a power-on. As a bonus on this bonus, you also have just found a way to test your backups on the most regular basis you can think of – every single backup is tested by actually performing a restore!

What does this DR setup cost?
There is no such thing as a free lunch. You have to consider these costs:

Extra ESX servers (standby at the recover/restart site) plus licenses; ESXi is not supported by esXpress (yet);
esXpress licenses for each ESX server (on both sites);
A speedy WAN link (fast enough to offload backups);
Double or even triple the amount of storage on the recover/restart site (space for backups+standby VMs. This is only a rough rule-of-thumb).

Still, way below the costs of any list that holds two SANs and SRM licenses…

So what do you get in the end?
Final question of course, is what do you get from a setup such as this? In short:

Full-image Backups of your VMs (FULLs and DELTAs), which are instantaneously offloaded to the recover/restart site;
The ability to make backups more than one time per 24 hours, tunable on a “per VM” basis;
Have standby VMs that match the latest successful backup of the originating VMs;
Failover to the DR site is as simple as a click… shiftclick… “power on VMs” !;
Ability to put all VM backups to tape with ease;
All backups created are tested by performing automated full restores;
Ability to test your Disaster Restart (only manual reconnection to a “dummy” network is needed in order not to disturb production);
RTO is short. Very short. Keep in mind, that the RTO for one or two VMs can be longer if a restore is running at the DR site: The VM being restored has to finish the restore before it can be started again;
Finally (and this one is important!), if the primary site “breaks” during a replication action (backup action in this case), the destination VM is still functional (in the state of the latest successful backup made).

Using a setup like this is dirt-cheap when compared to SRM-like setups, you can even get away with using local storage only! The RPO is quite long (in the range of several hours to 24 hours), but RTO is short- In a smaller environment (like 30-50 VMs) RTO can easily be shorter than 30 minutes.

If this fits your needs, then there is no need to spend more – I would advise you to look at a solution like this using esXpress! You can actually build a fully automated DR environment without complex scripting or having to sell your organs 😉 . You even get backup as a bonus (never confuse backup with DR!)