VMFS

Ye Olde Snapshot – by Erik Zandboer

February 16th, 2009 |

A lot of people have had more or less unpleasant experiences with forgotten snapshots. You login in the morning, and a VM is down. “Strange” you think. After some investigation, you find out the VMFS volume on which the VM was running is full. Completely full. And to your horror you find out why – A forgotten snapshot is in place which has now grown beyond the size of the VMFS volume.

What exactly does a snapshot do

First thing to understand, is how a snapshot exactly works. When you add a snapshot, the original virtual disk is no longer written to. Each block that should be written into this file, is redirected to a snapshot file. So basically this snapshot file holds all changes made to the virtual disk after the snapshot was made. The more changes you make to blocks not changed before, the larger the snapshot file will grow (in steps of 16MB). Each changed block is stored inside the snapshot file only once. This means that a snapshot file can reach a sometimes staggering size equal or almost equal to the size of the original virtual disk (defragmentation inside a VM is my personal favorite 😉 ).

Monstrous snapshot – now what?

If you “forget” about a snapshot, changes are you will never notice this, right until it might be too late. Especially if you snapshotted a very large virtual disk, and have plenty of room left on the VMFS, snapshots can grow to immense sizes. Cleaning them up can be very time consuming indeed.

If you have found a very old snapshot file which has grown very large (eg. 10-40GB), you can actually delete the snapshot without problems, thereby committing all changes recorded in the snapshot file back to the original disk. So you end up with only the virtual disk as it appeared when the snapshot was in place, only without the snapshot there. But beware – If you delete the snapshot from vCenter (got to get used to that name instead of VirtualCenter), you might very well get a timeout. This has given some people some really sweaty fingers. Don’t panic, login to the ESX node itself, and you’ll probably see that the snapshot is still being removed. It might take an hour, it might take four hours, but in time the snapshot should remove itself.

VMFS full – How to get the VM running again

If a forgotten snapshot fills up the entire VMFS, you might run out of VMFS space. chances are that your snapshotted VM stops. This is because the VM is trying to write to its disk, and the snapshot needs to grow but it can’t. There are two ways to resolve this: 1) make room on the VMFS, or 2) delete the snapshot while the VM remains off. In a production environment, option 2) might not work for you. Deletion of large snapshots might take hours. So we’re back to making room on the VMFS. Maybe you can or move another VM from the VMFS. Maybe you have some ISOs laying about the VMFS you can delete. Then you can start your troubled VM again, and remove the snapshot while the VM is running again. A last resort might even be to give the VM less memory, or put its swapfile in another location (possible in ESX 3.5u3). Then start to delete the snapshot right away, before it manages to fill up the VMFS again.

I have even heard of people who put a 2GB dummy file on each VMFS volume, so that when it comes to these issues they just delete the file – and gain 2 Gbytes of space. If forgetting snapshots is your habit, you might consider this as a “best practice” for your environment…

50GB+ snapshot – Delete or…?

What if you have a really big snapshot (and I mean 50+ GB), or you might even have multiple huge snapshots in place? Or even have snapshots that appear to be garbled in their linkage (horrors like “cannot delete snapshot because the base disk was modified after the snapshot was taken”). You might not want to risk deletion of these snapshot(s). There is another way to recover safely, especially if you run Windows 2003 or later, which should be much more advertised: VMware Converter! It is really a magical tool. Not only for P2V, but also in cases exactly like this. While you keep your VM running, just point Converter to the VM while telling Converter it is a physical machine. Converter will install its agent inside the VM, and start to duplicate your VM to another LUN. After the conversion, the target VM will be free of any snapshots!

This option also works great if you have issues with your SAN. I have seen environments that had LUNs you could not even browse through any more (not from the datastore browser nor via ssh) – but VMs placed there were still running OK. It shows stability and enterprise-readyness of ESX for sure, but how to recover? Even restarting the VM or scanning LUNs is risky here. The simple answer was: Use Converter. Simply use Converter! To make a short story even shorter: converter saved the day 🙂

So I guess as a final word I should say: For VM recovery from even the weirdest disk-related issues, consider to use VMware Converter !