Recently I had another one of those great little problems. A VM refused to have its snapshot removed. Not because the snapshot was too big, it just failed.
What was found
When I looked at the VM, the first thing that struck me was that a (VMware Data Recovery) snapshot was indeed still present. Looking closer, there were a LOT of snapshot files in the VMs folder (somewhere up to VMname-0000033.vmdk !). All added together somewhere in the 70GB range. The VI Client stated that only one snapshot was in existence though, and the *.vmsd file confirmed this. But wait – The *.vmsd file stated that three disks were snapshotted, but only two virtual disks existed in the VMs configuration. What the … ?? I surely had found the reason for the snapshot not wanting to go away: How do you commit three snapshots onto two virtual disks?
Where things went wrong
So what had gone wrong with this VM? The VM was being backed up by VDR. The VM used to have three normal virtual disks. And here comes the fun part: At some time, The third disk had to be removed from the VM. In order to do this, the customer had decided that disabling LUN access was the way to go (Don’t try this at home though ). As a result, the third virtual disk vanished from the face of the earth. But apparently a VDR snapshot was present at that time (still following?). This resulted in a VM having three virtual disks (pointing to their three snapshots), only the last disk did no longer have a base disk to support the snapshot (the snapshot of this disk still existed because it lies where the VM config is situated and not on the disabled LUN).
Up next: The customer removed the third disk from the configuration of the VM, and all seemed well. Right up to the point where VDR tried to perform another backup. Another snapshot was created, but failed to be removed properly. VDR continued to try and backup the VM, but kept failing in removing the last snapshot.
This is what the *.vmsd file looks like:
.encoding = "UTF-8"
snapshot.lastUID = "654"
snapshot.numSnapshots = "1"
snapshot.current = "631"
snapshot0.uid = "631"
snapshot0.filename = "VMNAME-Snapshot631.vmsn"
snapshot0.displayName = "_datarecovery_"
snapshot0.description = "Automatically created by VMware Data Recovery 11/19/2010 6:00:17 PM"
snapshot0.createTimeHigh = "300394"
snapshot0.createTimeLow = "-344966923"
snapshot0.numDisks = "3"
snapshot0.disk0.fileName = "VMNAME.vmdk"
snapshot0.disk0.node = "scsi0:0"
snapshot0.disk1.fileName = "VMNAME_1.vmdk"
snapshot0.disk1.node = "scsi0:1"
snapshot.needConsolidate = "TRUE"
snapshot0.disk2.fileName = "VMNAME.vmdk" <-- NON-existent base disk (LUN was disabled)
snapshot0.disk2.node = "scsi0:2"
snapshot1.uid = "654"
snapshot1.filename = "VMNAME-Snapshot654.vmsn"
snapshot1.parent = "631"
snapshot1.displayName = "_datarecovery_"
snapshot1.description = "Automatically created by VMware Data Recovery 12/6/2010 6:37:05 PM"
snapshot1.createTimeHigh = "300737"
snapshot1.createTimeLow = "1390438140"
snapshot1.numDisks = "2"
snapshot1.disk0.fileName = "VMNAME-000035.vmdk"
snapshot1.disk0.node = "scsi0:0"
snapshot1.disk1.fileName = "VMNAME_1-000023.vmdk"
snapshot1.disk1.node = "scsi0:1"
snapshot1.disk2.fileName = "VMNAME-000022.vmdk" <--A snaphot still exists!
snapshot1.disk2.node = "scsi0:2"
Below part of the many snapshot files visible in the VMs folder:
As you can see, a lot of snapshots are from December 6th. This is when VDR was disabled and no further snapshots were taken. A lot of other snapshots are from earlier dates. Somehow VMware managed to commit the snapshots, but failed to remove the snapshot files themselves. A lot of “empty” snapshots are present as well (look for “8,192.00 KB” size). These are almost certain the snapshots made for disk2, and since disk2 was no longer present inside the VM, no changes were ever made to those snapshots.
How to resolve
So how to go about cleaning out this snapshot mess? Basically, I only looked at the VM config file (*.vmx) and the snapshot file (*.vmsd). These indicate that the VM has two disks, SCSI0:0 and SCSI0:1. The snapshot file indicates that both of these disks have one active snapshot (see above). Really the only strange thing is, that the *.vmsd file thinks a third disk (SCSI0:2) is present and also has a snapshot, where in fact this entire disk is long gone.
In the end, cleaning is easy. First, stop the VM. Then, edit the *.vmsd file and remove any references to SCSI0:2 (all lines marked red in the file above).
Now it is time to rid yourself of the snapshot. Be sure to check the size of the current snapshots! Look into the *.vmsd file, and find out which files are the current snapshots (the files called “VMNAME-nnnn.vmdk“). Find the sizes of these vmdk files. If these files are really big (think >40GB) I’d consider to V2V or clone the VM at this stage instead of cleaning the snapshot (See “Ye olde Snapshot” for more information on that). If the snapshot is not that big, proceed with normal removal by choosing “delete all” in the snapshot manager. Remember that removing the snapshot (especially a bigger one) will be faster if your VM is not powered on in most cases.
Finally, check the *.vmsd file again. It should now contain something like “snapshot.numSnapshots = “0″“, meaning no snapshots are around anymore. Then restart the VM and all should be good.
Don’t forget to remove all old snapshot files (any vmdk files with a trailing 6 digit number) and if CBT was enabled, also remove the corresponding “VMNAME-NNNNNN-ctk.vmdk” files; they are just taking up space and no longer have any value to your VM.