Posts Tagged ‘ESX’

Quick dive: ESX and maximum snapshot sizes

Even today I still encounter discussions about snapshots and their maximum size. It is somewhat too simple a test for my taste, but I’m posting it anyway so hopefully I don’t have to repeat this “yes/no”-discussion every time 🙂

The steps to take are easy:

  1. – Take any running VM;
  2. – Add an additional disk (not in independent mode);
  3. – Fill this disk with data;
  4. – Check out the snapshot size;
  5. – Delete all data from the disk;
  6. – Fill the disk once again with different data just to be sure;
  7. – Check the snapshot size again.

So here we go:

Create an additional disk of 1GB, and we see this:

-rw------- 1 root root 65K Oct 18 09:58 Testdisk-ctk.vmdk
-rw------- 1 root root 1.0G Oct 18 09:28 Testdisk-flat.vmdk
-rw------- 1 root root 527 Oct 18 09:56 Testdisk.vmdk

As you can see, I created a Testdisk of 1GB. The Testdisk-ctk.vmdk file comes from Changed Block Tracking, something I have enabled in my testlab for my PHD Virtual Backup (formerly esXpress) testing.

Now we take a snapshot:

-rw------- 1 root root 65K Oct 18 09:59 Testdisk-000001-ctk.vmdk
-rw------- 1 root root 4.0K Oct 18 09:59 Testdisk-000001-delta.vmdk
-rw------- 1 root root 330 Oct 18 09:59 Testdisk-000001.vmdk
-rw------- 1 root root 65K Oct 18 09:59 Testdisk-ctk.vmdk
-rw------- 1 root root 1.0G Oct 18 09:28 Testdisk-flat.vmdk
-rw------- 1 root root 527 Oct 18 09:56 Testdisk.vmdk

Above you see that the Testdisk now has an additional file to it, namely Testdisk-000001-delta.vmdk. This is the actual snapshot file, where VMware will keep all changes (writes) to the snapped virtual disk. At this stage the base disk (Testdisk-flat.vmdk) is not modified anymore, all changes go into the snapshot from now on (you can see this in the next sections where the change date of the base disk stays at 9:59).

Now I log into the VM where the disk is added to, and I perform a quickformat on the disk:

-rw------- 1 root root 65K Oct 18 09:59 Testdisk-000001-ctk.vmdk
-rw------- 1 root root 33M Oct 18 09:59 Testdisk-000001-delta.vmdk
-rw------- 1 root root 385 Oct 18 09:59 Testdisk-000001.vmdk
-rw------- 1 root root 65K Oct 18 09:59 Testdisk-ctk.vmdk
-rw------- 1 root root 1.0G Oct 18 09:28 Testdisk-flat.vmdk
-rw------- 1 root root 527 Oct 18 09:56 Testdisk.vmdk

Interestingly, the snapshot file has grown a bit to 33MB. But it is nowhere near the 1GB size of the disk. Makes sense though, a quick format does not touch data blocks, only some to get the volume up and running. Because snapshot files grow in steps of 16[MB], I guess the quick format changed anything between 16MB and 32MB of blocks.

Next I perform a full format on the disk from within the VM (just because I can):

-rw------- 1 root root 65K Oct 18 09:59 Testdisk-000001-ctk.vmdk
-rw------- 1 root root 1.1G Oct 18 10:19 Testdisk-000001-delta.vmdk
-rw------- 1 root root 385 Oct 18 09:59 Testdisk-000001.vmdk
-rw------- 1 root root 65K Oct 18 09:59 Testdisk-ctk.vmdk
-rw------- 1 root root 1.0G Oct 18 09:28 Testdisk-flat.vmdk
-rw------- 1 root root 527 Oct 18 09:56 Testdisk.vmdk

Not surprising, the format command touched all blocks within the virtual disk, growing the snapshot to the size of the base disk (plus 0.1GB in overhead).

Let’s try to rewrite the same block by copying a file of 800MB in size onto the disk:

-rw------- 1 root root 65K Oct 18 09:59 Testdisk-000001-ctk.vmdk
-rw------- 1 root root 1.1G Oct 18 10:19 Testdisk-000001-delta.vmdk
-rw------- 1 root root 385 Oct 18 09:59 Testdisk-000001.vmdk
-rw------- 1 root root 65K Oct 18 09:59 Testdisk-ctk.vmdk
-rw------- 1 root root 1.0G Oct 18 09:28 Testdisk-flat.vmdk
-rw------- 1 root root 527 Oct 18 09:56 Testdisk.vmdk

Things get really boring from here on. The snapshot disk remains at the size of the base disk.

While I’m at it, I delete the 800MB file and copy another file on the disk, this time 912MB:

-rw------- 1 root root 65K Oct 18 09:59 Testdisk-000001-ctk.vmdk
-rw------- 1 root root 1.1G Oct 18 10:21 Testdisk-000001-delta.vmdk
-rw------- 1 root root 385 Oct 18 09:59 Testdisk-000001.vmdk
-rw------- 1 root root 65K Oct 18 09:59 Testdisk-ctk.vmdk
-rw------- 1 root root 1.0G Oct 18 09:28 Testdisk-flat.vmdk
-rw------- 1 root root 527 Oct 18 09:56 Testdisk.vmdk

Still boring. There is no way I manage to get the snapshot file to grow beyond the size of its base disk.


No matter what data I throw onto a snapped virtual disk, the snapshot never grows beyond the size of the base disk (except just a little overhead). I have written the same blocks inside the virtual disk several times. That must mean that snapshotting nowadays (vSphere 4.1) works like this:

For every block that is written to a snapshotted basedisk, the block is added to its snapshot file, except when that logical block was already written in the snapshot before. In this case the block already existing in the snapshot is OVERWRITTEN, not added.

So where did the misconception come from that snapshot files can grow beyond the size of their base disk? Without wanting to test all ESX flavours around, I know that in the old ESX 2.5 days a snapshot landed in a REDO log (and not a snapshot file). These redo logs were simply a growing list of written blocks. In those days snapshots (redo files) could just grow and grow forever (till your VMFS filled up. Those happy days 😉 ). Not verified, but I believe this changed in ESX 3.0 to the behavior we see today.

Throughput part 3: Data alignment

A lot of people have discovered yet another excuse why their environment is not quite performing as it should: misalignment. Ever since a VMware document stated misalignment could potentially cost you up to 60% of performance, it has become an excuse. When looking closer, the impact is often nearly negligible, but sometimes substantial. Why is this?


It is more and more seen in VMware environments today. “You should have aligned the partition. No wonder performance is bad”. But what is misalignment exactly, and is it really that devastating in a normal environment? The basic understanding of misalignment is rather simple. In RAID arrays, there is a certain segment size (see Throughput part 2: RAID types and segment sizes). This means data is striped across all members of a raid volume (a set of disks strung together to perform as one big unity). Especially when performing random I/O (and most VMware environments do), you want only a single disk to have to perform a track seek in order to get a block of data. So if your segment size on disk is 64KB, and you read a block of 64KB, only one disk has to seek for the data. That is, IF you aligned your data. If somewhere in between the data is not aligned with the segments on disk, you’d possibly have to read two segments, because each segment carries part of the block to be read (or written for that matter). Exactly that is called misalignment.

In most VMware environments, there are two “layers” between your VM data and the segments on disk: the VMFS and the file system inside your virtual disk. Since ESX 3.x, VMware delivers 64KB alignment of the VMFS. As soon as the blocks vSphere is accessing get bigger than 64KB, you could call it sequential access, where alignment does not help anymore. So basically the start of a VMFS block of 64KB, is always aligned to a 64KB segment on the disks laying underneath. For those who might wonder: VMFS block sizes (1MB … 8MB) are not related to the I/O sizes used on disk; VMFS is able to perform I/O on subsets of these blocks.

The second “layer” is more problematic: The guest file system. Especially NTFS under Windows 2003 server (or earlier) or desktop releases prior to Windows 7, NTFS will by default misalign. I have never understood why, but a default NTFS will align itself to 32256 bytes, or 63 sectors. After that the actual data starts. Getting NTFS aligned is simple: just create a gap after sector 63 right up to sector 128 (or any power of two above for that matter). This is easily done for new virtual disks, but not so easy for existing ones (especially system disks).

Misalignment shown graphically

A lot of people find misalignment hard to understand. A picture says a thousand words, so in order to keep this blog post somewhat shorter: pictures!

Both NTFS and VMFS aligned

Figure 1:    Aligned VMFS and NTFS (how you should want it)

In figure 1, both VMFS and NTFS have been properly aligned, including some alignment space. In effect, for every block accessed from or to the NTFS file system, only one block on the underlying storage is touched. Thumbs up!

Both NTFS and VMFS misaligned

Figure 2:    Both VMFS and NTFS misaligned (you should never want this)

A misalignment of both VMFS and NTFS is depicted in figure 2. This is a really undesirable situation. As you can see, the access of an NTFS block will require one VMFS segment to be read, sometimes even two (due to NTFS misalignment). But since VMFS is misaligned to the disk segments, every 64KB VMFS block in this example will require the access to two segments on disk. This can and will hurt performance. Luckily, VMware spotted this problem relatively early, and from ESX 3.0 and up VMFS alignment is automagically if you format the VMFS from the VI client.

NTFS misaligned, VMFS aligned

Figure 3:    Aligned VMFS, but misaligned NTFS (most common situation)

Figure 3 shows the situation I mostly see in the field. VMFS is aligned (because VMFSses formatted in the VI client align automagically to a 64KB boundary). NTFS is misaligned in this example. I see this all the time in Windows 2003 / Windows XP VMs. As you can see in this example, most blocks touch only a single segment on the physical disk. Some NTFS blocks “fall over the edge” of a 64K segment on disk. Any action performed on those NTFS blocks will result in the reading or writing of TWO segments on the underlying disks. This is the performance impact right there.

You can probably see where this is going: If your segment size on your storage is way bigger than the block size of your VM file system, impact is not too much of a problem. In the example in figure 2, two NTFS blocks out of every 64 blocks will be impacted by this, and only for random access (in sequential accesses your storage cache will fix your problem since both segments on disk will be read anyway). This is an impact of 1/32th, or 3,1%. You could possibly live with that…

Now let’s up the stakes. What if your storage array used a really small segment size on physical disk, let’s say 4KB? Take a look at figure 4:

VMFS alligned, NTFS misaligned, small segment size

Figure 4:    Aligned VMFS, misaligned NTFS and a small segment size

vSphere will generate I/O blocks which get sized to the highest effectiveness. For example, if you have a database which uses 4KB blocks, and performs 100% random I/O, you get a situation like in figure 4. Every time you access a 4KB block, VMFS interprets this to a 4KB I/O action to your array. Because the NTFS / database blocks are misaligned, EACH access to a 4K block ends up on TWO disk segments. This impacts performance dramatically (up to 50% if all I/O sizes are 4KB). A similar situation occurs when your database application would use 8KB blocks; in that case for every I/O three segments on disk would be accessed instead of two, impacting performance of the disk set by 33% (if all I/O sizes are 8KB).

Why ever use a small segment size?

When you look at an EMC SAN (Clariion), the segment size is fixed at 64KB . When you look at a NetApp, segment size is fixed at 4KB. It would be pretty safe to say, that the impact of misalignment will hit harder on a NetApp than on an EMC box. That is probably why NetApp hammers so hard on alignment; in a NetApp environment it really does matter, in an EMC environment, a little less.

Looking at it the other way round: Why would you ever use such a small segment size? Why not use a segment size of for example 256KB, and feast on having only 1/128th or 0,78% impact when not aligning? Well, using a large segment size appears to be the solution to misalignment. And in a way, it is. But do not forget: Every time you need to access 4KB of data, 256KB is accessed on disk. So both yes AND no, a large segment size makes alignment almost a waste of time, but it introduces other problems.

Somewhere, the “perfect segmentsize” should exist. Best of both worlds… The problem is… This perfect segment size will vary with the type of load you feed to your SAN. EMC is sure about their 64KB (since it cannot be altered), NetApp seems sure about 4KB, because of the very same reason. The el-cheapo parallel-SCSI array (yes parallel SCSI indeed and vmotion works- but that is another story) I use for my home lab does a more generic job: For each RAID volume, I am allowed to choose my segment size (called a stripe size there). Now THAT gives room for tuning! And room for failure in tuning it at the same time…

Dedup and misalignment

Now that deduplication is the new hype, misalignment is said to impact dedup effectiveness. The answer to this, as usual, is…. It depends. If you take two misaligned windows 2003 servers from a template, you’d deduplicate them very effectively since they are very alike. If you were to align one of them (leaving the second one misaligned), dedup would possibly not find a single block in common. Makes sense right? Your alignment shifted all data within the VMDK, differentiating all blocks in effect. If I now align the second VM as well (using the same alignment boundary), dedup would once again be able to work effectively.

So the final answer should be: If dedup is to be effective, either align ALL VMs, or align NONE.

How to get rid of misalignment

Let’s say you’ve found that your VMs are misaligned. If things are really bad they are situated on a RAID volume with a very small segment size. Alignment could save the day. So how do you go about it? Several solutions I’ve come across:

  1. Manually;
  2. GParted utility;
  3. Use Vizioncore’s vOptimizer;
  4. If you’re a NetApp customer and use ESX (not ESXi), use their alignment tool mbrscan/mbralign;
  5. V2V your VMs using Platespin PowerConvert and align them on the way.

Manual Alignment is perfect for data drives. The idea is that you add a second data drive, create an aligned partition there using diskpart:

  • Open a command promp, run diskpart;
  • list disk – then select disk x;
  • list volume – then select volume x;
  • create partition primary align=64 (or any power of 2 above).

after that, stop whatever service is using your datadrive, copy all data, change the drive letters so your new aligned disk matches the old data drive, restart your services, remove the original data disk from the VM. This works great for SQL, Exchange, fileservers etc. The big downside: You cannot align system disks using diskpart (not even from another VM; diskpart’s create partition is destructive).

GParted is a utility that is said to align your partition if you resize the partition using this tool. Never looked into it, but it’s worth checking it out.

Vizioncore’s vOptimizer is a very nice tool that performs alignment for you. Basically it shuts the VM in question, and starts to move every block inside your VMDK(s). You end up with all disks aligned. The VM is then restarted and an NTFS disk check is forced. After that you’re good to go. It served me well on some occasions! You even get two alignments for free if you decide to give their product a spin.

NetApp customers get an alignment tool for free: mbralign. I never used this tool, but apparently it does about the same job as vOptimzer. It shuts your VM, aligns the disks, reboots your VM. It only works on ESX though (installs software in the Service Console).

If you cannot live with the downtime, but need to align anyway, you could consider to look at Platespin products. They can perform a “hot” V2V and align in the process. When data moving is complete, they fail over from the original VM to the newly V2Ved VM, syncing the final changes on the destination disk(s). You end up with an aligned copy of your VM with minimal downtime.

How to prevent misalignment in the first place

Misalignment is often seen, but not necessary at all if you think about it before you start: A lot of people create templates. Not too many align their templates… But you could! If you have a (misaligned or not) VM laying around, you could add an empty system disk of the template-to-be to it, and format the partition aligned from that “helper” VM (see the diskpart description above). Then detach the system disk from the helper VM again, and proceed to install Windows on the (now aligned) disk. Choose not to change anything to the partitioning and you are good to go. Bootable XP CD’s can also do the same trick here.

Now your template is aligned. The upshot: Any VM deployed from this template is too!

There is an easy way to check under windows if your disks are aligned. Simply run the msinfo32.exe from windows, expand components, storage, disks. Find the item “Partition Starting Offset”. If it reads 32.256, you’re out of luck: your partition is misaligned. If it reads 65536, you have a 64K aligned partition. If the value reads 1.048.576, the partition is aligned on a 1MB boundary (Windows 2008 / Windows 7 default).


Is alignment important? Well, it depends. It particularly depends on the segment size used within your storage array. The smaller the segment size, the more impact you have. Bottom line though: Alignment always helps! Get off to a good start and perform alignment right from the beginning and you’ll profit ever after. If you didn’t go off to a perfect start, consider aligning your VMs afterwards. Start with the heavy random I/O data disks for sure, but I would recommend to have the system disks aligned as well, using one of the described tools.

Performance impact when using VMware snapshots

It is certainly not unheared of – “When I delete a snapshot from a VM, the thing totally freezes!“. The strange thing is, some customers have these issues, others don’t (or are not aware of it). So what really DOES happen when you clean out a snapshot? Time to investigate!

Test Setup

So how do we test performance impact on storage while ruling out external factors? The setup I choose was using a VM with the following specs:

Read the rest of this entry »

Breaking VMware Views sound barrier with Sun Open Storage (part 1)

A hugely underestimated requirement in larger VDI environments is disk IOPs. A lot of the larger VDI implementations have failed using SATA spindles, when you use 15K SAS or FC disks you get away with it most of the times (as long as you do not scale up too much). I have been looking at ways to get more done using less (especially in current times, who doesn’t!). Dataman, the dutch company I work for ( teamed up with Sun Netherlands and their testing facility in Linlithgow, Scotland for testing. I got the honours of performing the tests, and I almost literally broke the sound barrier using Suns newest line of Unified Storage: The 7000 series. Why can you break the sound barrier with this type of storage? Watch the story unroll! For now part one… The intro.

What VMware View offers… And needs

Before a performance test even came to mind, I started to figure what VMware View offers, and what it needs. It is obvious: View gives you linked cloning technology. This means, that only a few full clones (called replicas) are read by a lot of Virtual Desktops (or vDesktops as I will call them from now on) in parallel. So what would really help pushing the limits of your storage? Exactly, a very large cache or solid-state disks. Read the rest of this entry »

esxtop advanced features

No rocket science here. esxtop has always been there. Yet a lot of people miss out on some of its great features. Hopefully this blogpost will get you interested in looking at esxtop (again?) in detail!

Yesterday I attended a very interesting breakout session about esxtop and its advanced features in vSphere. Old news you might say, but there is SO much you can do with esxtop. For example, you can export data from esxtop and import them in Windows perfmon. And if you did know that, then for example, did you know you can now actually see which physic NIC is being used by a certain VM?

Other neat little features were shown. The best one being that the “swcur” field is actually NOT about the current swapping activity of a VM, but swapping that occured in the past (yes, I too would have called it differently…). How many of you knew that one? Finally, a very interesting field in the storage screen (yes for those who did not know that one, esxtop is not just about CPU, but also memory, storage, and new in vSphere… Interrupts) ). This field is called “DAVG” and this shows the actual latency seen by ESX to your storage (and also KAVG for kernel latency and GAVG for the total latency the guest sees).

There were also a few examples of misbehaving VMs which was very interesting to see. Numbers which seemed not possible, yet explained perfectly. I would like to vote this very last presentation at VMworld 2009 the best technical presentation I witnessed there!

I hope I got you (re)interested in esxtop. I am more of a graphical guy, so I like the performance monitor embedded within the VI client. But some things just aren’t there. So esxtop is definitely worth a(nother) look. If you’re using ESXi, make sure to download the vMA appliance (here) which has resxtop included (which looks a lot like esxtop on ESX).

The Dedup Dilemma

Everybody does it – and if you don’t, you can’t play along. What am I talking about? Data deduplication. It’s the best thing since sliced bread I hear people say. Sure it saves you a lot of disk space. But is it really all that brilliant in all scenarios?

The theory behind Data Deduplication

The idea is truly brilliant – You store blocks of data in a storage solution, and you create a hash which identifies the data inside the block uniquely. Every time you need to backup a block, you check (using the hash) if you already have the block in storage. If you do, just write a pointer to the data. Only if you have not got the block yet, copy it and include it into the storage dedup Dbase. The advantage is clear: The more equal data you store, the more you save in disk space. This is, especially in VMware, using equal VMs from templates a very big saving in disk space.

The actual dilemma

Certainly a nice thing about deduplication is, next to the large amounts of storage (and associated costs) you save, is that when you deduplicate at the source, you end up only sending new blocks across the line, which could dramatically reduce the bandwidth you need between remote offices and central backup locations. Deduplication at the source also means, you generally spread CPU loads better across your remote servers instead of locally in the storage solution.

Since there is a downside on every upside – Data Deduplication certainly has its downsides. For example, if I had 100 VMs, all from the same template, there surely are blocks that occur in each and every one of them. If that particular block gets corrupted… Indeed! You loose ALL your data. Continuing to scare you, if the hash algorithm you use is insufficient, two different data blocks might be identified as being equal, resulting in corrupted data. Make no mistake, the only way you can be 100% percent sure the blocks are equal, you need a hash number as big as the block itself (rendering the solution kind of useless). All dedup vendors use shorter hashes (I wonder why 😉 ), and live with the risk (which is VERY small in practice but never zero). Third mayor drawback, is the speed at which the storage device is able to deliver your data (un-deduplicated) back to you (which especially hurts on backup targets which have to perform massive restore operations). Final drawback: You need your ENTIRE database in order to perform any restore (at least you cannot be sure which blocks are going to be required to restore a particular set of data).

So – should I use it?

The reasons stated above always kept me a skeptic when it came to data deduplication, especially for backup purposes. Because at the end of the day, you want your backups to be functional, and not requiring the ENTIRE dataset in order to perform a restore. Speed can also be a factor, especially when you rely on restores from the dedup solution in a case of disaster recovery.

Still, there are definitely uses for deduplication. Most vendors have solved most issues with success, for example being able to access un-deduplicated data directly from the storage solution (enabling separate backups to tape etc). I have been looking at the new version of esXpress with their PHDD dedup targets, and I must say it is a very elegant solution (on which I will create a blog shortly 🙂

Surviving total SAN failure

Almost every enterprise setup for ESX features multiple ESX nodes, multiple failover paths, multiple IP and/or fiber switches… But having multiple SANs is hardly ever done, except in Disaster Recovery environments. But what if your SAN decides to fail altogether? And even more important, how can you prevent impact if it happens to your production environment?



Using a DR setup to cope with SAN failure

One option to counter the problem of total SAN failure would of course be to use your DR-site’s SAN, and perform a failover (either manual or via SRM). This is kind of a hard call to make: Using SRM will probably not get your environment up within the hour, and if you have a proper underlying contract with the SAN vendor, you might be able to fix your issue on the primary SAN within the hour. No matter how you look at it, you will always have downtime in this scenario. But in these modern times of HA and even Fault Tolerance (vSphere4 feature), why live with downtime at all?


Using vendor-specific solutions

A lot of vendors have thought about this problem, and especially in the IP-storage corner one sees an increase in “high available” solutions. Most of the time relative simple building blocks are simply stacked, and can then survive a SAN (component) failure in that case. This is one way to cope with issues, but it generally has a lot of restrictions – such as vendor lock-in and an impact on performance.

Why not do it the simple way?

I have found that simple solutions are generally the best solutions. So I tried to approach this problem from a very simple angle: From within the VM. The idea is simple: You use two storage boxes which your ESX cluster can use, you put a VMs disk on a LUN on the first storage box, and you simply add a software mirror on a LUN on the second storage. It is almost too easy to be true. I used a windows 2003 server VM, converted the bootdrive to a dynamic disk, and simply added the second disk to the VM, choose “add mirror” from the bootdisk which I placed on the second disk.

Unfortunately, it did not work right away. As soon as one of the storages fails, VMware ESX reports “SCSI BUSY” to the VM, which will cause the VM to freeze forever. After adding the following to the *.vmx file of the VM, things got a lot better:

scsi0.returnBusyOnNoConnectStatus = “FALSE”

Now, as soon as one of the LUNs fail, the VM has a slight “hiccup” before it decides that the mirror is broken, and it continues to run without issue or even lost sessions! After the problem with the SAN is fixed, you simply perform an “add mirror” within the VM again, and after syncing to are ready for your next SAN failure. Of course you need to remember that if you have 100+ VMs to protect this way, there is a lot of work involved…

This has proven to be a simple yet very effective way to protect your VMs from a total (or partial) SAN failure. A lot of people do not like the idea of using software RAID within the VMs, but eh, in the early days, who gave ESX a thought for production workloads? And just to keep the rumors going: To my understanding vSphere is going to be doing exactly this from an ESX point of view in the near future…

To my knowledge, at this time there are no alternatives besides the two described above to survive a SAN failure with “no” downtime (unless you go down the software clustering path of course).

Resistance is ViewTile!

Nowadays, more and more companies realize that virtual desktops is the way to go. It seems inevitable. Resistance is Futile. But how do you scale up to for example 1000 users per building block? How much storage do you need, how many spindles do you need? Especially with the availability of VMware View 3, the answers to these questions become more and more complex.


Spindle counts

Many people still design their storage requirements based on the amount (in GBytes) of storage needed. For smaller environments, you can actually get away with this. It seems to “fix itself” given the current spindle sizes (just don’t go and fill up 1TB SATA spindles with VMs). The larger spindle sizes of today and the near future however, make it harder and harder to maintain proper performance if you are ignorant about spindle counts. Do not forget, those 50 physical servers you had before actually had at least 100 spindles to run from. After virtualization, you cannot expect them all to “fit” on a (4+1) RAID5. The resulting storage might be large enough, but will it be fast enough?

Then VMware introduced the VMmark Tiles. This was a great move; a Tile is a simulated common load for server VMs. The result: The more VMmark Tiles you can run on a box, the faster the box is from a VMware ESX point of view.

In the world of view, there really is no difference. A thousand physical desktops have a thousand CPUs, and a thousand (mostly SATA) spindles. Just as in the server virtualization world, one cannot expect to be able to run a thousand users off of ten 1TB SATA drives. Although the storage might be sufficient in the number of resulting Terabytes, the number of spindles in this example would obviously not be sufficient. A hundred users would all share have to share a single SATA spindle!

So basically, we need more spindles, and we might even have to keep expensive gigabytes or even terabytes unused. The choice of spindle type is going to be the key here – using 1TB SATA drives, you’d probably end up using 10TB, leaving about 40TB empty. Unless you have a master plan for putting your disk based backups there (if no vDesktops are used at night), you might consider to go for faster, smaller spindles. Put finance in the mix and you have some hard design choices to make.


Linked cloning

Just when you thought the equation was relatively simple, like “a desktop has a 10GB virtual drive period”, Linked cloning came about. Now you have master images, replicas of these masters, and linked clones from the replicas. Figuring out how much storage you need, and how many spindles you need just got even harder to determine!

Lets assume we have one master image which is 10GB in size. Per +-64 clones, you are going to need a replica. You can add up to about 4 replicas per master image. All this is not an exact science though; just recommendations found here and there. But how big are these linked clones going to be? This again depends heavily on things like:

  • do you design separate D: drives for the linked clones where they can put their local data and page files;
  • What operating system are you running for the vDesktops;
  • Do you allow vDesktops to “live” beyond one working day (e.g. do you revert to the master image every working day or not).

Luckily, the amount of disk IOPS per VM is not affected by the underlying technology. Or is it? SAN caching is about to add yet another layer of complexity to the world of View…

Cache is King

Let’s add another layer of complexity: SAN caching. From the example above, if you would like to scale up that environment to 1000 users, you would end up with 1000/64 = 16 LUNs, each having their own replica put on there, together with its linked clones. If in a worst-case scenario, all VMs boot up in parallel, you would have enormous amount of disk reads on the replicas (since booting requires mostly read actions). Although all replicas are identical, the SAN has no knowledge of this. The result is, that the blocks used for booting the VMs of all 16 replica’s should be in the read-cache in a perfect world. Lets say our XP image uses 2GB of blocks for booting, you would optimally require a read cache in the SAN of 16*2=32GB. Performance will degrade the less cache you have. Avoiding these worst-case scenarios is another option to manage with less cache of course. Still I guess in a View 3 environment: “Cache is King“!   

While I’m at it, I might just express my utmost interest in the development from SUN, their Amber Road product line to be more exact. On the inside of these storage boxes, SUN uses the ZFS file system. One of the things that really could make a huge difference here is the ability of ZFS to move content to different tiers (faster storage versus slower storage) depending on how heavily this content is being used. Add high-performance SSD disks in the mix, and you just might have an absolute winner, even if the slowest-tier storage is “only” SATA. I cannot wait on performance results regarding a VDI-like usage on these boxes! My expectations are high, if you can get a decent load-balance on the networking side of things (even a static load balance per LUN would work in VDI-like environments).


Resistance is ViewTile!

As I laid out in this blog post, there are many layers of complexity involved when attempting to design a VDI environment (especially the storage-side of things). It is becoming almost too complex to use “theory only” on these design challenges. It would really help to have a View-Tile (just like the server-side VMmark Tiles we have now). The server tiles are mostly used just to prove the effectiveness of a physical server running ESX, the CPU, the bus structure etc. A View-Tile would potentially not only prove server effectiveness, but also very much the storage solution used (and the FC- / IP-storage network design in between). So VMware: a View-Tile is definitely on my wish list for Christmas (or should I consider to get a life after all? 😉 )

The VCDX "not quite Design" exam

Last week I was in London to complete the VCDX Design beta exam. This long awaited exam consists of a load of questions to be completed in four hours. In this blog post I will give my opinion on this exam. Because the contents of the exam should not be shared with others, I will not be giving any hints and tips on how to maximize your score if you participate, but I will address the kind of questions and my expectations in this blog post.

First off, there was way to little time to complete the exam. I have been trying to type comments on some questions with obvious errors or questions where I suspected something wasn’t quite right. All questions require some reading, so I could nicely time if I needed to speed up or down. Unfortunately, somewhere near question 100, I stumbled upon a question with pages worth of reading! So I had to “hurry” that one, and as a result, the rest as well. Shame. Also, VMware misses out vital information this way because people just don’t have the time to comment.

As I have noticed with other VMware exams, the scenarios are never anywhere near realism (at least not for European standards). Also having questions about for example the bandwidth of a T1 line is not very bright, given the fact that the exam is to be held worldwide. In Europe, we have no clue on what a T1 line is.

But the REAL problem with this design exam is, that in my opinion, this is NO design exam at all. Sorry VMware, I was very disappointed. If this were a real design exam, I would actually encourage people to bring all PDFs and books they can find; it would (should!) not help you. Questions like “how to change a defective HBA inside an ESX node without downtime” ? Sorry, nice for the enterprise exam, but it has absolutely nothing to do with designing. Offload that kind of stuff to the enterprise exam please! And if you have to ask about things like this, then ask about how to go about the rezoning of the fiber switches. That would at least prove of some understanding how you design a FC network. But that question was lacking.

There are numerous other examples of this, all about just knowing that little tiny detail to get you a passing score. That is not designing! I had been hoping for questions like how many spindles to design under VMware and why. When to use SATA and when not to. Customers having blades with only two uplinks. Things that actually happen in reality, things that bring out the designer in you (or should bring out the designer in you)! Designing is not knowing about what action will force you to shut a VM and which action will not. <Sigh>.

I know, the answers could not be A, B, C or D. This exam should have open questions. More work for VMware, but hey, that is life. Have people pick up their pen and write it down! Give them space for creativity, avoiding the pitfalls that were sneaked into the scenario. These things are what should be a quality of a designer anyway. That’s the way to get it tested.

I’ll just keep hoping the final part of the VCDX certification (defending a design for a panel) will finally bring that out. If it doesn’t, we’ll end up with just another “VCP++” exam, for which everyone can get a passing score if you study for a day or two. I hope VCDX will not become “that kind of a certification”!

I hope VMware will look at comments like these in a positive manner, and create an exam which can actually be called a DESIGN exam. VMware, Please PLEASE put all the little knowledge tidbits into the Enterprise exam, and create a design exam that actually forces people to DESIGN! Until that time, I’ll be hoping the final stage of VCDX will give back my hopes that this certification will really make a difference.

VMware HA, slotsizes and constraints – by Erik Zandboer

There is always a lot of talk about VMware HA, and how it works with slotsizes and determines host failover using the slotsizes. But nobody seems to know exactly how this is done inside VMware. By running a few tests (on VMware ESX 3.5U3 build 143128), I was hoping to unravel how this works.


Why do we need slotsizes

In order to be able to guess how many VMs can be run on how many hosts, VMware HA does a series of simple calculations to “guestimate” the possible workload given a number of ESX hosts in a VMware cluster. Using these slotsizes, HA determines if more VMs can be started on a host in case of failure of one (or more) other hosts in the cluster (only applicable if HA runs in the “Prevent VMs from being powered on if they violate constraints”), and how big the failover capacity is (failover capacity is basically how many ESX hosts in a cluster can fail while maintaining performance on all running and restarted VMs).


Slotsizes and how they are calculated

The slotsize is basically the size of a default VM, in terms of used memory and CPU. You could think of a thousand smart algorithms in order to determine a worst-case, best-case or somewhere-in-between slot size for any given environment. VMware HA however does some pretty basic calculations on this. As far as I have figured it out, here it comes:

Looking through all running (!!) VMs, find the VM with the highest cpu reservation, and find the VM with the highest memory reservation (actually reservation+memory overhead of the VM). These worst-case numbers are used to determine the HA slotsize.

A special case in this calculation is when reservations are set to zero for some or all VMs. In this case, HA uses its default settings for such a VM: 256MB of memory and 256MHz of processing power. You can actually change these defaults, by specifying these variables in the HA advanced settings:


In my testing environment, I had no reservations set, not specified these variables, and did not have any host fail-over (Failover capacity= 0). As soon as I introduced these variables, both set at 128, my failover level was instantaneously increased to 1. When I started to add reservations, I was done quite quickly: adding a reservation of 200MB to a single running VM, changed my failover level back to 0. So yes, my environment proves to be a little “well filled” 😉


Failover capacity

Now we have determined the slotsize, the next question which arises, is: How does VMware calculate the current failover capacity? This calculation is also pretty basic (in contrast to the thousand interesting calculations you could think of): Basically VMware HA uses the ESX host with the least resources, calculates the number of slots that would fit into that particular host, then determines the number of slots per ESX host (which are also projected to any larger hosts in the ESX cluster!). What?? Exactly: using ESX hosts in a HA enabled cluster which do have different sizes for memory and/or processing power impacts the failover level!

In effect, these calculations are done for both memory and CPU. Again, the worst-case value is used as the number of slots you can put on any ESX host in the cluster.

After both values are known (slotsize and number of slots per ESX host), it is a simple task to calculate the failover level: Take the sum of all resources, divide them by the slotsize resources, and this will give you the number of slots available to the environment. Subtract the number of running VMs from the available slots, and presto, you have the number of slots left. Now divide this number by the number of slots per host, and you end up with the current failover level. Simple as that!



Lets say we have a testing environment (just like mine 😉 ), with two ESX nodes in a HA-enabled cluster, configured with 512MB of SC memory. Each has a 2GHz dualcore CPU and 4GB of memory. On this cluster, we have 12 VMs running, with no reservations set anywhere. All VMs are Windows 2003 std 32 bit, which gives a worst-case memory overhead of (in this case) 96Mb.

At first, we have no reservations set, and no variables set. So the slotsize is calculated as 256MHz / 256MByte. As both hosts are equal, I can use any of the hosts for the number of slots per hosts calculation:

CPU –> 2000MHz x 2 (dualcore) = 4000 MHz / 256 MHz = 15,6 = 15 slots per host
MEM –> (4000-512) Mbytes / (256+96) Mbytes = 9,9 = 9 slots per host

So in this scenario 9 slots are available per host, so in my case 9 slots x 2 host = 18 slots for the total environment. I am running 12 VMs, so 18 – 12 = 6 slots left. 6/9 = 0,6 hosts left for handling failovers. Shame, as you can see I need 0,4 hosts extra to have any failover capacity.

Now in order to change the stakes, I put in the two variables, specifying CPU at 300MHz, and memory at 155Mbytes (of course I just “happened” to use exactly these numbers in order to get both CPU and memory “just pass” the HA-test):

          das.vmCpuMinMHz = 300
          das.vmMemoryMinMB = 155

Since I have no reservations set on any VMs, these are also the highest values to use for slotsizes. Now we get another situation:

CPU –> 2000MHz x 2 (dualcore) = 4000 MHz / 300 MHz = 13.3 = 13 slots per host
MEM –> (4000-512) Mbytes / (155+96) Mbytes = 13.9 = 13 slots per host

So now 13 slots are available per host. You can imagine where this is going when using 12 VMs… In my case 13 slots x 2 host = 26 slots for the total environment. I am running 12 VMs, so 26 – 12 = 14 slots left. 14/13 = 1,07 hosts left for handling failovers. Yes! I just upgraded my environment to a current failover level of 1!

Finally, Lets look at a situation where I would upgrade one of the hosts to 8GB. Yep you guessed right, the smaller host will still force its values into the calculations, so basically nothing would change. This is where the calculations go wrong, or even seriously wrong: Assume you have a cluster of 10ESX nodes, all big and strong, but you add a single ESX host having only a single dualcore CPU and 4GB of memory. Indeed, this would impose a very small number of slots per ESX hosts on the cluster. So there you have it: yet another reason to always keep all ESX hosts in a cluster equal in sizing!

Looking at these calculations, I actually was expecting the tilt point to be at 12 slots per host (because I have 12 VMs active), not 13. I might have left out some values of smaller influence somewhere, like used system memory on the host (host configuration… Memory view). Also, the Service Console might count as a VM?!? Or maybe VMware just likes to keep “one free” before stating that yet another host may fail… This is how far I got, maybe I’ll be able to ad more detail as I test more. Therefore the calculations shown here may not be dead-on, but at least precise enough for any “real life situation” estimates.


So what is “Configured failover”?

This setting is related to what we have just calculated. You must specify this number, but what does it do?

As seen above, VMware HA calculates the current host failover capacity, which is a calculation based on resources in the ESX hosts, the number of running VMs and the resource settings on those VMs.

Now, the configured failover capacity determines how many host failures you are willing to tolerate. You can imagine that if you have a cluster of 10 ESX hosts, and a failover capacity of one, the environment will basically come to a complete halt if five ESX nodes fail at the same time (assuming all downed VMs have to be restarted on the remaining five). In order to make sure this does not happen, you have to configure a maximum number of ESX hosts that may fail. If more hosts fail than the specified number, HA will not kick in. So, it is basically a sort of self-preservation of the ESX cluster.

Soon to come
  • Coming soon

    • Determining Linked Clone overhead
    • Designing the Future part1: Server-Storage fusion
    • Whiteboxing part 4: Networking your homelab
    • Deduplication: Great or greatly overrated?
    • Roads and routes
    • Stretching a VMware cluster and "sidedness"
    • Stretching VMware clusters - what noone tells you
    • VMware vSAN: What is it?
    • VMware snapshots explained
    • Whiteboxing part 3b: Using Nexenta for your homelab
    • widget_image
    • sidebars_widgets
  • Blogroll