VMdamentals.com

VMware Technical Deepdive by Erik Zandboer

hostd-hara-kiri – by Erik Zandboer

February 24th, 2009 |

Today I got a question of a customer – His hosts appeared to reboot every few hours, or at least show up grey in vCenter. I found the issue – A clear case of hostd-hara-kiri…!

When I heard of this issue, the first and only thing that came to mind was hostd running out of memory. A quick look at /var/log/vmware/hostd.log showed the issue: “Memory checker: Current value 174936 exceeds soft limit 122880“. I advised to raise the service console memory, although I am not sure this resolves the issue, since the limits for hostd memory are not changed when you alter the SC memory… So as a “backup” I told him to make the changes stated below in order to at least make sure the problem would not come back.

Anyway, I decided to check out my testing environment. I too had the hostd.log being filled up with these messages. The soft limit is almost constantly broken, which is set by VMware at 122880. The hard limit is set at 204800. Hard limit??? So what happens when the hard limit is reached? – Exactly, hostd-hara-kiri.

One of the ESX servers I looked at, showed a value of 204660, geez it must be my “lucky” day! I exported the hostd.log, imported it in Excel, and managed to get out this graph:

Here you see the hostd memory usage climbing to its summit: hostd-hara-kiri.

(Not so) reassured by the outcome of the graph and it linear behaviour, I started to tail the hostd.log. Man, this is more exciting than watching a horror movie 😉 ! After a short while, the inevitable happened: “Current value 204828 exceeds hard limit 204800. Shutting down process.” KA-BOOOM! Hostd was gone, the host fell grey for about 30 seconds in vCenter, then came back up as if nothing had happened. And they say there is no such thing as reincarnation! I think a lot of people must have witnessed this, thought it to be “odd”, and went on with their lives.

In fact, after looking at one of my ESX test hosts through all hostd logging I could lay my hands on (they rotate quite fast because of these once-every-30-second events), I put together this graph. Lucky me, I managed to capture a controlled reboot and a hostd-hara-kiri event:

Hostd memory climbing, going down because of host reboot, then a climbing again followed by a plummit = host-hari-kiri

Hostd memory climbing, going down because of host reboot, then a climbing again followed by a plummit = host-hara-kiri

As shown in the graph, a full circle from controlled reboot to hara-kiri appears to be somewhere around every 6000 samples for this particular host. A warning appears every 30 seconds, and I have removed every sample except the 10th one. So this sets the hara-kiri-frequency at about (6000*10*30 = 1.800.000 seconds, or 20.8 days. Not being very happy with these results, I decided to try and avoid this repeating “reincarnation event”. And I was soon to find a workaround (not sure if this is the solution), by editing /etc/vmware/hostd/config.xml. I added these lines right below <config>:

This basically sets the limits to a higher value. The warnings will now appear where it used to be hostd-hara-kiri time, and the true hara-kiri threshold is raised from 200MB to 250MB. This at least delays the problem of hostd reincarnation, but I am unsure about the true cause at this time. It appears to have something to do with stuff installed inside the service console of ESX: Servers having for example HP agents installed appear to use more hostd memory than “clean” service consoles, and these reincarnation events can occur in hours instead of days. That, and the linear climbing of used memory pleads for…. Memory leak. I expect VMWare has a bug to fix. Might be a nasty one too, I believe it has been inside ESX for a long time (maybe even 3.5U2 or before).

So: If you have intermitting reboots or at least disconnects from vCenter, check the hostd log for these limit-warnings.

Posted in VMware |

Tags: ESX, hostd, hostd-hara-kiri, memory leak, memory-checker, VMware |

2 Comments »

Ye Olde Snapshot – by Erik Zandboer

February 16th, 2009 |

Author: Erik Zandboer

A lot of people have had more or less unpleasant experiences with forgotten snapshots. You login in the morning, and a VM is down. “Strange” you think. After some investigation, you find out the VMFS volume on which the VM was running is full. Completely full. And to your horror you find out why – A forgotten snapshot is in place which has now grown beyond the size of the VMFS volume.

What exactly does a snapshot do

First thing to understand, is how a snapshot exactly works. When you add a snapshot, the original virtual disk is no longer written to. Each block that should be written into this file, is redirected to a snapshot file. So basically this snapshot file holds all changes made to the virtual disk after the snapshot was made. The more changes you make to blocks not changed before, the larger the snapshot file will grow (in steps of 16MB). Each changed block is stored inside the snapshot file only once. This means that a snapshot file can reach a sometimes staggering size equal or almost equal to the size of the original virtual disk (defragmentation inside a VM is my personal favorite 😉 ).

Monstrous snapshot – now what?

If you “forget” about a snapshot, changes are you will never notice this, right until it might be too late. Especially if you snapshotted a very large virtual disk, and have plenty of room left on the VMFS, snapshots can grow to immense sizes. Cleaning them up can be very time consuming indeed.

If you have found a very old snapshot file which has grown very large (eg. 10-40GB), you can actually delete the snapshot without problems, thereby committing all changes recorded in the snapshot file back to the original disk. So you end up with only the virtual disk as it appeared when the snapshot was in place, only without the snapshot there. But beware – If you delete the snapshot from vCenter (got to get used to that name instead of VirtualCenter), you might very well get a timeout. This has given some people some really sweaty fingers. Don’t panic, login to the ESX node itself, and you’ll probably see that the snapshot is still being removed. It might take an hour, it might take four hours, but in time the snapshot should remove itself.

VMFS full – How to get the VM running again

If a forgotten snapshot fills up the entire VMFS, you might run out of VMFS space. chances are that your snapshotted VM stops. This is because the VM is trying to write to its disk, and the snapshot needs to grow but it can’t. There are two ways to resolve this: 1) make room on the VMFS, or 2) delete the snapshot while the VM remains off. In a production environment, option 2) might not work for you. Deletion of large snapshots might take hours. So we’re back to making room on the VMFS. Maybe you can or move another VM from the VMFS. Maybe you have some ISOs laying about the VMFS you can delete. Then you can start your troubled VM again, and remove the snapshot while the VM is running again. A last resort might even be to give the VM less memory, or put its swapfile in another location (possible in ESX 3.5u3). Then start to delete the snapshot right away, before it manages to fill up the VMFS again.

I have even heard of people who put a 2GB dummy file on each VMFS volume, so that when it comes to these issues they just delete the file – and gain 2 Gbytes of space. If forgetting snapshots is your habit, you might consider this as a “best practice” for your environment…

50GB+ snapshot – Delete or…?

What if you have a really big snapshot (and I mean 50+ GB), or you might even have multiple huge snapshots in place? Or even have snapshots that appear to be garbled in their linkage (horrors like “cannot delete snapshot because the base disk was modified after the snapshot was taken”). You might not want to risk deletion of these snapshot(s). There is another way to recover safely, especially if you run Windows 2003 or later, which should be much more advertised: VMware Converter! It is really a magical tool. Not only for P2V, but also in cases exactly like this. While you keep your VM running, just point Converter to the VM while telling Converter it is a physical machine. Converter will install its agent inside the VM, and start to duplicate your VM to another LUN. After the conversion, the target VM will be free of any snapshots!

This option also works great if you have issues with your SAN. I have seen environments that had LUNs you could not even browse through any more (not from the datastore browser nor via ssh) – but VMs placed there were still running OK. It shows stability and enterprise-readyness of ESX for sure, but how to recover? Even restarting the VM or scanning LUNs is risky here. The simple answer was: Use Converter. Simply use Converter! To make a short story even shorter: converter saved the day 🙂

So I guess as a final word I should say: For VM recovery from even the weirdest disk-related issues, consider to use VMware Converter !

Posted in VMware |

Tags: Converter, ESX, huge, snapshot, VMFS, VMware |

3 Comments »

The temptation of "Quantum-Entangling" Virtual Machines – by Erik Zandboer

February 12th, 2009 |

Author: Erik Zandboer

More and more vendors of SANs and NASses are starting to add synchronous replication to their storage devices – some are even able to deliver the same data locally on different sites using nfs. This sounds great, but more and more people tend to use VMware clusters across sites – and that is where it goes wrong: VMs run here, using storage there. It all becomes “quantum entangled”, leaving you nowhere when disaster strikes.

These storage offerings are causing people to translate this into creating a single VMware HA-cluster across sites. And really- I cannot blame them. It all sounds too good to be true: “If an ESX node at site A fails, the VMs are automagically started on an ESX server at another site. Better yet, you can actually VMotion VMs from site A to site B and vice versa.” Who would not want this?

VMware thinks differently – and with reason. They state that a VMware cluster is meant for failover/load balancing between LOCAL ESX nodes, and failover is a whole other ballgame (where Site Recovery Manager or SRM comes in). This decision was not made for no reason as I will try to explain.

How you should not do DR

If you have one big single storage array across sites, you could run VMs on either side, using whatever storage is local to that VM. That way, you do not have your disk access from VM to storage over the WAN. But when DRS kicks in, the VMs will start to migrate between ESX nodes – and between sites! And that is where it goes wrong, the VMs and their respective storage will get “entangled”. I like to call that “quantum-entanglement of VMs”, because it is kind of alike, and of course, because I can 🙂

Even without DRS, but with manual VMotions, in time you will definitely loose track on which VM runs where, and more import: use storage from where. In the end 50% of your VMs might be using storage on the other site, loading the WAN with disk I/O and introducing the WANs latency to the disk I/O of the VMs that have become “stretched”.

All this is pretty bad, but let’s say something really bad happens: your datacenter at one location is flooded, and management decides you have to perform a failover to the other site. Now panic strikes: There is probably no Disaster Recovery plan, and even if there is, it is probably way off from being actually useable. VMs have VMotioned to the other site, storage has been added from either side. VMs have been created somewhere, using storage somewhere and possibly everywhere. In other words: You have no idea where to begin, let alone being able to automate or test a failover.

VMware’s way of doing DR

In order to be able to overcome the problems with this “entanglement”, VMware defines a few clear design limitations as to how you should setup DR failover, with SRM helping out if you choose to. But even without SRM, it is still a very good way of designing DR.

VMware states, that you should keep a VMware cluster within a single site. DRS and HA will then take care of the “smaller disasters” such as NICs going down, ESX nodes failing, basically all events that are not to be seen as a total disaster. These failovers are automatic, they correct without any human intervention.

The other site should be totally separated (from a storage point of view). The only connection between the storages on both sides should be a replication connection. So both sites are completely holding their own as far as storage is concerned. Out of scope of this blog, yet VERY important: When you decide on using asynchronous replication, make sure your storage devices can guarantee data integrity across both sites! A lot of vendors “just copy blocks” from one site to the other. Failure of one site during this block copy can (and will) lead to data corruption. For example, EMC storage creates a snapshot just before an asynchronous replication starts, and can revert to that snapshot in case of real problems. Once again, make sure your SAN supports this (or use synchronous replication).

Now let’s say disaster strikes. One site is flooded. HA and DRS are not able to keep up, serves go down. This is beyond what the environment should be allowed to “fix” by itself – So management decides to go for a failover. Using SRM, it should only take the press of a button, some patience (and coffee); but even without SRM you will know exactly what to do: Make replicated data visible (read/write) on the other site, browse for any VMs on them, register, and start. Even without any DR-plan in place, it is still doable!

Where to leave your DR capacity: 50-50 or 100-0?

So let’s assume you went for the “right” solution. Next to decide will be, what you are going to run where. Having a DR site, it would make sense to run all VMs (or at least almost all VMs) on the primary site, and leave the DR site dormant. Even better, if your company structure allows it, run test and development at the DR site. In case of a major disaster you can failover production to the DR site, and loosing only test and development (if that is allowable).

The problem often is your manager: He paid a lot of money for the second SAN, and DR ESX nodes. Now you will have to explain that these will do absolutely nothing as long as no disaster takes place. Technically there is no difference: You either run both sites at 50%, or one on 100% and the other dormant at 0%. Politically it is much more difficult to sell.

If you use SRM, there is a clear business case: If you run at 50-50, SRM needs double the licenses. And SRM is not cheap. Without SRM, it takes more explanation, but in my opinion running at 100-0 is still the way to go. As an added bonus, you might use less ESX nodes on the DR site if you do not have to failover the full production environment (which will reduce cost without SRM as well).

Conclusion

–> Don’t ever be tempted to quantum-entangle your VMs and their storage!

Posted in VMware |

Tags: disaster recovery, DR, DRS, ESX, Site Recovery Manager, SRM, VMware |

Comments Closed

Performance tab "Numbers" demystified – by Erik Zandboer

January 30th, 2009 |

Author: Erik Zandboer

“CPU ready time? Always use esxtop! The performance tab stinks!” is what I hear all the time. But in reality, they don’t stink, they’re just misunderstood. This edition of my blog will try to clarify this using the famous CPU ready times as an example.

A lot of people have questions about items like CPU ready times. Advice is usually to run esxtop. I have always been a fan of the performance tabs instead. However, the numbers presented there are very often unclear to people. The same goes for disk IOps. In fact any measured value with a “number” as the unit appears to suffer from this. No need – The numbers are valid, you just should know how to interpret them. Added bonus is off course, you get an insight through time on your precious CPU ready times, instead of a quick look in esxtop. If you tune the VirtualCenter settings right, you can even see CPU ready times on your VM yesterday or the day before!

Comparing esxtop to performance in VI client

When you compare the values in both esxtop and the performance monitor, you could notice a clear difference in these values. While a VM has a %ready in esxtop of about 1.0%, in the performance graph (real-time view) it is all of a sudden a number around 200 ms?!? As strange as this sounds – the number is correct. The secret lies in the percentage versus the number of milliseconds. It is the same thing, but shown differently.

The magic word: sampletime

When you start to think of it: The number presented in the performance tab is in the unit of milliseconds. So basically it is the number of milliseconds the CPU has been “ready” as opposed to the percentage ready in esxtop. But how many milliseconds out of how much time you might wonder? That is where the magic word comes in – sampletime!

Sampletime is basically the time between two samples. As you might know, the sampletime in the real-time view of the performance tab is 20 seconds (check upper left of the window):

Sampletime - yes! it is 20 seconds

So VMware could have taken the percentage ready from esxtop, and display these instant values to form this graph. In reality however it is even nicer: VMware is able to measure the number of milliseconds ready-time of the CPU inbetween these two samples! Once you grasp this, it all becomes clear – So if you see in real-time view a value of 200 ms, in esxtop-like representation you would have 200/20 = 10 milliseconds per second, or 0.01 seconds per second, which is dead-on 1% (how DO I manage 😉 )

Changing the statistics level in VirtualCenter

So now we have proven that the numbers can actually be matched – time for the nice bonus. If you login to VirtualCenter via the VI client, you can edit the settings of VirtualCenter by clicking on the “administration” drop-down menu, and selecting “VirtualCenter Management Server Configuration” (whoever came up with that monstrous name!). In this screen you’ll see an item called “statistics”. There you can tune a see more statistics over more time:

Altering statistics levels in VirtualCenter

In this example I have increased the level of statistics, and I have also changed the first interval duration to two minutes. In this setup, I can see CPU ready times of all my VMs for one entire week now:

CPU ready times for an entire week

Take care, your database will grow much larger, and VMware does not encourage you to keep this level of statistics on for an extended period of time – although I have been running these levels for months now without issues on a small (2 ESX server) environment. My database size is now around 1.4GB – Not alarming although the size grows exponentially with the number of ESX hosts you have.

When you look close into the graph (an almost idle webserver which runs virusscan at 5AM), you’ll notice that CPU ready times boost up to around 50.000 milliseconds. Sounds alarming, right? But do not forget, in this case I am looking at a weekly graph, which is sampling at a 30 minute sample rate. So I should do my calculation once again: 30 minutes = 30*60 = 1800 seconds (=sampletime). So 50000 / 1800 = 27.8 milliseconds per seconds, or 0.0278 seconds per second. So CPU ready is peaking up to just below 3%, which is acceptable (usually below 10% is considered ok). Not bad at all! But it would have been hard to find using esxtop though (I hate getting up early).

Any number

The calculations I have done are valid for all measured values that have a “numbers” unit within the Performance tab. So it also works for disk operations (like “Disk read requests”) and networking (like “Network packets received”).

So if you looking for esxtop output through time in a graph – Take a second look at your Performance tabs!

Posted in VMware |

Tags: cpu ready, ESX, esxtop, performance, VMware |

8 Comments »

Scaling VMware hot-backups (using esXpress) – by Erik Zandboer

January 26th, 2009 |

Author: Erik Zandboer

There are a lot of ways of making backups- When using VMware Infrastructure there are even more. In this blog, I will focus on so called “hot backups”- Backups made by snapshotting the VM in question (on an ESX level), and then copying the (now temporary read-only) virtual disk files off to the backup location. And especially, how to scale these backups into larger environments.

Say CHEESE!

Hot backups are created by first taking a snapshot. A snapshot is quite a nasty thing. First of all, each virtual disk that makes up a single VM have to be snapped at exactly the same time. Secondly, if at all possible, the VM should flush all pending writes to disk just before making this snapshot. Quiescing is supported in the VMware Tools (which should be inside your VM). Quiescing will flush all write buffers to disk. Effective to some extent, however not enough for database applications like SQL, Exchange or Active Directory. In those cases VSS was thought up by Microsoft. VSS will tell VSS enabled applications to flush every buffer to disk and hold any new writes for a while. Then the snapshot is made.

There is a lot of discussions about making these snapshots, and quiescing or using VSS. I will not get into that discussion, it is too much application related for my taste. For the sake of this blog, you just need to know it is there 🙂

Snapshot made – now what?

After a snapshot is made, the virtual disk files of the VM have become read-only accessible. It is time to make a copy. And this is where different backup vendors start to really do things different. As far as I have seen, there are several different ways of taking out these files:

Using VCB
VCB is VMware’s enabler for primarily making backups through a fibre-based SAN. VCB enables a “view” to the internals of a virtual disk (for NTFS), or give a view to an entire virtual disk file. From that point, any backup software can make a backup of these files. It is fast for a single backup stream, but requires a lot of local storage on the backup proxy, and does not easily scale up to a larger environment. The variation using the network as carrier is clearly a suboptimal solution compared to other network-based solutions.

Using the service console
This option installs a backup agent inside the service console, which takes care of the backup. Do not forget, an FTP server is also an agent in this situation. It is not a very fast option, especially since the service console network is “crippled” in performance. This scenario does not scale very well to larger environments.

Using VBAs
And here things get interesting – Please welcome esXpress. I like to call esXpress the “Software version” of VCB. Basically, what VCB does – make a snapshot and create a view to a backup proxy – is what esXpress does as well. The backup proxy is no hardware server though, or a single VM, but numerous tiny appliances, all running together on each and every ESX host in your cluster! You guessed it – see it and love it. I did anyways.

esXpress – What the h*ll?

The first time you see esXpress in action, you might think it is a pretty strange thing – First you install it inside the service console (you guessed right – There is no ESXi support yet). Second it creates and (re)configures tiny Virtual Appliances all by itself.

When you look closer and get used to these facts – It is an awesome solution. It scales very well, each ESX server you add to your environment starts acting as a multiheaded dragon, opening 2-8 (even up to 16) parallel backup streams out of your environment.

esXpress is also the only solution I have seen which does not have a single SPOF (Single Point of Failure). ESX host failure is countered by VMware HA, and the restarted VMs on other ESX servers are backup up from the remaining hosts. Failing backup targets are automatically failed over to by esXpress to other targets.

Setting up esXpress can seem a little complex – there are numerous option you can configure. You can get it to do virtually anything. Delta backups (changed blocks only), skipping of virtual disks, different scheduling of backups per VM, compression and encryption of the backups, and SO many more. Excellent!

Finally, esXpress has the ability to perform what is called “mass restores” or “simple replication”. This function will automatically restore any new backups found on the backup target(s) to other locations. YES! You can actually create a low-cost Disaster Recovery solution with esXpress – RPO (Recovery Point Objective) is not too small (about 4-24 hours), but the RTO (Recovery Time Objective) can be small, 5-30 minutes is easily accomplished.

The real stuff: Scaling esXpress backups

Being able to create backups is a nice feature for a backup product. But what about scaling to a larger environment? esXpress, unlike most other solutions, scales VERY well. Although esXpress is able to backup to VMFS, I will focus in this blog on backing up to the network, in this case FTP servers. Why? Because it scales easily! following makes the scaling so great:

For every ESX host you add, you add more Virtual Backup Appliances (VBAs), so increases total backup bandwidth out of your ESX environment;
Backup uses CPU from the ESX hosts. Especially because CPU is hardly an issue nowadays, it scales with usually no extra costs for source/proxy hardware;
Backups are made through regular VM networks (not the service console network), so you can easily add more bandwidth out of the ESX hosts and bandwidth is not crippled by ESX;
Because each ESX server runs multiple VBAs at the same time, you can balance the network load very well across physical uplinks, even when you use PORT-ID load balancing;
More backup target bandwidth can be realized by adding network interfaces to the FTP server(s) when they (and your switches) support load-balancing through mulitple NICs (etherchannel/port aggregation);
More backup target bandwidth can also be realized by adding more FTP targets (esXpress can load-balance VM backups across these FTP targets).

Even better stuff: Scaling backup targets using VMware ESXi server(s)

Although ESXi is not supported with esXpress, it can very well be leveraged as a “multiple FTP server” target. If you do not want to fiddle with load-balancing NICs on physical FTP targets, why not use the free ESXi to install serveral FTP servers as VMs inside one or more ESXi hosts! By adding NICs to the ESXi server it is very easy to establish load-balancing. Especially since each ESX host delivers 2-16 data streams, IP-hash load-balancing works very well in this scenario, and is readily available in ESXi.

Conclusion

If you want to make high performance full-image backups of your ESX environment, you should definitely consider the use of esXpress. In my opinion, the best way to go would be to:

Use esXpress Pro or better (more VBAs, more bandwidth, delta backups, customizing per VM);
Reserve some extra CPU power inside your ESX hosts (only applicable if you burn a lot of CPU cycles during your backup window);
Reserve bandwidth in physical uplinks (use a backup-specific network);
Backup to FTP targets for optium speed (faster than SMB/NFS/SSH);
Place multiple FTP targets as VMs on (free) ESXi hosts;
Use multiple uplinks from these ESXi hosts using loadbalancing mechanisms inside ESXi;
Configure each VM to use a specific FTP target as its primary target. This may seem complex, but it guarantees that backups of a single VM always land on the same FTP target (better than selecting a different primary FTP target per ESX host);
And finally… Use non-blocking switches in your backup LAN, which preferably support etherchannel/port aggregation.

If you design your backup environment like this, you are sure to get a very nice throughput! Any comments or inquiries for support on this item are most welcome.

Posted in Virtual Backup |

Tags: Backup, ESX, Largescale, Virtual Backup, VMware |

6 Comments »

VMware Infrastructure Resource Settings Sagas – by Erik Zandboer

January 23rd, 2009 |

Author: Erik Zandboer

One of the most abused items in VMware ESX server are probably the resource settings. I have heard most wild fairytale stories and sagas about what to do and what not to do, and have seen numerous issues around bad performance in conjunction with wild resource settings.

Resource Settings abuse

I have heard of VMware “experts” (even official trainers) who state that “one should put both memory and CPU limits and reservations on EVERY Virtual Machine. Anyone who claims it is better to leave things at default does not understand virtualization”. Technicians who followed that way of working, got themselves into the weirdest issues. HA not functioning, VMs unwilling to start, VMs freezing up while others continue to run OK, etc etc. No surprise in my opinion.

What I have always seen, learned and done is “don’t touch resource settings until you really need them”. That has proven to both work and scale. Don’t touch resource settings unless you know exactly what you are doing.

“There’s two sides to every Schwartz”

Resources basically fall apart in two sections: there are shares on one side, and limits/reservations on the other. Limits and reservations are relatively easy to understand. A VM (or when using pools a group of VMs) gets resources reserved, and can also be limited in resources. Shares are more complex to understand: shares cut in ONLY if a physical ESX server runs out of resources. Who gets the remaining resources is determined by the shares mechanism.

Limits and reservations

Be very careful with these. There are little real use cases for both on a VM level. Be sure to know what you are doing before you start configuring these. These are the four cases:

Limit VM memory
In my opinion should this be done by setting the correct amount of memory in the VM itself. Not having to reboot a machine to change the setting is in my opinion no excuse. Memory ballooning will occur by default if this limit lies below the assigned memory. Use cases: political/requirements-fooling (“client thinks he gets 2GB, but in fact he gets no more than 1GB”). Some claim some OSses run better when they see 4GB, even if they get only 1GB in real life (never got that one confirmed though). More useful on resource pools (financial: “you get what you pay for”).

Reserve VM memory
In this situation ESX will reserve a certain amount of memory for the
VM, whether it is used or not. This means that you basically render physical memory unused, which is kind of against the idea of virtualisation. Use cases: administrative: guaranteeing memory is available for the VM (although I think that should not be done at this level). In my opinion it is better to rely on shares and ballooning to get more memory if needed (dynamic versus static).

Limit VM CPU
Cutting down the performance of a VM. This one can actually be effective: Sometimes a VM can use up 100% CPU all of a sudden (scheduled tasks for example), where run time is of little issue. If you have 10 of these VMs, they can put a serious drain on your ESX resources. Limiting those VMs in cycles can be effective (because you simply cannot assign a half vCPU so to speak). Another great example is a DOS based server: DOS has no idle loop, so these VMs draw 100% CPU cycles 24/7. Limiting the CPU back to lets say 150MHz helps here.

Reserve VM CPU
For this one I have never ever seen a use case. You might decide to reserve CPU cycles, but then again, ESX can give or take CPU cycles as it wishes, so I would always rely on the shares mechanism here. Some reserve CPU cycles by default for every VM, but WHY? I never got a good reason. In my opinion: Don’t ever.

“Expandable” = “Expendable” ?

On top of the limits and reservations, you get an “expandable” checkbox for free in resource pools. Even worse, it is “on” by default. Let me get this straight: First I limit a resource, and then I allow the boundary to be crossed? Ok, this can be valid in some cases, but I would not do that by default. The reason for this default setting might be, that VMware does not want all the support calls from people who set reservations (and do not really know what they are doing), and then end up not being able to start or add their VMs anymore…

The shares mechanism

It is most important to understand how shares work, whether you use them or not. This is because they are always set to some value, even though you might leave them all at the default. And that is where things go wrong: VMwares understanding of what the settings should be by default changed through time.

The question around defaults is: “how should defaults be determined” ? Initially, VMware got it right. CPU shares of a VM grew as the number of vCPUs grew, memory shares grew as the VMs memory grew. Later on (somewhere in ESX 3.0), VMware seemed to have forgotten this, and simply created VMs with equal shares, not taking into account the number of vCPUs or memory assigned to a VM.

The reason for adjusting the shares to the size of the VM is quite obvious: Think of two VMs, one is an old Windows 2000 server with 384MB of memory, the other is a 64bit Windows 2003 VM with 8GB of memory. Now lets say these VMs coexists on a single physical ESX server, and the ESX server runs out of physical memory, but both VMs need more memory. The shares mechanism cuts in. Assume there is 200MB of physical memory left to give. When shares are set equally, both machines will get 100MB. This is pretty significant for the W2K server, but not very much for the W2K3 server. This pleads for the strategy “more memory = more shares”. The same goes for the number of vCPUs, “more vCPUs = more shares”. Add priorities in the mix (“production” versus “test and dev”) and you have a nice shares-value-cocktail (we could name this SVC 😉 ).

Guess the correct default and win!

Under ESX 2.5, shares where created “correctly” (sized to VM size). With the coming of ESX 3, at some point VMware “forgot” how to really configure resource settings by default. Later on (somewhere along 3.5), things were fixed again and VMware created once again correct defaults for resource settings. This can result in faulty resource settings, especially if you upgraded your environment from ESX3.0 to ESX3.5. I have seen production environments, where some VMs had a 1000 memory shares (used to be the default for a “normal” shares value once), but some other VMs that where created in a later version of ESX got a memory share value of 10240 (10 times the number of MBs given to a VM, where 10 stands for a “normal” shares level). The result: Not much, right until the ESX hosts runs out of physical memory. As soon as that happens, the 10240-shares VM runs normally, the other (1000-shares VM) simply comes to an almost complete standstill. Ever seen this issue? Then go to your resource pool, and check the tab “resource allocation”. The “percentage shares” column just might show you something obviously wrong: One VM gets 99%, the rest 0% (in real life a part of the one remaining percent which evens out to 0%). OUCH!

Given any VMware Infrastructure environment, if the VMware administrators have never looked into the settings, I would say there is about a 25 percent change that resource settings are mismatched in some degree.

Moral of this blog?

Check your resource settings. Even out these settings throughout your VMs, change resources to match the sizing of the VM where necessary. Most important: Make sure the shares values match. Two VMs with 100 shares is ok, two VMs with 10000 shares is ok. Avoid the situation where one has 100, the other has 10000 shares (unless you compare a DOS machine to a windows2008 SQL server maybe 😉 ). Do never use resource settings just “because you can”, but make sure you have a solid idea behind the WHY whenever you change a resource setting other than the (correct) default should be.

Most important: If you have no issues and/or have no clue, do not mess with custom resource settings. Most ESX environments run best when ESX gets to decide! If you must, stick to resource pool settings, and stay away from custom individual VM resource settings.

Posted in VMware |

Tags: ESX, pools, resource pools, resource settings, resources, VMware |

3 Comments »