vCenter 4.1 and CPU usage

I have a very tiny testing environment, which I just upgraded to vSphere 4.1. I chose to reinstall the vCenter server on a 64bit Windows 2003 VM with a local SQL express installation, having only 1GB of memory and a single vCPU. I know this is against best practice, but I like to follow the much older best practice to “start out small” like VMware has been (and still is) preaching about. So I started out small. Although 1GB of memory is not even really small for my measures :)

I quickly noticed that the VM was running fine at first, but soon started hogging CPU resources, and was hardly responsive. vCenter 4.1 could not have grown to such a resource eater I figured. So I checked to settings of the Windows server, and I changed the swapfile size to a fixed value (the system managed size can have impact when growing).

To have a look at the SQL settings, I decided to download and install the Microsoft SQL Server Management Studio Express tool, which can be obtained here.

Looking into the settings, I noticed that SQL is set to take all memory it can (the max value was set at 2TB I believe). I changed that setting to 768MB, meaning SQL can use 768MB as a maximum, leaving 256MB for “the rest” (read: vCenter and VUM). If needed you might fiddle with this setting.

After changing these values, vCenter began to respond properly. Sometime the VM gets really busy, but quickly returns to “normal” behaviour; no more “stampede”.

Take care: This is in no way meant for a full-blown production environment. Always follow VMware’s best practices… But if you have a test environment which is really REALLY small, consider these changes to extent the life of your single socket, dual core ESX nodes by a year (or possibly two) :)

No COS NICs have been added by the user – solved

Now that I am busy setting up UDA 2.0 (beta14) for a customer to be able to reinstall their 50+ VMware servers, I stumbled upon this message. The install would hang briefly, then proceed to a “press any key to reboot” prompt. Not too promising…

After searching the internet I found a lot of blog entries on exactly this error. I could not find any useful hints or tips that would solve my problem; I have been checking the disk layout over and over again, to make sure no mistakes were made there. I was starting to pull my hair out, because it did work previously.

Then I started thinking; the customer in question has multiple PxE servers in the same network, and special DHCP entries were created for all vmnic0 MAC addresses, so that option 66 and 67 could be set to point to the UDA appliance. I think their DHCP server denies DHCP to any MAC address unknown to it, because right before the “press any key to reboot” I saw something passing in the line of “unable to obtain a dynamic address”. I figured in the initial setup, the kickstart part tries to get a DHCP address using the Service Console virtual NIC (with a different MAC address each time you reinstall). So I tried to alter the “Kernel option command-line” from this:

ks=http://[UDA_IPADDR]/kickstart/[TEMPLATE]/[SUBTEMPLATE].cfg initrd=initrd.[OS].[FLAVOR] mem=512M



to include static IP data:



ks=http://[UDA_IPADDR]/kickstart/[TEMPLATE]/[SUBTEMPLATE].cfg initrd=initrd.[OS].[FLAVOR] mem=512M ksdevice=vmnic0 ip=[IPADDR] netmask=255.255.255.0 gateway=10.11.12.254 dns=10.11.12.13



This appears to have done the trick; Now finally the “No COS NICs have been added by the user”-error is resolved. This warning however is not the actual issue: The warning is still there, but the install continues now. Still unsure what this actual warning means…

PHDVirtual releases Virtual Backup 4.0-4 with vSphere 4.1 support

PHDVirtual has released an updated version of their famous Virtual Backup solution (formerly esXpress). This version fully supports VMware vSphere 4.1, and is one of the first (if not THE first) of the 3rd party “high tech virtual backup only” to support vSphere 4.1

I was very quick into upgrading my test environment to vSphere 4.1 (right after the general release), breaking the PHDvirtual backup in the process. For days the environment failed to backup, because vSphere 4.1 introduced a snapshot issue with esXpress. PHDvirtual worked hard to get vSphere 4.1 supported, and on 9/17/2010 they released version 4.0-4 which did just that.

So I upgraded my test environment to PHDvirtual 4.0-4. Right after the upgrade I forced a reinstall on the ESX nodes to 4.0-4 from the 4.0-4 GUI appliance, and I kicked of an initial backup by renaming a VM from the VI client to include [xPHDD] in the VM name. PHDvirtual Backup picked it up, renamed the VM back and commenced performing the backup. It just worked straight away. Even CBT was still functional, and my first Windows VM backed up again with only 2.2[GB] in changed blocks. Awesome!

From the initial tests it shows that both speed and stability are just fine, not very different from the previous release. Still fast and definitely rock solid. Highly recommended!

VMware releases vSphere 4.0 update 2

VMware just released vSphere 4.0 update 2. Not much new stuff, except improvements in Guest Operating Systems improvements:

Guest Operating System Customization Improvements: vCenter Server now supports customization of the following guest operating systems:

  • Windows XP Professional SP2 (x64) serviced by Windows Server 2003 SP2
  • SLES 11 (x32 and x64)
  • SLES 10 SP3 (x32 and x64)
  • RHEL 5.5 Server Platform (x32 and x64)
  • RHEL 5.4 Server Platform (x32 and x64)
  • RHEL 4.8 Server Platform (x32 and 64)
  • Debian 5.0 (x32 and x64)
  • Debian 5.0 R1 (x32 and x64)
  • Debian 5.0 R2 (x32 and x64)

Also a lot of resolved issues; which is always nice to have!

Throughput part 4: A day at the races (Hotspotting case)

The fourth part of this triptych ( ;) ) is a customer case of hotspotting on storage. The graphs speak for themselves! Some storage design decisions they made caused them a lot of trouble…



Birth of the storage design

The customer in question was going to run a large VDI (virtual desktop) deployment in several pods. The first pod was designed with two low-cost FC SANs, each having 48 SATA disks. A single SAN should deliver full-clone desktops for 500 users. Running on a “conventional” FC-SAN (no ZFS filesystem or large caches) 48 SATA disks for 500 vDesktops alone is what I’d call a challenge already!

Apart from that they started out right, by choosing a RAID10 configuration.  They reserved two SATA disks to function as a hot-spare. So far so good. But then what? You have 46 disks left, and you must put them in a RAID10. They decided so create ONE single RAID10 volume, consisting of 46 disks, thinking that for each I/O performed all disks would be used, boosting performance. On top of that, they decided to use 512KByte as a segment size, because VMware uses a blocksize of at least 1Mbyte anyway (both not true of course). The setup on a disk-level looks something like this:


RAID10 array consisting of 23 stripe members and running 10 cloning actions in parallel
Figure1: RAID10 array consisting of 23 stripe members and showing 10 full-cloned vDesktops layed out on the disks.



For those of you who have read the other parts of my throughput blogs, might already have spotted where things go wrong. In fact things went horribly wrong as I’ll demonstrate in the following section.



What’s happening here?

As described in Throughput part 1: the basics and Throughput part2: RAID types and segment sizes, in a random I/O environment you optimally want only one member of a stripe to perform a seek over a single I/O. That is covered when using 512KB segments. During a complete random I/O pattern, things really aren’t that bad: the randomness makes sure all mirror pairs will be active, no matter how big the segment size might be. The large number of members does not impact rebuild times as well in a RAID10 configuration.

The very large amount of stripe members (23 mirror pairs) in combination with the rather large segment sizes is what really caused the fall though. As soon as the environment was running a larger number of vDesktops, and new vDesktops were cloned, things got bad fast. Full cloning technology was used, which means that each vDesktop has a full image on disk (about 16Gbytes in size). The VDI solution used, was only able to limit the deployment of vDesktops to a number per ESX host. To make a short story even shorter, during a deployment they ran 10 full-cloning actions in parallel against a single SAN. Watch and be amazed what happened!



A day at the races

So why is this blog called “a day at the races” anyway? Well, it simply reminded me of horse racing (and Queen rocks ;) ). Time for some theory before we prove it also applies in real life. Let’s assume we already have vDesktops already running (let’s say about 250 of them; the number is not really relevant here). They perform random I/O’s on the SAN, loading all disks to some degree (performance wise).

Now we start a single cloning thread (the VDI broker calls for a cloning action to VMware). Sequential reads and writes start to occur (from the template into the new vDesktop virtual disks). Assuming this clone runs at just about 60[MB/sec] (which is a realistic yet theoretical number), and the segment size is 0,5Mbytes over 23 stripe members, each stripe member is accessed about 60 / 0,5 = 120 times every second. No bells ringing yet…

Now think about not one clone, but 10 of these full cloning actions running simultaneously. Remember each cloning action accesses only one stripe member at a time (as they progress through all of the stripe members over and over again). Basically all cloning actions race each other over the stripe members, each a few full-stripes below the other (see figure 1). Assuming they never run at exactly the same speed, it is to be expected two full clones will meet on the same stripe member, slowing things down for these two full clone actions, sticking them to that single stripe member for the time.

As soon as they slow down, the other cloning actions which still run faster “crash” into the rear of this stripe member as well. In the end, all full cloning actions are hammering on the same stripe member, while all other disks are not being accessed by the cloning action at all. Hopefully you’l get the idea when looking at figure 2:


10 cloning actions racing each other
Figure 2: Ten cloning actions racing each other. All are writing on the impacted Stripe Member 2. Clone10 (purple) is about to escape, while clone3 (green) is about to crash into the rear of the impacted Stripe member again.



Each cloning action runs along one of the coloured lines, visiting all stripe members over and over again. Multiple writes being performed to a single stripe member will cause all those writes to slow down (the stripe member gets busier). This in turn causes the other sequential writes which did not slow down yet to “crash into the rear” of the impacted stripe member, causing an even bigger impact. This finally results in all cloning actions hammering on the same single stripe member, forcing the entire SAN to its knees.

As soon as one full clone “escapes the group”, it finds the other stripe members which do not suffer from the hammering. So they pick up speed, race through the non-impacted stripe members, and simply crash into the rear of the stripe member where they just managed to escape from again. Basically, the system will keep hammering on a single stripe member!

In the end, the 10 parallel full cloning actions effectively use one single stripe member, giving the performance of one single SATA disk (RAID 1 write penalty is 2, meaning a stripe member (=mirror pair of disks) perform like a single SATA disk for writes). The overall cloning performance was measured, and went down to about 5 [MB/sec] effectively. Running vDesktops came to a near-freeze.

When you calculate the frequency in which the stripe members are “visited” now,  you’ll find that each stripe member is accessed about 5 / 0,5 = 10 times every second. This is a frequency of 10[Hz], very visible to the human eye! So you could actually see this happening on the array (10 times a second the disk activity leds will swipe across the array). Too bad I don’t have a video on that one :(

Here some latency graphs on the array during the parallel deployment of 10 full clones:

Read Latency of 10 parallel full-clones
Figure 3: Abnormal read Latency during 10 parallel full-clone actions



Figure 3 clearly shows that performance suffers. Even though the heavy writes are are thought to be the guilty ones, all reads that have to be performed on the impacted stripe member suffer as well, kicking up read latency well over 150 [ms]. The reason the graph keeps touching the lower parts of the graph (which are low-latency reads) are probably the effect of read cache (when disks are not required to service a read request).


Write Latency of 10 parallel full-clones
Figure 4: Abnormal write Latency during 10 parallel full-clone actions



Write latency in figure 4 is really showing the infamous “A bridge too far”. Especially in the left side of the graph, latencies run up dramatically. The LUNs that draw a thin line along the 10[ms] boundary do not appear to be impacted as much as the other LUNs; this is probably due to the fact that these LUNs are not being written to by a full clone action, so therefore only the random writes performed by the already running vDesktops are registered there. Nonetheless they also see the impact of the cloning (note the starting situation where all write latencies are well below 3 [ms] ).

All other vDesktops running are still performing their random I/O. As long as they do not hit the “impacted stripe member” they just go about their business. But as soon as they hit that stripe member (and they will), they start crawling. In effect, the entire SAN performance appears to crumble, and the vDesktops freeze almost completely.



How to fix things

So how do you fix these issues? The answer is relatively simple: The customer upgraded their disks to 15K SAS drives (being a more realistic configuration for running 500 vDesktops), and they divided the available disks in 4 separate RAID10 groups instead of just one. Also, they decreased the segment size to 64KBytes, which appears a much more sane design.

The smaller segment size will cause cloning actions to stick to a particular segment for a much shorter period of time. More disk volumes with smaller number of members in the stripe will help to isolate performance impact. Together with faster disks performance was boosted effectively (a 15K SAS drive delivers about three times the amount of IOPS a single 7K2 SATA disk can handle).

Soon to come
* Breaking VMware Views sound barrier with Sun Open Storage- Sneak Preview
* VDI Storage Design Toolkit
* Backing up the virtual world - Veeam vs. PHD Virtual
Search
Categories