memory-checker

hostd-hara-kiri – by Erik Zandboer

February 24th, 2009 |

Today I got a question of a customer – His hosts appeared to reboot every few hours, or at least show up grey in vCenter. I found the issue – A clear case of hostd-hara-kiri…!

When I heard of this issue, the first and only thing that came to mind was hostd running out of memory. A quick look at /var/log/vmware/hostd.log showed the issue: “Memory checker: Current value 174936 exceeds soft limit 122880“. I advised to raise the service console memory, although I am not sure this resolves the issue, since the limits for hostd memory are not changed when you alter the SC memory… So as a “backup” I told him to make the changes stated below in order to at least make sure the problem would not come back.

Anyway, I decided to check out my testing environment. I too had the hostd.log being filled up with these messages. The soft limit is almost constantly broken, which is set by VMware at 122880. The hard limit is set at 204800. Hard limit??? So what happens when the hard limit is reached? – Exactly, hostd-hara-kiri.

One of the ESX servers I looked at, showed a value of 204660, geez it must be my “lucky” day! I exported the hostd.log, imported it in Excel, and managed to get out this graph:

Here you see the hostd memory usage climbing to its summit: hostd-hara-kiri.

(Not so) reassured by the outcome of the graph and it linear behaviour, I started to tail the hostd.log. Man, this is more exciting than watching a horror movie 😉 ! After a short while, the inevitable happened: “Current value 204828 exceeds hard limit 204800. Shutting down process.” KA-BOOOM! Hostd was gone, the host fell grey for about 30 seconds in vCenter, then came back up as if nothing had happened. And they say there is no such thing as reincarnation! I think a lot of people must have witnessed this, thought it to be “odd”, and went on with their lives.

In fact, after looking at one of my ESX test hosts through all hostd logging I could lay my hands on (they rotate quite fast because of these once-every-30-second events), I put together this graph. Lucky me, I managed to capture a controlled reboot and a hostd-hara-kiri event:

Hostd memory climbing, going down because of host reboot, then a climbing again followed by a plummit = host-hari-kiri

Hostd memory climbing, going down because of host reboot, then a climbing again followed by a plummit = host-hara-kiri

As shown in the graph, a full circle from controlled reboot to hara-kiri appears to be somewhere around every 6000 samples for this particular host. A warning appears every 30 seconds, and I have removed every sample except the 10th one. So this sets the hara-kiri-frequency at about (6000*10*30 = 1.800.000 seconds, or 20.8 days. Not being very happy with these results, I decided to try and avoid this repeating “reincarnation event”. And I was soon to find a workaround (not sure if this is the solution), by editing /etc/vmware/hostd/config.xml. I added these lines right below <config>:

This basically sets the limits to a higher value. The warnings will now appear where it used to be hostd-hara-kiri time, and the true hara-kiri threshold is raised from 200MB to 250MB. This at least delays the problem of hostd reincarnation, but I am unsure about the true cause at this time. It appears to have something to do with stuff installed inside the service console of ESX: Servers having for example HP agents installed appear to use more hostd memory than “clean” service consoles, and these reincarnation events can occur in hours instead of days. That, and the linear climbing of used memory pleads for…. Memory leak. I expect VMWare has a bug to fix. Might be a nasty one too, I believe it has been inside ESX for a long time (maybe even 3.5U2 or before).

So: If you have intermitting reboots or at least disconnects from vCenter, check the hostd log for these limit-warnings.