Last few weeks I have been very busy solving HA issues at a client site. As you may have read I solved the problems by swapping out the USB sticks and troubleshooting BIOS settings. Now my collegues asked me if I could write down all checks I performed (together with VMware support) to target these HA issues.

So here are the things I checked/performed:

  • For pre U2 ESX(i) 3.5 installations:
  1. As we all know HA is very dependent on DNS, so I checked if all DNS entries (forward & reverse) were OK;
  2. To rule out DNS I created host files on all ESXi hosts which contained FQDN and short names;
  • For post U2 ESX(i) 3.5 installations:
  1. Because we use ESXi 3.5U3 (so post U2) we had to check the FT_HOSTS file in which VMware stores all HA enabled hosts. With the release of ESX(i) 3.5U2 and VC2.5U2 DNS resolution is no longer a requirement to enable HA. That means, that ESX(i) servers do not rely upon any DNS setting and/or even the host file (/etc/host) to resolve DNS names of the other cluster members.When a new host joins a HA cluster, the VirtualCenter server provides it with the IP address of one of the HA cluster members, which gives the host the network information it needs to contact one of the cluster members. That one cluster member then supplies the network information for all nodes in the cluster. This information is stored in the FT_HOSTS file located in :
    • /etc/opt/vmware/aam/FT_HOSTS (on standard ESX);
    • /var/run/vmware/aam/FT_HOSTS (on ESXi).
  2. Make sure HA uses the correct network for commuication. When enableing HA, ESX(i) looks at all VMkernel portgroups and skips VMotion enables portgroups by default unless the only VMkernel portgroup available is configured for VMotion. To enable HA using VMotion enabled VMkernel portgroups enter an advanced HA parameter called ‘das.allowVmotionNetworks‘ and set it to ‘true‘.
  3. In our case HA communication kept using the IP Storage VMkernel portgroup. To force HA to use the Management Network for communication we had to enter another advanced HA parameter called ‘das.allowNetwork[n]‘. The value of this parameter is the character string  whick needs to exactly match the name of the Service Console (ESX) or Management network (ESXi) on all cluster nodes. This is case sensitive. So the value we had to set ‘das.allowNetwork[1]‘ to, was the label of our Management/Vmotion network whick was ‘Management Network‘ (how strange ;-).
  • When enableing HA still fails check the setting for your USB 2.0 controller in the servers BIOS. With one of our servers the USB 2.0 controller was disabled, reducing the speed of the USB port containing the ESXi USB stick and therefor failing to install/enable HA.
  • When everything else fails and you cannot solve your HA problems check the various ESX(i) log files and check the console screen especially when booting the host. Here we discovered the problems with our first series of faulty green HP USB sticks.

Read the ‘VMware High Availability (HA) Implementations Release notes‘ for more information on implementing and troubleshooting HA.