HA problem checklist

Last few weeks I have been very busy solving HA issues at a client site. As you may have read I solved the problems by swapping out the USB sticks and troubleshooting BIOS settings. Now my collegues asked me if I could write down all checks I performed (together with VMware support) to target these HA issues.

So here are the things I checked/performed:

  1. As we all know HA is very dependent on DNS, so I checked if all DNS entries (forward & reverse) were OK;
  2. To rule out DNS I created host files on all ESXi hosts which contained FQDN and short names;
  1. Because we use ESXi 3.5U3 (so post U2) we had to check the FT_HOSTS file in which VMware stores all HA enabled hosts. With the release of ESX(i) 3.5U2 and VC2.5U2 DNS resolution is no longer a requirement to enable HA. That means, that ESX(i) servers do not rely upon any DNS setting and/or even the host file (/etc/host) to resolve DNS names of the other cluster members.When a new host joins a HA cluster, the VirtualCenter server provides it with the IP address of one of the HA cluster members, which gives the host the network information it needs to contact one of the cluster members. That one cluster member then supplies the network information for all nodes in the cluster. This information is stored in the FT_HOSTS file located in :
    • /etc/opt/vmware/aam/FT_HOSTS (on standard ESX);
    • /var/run/vmware/aam/FT_HOSTS (on ESXi).
  2. Make sure HA uses the correct network for commuication. When enableing HA, ESX(i) looks at all VMkernel portgroups and skips VMotion enables portgroups by default unless the only VMkernel portgroup available is configured for VMotion. To enable HA using VMotion enabled VMkernel portgroups enter an advanced HA parameter called ‘das.allowVmotionNetworks‘ and set it to ‘true‘.
  3. In our case HA communication kept using the IP Storage VMkernel portgroup. To force HA to use the Management Network for communication we had to enter another advanced HA parameter called ‘das.allowNetwork[n]‘. The value of this parameter is the character string  whick needs to exactly match the name of the Service Console (ESX) or Management network (ESXi) on all cluster nodes. This is case sensitive. So the value we had to set ‘das.allowNetwork[1]‘ to, was the label of our Management/Vmotion network whick was ‘Management Network‘ (how strange ;-).

Read the ‘VMware High Availability (HA) Implementations Release notes‘ for more information on implementing and troubleshooting HA.

Related posts

Needed ports and services for a P2V

by Sander Martijn
15 years ago

Restoring a NSX Edge Gateway

by Martijn Smit
6 years ago

vSphere 4 HA may not work with certain IP addresses

by Erik Scholten
14 years ago
Exit mobile version