NFS disconnectA new issue has been discovered with vSphere 5.5 Update 1 that is related to loss of connection of NFS based datastores. The result is an All Paths Down (APD) of the NFS datastores meaning that vSphere loses connection to the virtual machine files and/or the virtual machines will not be able to do any IO to the datastore. This will result in a BSOD for Windows virtual machine and read-only filesystems for Linux guests.

Last Friday we encountered this ourselves and already got the confirmation from VMware Support that this was indeed a bug but the information was confidential. Now that it’s out in the open, I can share.

According to VMware KB 2076392 the symptoms are as follows:

  • Intermittent APDs for NFS datastores are reported, with consequent potential blue screen errors for Windows virtual machine guests and read-only filesystems in Linux virtual machines.Note: NFS volumes include VSA datastores.
  • For the duration of the APD condition and after, the array still responds to ping and netcat tests are also successful, and there is no evidence to indicate a physical network or a NFS storage array issue.
  • The NFS storage array logs and traces also do not indicate any evident issue, other hosts not running ESXi 5.5 U1 continue to work and can read and write to the NFS share without issue.
  • You see entries in the vobd logs similar to:Note: These log entries use the 12345678-abcdefg0 volume as an example:2014-04-01T14:35:08.074Z: [APDCorrelator] 9413898746us: [vob.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
    2014-04-01T14:35:08.075Z: [APDCorrelator] 9414268686us: [esx.problem.storage.apd.start] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down state.
    2014-04-01T14:36:55.274Z: No correlator for vob.vmfs.nfs.server.disconnect
    2014-04-01T14:36:55.274Z: [vmfsCorrelator] 9521467867us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.1.1/NFS-DS1 12345678-abcdefg0-0000-000000000000 NFS-DS1
    2014-04-01T14:37:28.081Z: [APDCorrelator] 9553899639us: [vob.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
    2014-04-01T14:37:28.081Z: [APDCorrelator] 9554275221us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [12345678-abcdefg0] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

 

This is a known issue affecting ESXi 5.5 Update 1 hosts with connected NFS storage. VMware is working towards providing a resolution to customers. To work around this issue, VMware recommends using ESXi 5.5 GA. Monitor the KB article to be informed as soon as a fix is released.

For NetApp customers the issue seems to occur when using Data ONTAP 8.0 7-Mode. NetApp is also  investigating if this issue occurs on cDOT as well.
According to Nick Howell (NetApp – DatacenterDude.com), vSphere 5.5 Update 1 is not yet added to the NetApp Interoperability Matrix Tool which is a web-based utility.

Update:

VMware released a patch today VMware ESXi 5.5, Patch ESXi550-201406401-SG: Updates esx-base (2077360) which resolves this issue.

The typical way to apply patches to ESXi hosts is through the VMware Update Manager. For details, see the Installing and Administering VMware vSphere Update Manager.

ESXi hosts can be updated by manually downloading the patch ZIP file from the VMware download page and installing the VIB by using the esxcli software vib command. Additionally, the system can be updated using the image profile and the esxcli software profile command. For details, see the vSphere Command-Line Interface Concepts and Examples and the vSphere Upgrade Guide.