vSphere 6 experiencing high packet loss
Since I have updated my lab environment to vSphere 6, I regularly get ‘Virtual machine is experiencing high number of received packets dropped‘ messages in vRealize Operations for all virtual machines in my environment.
I canceled these alerts multiple times but the high packet loss errors always return. So, time to fix this issue, but the outcome is as strange as it is positive.
Increase buffers in VM guest OS to fix high packet loss
Searching for answers I quickly found articles indicating high packet loss at the guest OS level on the VMXNET3 adapters in vSphere 4.x until 6.x.
According to the articles the issue occurs when packets are dropped during high traffic bursts because of a lack of receive and transmit buffer space or when receive traffic is speed-constrained, as, for example, with a traffic filter.
To resolve this issue slowly increase the number of buffers in the guest operating system.
To reduce burst traffic drops in Windows Buffer Settings:
- Click Start > Control Panel > Device Manager.
- Right-click vmxnet3 and click Properties.
- Click the Advanced tab.
- Click Small Rx Buffers and increase the value. The default value is 512 and the maximum is 8192.
- Click Rx Ring #1 Size and increase the value. The default value is 1024 and the maximum is 4096.
Source: VMware KB article 2039495.
Although the symptoms described look similar to what I’m experiencing, the articles also mention that the packet loss occurs during periods of very high traffic bursts. My environment is a lab environment which does not experience huge workloads or high traffic bursts.
Besides that the articles mention trouble when using VMXNET3 adapters, but my lab environment is a mix of VMXNET and E1000 on both Windows and Linux virtual machines.
Esxtop shows dropped receive packets (%DRPRX)
Investigating further I found an article to solve issues when ‘esxtop‘ shows dropped receive packets (%DRPRX) at the virtual switch. This articles not only mentions VMXNET3 but also E1000 and not only for Windows but also Linus and Solaris.
Because ESXi packets are treated on a FIFO (First in first out) basis the virtual machine’s network driver can run out of receive (RX) buffer memory when receive large amounts of network traffic. This will degrade the virtual machine network. ‘Esxtop‘ might show that the receive packets are dropped at the virtual switch but they are actually dropped between the virtual switch and the guest operating system driver.
The number of dropped packets can be reduced by increasing the Rx buffers for the virtual network driver.
E1000 Virtual Network Driver
For the E1000 virtual network driver in a Linux guest operating system, Rx buffers can be modified from the guest operating system in exactly the same way as on the physical machine. The default value is 256, and the maximum value that can be manually canfigured is 4096. Determine an appropriate setting by experimenting with different buffer sizes.
To determine the appropriate setting run the command:
ethtool -G ethX rx value
For the Intel pro driver in Windows, Receive Buffers can be modified from the guest operating system in exactly the same way as on the physical machine. The default value for the inbox driver on Windows 2008 R2 is 256, this may very depending on the driver used. To determine the appropriate setting by experimenting with different buffer size, load the Intel pro driver to the guest operating system and modify the Receive Buffers in the driver’s property.
For VMXNET3 Network Driver
The default Rx ring size is 256 and the maximum is 4096. The default setting can be modified from within the guest operating system.Example for a Linux guest operating system:
ethtool -G ethX rx value
Where X refers to the Ethernet interface ID in the guest operating system, and value refers to the new value for the Rx ring size.
Additionally, a Linux virtual machine enabled with Large Receive Offload (LRO) functionality on a VMXNET3 device might experience packet drops on the receiver side when the Rx ring #2 runs out of memory. This occurs when the virtual machine is handling packets generated by LRO.
As of ESXi 5.1 Update 3, the Rx ring #2 can be configured through the rx-jumbo parameter in ethtool. The maximum ring size for this parameter is 2048.
ethtool -G ethX rx-jumbo value
Where X refers to the Ethernet interface ID in the guest operating system, and value refers to the new value for the Rx ring #2.
In ESXi/ESX 3.x.x, you cannot configure the ring size of the VMXNET3 network interface card in Windows guest operating systems. In ESXi/ESX 4.0 Update 2 and later, you can configure the following parameters from the Device Manager (a Control Panel dialog box) in Windows guest operating systems: Rx Ring #1 Size, Rx Ring #2 Size, Tx Ring Size, Small Rx Buffers, and Large Rx Buffers.
The default value of the size of the first Rx ring, Rx Ring #1 Size, is 512. You can modify the number of Rx buffers separately using the Small Rx Buffers parameter. The default value is 1024.
For some processes (for example, traffic that arrives in burst), you might need to increase the size of the ring, while for others (for example, applications that are slow in processing receive traffic) you might increase the number of the receive buffers. When jumbo frames are enabled, you might use a second ring, Rx Ring #2 Size. The default value of RX Ring #2 Size is 32.
The number of large buffers that are used in both RX Ring #1 and #2 Sizes when jumbo frames are enabled is controlled by Large Rx Buffers. The default value of Large Rx Buffers is 768.
Note: This is true for windows 2008 but For 2012 servers and windows 10 it has been increased to 1024 for #1 and 64 for #2
Source: VMware KB article 1010071.
So it is time to unleashed the power of ‘esxtop‘. But check out the output of ‘esxtop‘ below.
The dropped packets metrics, %DRPTX and %DRPRX, are all 0, zero, NULL. So why is vRealize Operations reporting high packet loss? Besides that the article only applies to VMware ESX(i) 3.x until 5.5.
I get high packet loss errors which might be caused by high traffic bursts but my lab environment is almost in hibernation. ‘Esxtop‘ should report dropped receive packets (%DRPRX) but the counters are all 0, zero, NULL. What’s happening here, let’s get to the bottom of this and get ‘vsish’ out of the VMware toolbox.
‘vsish’ (VMkernel Sys Info Shell) is a command like ‘esxtop‘ which runs in the ESXi shell and allows you to check advanced performance counters of the ESXi host and virtual machines running on it.
To start we need to know the name of the port group and the portID that the virtual machine is connected to. Launch ‘esxtop’ and switch to the network display (n) to get this information.
In this case we will use the Windows 8 virtual machine, this has portID 50331658 on port group DvSPortset-0.
Now exit ‘esxtop‘ (q) and start ‘vsish’.
Using ‘vsish‘ is like navigating through a Unix filesystem tree, use cd to change to different folders, ls to list the content and cat to display the content. We select the Windows 8 virtual machine ports by typing
Use ‘cat status’ to show some details about the port’s configuration. This virtual machine is using an E1000 adapter on this port.
We now use ‘cat stats‘ to display the port statistics.
This shows that the virtual machine is experiencing high packet loss. Let’s get some more detail.
cd e1000 cat rxQueueStats
for VMXNET3 adapters.
cd vmxnet3 cat rxSummary
The strange thing is, why aren’t these values showing that the E1000 and VMXNET3 adapters are running out of buffers?
What is causing the high packet loss? Well, actually nothing is.
I found VMware KB article 2052917, vCenter Server 5.1/5.5/6.0 performance charts report dropped network packets. The symptoms decribed, The Network Performance Charts in vCenter Server 5.1, 5.5 and 6.0 show dropped packets under the Receive/Transmit Packets Dropped counter, .
According to the article, this issue occurs when packets filtered by the I/O chain are incorrectly recorded as dropped packets. This is a reporting issue, the packets are not dropped, therefore they cannot be seen using ‘esxtop’ or other network monitoring tools. Because vRealize Operations is getting the information from vCenter it is repeating this false positive.
This is a cosmetic issue and does not indicate an actual network problem. This is a known issue affecting ESXi/ vCenter Server 5.1, 5.5 and 6.0. There is a patch for VMware ESXi 5.1 and ESXi 5.5 but currently there is no resolution for ESXi/ vCenter Server 6.0.