vSphere 6 experiencing high packet loss
Since I have updated my lab environment to vSphere 6, I regularly get ‘Virtual machine is experiencing high number of received packets dropped‘ messages in vRealize Operations for all virtual machines in my environment.
I canceled these alerts multiple times but the high packet loss errors always return. So, time to fix this issue, but the outcome is as strange as it is positive.
Increase buffers in VM guest OS to fix high packet loss
Searching for answers I quickly found articles indicating high packet loss at the guest OS level on the VMXNET3 adapters in vSphere 4.x until 6.x.
According to the articles the issue occurs when packets are dropped during high traffic bursts because of a lack of receive and transmit buffer space or when receive traffic is speed-constrained, as, for example, with a traffic filter.
To resolve this issue slowly increase the number of buffers in the guest operating system.
To reduce burst traffic drops in Windows Buffer Settings:
- Click Start > Control Panel > Device Manager.
- Right-click vmxnet3 and click Properties.
- Click the Advanced tab.
- Click Small Rx Buffers and increase the value. The default value is 512 and the maximum is 8192.
- Click Rx Ring #1 Size and increase the value. The default value is 1024 and the maximum is 4096.
Source: VMware KB article 2039495.
Objections
Although the symptoms described look similar to what I’m experiencing, the articles also mention that the packet loss occurs during periods of very high traffic bursts. My environment is a lab environment which does not experience huge workloads or high traffic bursts.
Besides that the articles mention trouble when using VMXNET3 adapters, but my lab environment is a mix of VMXNET and E1000 on both Windows and Linux virtual machines.
Esxtop shows dropped receive packets (%DRPRX)
Investigating further I found an article to solve issues when ‘esxtop‘ shows dropped receive packets (%DRPRX) at the virtual switch. This articles not only mentions VMXNET3 but also E1000 and not only for Windows but also Linus and Solaris.
Because ESXi packets are treated on a FIFO (First in first out) basis the virtual machine’s network driver can run out of receive (RX) buffer memory when receive large amounts of network traffic. This will degrade the virtual machine network. ‘Esxtop‘ might show that the receive packets are dropped at the virtual switch but they are actually dropped between the virtual switch and the guest operating system driver.
The number of dropped packets can be reduced by increasing the Rx buffers for the virtual network driver.
E1000 Virtual Network Driver
Linux
For the E1000 virtual network driver in a Linux guest operating system, Rx buffers can be modified from the guest operating system in exactly the same way as on the physical machine. The default value is 256, and the maximum value that can be manually canfigured is 4096. Determine an appropriate setting by experimenting with different buffer sizes.
To determine the appropriate setting run the command:
ethtool -G ethX rx value
Windows
For the Intel pro driver in Windows, Receive Buffers can be modified from the guest operating system in exactly the same way as on the physical machine. The default value for the inbox driver on Windows 2008 R2 is 256, this may very depending on the driver used. To determine the appropriate setting by experimenting with different buffer size, load the Intel pro driver to the guest operating system and modify the Receive Buffers in the driver’s property.
For VMXNET3 Network Driver
Linux
The default Rx ring size is 256 and the maximum is 4096. The default setting can be modified from within the guest operating system.Example for a Linux guest operating system:
ethtool -G ethX rx value
Where X refers to the Ethernet interface ID in the guest operating system, and value refers to the new value for the Rx ring size.
Additionally, a Linux virtual machine enabled with Large Receive Offload (LRO) functionality on a VMXNET3 device might experience packet drops on the receiver side when the Rx ring #2 runs out of memory. This occurs when the virtual machine is handling packets generated by LRO.
As of ESXi 5.1 Update 3, the Rx ring #2 can be configured through the rx-jumbo parameter in ethtool. The maximum ring size for this parameter is 2048.
ethtool -G ethX rx-jumbo value
Where X refers to the Ethernet interface ID in the guest operating system, and value refers to the new value for the Rx ring #2.
Windows
In ESXi/ESX 3.x.x, you cannot configure the ring size of the VMXNET3 network interface card in Windows guest operating systems. In ESXi/ESX 4.0 Update 2 and later, you can configure the following parameters from the Device Manager (a Control Panel dialog box) in Windows guest operating systems: Rx Ring #1 Size, Rx Ring #2 Size, Tx Ring Size, Small Rx Buffers, and Large Rx Buffers.
The default value of the size of the first Rx ring, Rx Ring #1 Size, is 512. You can modify the number of Rx buffers separately using the Small Rx Buffers parameter. The default value is 1024.
For some processes (for example, traffic that arrives in burst), you might need to increase the size of the ring, while for others (for example, applications that are slow in processing receive traffic) you might increase the number of the receive buffers. When jumbo frames are enabled, you might use a second ring, Rx Ring #2 Size. The default value of RX Ring #2 Size is 32.
The number of large buffers that are used in both RX Ring #1 and #2 Sizes when jumbo frames are enabled is controlled by Large Rx Buffers. The default value of Large Rx Buffers is 768.
Note: This is true for windows 2008 but For 2012 servers and windows 10 it has been increased to 1024 for #1 and 64 for #2
Source: VMware KB article 1010071.
Objections
So it is time to unleashed the power of ‘esxtop‘. But check out the output of ‘esxtop‘ below.
The dropped packets metrics, %DRPTX and %DRPRX, are all 0, zero, NULL. So why is vRealize Operations reporting high packet loss? Besides that the article only applies to VMware ESX(i) 3.x until 5.5.
What’s next?
I get high packet loss errors which might be caused by high traffic bursts but my lab environment is almost in hibernation. ‘Esxtop‘ should report dropped receive packets (%DRPRX) but the counters are all 0, zero, NULL. What’s happening here, let’s get to the bottom of this and get ‘vsish’ out of the VMware toolbox.
‘vsish’ (VMkernel Sys Info Shell) is a command like ‘esxtop‘ which runs in the ESXi shell and allows you to check advanced performance counters of the ESXi host and virtual machines running on it.
If you’re not familiar with ‘vsish‘ check out these articles from William Lam. What is VMware vsish? and What’s new in VMware vsish for ESXi 5?.
To start we need to know the name of the port group and the portID that the virtual machine is connected to. Launch ‘esxtop’ and switch to the network display (n) to get this information.
In this case we will use the Windows 8 virtual machine, this has portID 50331658 on port group DvSPortset-0.
Now exit ‘esxtop‘ (q) and start ‘vsish’.
Using ‘vsish‘ is like navigating through a Unix filesystem tree, use cd to change to different folders, ls to list the content and cat to display the content. We select the Windows 8 virtual machine ports by typing
cd /net/portsets/DvsPortset-0/ports/50331658/
Use ‘cat status’ to show some details about the port’s configuration. This virtual machine is using an E1000 adapter on this port.
We now use ‘cat stats‘ to display the port statistics.
cat stats
This shows that the virtual machine is experiencing high packet loss. Let’s get some more detail.
cd e1000 cat rxQueueStats
for VMXNET3 adapters.
cd vmxnet3 cat rxSummary
The strange thing is, why aren’t these values showing that the E1000 and VMXNET3 adapters are running out of buffers?
What is causing the high packet loss? Well, actually nothing is.
I found VMware KB article 2052917, vCenter Server 5.1/5.5/6.0 performance charts report dropped network packets. The symptoms decribed, The Network Performance Charts in vCenter Server 5.1, 5.5 and 6.0 show dropped packets under the Receive/Transmit Packets Dropped counter, .
According to the article, this issue occurs when packets filtered by the I/O chain are incorrectly recorded as dropped packets. This is a reporting issue, the packets are not dropped, therefore they cannot be seen using ‘esxtop’ or other network monitoring tools. Because vRealize Operations is getting the information from vCenter it is repeating this false positive.
This is a cosmetic issue and does not indicate an actual network problem. This is a known issue affecting ESXi/ vCenter Server 5.1, 5.5 and 6.0. There is a patch for VMware ESXi 5.1 and ESXi 5.5 but currently there is no resolution for ESXi/ vCenter Server 6.0.
Tags In
Related Posts
10 Comments
Leave a Reply Cancel reply
You must be logged in to post a comment.
Excellent post Erik. THanks for sharing the details of how to prove the issue is cosmetic only. The VMware KB doesn’t give any details, so it’s nice to see how to break it down.
You’re welcome James. It took me a while to figure it out. But with the help of the vROPS admin guide, some VMware KB articles and William Lam’s info on vsish I managed to get it figured out. Always nice to write it down for future reference and to help others.
Great post! I was looking for this exact issue. Thank you!
Thank you! I just upgraded to 6.0U2 and this was driving me nuts.
My only worry is that the “Running ouf of buffers” count is 555, but the “# of times 1st and 2nd ring is full” counts are still 0.
I hate it when vendors call their monitoring stat bugs cosmetic. Its not cosmetic if a key indicator that you need to proactively monitor your system is totally unusable.
I encourage anyone who sees a vendor labeling a broken monitoring stat (see this with cisco alot too) as cosmetic to let them know that its not cosmetic if you cant monitor something..
A cosmetic issue in my mind is not being able to change colors on your graphs or or something like that and its not cosmetic if you can no longer monitor whether you are droppong backers because of microbursts of traffic..
This easily understandable gets us to clarity on %DRPRX
However, I would like to know,what is best number to between default and Max.
is this depends on our environment/application or on vm? suggestions appreciated( as it helps us ).thanks
Patch is released by VMware for vsphere 6.0 as well now.
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2145667
I am a bit confused. My hosts are running latest 6.0 Express Patch 8 build 5224934 which should include this fix, but using your excellent write-up I still show tons of dropped Rx packets:
/net/portsets/DvsPortset-0/ports/67108874/> cat stats
packet stats {
pktsTx:25565332769
pktsTxMulticast:74209
pktsTxBroadcast:166556
pktsRx:55042947626
pktsRxMulticast:880461
pktsRxBroadcast:12299925
droppedTx:0
droppedRx:19865596
yet the vmxnet3 shows no issues with rings or buffers:
/net/portsets/DvsPortset-0/ports/67108874/vmxnet3/> cat rxSummary
stats of a vmxnet3 vNIC rx queue {
LRO pkts rx ok:0
LRO bytes rx ok:0
pkts rx ok:275048337
bytes rx ok:393482490492
unicast pkts rx ok:274994023
unicast bytes rx ok:393478716583
multicast pkts rx ok:1666
multicast bytes rx ok:112462
broadcast pkts rx ok:52648
broadcast bytes rx ok:3661447
running out of buffers:0
pkts receive error:0
of times the 1st ring is full:0
of times the 2nd ring is full:0
fail to map a rx buffer:0
request to page in a buffer:0
of times rx queue is stopped:0
failed when copying into the guest buffer:0
of pkts dropped due to large hdrs:0
of pkts dropped due to max number of SG limits:0
Any thoughts as to the discrepancy?
Hi Erik,
I know this post is a bit old, but have you seen this at all in ESXi 6.5? I started using netdata which was showing me dropped packets within my VM’s. I followed the same patch you did, and increased the buffer settings on the Guest NIC’s, but would continue to get alerts.
Next I checked esxtop, and I saw the same things you did and vsish also show minimal drops.
For instance for one of my VM’s:
pktsTx:21057400
pktsTxMulticast:23292
pktsTxBroadcast:10938
pktsRx:30430174
pktsRxMulticast:45445
pktsRxBroadcast:73678
droppedTx:0
droppedRx:5
/net/portsets/vSwitch1/ports/50331653/vmxnet3/> cat rxSummary
stats of a vmxnet3 vNIC rx queue {
LRO pkts rx ok:6066366
LRO bytes rx ok:63931520866
pkts rx ok:30506444
bytes rx ok:90319505634
unicast pkts rx ok:30387273
unicast bytes rx ok:90300398338
multicast pkts rx ok:45459
multicast bytes rx ok:14152432
broadcast pkts rx ok:73712
broadcast bytes rx ok:4954864
running out of buffers:0
pkts receive error:0
1st ring size:4096
2nd ring size:128
# of times the 1st ring is full:0
# of times the 2nd ring is full:0
fail to map a rx buffer:0
request to page in a buffer:0
# of times rx queue is stopped:0
failed when copying into the guest buffer:0
# of pkts dropped due to large hdrs:0
# of pkts dropped due to max number of SG limits:0
pkts rx via data ring ok:0
bytes rx via data ring ok:0
Whether rx burst queuing is enabled:0
current backend burst queue length:0
maximum backend burst queue length so far:0
aggregate number of times packets are requeued:0
aggregate number of times packets are dropped by PktAgingList:0
}
So it looks like the same thing, correct?
The bigger question I have is, is it because of the false reporting that the Guest OS’s then believe they are seeing dropped packets when in reality they are not?
Thanks!
Hi Marcus, looks like the same issue idd.
“is it because of the false reporting that the Guest OS’s then believe they are seeing dropped packets when in reality they are not?”Correct. If it is the exact same issue, then it is a reporting issue, the packets are not dropped.