vSphere network troubleshooting

During the last month I have been very busy building a new infrastructure at a client site. I’m responsible for the overall technical solution and the basis, a VMware vSphere infrastructure build on five Dell PowerEdge R805’s, Dell EqualLogic PS5000 and 6000 storage and Cisco switches for LAN, DMZ and IP storage networking.

Just before the customer initiated their functional test period we discovered that the overall Windows network  performance was slow. We did several test like copying an 8 GB file from local vmdk to local vmdk and VM to VM and found that the storage performance was no issue but the network performance was very slow.

In the last few years that I have been working with virtualization I have always been a fan of a static network configuration. Meaning, when I configure ESX networking I like my network interfaces and physical switch ports to be configured at 1000MB full duplex if the switch/network interface combination allows it. The idea is that if you purchase gigabit network interfaces and switches you know the maximum speeds. So you configure it to run at it’s maximum capacity, eliminating overhead and using as much bandwidth as possible purely for data transfer.

So when we experienced slow network performance I had a colleague check the Cisco LAN switches for errors, drops, packet loss or any other flaw which might indicate a speed or duplex mismatch. None were found so I assumed that the network configuration was not the issue. But as we know by now, ‘Assumption is the mother of all fuck-ups!’.

To rule out Windows SMB issues I tested the network performance using IOMeter connecting a remote share and indeed, network performance was still very slow. With the IOMeter setting I used to stress the storage network (95-105 MB/s), I only managed to achieve 15-20% load on my gigabit connection which equals 18-25 MB/s. Why couldn’t I achieve numbers equal to the IP storage network? OK, the IP storage network uses jumbo frames and has some storm control and flow control settings but this didn’t justify the huge difference.

Then I found that DRS had moved one of my test servers to a different ESX host. To rule out the physical network and other load on the ESX host I moved both servers to a free ESX host and was amazed by the network performance ……..

When VMotion moved my test servers to the free ESX host the network performance went from 15% to  65%. This meant that the problem was in the physical network. To substantiate my findings I moved the test server around from one ESX host to another which resulted in the network graph on the  left.

So, I pinned it to the physical network but, as mentioned above, the switch didn’t show any error, drops, retransmits or packet loss, whatsoever. I immediately checked the settings on the physical switch ports because I suspected that they were set to auto/auto but this wasn’t the case, the switch ports were fixed at 1000MB full duplex.

Just for arguments sake I decided to configure the ESX physical network interfaces with the Autonegotiate setting and again I was amazed. The network performance improved to more than 60% as shown in the network performance graph on the right.

To check the effect of this setting I VMotion-ed the servers around for a while which consistently gave me the same results as shown in the network graph below. 60-65% load on the network interface when the test server were on different ESX hosts and 70-80% load on the network interface when on the same ESX host.

So bottom line, when network performance is slow, check and change your network interface speed and duplex settings and do NOT rely on the switch-port statistics.

Now we know that the vSphere infrastructure is capable of network speeds up to 100MB/s

Next up is solving the 33 MB/s bottleneck in Windows. Although we can achieve speeds up to 100MB/s with IOMeter, a simple Windows file copy never exceeds 33 MB/s. We tried changing the Auto tuning level, Chimney, Receive side scaling settings but with no result. When we start multiple file copies the combined speeds go up to 70 MB/s. So it looks like Windows is limited at 33MB/s per thread or something.

We haven’t been able to fix this so any help, comment, hints and tips are welcome!