Bad network performance on new ESX host
At a client site we came upon a problem with Windows 2003 VM’s. They would get low network performance when we moved them to a newly formed ESX cluster consisting of HP 460c G6 blades. In some cases logging on to the server with a remote session took about 20 minutes.
As I mentioned this only occurred when we moved a VM to the new cluster, but also VM’s that where newly installed would get the same problem when running on the new cluster. As we are using Altiris to install and configure new VM’s a colleague decided to install a new VM by going through the steps manually which normally would be done by Altiris and found out that after the activation of a security template the performance dropped significantly.
He also found, that when using a command to reset the IP-stack (netsh int ip reset c:\resetlog.txt) the problem would disappear. At that point we concluded that during the activation of the security template something in the IP-stack had been changed and resulted in low network performance.
When I checked the security template I found several registry keys which were added to the register and all where related to the IP-stack. Now with these keys I started a trial and error process (long live snapshots J ) by removing these keys one by one to see which key or combination of keys was causing us the problem.
With just a few tries I found that the following key was the cause of our problem:
Enablepmtudiscovery = 0
Removing this key or changing the value to “1” allowed Windows to use a MTU (Maximum Transmission Unit) of 1500 bytes. Normally this key would not be present defaulting Windows to use the 1500 MTU size. However the security template used by the client was dating from the time Windows 2000 was the standard. And according to this article from Microsoft it is recommended to set the key to “0” for Windows 2000 which limited the MTU size to 576 bytes.
The article describes setting this key to “0” for the following reason:
‘Setting this value to 1 (the default) forces TCP to discover the maximum transmission unit or largest packet size over the path to a remote host. An attacker can force packet fragmentation, which overworks the stack. Specifying 0 forces the MTU of 576 bytes for connections from hosts not on the local subnet.’
Now with Windows 2003 Microsoft has changed their recommendation and according to this article, they are now recommending setting the key to “1”.
So now we know what is giving us the problem, but it still did not explain why this problem did not occur on the “older” ESX hosts (HP BL480c, BL680c and BL45p blades) but does on the new hosts. Since we already logged a support call with VMware to resolve this problem we send them our latest discovery and shortly after that we got a reply (part of the reply is posted below):
‘A bug exists that relates to the same issue – it occurs on HP hardware with Broadcom network cards. It is due to be released as part of P23 – which should be out this month or next month I believe.’
The client decided that they’re prepared to wait for the patch to come out and not change the registry key on all the current VM’s.
Hopefully the patch will be released soon so we can install it and see if it resolves the problem. I will keep you posted.