Support Alert: Issue with IBM hardware running ESX/ESXi 4.1 (update)
A few minutes ago On November 12th VMware issued a Support Alert regarding an issue that affects users with IBM hardware, running ESX/ESXi 4.1.
The symptoms are mentioned below.
When using IBM x3650 M3 or BladeCenter HS22V servers, you may experience these symptoms:
- HBAs stop responding
- Other PCIs devices may also stop responding
- You see an an illegal vector shortly before an HBA stops responding to the driver. For example:
vmkernel: 6:01:34:46.970 cpu0:4120)ALERT: APIC: 1823: APICID 0x00000000 – ESR = 0x40
- The HBA stops responding to commands. For example:
vmkernel: 6:01:42:36.189 cpu15:4274)<6>qla2xxx 0000:1a:00.0: qla2x00_abort_isp: **** FAILED ****
vmkernel: 6:01:47:36.383 cpu14:4274)<4>qla2xxx 0000:1a:00.0: Failed mailbox send register test
- The HBA card gets marked offline. For example:
vmkernel: 6:01:47:36.383 cpu14:4274)<4>qla2xxx 0000:1a:00.0: ISP error recovery failed – board disabled
The issue is currently under investigation by VMware engineering. At this time, downgrading to ESX/ESXi 4.0 by performing a fresh install is the only resolution.
VMware has created a Knowledgebase article 1030265 – HBAs and other PCI devices may stop responding in ESX 4.1 when using IBM servers. This KB article may be updated with new information if it becomes available. Bookmark the KB, or subscribe to its rss feed here to receive updates.
Update 17-11-2010: VMware found a resolution workaround for this problem.
- IBM Server x3650 M3
- IBM BladeCenter HS22V
- Run the commands:# esxcfg-advcfg -k TRUE iovDisableIR
# rebootTo check if interrupt mapping is set after the reboot, run the command:
# esxcfg-info -c
- In vSphere Client:
- Click Configuration > (Software) Advanced Settings > VMkernel.
- Select VMkernel.Boot.iovDisableIR and click OK.
- Reboot the ESX host.
VMWare has updated the KB with a resolution: Disable interrupt remapping (See the KB for detailed info and a procedure)
Thanks for the heads up Rob. I updated the article with the new info.
It’s not really a resolution. A resolution would involve making interrupt remapping work on this hardware, not disabling it.
thank you for this!!
By now you probably know that this is not specific to IBM servers. The issue was with VMware’s idiosyncratic implementation of Intel Vt-d technology, and it affected many servers other than IBM. To thier credit, it appears that IBM were first to identify the issue. You should revise this article accordingly, it is misleading as written.