Hello,
after two weeks of testing and firmware patching I think we found some major bug in the ESX 5.1 OFED 1.8.1.0 IPoIB driver. We are currently running on a Fujitsu RX300 S6 (Dual Xeon X5670) and a Mellanox ConnectX-2 MHRH2A (Firmware 2.9.1200). The storage server is running Ubuntu 12.04 LTS with an older ConnectX (PCIe Gen2) card and Linux Kernel 3.5. In between an 24 Port DDR Flextronics IB CX4 Switch. Therefore our max MTU is limited to 2K but that is no problem for us.
On the ESX the Infiniband card serves as a VMKernel interface and as a VM port group at the same time. A running VM has its "local" disks mounted over the VMKernel interface via IPoIB. Inside the VM we have mounted a NFS filesystem from the NFS server. So it looks like:
vm:~ # df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 61927388 3577888 55203784 7% / (mounted by ESX)
10.10.30.253:/var/nas/backup 11007961088 6360753152 4647207936 58% /backup (mounted inside VM)
To reproduce the error we copy data into the VM using SCP and use /backup as a target. After copying some gigabytes of data the Infiniband card stops working and the ESX kernel gives the following error message. Ths situation cannot be solved without ESX reboot.
WARNING: LinDMA: Linux_DMACheckContraints:149:Cannot
map machine address = 0x15ffff37b0, length = 65160
for device 0000:02:00.0; reason = buffer straddles
device dma boundary (0xffffffff)
<3>vmnic_ib1:ipoib_send:504: found skb where it does not belong
tx_head = 323830, tx_tail =323830
<3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0
Backtrace for current CPU #20, worldID=8212, ebp=0x41220051b028
ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c4524aa, 0x4f0f5000000d
ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c44bca8, 0x41000fe5d6c0
ipoib_start_xmit@<None>#<None>+0x53 stack: 0x41220051b238, 0x41800c4
In the process of eleminating the error we tried (without success)
1) Updated servers firmware to latest version
2) Switched from ConnectX to ConnectX-2 card
3) Switched from firmware 2.9.1000 to 2.9.1200
Everything works fine if we use the infiniband card only as a VMKernel interface. More details in my first post:
Any help is appreciated.