Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6211

ESX 5.1 IPoIB driver crash

$
0
0

Hello,

 

after two weeks of testing and firmware patching I think we found some major bug in the ESX 5.1 OFED 1.8.1.0 IPoIB driver. We are currently running on a Fujitsu RX300 S6 (Dual Xeon X5670) and a Mellanox ConnectX-2 MHRH2A (Firmware 2.9.1200). The storage server is running Ubuntu 12.04 LTS with an older ConnectX (PCIe Gen2) card and Linux Kernel 3.5. In between an 24 Port DDR Flextronics IB CX4 Switch. Therefore our max MTU is limited to 2K but that is no problem for us.

 

On the ESX the Infiniband card serves as a VMKernel interface and as a VM port group at the same time. A running VM has its "local" disks mounted over the VMKernel interface via IPoIB. Inside the VM we have mounted a NFS filesystem from the NFS server. So it looks like:

 

vm:~ # df

Filesystem           1K-blocks      Used Available Use% Mounted on

/dev/sda1             61927388   3577888  55203784   7% /  (mounted by ESX)

10.10.30.253:/var/nas/backup 11007961088 6360753152 4647207936  58% /backup (mounted inside VM)

 

To reproduce the error we copy data into the VM using SCP and use /backup as a target. After copying some gigabytes of data the Infiniband card stops working and the ESX kernel gives the following error message. Ths situation cannot be solved without ESX reboot.

 

WARNING: LinDMA: Linux_DMACheckContraints:149:Cannot

         map machine address = 0x15ffff37b0, length = 65160

         for device 0000:02:00.0; reason = buffer straddles

         device dma boundary (0xffffffff)

<3>vmnic_ib1:ipoib_send:504: found skb where it does not belong

                             tx_head = 323830, tx_tail =323830

<3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0

Backtrace for current CPU #20, worldID=8212, ebp=0x41220051b028

ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c4524aa, 0x4f0f5000000d

ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c44bca8, 0x41000fe5d6c0

ipoib_start_xmit@<None>#<None>+0x53 stack: 0x41220051b238, 0x41800c4

 

In the process of eleminating the error we tried (without success)

 

1) Updated servers firmware to latest version

2) Switched from ConnectX to ConnectX-2 card

3) Switched from firmware 2.9.1000 to 2.9.1200

 

Everything works fine if we use the infiniband card only as a VMKernel interface. More details in my first post: http://community.mellanox.com/message/2270

 

Any help is appreciated.


Viewing all articles
Browse latest Browse all 6211

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>