Hello,
hopefully this is the right place to provide some information about a VMWare IPoIB datastore freeze. We are testing a new VMWare ESX 5.1.0 setup. Sadly we have only 3 ConnectX (gen 1) cards left, the other ConnectX2 ones are in our productive ESX 4.1 environment. We know that this is not officially supported right now but I want to make sure that the error will not happen when we upgrade the productive machines.
The hardware is:
Fujitsu RX300 S6 (Dual Intel X5670)
ConnectX MT25418 Firmware 2.9.1000
ESX 5.1.0 1117900
Mellanox driver 1.8.1
When copying data between VMs all of a sudden the adapter freezes and the datastore is "lost". From the vmkernel log we can read endless lines as below:
2013-07-22T17:22:48.775Z cpu10:8202)<3>vmnic_ib1:ipoib_send:504: found skb where it does not belongtx_head = 3827020, tx_tail =3827020
2013-07-22T17:22:48.775Z cpu10:8202)<3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0
2013-07-22T17:22:48.775Z cpu10:8202)Backtrace for current CPU #10, worldID=8202, ebp=0x41220029af68
2013-07-22T17:22:48.776Z cpu10:8202)0x41220029af68:[0x41802a310d59]ipoib_send@<None>#<None>+0x5d4 stack: 0xffffff, 0x0, 0x412410d4c948,
2013-07-22T17:22:48.777Z cpu10:8202)0x41220029b018:[0x41802a310d59]ipoib_send@<None>#<None>+0x5d4 stack: 0x41220029b088, 0x418029e0a55b
2013-07-22T17:22:48.777Z cpu10:8202)0x41220029b148:[0x41802a317160]ipoib_mcast_send@<None>#<None>+0xf7 stack: 0x41220029b188, 0x418029d
2013-07-22T17:22:48.778Z cpu10:8202)0x41220029b238:[0x41802a31dabf]ipoib_start_xmit@<None>#<None>+0x396 stack: 0x41220029b598, 0x412200
2013-07-22T17:22:48.778Z cpu10:8202)0x41220029b398:[0x41802a31ac3b]vmipoib_start_xmit@<None>#<None>+0x49a stack: 0x41000be0b880, 0x839e
2013-07-22T17:22:48.779Z cpu10:8202)0x41220029b468:[0x41802a16d8f0]DevStartTxImmediate@com.vmware.driverAPI#9.2+0x137 stack: 0x41220029
2013-07-22T17:22:48.779Z cpu10:8202)0x41220029b4d8:[0x418029d3470e]UplinkDevTransmit@vmkernel#nover+0x295 stack: 0x10787a40, 0x41220029
2013-07-22T17:22:48.780Z cpu10:8202)0x41220029b558:[0x418029dabbaa]NetSchedFIFORunLocked@vmkernel#nover+0x1a5 stack: 0xc0bd95300, 0x0,
2013-07-22T17:22:48.781Z cpu10:8202)0x41220029b5e8:[0x418029dabf57]NetSchedFIFOInput@vmkernel#nover+0x24e stack: 0x41220029b638, 0x4180
2013-07-22T17:22:48.781Z cpu10:8202)0x41220029b698:[0x418029dab0b2]NetSchedInput@vmkernel#nover+0x191 stack: 0x41220029b748, 0x41000bd9
2013-07-22T17:22:48.782Z cpu10:8202)0x41220029b738:[0x418029d3ced0]IOChain_Resume@vmkernel#nover+0x247 stack: 0x41220029b798, 0x418029d
2013-07-22T17:22:48.782Z cpu10:8202)0x41220029b788:[0x418029d2c0e4]PortOutput@vmkernel#nover+0xe3 stack: 0x41220029b808, 0x41802a216a2a
2013-07-22T17:22:48.783Z cpu10:8202)0x41220029b808:[0x41802a2254c8]TeamES_Output@<None>#<None>+0x16b stack: 0x0, 0x418029cc3879, 0x4122
2013-07-22T17:22:48.784Z cpu10:8202)0x41220029ba08:[0x41802a218047]EtherswitchPortDispatch@<None>#<None>+0x142a stack: 0xffffffff000000
2013-07-22T17:22:48.784Z cpu10:8202)0x41220029ba78:[0x418029d2b2c7]Port_InputResume@vmkernel#nover+0x146 stack: 0x410001553540, 0x41220
2013-07-22T17:22:48.785Z cpu10:8202)0x41220029baa8:[0x41802a3b95cb]TcpipTxDispatch@<None>#<None>+0x9a stack: 0x7c1f45, 0x41220029bad8,
2013-07-22T17:22:48.785Z cpu10:8202)0x41220029bb28:[0x41802a3ba118]TcpipDispatch@<None>#<None>+0x1c7 stack: 0x246, 0x41220029bb70, 0x41
2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bca8:[0x418029d0b245]WorldletProcessQueue@vmkernel#nover+0x4b0 stack: 0x41220029bd58, 0xb
2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bce8:[0x418029d0b895]WorldletBHHandler@vmkernel#nover+0x60 stack: 0x100000000000001, 0x41
2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bd68:[0x418029c2083a]BH_Check@vmkernel#nover+0x185 stack: 0x41220029be68, 0x41220029be08,
2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be68:[0x418029dbc9bc]CpuSchedIdleLoopInt@vmkernel#nover+0x13b stack: 0x41220029be98, 0x41
2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be78:[0x418029dc66de]CpuSched_IdleLoop@vmkernel#nover+0x15 stack: 0xa, 0x14, 0x41220029bf
2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be98:[0x418029c4f71e]Init_SlaveIdle@vmkernel#nover+0x49 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2013-07-22T17:22:48.788Z cpu10:8202)0x41220029bfe8:[0x418029ee26a6]SMPSlaveIdle@vmkernel#nover+0x31d stack: 0x0, 0x0, 0x0, 0x0, 0x0
Any help is appreciated.
Best regards.
Markus