Hi,
we are experiencing intermittent kernel crashes with our storage fabric. All servers are running up to date Centos 6.5 with ConnectX3 HCAs, OFED 2.3-1.0.1 with 2.32.5100 FW and QDR link over an IB switch. The fileserver is serving data from ZFS datapools with NFS over RDMA, speeds are acceptable and the mounts are stable. My guess what's happening is the fileserver crashes (and reboots), which in turn causes the IB link disruption and kernel OOPSes on the clients which also crash and reboot. The servers are brand new Dell PE R420 boxes.
This has happened twice so far with different dmesg logs. Here are the more or less relevant bits from the fileserver (see full text and client crash log attached):
<2>kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
<1>BUG: unable to handle kernel paging request at ffff881e53d16758
<1>IP: [<ffff881e53d16758>] 0xffff881e53d16758
<4>PGD 1a86063 PUD 8000001e400001e3
<4>Oops: 0011 [#1] SMP
<4>last sysfs file: /sys/kernel/mm/ksm/run
<4>CPU 11
<4>Pid: 46610, comm: stat Tainted: P --------------- 2.6.32-431.29.2.el6.x86_64 #1 Dell Inc. PowerEdge R420
--- snip ---
<4>Call Trace:
<4> <IRQ>
<4> [<ffffffffa0785a80>] ? rpcrdma_run_tasklet+0x60/0x90 [xprtrdma]
<4> [<ffffffff8107ab05>] tasklet_action+0xe5/0x120
<4> [<ffffffff8107a5f1>] __do_softirq+0xc1/0x1e0
<4> [<ffffffff810e6c60>] ? handle_IRQ_event+0x60/0x170
<4> [<ffffffff8100c30c>] call_softirq+0x1c/0x30
<4> [<ffffffff8100fa75>] do_softirq+0x65/0xa0
<4> [<ffffffff8107a4a5>] irq_exit+0x85/0x90
<4> [<ffffffff81532525>] do_IRQ+0x75/0xf0
<4> [<ffffffff8100b9d3>] ret_from_intr+0x0/0x11
<4> <EOI>
<4> [<ffffffffa06671a0>] ? rpc_release_client+0x0/0xa0 [sunrpc]
<4> [<ffffffffa0667295>] ? rpc_task_release_client+0x55/0x70 [sunrpc]
<4> [<ffffffffa066fc84>] rpc_release_resources_task+0x34/0x40 [sunrpc]
<4> [<ffffffffa0670774>] __rpc_execute+0x174/0x350 [sunrpc]
<4> [<ffffffff8109ae27>] ? bit_waitqueue+0x17/0xd0
<4> [<ffffffffa06709b1>] rpc_execute+0x61/0xa0 [sunrpc]
<4> [<ffffffffa06673a5>] rpc_run_task+0x75/0x90 [sunrpc]
<4> [<ffffffffa06674c2>] rpc_call_sync+0x42/0x70 [sunrpc]
<4> [<ffffffffa0728aae>] _nfs4_call_sync+0x3e/0x40 [nfs]
<4> [<ffffffffa0721065>] _nfs4_proc_statfs+0xa5/0xc0 [nfs]
<4> [<ffffffffa0724376>] nfs4_proc_statfs+0x56/0x80 [nfs]
<4> [<ffffffffa070d276>] nfs_statfs+0x66/0x1a0 [nfs]
<4> [<ffffffff811bd744>] statfs_by_dentry+0x74/0xa0
--- snip ---
And the other one:
<1>BUG: unable to handle kernel NULL pointer dereference at 00000000000001cc
<1>IP: [<ffffffff8152b95b>] _spin_lock_bh+0x1b/0x40
<4>PGD 0
<4>Oops: 0002 [#1] SMP
<4>last sysfs file: /sys/devices/virtual/block/dm-0/dm/uuid
<4>CPU 3
<4>Pid: 4198, comm: ib_cm/3 Tainted: P --------------- 2.6.32-431.29.2.el6.x86_64 #1 Dell Inc. PowerEdge R420
<4>RIP: 0010:[<ffffffff8152b95b>] [<ffffffff8152b95b>] _spin_lock_bh+0x1b/0x40
--- snip ---
<4>Process ib_cm/3 (pid: 4198, threadinfo ffff881027640000, task ffff8810257ce040)
<4>Stack:
<4> ffff881b2a2f6e98 00000000000001cc ffff881027641ce0 ffffffffa0661f0f
<4><d> 0000000000000000 ffff881a31bc67c0 ffff881027641cd0 ffff881f261c8800
<4><d> ffff881e49ae5c00 ffff881e39e78918 ffff8819b0a31470 ffffe8ffffa2ff88
<4>Call Trace:
<4> [<ffffffffa0661f0f>] svc_xprt_enqueue+0x6f/0x220 [sunrpc]
<4> [<ffffffffa07d76d7>] rdma_cma_handler+0xd7/0x150 [svcrdma]
<4> [<ffffffffa04111c9>] cma_ib_handler+0x119/0x290 [rdma_cm]
<4> [<ffffffffa0522e37>] cm_process_work+0x27/0x110 [ib_cm]
<4> [<ffffffffa0524805>] cm_work_handler+0x215/0x138e [ib_cm]
<4> [<ffffffffa05245f0>] ? cm_work_handler+0x0/0x138e [ib_cm]
<4> [<ffffffff81094a20>] worker_thread+0x170/0x2a0
<4> [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff810948b0>] ? worker_thread+0x0/0x2a0
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
--- snip ---
By the looks of it RDMA is to blame up to some degree. After some googling I see there has been significant work on sunrpc, svcrdma and xprtrdma in more recent kernels, but it's doubtful any of the patches have been backported into the el6 kernel 2.6.32-431.29.2.el6.x86_64. The chances for the crash to occur seems to be related to throughput, having more IO tends have more chances of a crash. Kernel taint P is due to the ZFS kernel module, in case anyone is wondering. Other than that it's a pretty straightforward standard setup, SElinux is off.
Unfortunately this is a production environment, so I would like to avoid any drastic measures while troubleshooting. Also if I remember correctly, OFED was upgaded from 2.2-1.0.1 to the latest version between the two fileserver crashes.
Any thoughts or ideas? Does this seem more like a kernel than a driver issue to you? Any feedback will be greatly appreciated.
Cheers,
Lasse