Centos 6.5 kernel OOPS with NFS RDMA over ConnectX3 VPI

Hi,

we are experiencing intermittent kernel crashes with our storage fabric. All servers are running up to date Centos 6.5 with ConnectX3 HCAs, OFED 2.3-1.0.1 with 2.32.5100 FW and QDR link over an IB switch. The fileserver is serving data from ZFS datapools with NFS over RDMA, speeds are acceptable and the mounts are stable. My guess what's happening is the fileserver crashes (and reboots), which in turn causes the IB link disruption and kernel OOPSes on the clients which also crash and reboot. The servers are brand new Dell PE R420 boxes.

This has happened twice so far with different dmesg logs. Here are the more or less relevant bits from the fileserver (see full text and client crash log attached):

<2>kernel tried to execute NX-protected page - exploit attempt? (uid: 0)

<1>BUG: unable to handle kernel paging request at ffff881e53d16758

<1>IP: [<ffff881e53d16758>] 0xffff881e53d16758

<4>PGD 1a86063 PUD 8000001e400001e3

<4>Oops: 0011 [#1] SMP

<4>last sysfs file: /sys/kernel/mm/ksm/run

<4>CPU 11

<4>Pid: 46610, comm: stat Tainted: P --------------- 2.6.32-431.29.2.el6.x86_64 #1 Dell Inc. PowerEdge R420

--- snip ---

<4>Call Trace:

<4> <IRQ>

<4> [<ffffffffa0785a80>] ? rpcrdma_run_tasklet+0x60/0x90 [xprtrdma]

<4> [<ffffffff8107ab05>] tasklet_action+0xe5/0x120

<4> [<ffffffff8107a5f1>] __do_softirq+0xc1/0x1e0

<4> [<ffffffff810e6c60>] ? handle_IRQ_event+0x60/0x170

<4> [<ffffffff8100c30c>] call_softirq+0x1c/0x30

<4> [<ffffffff8100fa75>] do_softirq+0x65/0xa0

<4> [<ffffffff8107a4a5>] irq_exit+0x85/0x90

<4> [<ffffffff81532525>] do_IRQ+0x75/0xf0

<4> [<ffffffff8100b9d3>] ret_from_intr+0x0/0x11

<4> <EOI>

<4> [<ffffffffa06671a0>] ? rpc_release_client+0x0/0xa0 [sunrpc]

<4> [<ffffffffa0667295>] ? rpc_task_release_client+0x55/0x70 [sunrpc]

<4> [<ffffffffa066fc84>] rpc_release_resources_task+0x34/0x40 [sunrpc]

<4> [<ffffffffa0670774>] __rpc_execute+0x174/0x350 [sunrpc]

<4> [<ffffffff8109ae27>] ? bit_waitqueue+0x17/0xd0

<4> [<ffffffffa06709b1>] rpc_execute+0x61/0xa0 [sunrpc]

<4> [<ffffffffa06673a5>] rpc_run_task+0x75/0x90 [sunrpc]

<4> [<ffffffffa06674c2>] rpc_call_sync+0x42/0x70 [sunrpc]

<4> [<ffffffffa0728aae>] _nfs4_call_sync+0x3e/0x40 [nfs]

<4> [<ffffffffa0721065>] _nfs4_proc_statfs+0xa5/0xc0 [nfs]

<4> [<ffffffffa0724376>] nfs4_proc_statfs+0x56/0x80 [nfs]

<4> [<ffffffffa070d276>] nfs_statfs+0x66/0x1a0 [nfs]

<4> [<ffffffff811bd744>] statfs_by_dentry+0x74/0xa0

--- snip ---

And the other one:

<1>BUG: unable to handle kernel NULL pointer dereference at 00000000000001cc

<1>IP: [<ffffffff8152b95b>] _spin_lock_bh+0x1b/0x40

<4>PGD 0

<4>Oops: 0002 [#1] SMP

<4>last sysfs file: /sys/devices/virtual/block/dm-0/dm/uuid

<4>CPU 3

<4>Pid: 4198, comm: ib_cm/3 Tainted: P --------------- 2.6.32-431.29.2.el6.x86_64 #1 Dell Inc. PowerEdge R420

<4>RIP: 0010:[<ffffffff8152b95b>] [<ffffffff8152b95b>] _spin_lock_bh+0x1b/0x40

--- snip ---

<4>Process ib_cm/3 (pid: 4198, threadinfo ffff881027640000, task ffff8810257ce040)

<4>Stack:

<4> ffff881b2a2f6e98 00000000000001cc ffff881027641ce0 ffffffffa0661f0f

<4><d> 0000000000000000 ffff881a31bc67c0 ffff881027641cd0 ffff881f261c8800

<4><d> ffff881e49ae5c00 ffff881e39e78918 ffff8819b0a31470 ffffe8ffffa2ff88

<4>Call Trace:

<4> [<ffffffffa0661f0f>] svc_xprt_enqueue+0x6f/0x220 [sunrpc]

<4> [<ffffffffa07d76d7>] rdma_cma_handler+0xd7/0x150 [svcrdma]

<4> [<ffffffffa04111c9>] cma_ib_handler+0x119/0x290 [rdma_cm]

<4> [<ffffffffa0522e37>] cm_process_work+0x27/0x110 [ib_cm]

<4> [<ffffffffa0524805>] cm_work_handler+0x215/0x138e [ib_cm]

<4> [<ffffffffa05245f0>] ? cm_work_handler+0x0/0x138e [ib_cm]

<4> [<ffffffff81094a20>] worker_thread+0x170/0x2a0

<4> [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40

<4> [<ffffffff810948b0>] ? worker_thread+0x0/0x2a0

<4> [<ffffffff8109abf6>] kthread+0x96/0xa0

<4> [<ffffffff8100c20a>] child_rip+0xa/0x20

<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0

<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

--- snip ---

By the looks of it RDMA is to blame up to some degree. After some googling I see there has been significant work on sunrpc, svcrdma and xprtrdma in more recent kernels, but it's doubtful any of the patches have been backported into the el6 kernel 2.6.32-431.29.2.el6.x86_64. The chances for the crash to occur seems to be related to throughput, having more IO tends have more chances of a crash. Kernel taint P is due to the ZFS kernel module, in case anyone is wondering. Other than that it's a pretty straightforward standard setup, SElinux is off.

Unfortunately this is a production environment, so I would like to avoid any drastic measures while troubleshooting. Also if I remember correctly, OFED was upgaded from 2.2-1.0.1 to the latest version between the two fileserver crashes.

Any thoughts or ideas? Does this seem more like a kernel than a driver issue to you? Any feedback will be greatly appreciated.

Cheers,

Lasse

Centos 6.5 kernel OOPS with NFS RDMA over ConnectX3 VPI

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...