Hello. We've got stuck on such a problem. 4 nodes are connected to a storage via NFS over RDMA. Hardware is:
Intel 2312WPQJR as a node
Intel R2312GL4GS as a storage with Intel Infiniband 2 ports controller
Infiniband Mellanox SwitchX IS5023 for commutation.
The nodes and storage run CentOS 6.5 with built-in Infiniband package (Linux 2.6.32-431.el6.x86_64)
On the storage is made an array, that is shown in system as /storage/s01. Then it is exported via NFS. The nodes connect to NFS by:
/bin/mount -t nfs -o rdma,port=20049,rw,hard,timeo=600,retrans=5,async,nfsvers=3,intr 192.168.1.1:/storage/s01 /home/storage/sata/01
mount shows:
192.168.1.1:/storage/s01 on /home/storage/sata/01 type nfs
(rw,rdma,port=20049,hard,timeo=600,retrans=5,nfsvers=3,intr,addr=192.168.1.1)
Then we create a virtual machine with virsh with a disk bus virtio. All is OK, until we don't start Windows on KVM. It may work for 2 hours or 2 days, but under heavy load it hangs the mount (i.e. /sata/02 and 03 are accessible, but requesting 01 will result in a total hang of console). This can be beaten only by hardware reset of the node. If we mount without rdma - all is fine. All linux vms work fine, no problems.
NFS tuning is done, the logs on the time of problem show:
195 Mar 20 09:42:22 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049
closed (-103)
196 Mar 20 09:42:42 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049
on mlx4_0, memreg 5 slots 32 ird 16
197 Mar 20 09:42:49 v0004 kernel: ------------[ cut here ]------------
198 Mar 20 09:42:49 v0004 kernel: WARNING: at kernel/softirq.c:159
local_bh_enable_ip+0x7d/0xb0() (Not tainted)
199 Mar 20 09:42:49 v0004 kernel: Hardware name: S2600WP
200 Mar 20 09:42:49 v0004 kernel: Modules linked in: act_police cls_u32
sch_ingress cls_fw sch_sfq sch_htb ebt_arp ebt_ip ebtable_nat ebtables
xprtrdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge stp llc
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables
ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack
ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm ib_addr ipv6 openvswitch(U) vhost_net macvtap macvlan
tun kvm_intel kvm iTCO_wdt iTCO_vendor_support sr_mod cdrom sb_edac
edac_core lpc_ich mfd_core igb i2c_algo_bit ptp pps_core sg i2c_i801
i2c_core ioatdma dca mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core ext4
jbd2 mbcache usb_storage sd_mod crc_t10dif ahci isci libsas
scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: scsi_wait_scan]
201 Mar 20 09:42:49 v0004 kernel: Pid: 0, comm: swapper Not tainted
2.6.32-431.5.1.el6.x86_64 #1
202 Mar 20 09:42:49 v0004 kernel: Call Trace:
203 Mar 20 09:42:49 v0004 kernel: <IRQ> [<ffffffff81071e27>] ?
warn_slowpath_common+0x87/0xc0
204 Mar 20 09:42:49 v0004 kernel: [<ffffffff81071e7a>] ?
warn_slowpath_null+0x1a/0x20
205 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a3ed>] ?
local_bh_enable_ip+0x7d/0xb0
206 Mar 20 09:42:49 v0004 kernel: [<ffffffff8152a7fb>] ?
_spin_unlock_bh+0x1b/0x20
207 Mar 20 09:42:49 v0004 kernel: [<ffffffffa04554f0>] ?
rpc_wake_up_status+0x70/0x80 [sunrpc]
208 Mar 20 09:42:49 v0004 kernel: [<ffffffffa044e79c>] ?
xprt_wake_pending_tasks+0x2c/0x30 [sunrpc]
209 Mar 20 09:42:49 v0004 kernel: [<ffffffffa05322fc>] ?
rpcrdma_conn_func+0x9c/0xb0 [xprtrdma]
210 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0535450>] ?
rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma]
211 Mar 20 09:42:49 v0004 kernel: [<ffffffffa01c11cb>] ?
mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib]
212 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0166c54>] ?
mlx4_qp_event+0x74/0xf0 [mlx4_core]
213 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0154057>] ?
mlx4_eq_int+0x557/0xcb0 [mlx4_core]
214 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0455396>] ?
rpc_wake_up_task_queue_locked+0x186/0x270 [sunrpc]
215 Mar 20 09:42:49 v0004 kernel: [<ffffffffa01547c4>] ?
mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core]
216 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e6eb0>] ?
handle_IRQ_event+0x60/0x170
217 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e980e>] ?
handle_edge_irq+0xde/0x180
218 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0153362>] ?
mlx4_cq_completion+0x42/0x90 [mlx4_core]
219 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0
220 Mar 20 09:42:49 v0004 kernel: [<ffffffff815312ec>] ? do_IRQ+0x6c/0xf0
221 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100b9d3>] ?
ret_from_intr+0x0/0x11
222 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a893>] ?
__do_softirq+0x73/0x1e0
223 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e6eb0>] ?
handle_IRQ_event+0x60/0x170
224 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100c30c>] ?
call_softirq+0x1c/0x30
225 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0
226 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a795>] ? irq_exit+0x85/0x90
227 Mar 20 09:42:49 v0004 kernel: [<ffffffff815312f5>] ? do_IRQ+0x75/0xf0
228 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100b9d3>] ?
ret_from_intr+0x0/0x11
229 Mar 20 09:42:49 v0004 kernel: <EOI> [<ffffffff812e09ae>] ?
intel_idle+0xde/0x170
230 Mar 20 09:42:49 v0004 kernel: [<ffffffff812e0991>] ?
intel_idle+0xc1/0x170
231 Mar 20 09:42:49 v0004 kernel: [<ffffffff814268f7>] ?
cpuidle_idle_call+0xa7/0x140
232 Mar 20 09:42:49 v0004 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
233 Mar 20 09:42:49 v0004 kernel: [<ffffffff8150cf1a>] ? rest_init+0x7a/0x80
234 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c26f8f>] ?
start_kernel+0x424/0x430
235 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c2633a>] ?
x86_64_start_reservations+0x125/0x129
236 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c26453>] ?
x86_64_start_kernel+0x115/0x124
237 Mar 20 09:42:49 v0004 kernel: ---[ end trace ddc1b92aa1d57ab7 ]---
238 Mar 20 09:42:49 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049
closed (-103)
239 Mar 20 09:43:19 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049
on mlx4_0, memreg 5 slots 32 ird 16
On the storage nothing is shown. CentOS virt-list can't help, so this community is the last place to ask.