Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6211

Windows VMs hang out NFSoRDMA on CentOS 6.5

$
0
0

Hello. We've got stuck on such a problem. 4 nodes are connected to a storage via NFS over RDMA. Hardware is:

Intel 2312WPQJR as a node

Intel R2312GL4GS as a storage with Intel Infiniband 2 ports controller

Infiniband Mellanox SwitchX IS5023 for commutation.

 

The nodes and storage run CentOS 6.5 with built-in Infiniband package (Linux 2.6.32-431.el6.x86_64)

 

On the storage is made an array, that is shown in system as /storage/s01. Then it is exported via NFS. The nodes connect to NFS by:

/bin/mount -t nfs -o rdma,port=20049,rw,hard,timeo=600,retrans=5,async,nfsvers=3,intr 192.168.1.1:/storage/s01 /home/storage/sata/01

mount shows:

192.168.1.1:/storage/s01 on /home/storage/sata/01 type nfs

(rw,rdma,port=20049,hard,timeo=600,retrans=5,nfsvers=3,intr,addr=192.168.1.1)

 

Then we create a virtual machine with virsh with a disk bus virtio. All is OK, until we don't start Windows on KVM. It may work for 2 hours or 2 days, but under heavy load it hangs the mount (i.e. /sata/02 and 03 are accessible, but requesting 01 will result in a total hang of console). This can be beaten only by hardware reset of the node. If we mount without rdma - all is fine. All linux vms work fine, no problems.

 

NFS tuning is done, the logs on the time of problem show:

195 Mar 20 09:42:22 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

closed (-103)

196 Mar 20 09:42:42 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

on mlx4_0, memreg 5 slots 32 ird 16

197 Mar 20 09:42:49 v0004 kernel: ------------[ cut here ]------------

198 Mar 20 09:42:49 v0004 kernel: WARNING: at kernel/softirq.c:159

local_bh_enable_ip+0x7d/0xb0() (Not tainted)

199 Mar 20 09:42:49 v0004 kernel: Hardware name: S2600WP

200 Mar 20 09:42:49 v0004 kernel: Modules linked in: act_police cls_u32

sch_ingress cls_fw sch_sfq sch_htb ebt_arp ebt_ip ebtable_nat ebtables

xprtrdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge stp llc

ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables

ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack

ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad

rdma_cm ib_cm iw_cm ib_addr ipv6 openvswitch(U) vhost_net macvtap macvlan

tun kvm_intel kvm iTCO_wdt iTCO_vendor_support sr_mod cdrom sb_edac

edac_core lpc_ich mfd_core igb i2c_algo_bit ptp pps_core sg i2c_i801

i2c_core ioatdma dca mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core ext4

jbd2 mbcache usb_storage sd_mod crc_t10dif ahci isci libsas

scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last

unloaded: scsi_wait_scan]

201 Mar 20 09:42:49 v0004 kernel: Pid: 0, comm: swapper Not tainted

2.6.32-431.5.1.el6.x86_64 #1

202 Mar 20 09:42:49 v0004 kernel: Call Trace:

203 Mar 20 09:42:49 v0004 kernel: <IRQ> [<ffffffff81071e27>] ?

warn_slowpath_common+0x87/0xc0

204 Mar 20 09:42:49 v0004 kernel: [<ffffffff81071e7a>] ?

warn_slowpath_null+0x1a/0x20

205 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a3ed>] ?

local_bh_enable_ip+0x7d/0xb0

206 Mar 20 09:42:49 v0004 kernel: [<ffffffff8152a7fb>] ?

_spin_unlock_bh+0x1b/0x20

207 Mar 20 09:42:49 v0004 kernel: [<ffffffffa04554f0>] ?

rpc_wake_up_status+0x70/0x80 [sunrpc]

208 Mar 20 09:42:49 v0004 kernel: [<ffffffffa044e79c>] ?

xprt_wake_pending_tasks+0x2c/0x30 [sunrpc]

209 Mar 20 09:42:49 v0004 kernel: [<ffffffffa05322fc>] ?

rpcrdma_conn_func+0x9c/0xb0 [xprtrdma]

210 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0535450>] ?

rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma]

211 Mar 20 09:42:49 v0004 kernel: [<ffffffffa01c11cb>] ?

mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib]

212 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0166c54>] ?

mlx4_qp_event+0x74/0xf0 [mlx4_core]

213 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0154057>] ?

mlx4_eq_int+0x557/0xcb0 [mlx4_core]

214 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0455396>] ?

rpc_wake_up_task_queue_locked+0x186/0x270 [sunrpc]

215 Mar 20 09:42:49 v0004 kernel: [<ffffffffa01547c4>] ?

mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core]

216 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e6eb0>] ?

handle_IRQ_event+0x60/0x170

217 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e980e>] ?

handle_edge_irq+0xde/0x180

218 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0153362>] ?

mlx4_cq_completion+0x42/0x90 [mlx4_core]

219 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0

220 Mar 20 09:42:49 v0004 kernel: [<ffffffff815312ec>] ? do_IRQ+0x6c/0xf0

221 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100b9d3>] ?

ret_from_intr+0x0/0x11

222 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a893>] ?

__do_softirq+0x73/0x1e0

223 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e6eb0>] ?

handle_IRQ_event+0x60/0x170

224 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100c30c>] ?

call_softirq+0x1c/0x30

225 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0

226 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a795>] ? irq_exit+0x85/0x90

227 Mar 20 09:42:49 v0004 kernel: [<ffffffff815312f5>] ? do_IRQ+0x75/0xf0

228 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100b9d3>] ?

ret_from_intr+0x0/0x11

229 Mar 20 09:42:49 v0004 kernel: <EOI> [<ffffffff812e09ae>] ?

intel_idle+0xde/0x170

230 Mar 20 09:42:49 v0004 kernel: [<ffffffff812e0991>] ?

intel_idle+0xc1/0x170

231 Mar 20 09:42:49 v0004 kernel: [<ffffffff814268f7>] ?

cpuidle_idle_call+0xa7/0x140

232 Mar 20 09:42:49 v0004 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110

233 Mar 20 09:42:49 v0004 kernel: [<ffffffff8150cf1a>] ? rest_init+0x7a/0x80

234 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c26f8f>] ?

start_kernel+0x424/0x430

235 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c2633a>] ?

x86_64_start_reservations+0x125/0x129

236 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c26453>] ?

x86_64_start_kernel+0x115/0x124

237 Mar 20 09:42:49 v0004 kernel: ---[ end trace ddc1b92aa1d57ab7 ]---

238 Mar 20 09:42:49 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

closed (-103)

239 Mar 20 09:43:19 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

on mlx4_0, memreg 5 slots 32 ird 16

 

On the storage nothing is shown. CentOS virt-list can't help, so this community is the last place to ask.


Viewing all articles
Browse latest Browse all 6211

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>