hi
I have an issue I've been struggling on with a new setup ive been working on.
When keeping the network reasonably busy continuously IB interfaces are dropping out of the network. On the IB layer everything seems fine but on the ipoib layerall neighbouring interfaces are unreachable and have to run a service networking restart to bring the offending interface up again.
What I have noticed is when running ibqueryerrors I get a lot of errors being reported which is a result I'm not accustomed to in other installs I have done. See info below.
root@jhb-tc-pve-b:/dump/dump# ibqueryerrors Errors for "MT25408 ConnectX Mellanox Technologies" GUID 0x2c90300f4d481 port 1: [PortXmitWait == 27830451] Errors for "MT25408 ConnectX Mellanox Technologies" GUID 0x2c903000dd41f port 1: [PortXmitWait == 9425245] Errors for 0x2c9020047d1f0 "Infiniscale-IV Mellanox Technologies" GUID 0x2c9020047d1f0 port ALL: [LinkErrorRecoveryCounter == 4] [PortRcvErrors == 5] [PortRcvSwitchRelayErrors == 1319] [PortXmitDiscards == 209] [PortXmitWait == 31409740] GUID 0x2c9020047d1f0 port 1: [PortRcvSwitchRelayErrors == 41] [PortXmitDiscards == 8] [PortXmitWait == 5823314] GUID 0x2c9020047d1f0 port 2: [LinkErrorRecoveryCounter == 4] [PortRcvErrors == 5] [PortRcvSwitchRelayErrors == 502] [PortXmitDiscards == 21] [PortXmitWait == 13257853] GUID 0x2c9020047d1f0 port 3: [PortRcvSwitchRelayErrors == 776] [PortXmitDiscards == 180] [PortXmitWait == 12328573] Errors for "MT25408 ConnectX Mellanox Technologies" GUID 0x2c90300fda4f1 port 1: [PortXmitWait == 35374998] ## Summary: 4 nodes checked, 4 bad nodes found ## 11 ports checked, 6 ports have errors beyond threshold ## ## Suppressed:
here are the details on the 3 interface cards im using.
root@jhb-tc-pve-a:~# ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0002:c903:000d:d41f base lid: 0x1 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0002:c903:000d:d420 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 10 Gb/sec (4X) link_layer: InfiniBand root@jhb-tc-pve-a:~# lspci -v | grep -i mellanox 04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) Subsystem: Mellanox Technologies Device 0048 root@jhb-tc-pve-a:~# mstflint -d 04:00.0 query Image type: FS2 FW Version: 2.9.1000 Device ID: 26428 Description: Node Port1 Port2 Sys image GUIDs: 0002c903000dd41e 0002c903000dd41f 0002c903000dd420 0002c903000dd421 MACs: 000000000000 000000000001 VSD: PSID: MT_0D80110009
jhb-tc-pve-b:/dump/dump# ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0002:c903:00fd:a4f1 base lid: 0x4 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand root@jhb-tc-pve-b:/dump/dump# lspci -v | grep -i mellanox 81:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] Subsystem: Mellanox Technologies Device 0017 root@jhb-tc-pve-b:/dump/dump# mstflint -d 81:00.0 query Image type: FS2 FW Version: 2.35.5100 FW Release Date: 6.9.2015 Product Version: 02.35.51.00 Rom Info: type=PXE version=3.4.648 devid=4099 proto=0xff Device ID: 4099 Description: Node Port1 Port2 Sys image GUIDs: 0002c90300fda4f0 0002c90300fda4f1 0002c90300fda4f2 0002c90300fda4f3 MACs: 0002c9fda4f0 0002c9fda4f1 VSD: PSID: MT_1060110018
root@jhb-tc-pve-c:~# ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0002:c903:00f4:d481 base lid: 0x3 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand root@jhb-tc-pve-c:~# lspci -v | grep -i mellanox 81:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] Subsystem: Mellanox Technologies Device 0017 root@jhb-tc-pve-c:~# mstflint -d 81:00.0 query Image type: FS2 FW Version: 2.35.5100 FW Release Date: 6.9.2015 Product Version: 02.35.51.00 Rom Info: type=PXE version=3.4.648 devid=4099 proto=0xff Device ID: 4099 Description: Node Port1 Port2 Sys image GUIDs: 0002c90300f4d480 0002c90300f4d481 0002c90300f4d482 0002c90300f4d483 MACs: 0002c9f4d480 0002c9f4d481 VSD: PSID: MT_1060110018
im using debian 8.2 and with kernel 4.2.3-2
here are the settings in the sysctl.conf file
#Infiniband Tuning net.ipv4.tcp_mem=1280000 1280000 1280000 net.ipv4.tcp_wmem = 32768 131072 1280000 net.ipv4.tcp_rmem = 32768 131072 1280000 net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.rmem_default=16777216 net.core.wmem_default=16777216 net.core.optmem_max=1524288 net.ipv4.tcp_sack=0 net.ipv4.tcp_timestamps=0
and here are the kernel modules im loading in /etc/modules
mlx4_core mlx4_ib ib_umad ib_ipoib
this is the error which is pushed to the terminal after the ipoib interface dies.
It would be great to get any suggestion from the community on this one as I'm stuck at this stage...