Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6211

ibquery errrors and ipoib crash issue

$
0
0

hi

 

I have an issue I've been struggling on with a new setup ive been working on.

 

When keeping the network reasonably busy continuously IB interfaces are dropping out of the network.  On the IB layer everything seems fine but on the ipoib layerall neighbouring interfaces are unreachable and have to run a service networking restart to bring the offending interface up again.

 

What I have noticed is when running ibqueryerrors I get a lot of errors being reported which is a result I'm not accustomed to in other installs I have done.  See info below.

root@jhb-tc-pve-b:/dump/dump# ibqueryerrors
Errors for "MT25408 ConnectX Mellanox Technologies"   GUID 0x2c90300f4d481 port 1: [PortXmitWait == 27830451]
Errors for "MT25408 ConnectX Mellanox Technologies"   GUID 0x2c903000dd41f port 1: [PortXmitWait == 9425245]
Errors for 0x2c9020047d1f0 "Infiniscale-IV Mellanox Technologies"   GUID 0x2c9020047d1f0 port ALL: [LinkErrorRecoveryCounter == 4] [PortRcvErrors == 5] [PortRcvSwitchRelayErrors == 1319] [PortXmitDiscards == 209] [PortXmitWait == 31409740]   GUID 0x2c9020047d1f0 port 1: [PortRcvSwitchRelayErrors == 41] [PortXmitDiscards == 8] [PortXmitWait == 5823314]   GUID 0x2c9020047d1f0 port 2: [LinkErrorRecoveryCounter == 4] [PortRcvErrors == 5] [PortRcvSwitchRelayErrors == 502] [PortXmitDiscards == 21] [PortXmitWait == 13257853]   GUID 0x2c9020047d1f0 port 3: [PortRcvSwitchRelayErrors == 776] [PortXmitDiscards == 180] [PortXmitWait == 12328573]
Errors for "MT25408 ConnectX Mellanox Technologies"   GUID 0x2c90300fda4f1 port 1: [PortXmitWait == 35374998]

## Summary: 4 nodes checked, 4 bad nodes found
##          11 ports checked, 6 ports have errors beyond threshold
##
## Suppressed:

 

here are the details on the 3 interface cards im using.

 

root@jhb-tc-pve-a:~# ibstatus
Infiniband device 'mlx4_0' port 1 status:        default gid:     fe80:0000:0000:0000:0002:c903:000d:d41f        base lid:        0x1        sm lid:          0x3        state:           4: ACTIVE        phys state:      5: LinkUp        rate:            40 Gb/sec (4X QDR)        link_layer:      InfiniBand

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0002:c903:000d:d420        base lid:        0x0        sm lid:          0x0        state:           1: DOWN        phys state:      2: Polling        rate:            10 Gb/sec (4X)        link_layer:      InfiniBand

root@jhb-tc-pve-a:~# lspci -v | grep -i mellanox
04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
        Subsystem: Mellanox Technologies Device 0048
root@jhb-tc-pve-a:~# mstflint -d 04:00.0 query
Image type:      FS2
FW Version:      2.9.1000
Device ID:       26428
Description:     Node             Port1            Port2            Sys image
GUIDs:           0002c903000dd41e 0002c903000dd41f 0002c903000dd420 0002c903000dd421
MACs:                                 000000000000     000000000001
VSD:
PSID:            MT_0D80110009

 

 

jhb-tc-pve-b:/dump/dump# ibstatus
Infiniband device 'mlx4_0' port 1 status:        default gid:     fe80:0000:0000:0000:0002:c903:00fd:a4f1        base lid:        0x4        sm lid:          0x3        state:           4: ACTIVE        phys state:      5: LinkUp        rate:            40 Gb/sec (4X QDR)        link_layer:      InfiniBand

root@jhb-tc-pve-b:/dump/dump# lspci -v | grep -i mellanox
81:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
        Subsystem: Mellanox Technologies Device 0017
root@jhb-tc-pve-b:/dump/dump# mstflint -d 81:00.0 query
Image type:      FS2
FW Version:      2.35.5100
FW Release Date: 6.9.2015
Product Version: 02.35.51.00
Rom Info:        type=PXE version=3.4.648 devid=4099 proto=0xff
Device ID:       4099
Description:     Node             Port1            Port2            Sys image
GUIDs:           0002c90300fda4f0 0002c90300fda4f1 0002c90300fda4f2 0002c90300fda4f3
MACs:                                 0002c9fda4f0     0002c9fda4f1
VSD:
PSID:            MT_1060110018

 

 

root@jhb-tc-pve-c:~# ibstatus
Infiniband device 'mlx4_0' port 1 status:        default gid:     fe80:0000:0000:0000:0002:c903:00f4:d481        base lid:        0x3        sm lid:          0x3        state:           4: ACTIVE        phys state:      5: LinkUp        rate:            40 Gb/sec (4X QDR)        link_layer:      InfiniBand

root@jhb-tc-pve-c:~# lspci -v | grep -i mellanox
81:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
        Subsystem: Mellanox Technologies Device 0017
root@jhb-tc-pve-c:~# mstflint -d 81:00.0 query
Image type:      FS2
FW Version:      2.35.5100
FW Release Date: 6.9.2015
Product Version: 02.35.51.00
Rom Info:        type=PXE version=3.4.648 devid=4099 proto=0xff
Device ID:       4099
Description:     Node             Port1            Port2            Sys image
GUIDs:           0002c90300f4d480 0002c90300f4d481 0002c90300f4d482 0002c90300f4d483
MACs:                                 0002c9f4d480     0002c9f4d481
VSD:
PSID:            MT_1060110018

 

 

im using debian 8.2 and with kernel 4.2.3-2

 

here are the settings in the sysctl.conf file

 

#Infiniband Tuning
net.ipv4.tcp_mem=1280000 1280000 1280000
net.ipv4.tcp_wmem = 32768 131072 1280000
net.ipv4.tcp_rmem = 32768 131072 1280000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=1524288
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0

 

 

and here are the kernel modules im loading in /etc/modules

 

mlx4_core
mlx4_ib
ib_umad
ib_ipoib

 

 

this is the error which is pushed to the terminal after the ipoib interface dies.

 

sad.PNG

 

It would be great to get any suggestion from the community on this one as I'm stuck at this stage...


Viewing all articles
Browse latest Browse all 6211

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>