Hi all,
I'm really new to IB, and I'm having some issues while trying to configure my existing IB network with SX6036G gw and SX6018F switches to a new HP Enclosure with QLogic HP BLc 4X QDR IB Switch and InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02) mezzanine adapters on each of the Blades. Here's my topology:
# ibswitches
Switch : 0x0002c902004b0918 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 29 lmc 0 --> QLogic HP BLc 4X QDR IB Switch
Switch : 0xe41d2d030031e9c1 ports 37 "MF0;GWIB01:SX6036G/U1" enhanced port 0 lid 24 lmc 0
Switch : 0xf45214030073f500 ports 18 "MF0;SWIB02:SX6018/U1" enhanced port 0 lid 1 lmc 0
Switch : 0xe41d2d030031eb41 ports 37 "MF0;GWIB02:SX6036G/U1" enhanced port 0 lid 23 lmc 0
Switch : 0xe41d2d0300097630 ports 18 "MF0;SWIB01:SX6018/U1" enhanced port 0 lid 2 lmc 0
The SM is running on switch SWIB01 with priority 8.
The thing comes when I try to configure the blades, they had Ubuntu 14.04.3 LTS with the following modules:
ib_ucm
ib_uverbs
ib_ipoib
ib_cm
ib_sa
ib_umad
ib_mthca
ib_qib
ib_mad
ib_core
ib_addr
dca
If I ran an "ibstat" from one of the Blades I'm getting:
root@ubuntu:~# ibstat
CA 'qib0'
CA type: InfiniPath_QMH7342
Number of ports: 2
Firmware version:
Hardware version: 2
Node GUID: 0x0011750000791fec
System image GUID: 0x0011750000791fec
Port 1:
State: Down
Physical state: Polling
Rate: 40
Base lid: 30
LMC: 0
SM lid: 2
Capability mask: 0x0761086a
Port GUID: 0x0011750000791fec
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 40
Base lid: 65535
LMC: 0
SM lid: 65535
Capability mask: 0x0761086a
Port GUID: 0x0011750000791fed
Link layer: InfiniBand
Ok, now If I go to a host that's inside of the IB network and run the following commands, I'm able to 'active' the port just for a while..:
# ibportstate -L 29 28 disable
# ibportstate -L 29 28 speed 4
# ibportstate -L 29 28 espeed 4
# ibportstate -L 29 28 smlid 2
# ibportstate -L 29 28 enable
# ibportstate -L 29 28
Switch PortInfo:
# Port info: Lid 29 port 28
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................75
SMLid:...........................2328
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Peer PortInfo:
# Port info: Lid 29 DR path slid 4; dlid 65535; 0,28 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................30
SMLid:...........................2
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............10.0 Gbps (IBA extension)
LinkSpeedEnabled:................10.0 Gbps (IBA extension)
LinkSpeedActive:.................10.0 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
On the Blade host:
root@ubuntu:~# ibstat
CA 'qib0'
CA type: InfiniPath_QMH7342
Number of ports: 2
Firmware version:
Hardware version: 2
Node GUID: 0x0011750000791fec
System image GUID: 0x0011750000791fec
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 30
LMC: 0
SM lid: 2
Capability mask: 0x0761086a
Port GUID: 0x0011750000791fec
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 40
Base lid: 65535
LMC: 0
SM lid: 65535
Capability mask: 0x0761086a
Port GUID: 0x0011750000791fed
Link layer: InfiniBand
But then in any moment it got Down again and lost connectivity,
If I run a "ibqueryerrors" on the host that work fine I'm getting the following:
# ibqueryerrors
Errors for "Intel Infiniband HCA ubuntu"
GUID 0x11750000791fec port 1: [LinkErrorRecoveryCounter == 255] [LinkDownedCounter == 132] [PortRcvErrors == 8]
Errors for 0x2c902004b0918 "Infiniscale-IV Mellanox Technologies"
GUID 0x2c902004b0918 port ALL: [SymbolErrorCounter == 65535] [PortRcvErrors == 65535] [PortRcvSwitchRelayErrors == 4] [PortXmitDiscards == 1]
GUID 0x2c902004b0918 port 1: [PortXmitDiscards == 1]
GUID 0x2c902004b0918 port 2: [LinkErrorRecoveryCounter == 1] [LinkDownedCounter == 1]
GUID 0x2c902004b0918 port 28: [SymbolErrorCounter == 65535] [LinkErrorRecoveryCounter == 255] [LinkDownedCounter == 255] [PortRcvErrors == 65535] [PortRcvSwitchRelayErrors == 4]
Errors for 0xe41d2d030031e9c1 "MF0;GWIB01:SX6036G/U1"
GUID 0xe41d2d030031e9c1 port ALL: [LinkDownedCounter == 7] [PortRcvRemotePhysicalErrors == 1485] [PortXmitWait == 87808]
GUID 0xe41d2d030031e9c1 port 0: [PortXmitWait == 87808]
GUID 0xe41d2d030031e9c1 port 9: [SymbolErrorCounter == 1] [LinkDownedCounter == 2] [PortRcvRemotePhysicalErrors == 1485]
GUID 0xe41d2d030031e9c1 port 10: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]
GUID 0xe41d2d030031e9c1 port 33: [LinkDownedCounter == 1]
GUID 0xe41d2d030031e9c1 port 34: [LinkDownedCounter == 1]
GUID 0xe41d2d030031e9c1 port 35: [LinkDownedCounter == 1]
GUID 0xe41d2d030031e9c1 port 36: [LinkDownedCounter == 1]
Errors for 0xf45214030073f500 "MF0;SWIB02:SX6018/U1"
GUID 0xf45214030073f500 port ALL: [LinkDownedCounter == 2] [PortXmitWait == 6380344]
GUID 0xf45214030073f500 port 0: [PortXmitWait == 14354]
GUID 0xf45214030073f500 port 4: [PortXmitWait == 1514987]
GUID 0xf45214030073f500 port 5: [PortXmitWait == 1569766]
GUID 0xf45214030073f500 port 6: [PortXmitWait == 1620863]
GUID 0xf45214030073f500 port 7: [PortXmitWait == 1660374]
GUID 0xf45214030073f500 port 16: [LinkDownedCounter == 1]
GUID 0xf45214030073f500 port 18: [LinkDownedCounter == 1]
Errors for 0xe41d2d030031eb41 "MF0;GWIB02:SX6036G/U1"
GUID 0xe41d2d030031eb41 port ALL: [LinkDownedCounter == 7] [PortRcvRemotePhysicalErrors == 2047] [PortXmitWait == 103260]
GUID 0xe41d2d030031eb41 port 0: [PortXmitWait == 103260]
GUID 0xe41d2d030031eb41 port 9: [LinkDownedCounter == 3] [PortRcvRemotePhysicalErrors == 2047]
GUID 0xe41d2d030031eb41 port 33: [LinkDownedCounter == 1]
GUID 0xe41d2d030031eb41 port 34: [LinkDownedCounter == 1]
GUID 0xe41d2d030031eb41 port 35: [LinkDownedCounter == 1]
GUID 0xe41d2d030031eb41 port 36: [LinkDownedCounter == 1]
Errors for "cibosd08 HCA-1"
GUID 0xe41d2d03007b77c1 port 1: [PortXmitWait == 3387]
GUID 0xe41d2d03007b77c2 port 2: [PortXmitWait == 3351]
Errors for "cibosd07 HCA-1"
GUID 0xe41d2d03007b67c1 port 1: [PortXmitWait == 3165]
GUID 0xe41d2d03007b67c2 port 2: [PortXmitWait == 3364]
Errors for "cibosd06 HCA-1"
GUID 0xe41d2d03007b77b1 port 1: [PortXmitWait == 2962]
GUID 0xe41d2d03007b77b2 port 2: [PortXmitWait == 3259]
Errors for "cibosd05 HCA-1"
GUID 0xe41d2d0300d95191 port 1: [PortXmitWait == 3213]
GUID 0xe41d2d0300d95192 port 2: [PortXmitWait == 4189]
Errors for "cibosd04 HCA-1"
GUID 0xf45214030095a6f1 port 1: [PortRcvRemotePhysicalErrors == 595] [PortXmitWait == 1861]
GUID 0xf45214030095a6f2 port 2: [PortXmitWait == 698289]
Errors for "cibosd03 HCA-1"
GUID 0xf45214030095ad91 port 1: [PortRcvRemotePhysicalErrors == 501] [PortXmitWait == 2317]
GUID 0xf45214030095ad92 port 2: [PortXmitWait == 734853]
Errors for "cibosd01 HCA-1"
GUID 0xf45214030095a701 port 1: [PortRcvRemotePhysicalErrors == 860] [PortXmitWait == 1975]
GUID 0xf45214030095a702 port 2: [PortXmitWait == 1459727]
Errors for "cibosd02 HCA-1"
GUID 0xf45214030095a6c1 port 1: [PortRcvRemotePhysicalErrors == 540] [PortXmitWait == 2282]
GUID 0xf45214030095a6c2 port 2: [PortXmitWait == 1080397]
Errors for "cibmon03 HCA-1"
GUID 0xe41d2d0300163631 port 1: [PortXmitWait == 219]
Errors for "cibmon02 HCA-1"
GUID 0xe41d2d0300163a61 port 1: [PortXmitWait == 24887]
Errors for 0xe41d2d0300097630 "MF0;SWIB01:SX6018/U1"
GUID 0xe41d2d0300097630 port ALL: [LinkDownedCounter == 2] [PortRcvRemotePhysicalErrors == 2912] [PortRcvSwitchRelayErrors == 248] [PortXmitWait == 62134]
GUID 0xe41d2d0300097630 port 0: [PortXmitWait == 27162]
GUID 0xe41d2d0300097630 port 1: [PortRcvSwitchRelayErrors == 16]
GUID 0xe41d2d0300097630 port 2: [PortRcvSwitchRelayErrors == 23]
GUID 0xe41d2d0300097630 port 3: [PortRcvSwitchRelayErrors == 21] [PortXmitWait == 34972]
GUID 0xe41d2d0300097630 port 4: [PortRcvSwitchRelayErrors == 53]
GUID 0xe41d2d0300097630 port 5: [PortRcvSwitchRelayErrors == 76]
GUID 0xe41d2d0300097630 port 6: [PortRcvSwitchRelayErrors == 30]
GUID 0xe41d2d0300097630 port 7: [PortRcvSwitchRelayErrors == 29]
GUID 0xe41d2d0300097630 port 16: [LinkDownedCounter == 1] [PortRcvRemotePhysicalErrors == 1673]
GUID 0xe41d2d0300097630 port 17: [LinkDownedCounter == 1]
GUID 0xe41d2d0300097630 port 18: [PortRcvRemotePhysicalErrors == 1239]
Errors for "cibmon01 HCA-1"
GUID 0xe41d2d0300163651 port 1: [PortXmitWait == 4071]
## Summary: 19 nodes checked, 17 bad nodes found
## 171 ports checked, 54 ports have errors beyond threshold
##
## Suppressed:
Any ideas? I've already try to setup the port speed at "7" but with no luck at all, in fact it also does not come Up, just with speed "4"
Thanks in advance,
Cheers,
German