Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6211

Infiniband SX6036G/SX6018F and QLogic HP BLc 4X QDR IB Switch

$
0
0

Hi all,

 

I'm really new to IB, and I'm having some issues while trying to configure my existing IB network with SX6036G gw and SX6018F switches to a new HP Enclosure with QLogic HP BLc 4X QDR IB Switch and InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02) mezzanine adapters on each of the Blades. Here's my topology:

 

# ibswitches

Switch    : 0x0002c902004b0918 ports 32 "Infiniscale-IV Mellanox Technologies" base port 0 lid 29 lmc 0     --> QLogic HP BLc 4X QDR IB Switch

Switch    : 0xe41d2d030031e9c1 ports 37 "MF0;GWIB01:SX6036G/U1" enhanced port 0 lid 24 lmc 0

Switch    : 0xf45214030073f500 ports 18 "MF0;SWIB02:SX6018/U1" enhanced port 0 lid 1 lmc 0

Switch    : 0xe41d2d030031eb41 ports 37 "MF0;GWIB02:SX6036G/U1" enhanced port 0 lid 23 lmc 0

Switch    : 0xe41d2d0300097630 ports 18 "MF0;SWIB01:SX6018/U1" enhanced port 0 lid 2 lmc 0

 

The SM is running on switch SWIB01 with priority 8.

 

The thing comes when I try to configure the blades, they had Ubuntu 14.04.3 LTS with the following modules:

 

ib_ucm

ib_uverbs

ib_ipoib

ib_cm

ib_sa

ib_umad

ib_mthca

ib_qib

ib_mad

ib_core

ib_addr

dca

 

If I ran an "ibstat" from one of the Blades I'm getting:

 

root@ubuntu:~# ibstat

CA 'qib0'

    CA type: InfiniPath_QMH7342

    Number of ports: 2

    Firmware version:

    Hardware version: 2

    Node GUID: 0x0011750000791fec

    System image GUID: 0x0011750000791fec

    Port 1:

        State: Down

        Physical state: Polling

        Rate: 40

        Base lid: 30

        LMC: 0

        SM lid: 2

        Capability mask: 0x0761086a

        Port GUID: 0x0011750000791fec

        Link layer: InfiniBand

    Port 2:

        State: Down

        Physical state: Polling

        Rate: 40

        Base lid: 65535

        LMC: 0

        SM lid: 65535

        Capability mask: 0x0761086a

        Port GUID: 0x0011750000791fed

        Link layer: InfiniBand

 

Ok, now If I go to a host that's inside of the IB network and run the following commands, I'm able to 'active' the port just for a while..:

 

# ibportstate -L 29 28 disable

# ibportstate -L 29 28 speed 4

# ibportstate -L 29 28 espeed 4

# ibportstate -L 29 28 smlid 2

# ibportstate -L 29 28 enable

 

# ibportstate -L 29 28

Switch PortInfo:

# Port info: Lid 29 port 28

LinkState:.......................Active

PhysLinkState:...................LinkUp

Lid:.............................75

SMLid:...........................2328

LMC:.............................0

LinkWidthSupported:..............1X or 4X

LinkWidthEnabled:................1X or 4X

LinkWidthActive:.................4X

LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps

LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps

LinkSpeedActive:.................10.0 Gbps

Peer PortInfo:

# Port info: Lid 29 DR path slid 4; dlid 65535; 0,28 port 1

LinkState:.......................Active

PhysLinkState:...................LinkUp

Lid:.............................30

SMLid:...........................2

LMC:.............................0

LinkWidthSupported:..............1X or 4X

LinkWidthEnabled:................1X or 4X

LinkWidthActive:.................4X

LinkSpeedSupported:..............10.0 Gbps (IBA extension)

LinkSpeedEnabled:................10.0 Gbps (IBA extension)

LinkSpeedActive:.................10.0 Gbps

Mkey:............................<not displayed>

MkeyLeasePeriod:.................0

ProtectBits:.....................0

 

 

On the Blade host:

 

root@ubuntu:~# ibstat

CA 'qib0'

    CA type: InfiniPath_QMH7342

    Number of ports: 2

    Firmware version:

    Hardware version: 2

    Node GUID: 0x0011750000791fec

    System image GUID: 0x0011750000791fec

    Port 1:

        State: Active

        Physical state: LinkUp

        Rate: 40

        Base lid: 30

        LMC: 0

        SM lid: 2

        Capability mask: 0x0761086a

        Port GUID: 0x0011750000791fec

        Link layer: InfiniBand

    Port 2:

        State: Down

        Physical state: Polling

        Rate: 40

        Base lid: 65535

        LMC: 0

        SM lid: 65535

        Capability mask: 0x0761086a

        Port GUID: 0x0011750000791fed

        Link layer: InfiniBand

 

 

But then in any moment it got Down again and lost connectivity,

 

If I run a "ibqueryerrors" on the host that work fine I'm getting the following:

 

# ibqueryerrors

Errors for "Intel Infiniband HCA ubuntu"

   GUID 0x11750000791fec port 1: [LinkErrorRecoveryCounter == 255] [LinkDownedCounter == 132] [PortRcvErrors == 8]

Errors for 0x2c902004b0918 "Infiniscale-IV Mellanox Technologies"

   GUID 0x2c902004b0918 port ALL: [SymbolErrorCounter == 65535] [PortRcvErrors == 65535] [PortRcvSwitchRelayErrors == 4] [PortXmitDiscards == 1]

   GUID 0x2c902004b0918 port 1: [PortXmitDiscards == 1]

   GUID 0x2c902004b0918 port 2: [LinkErrorRecoveryCounter == 1] [LinkDownedCounter == 1]

   GUID 0x2c902004b0918 port 28: [SymbolErrorCounter == 65535] [LinkErrorRecoveryCounter == 255] [LinkDownedCounter == 255] [PortRcvErrors == 65535] [PortRcvSwitchRelayErrors == 4]

Errors for 0xe41d2d030031e9c1 "MF0;GWIB01:SX6036G/U1"

   GUID 0xe41d2d030031e9c1 port ALL: [LinkDownedCounter == 7] [PortRcvRemotePhysicalErrors == 1485] [PortXmitWait == 87808]

   GUID 0xe41d2d030031e9c1 port 0: [PortXmitWait == 87808]

   GUID 0xe41d2d030031e9c1 port 9: [SymbolErrorCounter == 1] [LinkDownedCounter == 2] [PortRcvRemotePhysicalErrors == 1485]

   GUID 0xe41d2d030031e9c1 port 10: [SymbolErrorCounter == 65535] [LinkDownedCounter == 1]

   GUID 0xe41d2d030031e9c1 port 33: [LinkDownedCounter == 1]

   GUID 0xe41d2d030031e9c1 port 34: [LinkDownedCounter == 1]

   GUID 0xe41d2d030031e9c1 port 35: [LinkDownedCounter == 1]

   GUID 0xe41d2d030031e9c1 port 36: [LinkDownedCounter == 1]

Errors for 0xf45214030073f500 "MF0;SWIB02:SX6018/U1"

   GUID 0xf45214030073f500 port ALL: [LinkDownedCounter == 2] [PortXmitWait == 6380344]

   GUID 0xf45214030073f500 port 0: [PortXmitWait == 14354]

   GUID 0xf45214030073f500 port 4: [PortXmitWait == 1514987]

   GUID 0xf45214030073f500 port 5: [PortXmitWait == 1569766]

   GUID 0xf45214030073f500 port 6: [PortXmitWait == 1620863]

   GUID 0xf45214030073f500 port 7: [PortXmitWait == 1660374]

   GUID 0xf45214030073f500 port 16: [LinkDownedCounter == 1]

   GUID 0xf45214030073f500 port 18: [LinkDownedCounter == 1]

Errors for 0xe41d2d030031eb41 "MF0;GWIB02:SX6036G/U1"

   GUID 0xe41d2d030031eb41 port ALL: [LinkDownedCounter == 7] [PortRcvRemotePhysicalErrors == 2047] [PortXmitWait == 103260]

   GUID 0xe41d2d030031eb41 port 0: [PortXmitWait == 103260]

   GUID 0xe41d2d030031eb41 port 9: [LinkDownedCounter == 3] [PortRcvRemotePhysicalErrors == 2047]

   GUID 0xe41d2d030031eb41 port 33: [LinkDownedCounter == 1]

   GUID 0xe41d2d030031eb41 port 34: [LinkDownedCounter == 1]

   GUID 0xe41d2d030031eb41 port 35: [LinkDownedCounter == 1]

   GUID 0xe41d2d030031eb41 port 36: [LinkDownedCounter == 1]

Errors for "cibosd08 HCA-1"

   GUID 0xe41d2d03007b77c1 port 1: [PortXmitWait == 3387]

   GUID 0xe41d2d03007b77c2 port 2: [PortXmitWait == 3351]

Errors for "cibosd07 HCA-1"

   GUID 0xe41d2d03007b67c1 port 1: [PortXmitWait == 3165]

   GUID 0xe41d2d03007b67c2 port 2: [PortXmitWait == 3364]

Errors for "cibosd06 HCA-1"

   GUID 0xe41d2d03007b77b1 port 1: [PortXmitWait == 2962]

   GUID 0xe41d2d03007b77b2 port 2: [PortXmitWait == 3259]

Errors for "cibosd05 HCA-1"

   GUID 0xe41d2d0300d95191 port 1: [PortXmitWait == 3213]

   GUID 0xe41d2d0300d95192 port 2: [PortXmitWait == 4189]

Errors for "cibosd04 HCA-1"

   GUID 0xf45214030095a6f1 port 1: [PortRcvRemotePhysicalErrors == 595] [PortXmitWait == 1861]

   GUID 0xf45214030095a6f2 port 2: [PortXmitWait == 698289]

Errors for "cibosd03 HCA-1"

   GUID 0xf45214030095ad91 port 1: [PortRcvRemotePhysicalErrors == 501] [PortXmitWait == 2317]

   GUID 0xf45214030095ad92 port 2: [PortXmitWait == 734853]

Errors for "cibosd01 HCA-1"

   GUID 0xf45214030095a701 port 1: [PortRcvRemotePhysicalErrors == 860] [PortXmitWait == 1975]

   GUID 0xf45214030095a702 port 2: [PortXmitWait == 1459727]

Errors for "cibosd02 HCA-1"

   GUID 0xf45214030095a6c1 port 1: [PortRcvRemotePhysicalErrors == 540] [PortXmitWait == 2282]

   GUID 0xf45214030095a6c2 port 2: [PortXmitWait == 1080397]

Errors for "cibmon03 HCA-1"

   GUID 0xe41d2d0300163631 port 1: [PortXmitWait == 219]

Errors for "cibmon02 HCA-1"

   GUID 0xe41d2d0300163a61 port 1: [PortXmitWait == 24887]

Errors for 0xe41d2d0300097630 "MF0;SWIB01:SX6018/U1"

   GUID 0xe41d2d0300097630 port ALL: [LinkDownedCounter == 2] [PortRcvRemotePhysicalErrors == 2912] [PortRcvSwitchRelayErrors == 248] [PortXmitWait == 62134]

   GUID 0xe41d2d0300097630 port 0: [PortXmitWait == 27162]

   GUID 0xe41d2d0300097630 port 1: [PortRcvSwitchRelayErrors == 16]

   GUID 0xe41d2d0300097630 port 2: [PortRcvSwitchRelayErrors == 23]

   GUID 0xe41d2d0300097630 port 3: [PortRcvSwitchRelayErrors == 21] [PortXmitWait == 34972]

   GUID 0xe41d2d0300097630 port 4: [PortRcvSwitchRelayErrors == 53]

   GUID 0xe41d2d0300097630 port 5: [PortRcvSwitchRelayErrors == 76]

   GUID 0xe41d2d0300097630 port 6: [PortRcvSwitchRelayErrors == 30]

   GUID 0xe41d2d0300097630 port 7: [PortRcvSwitchRelayErrors == 29]

   GUID 0xe41d2d0300097630 port 16: [LinkDownedCounter == 1] [PortRcvRemotePhysicalErrors == 1673]

   GUID 0xe41d2d0300097630 port 17: [LinkDownedCounter == 1]

   GUID 0xe41d2d0300097630 port 18: [PortRcvRemotePhysicalErrors == 1239]

Errors for "cibmon01 HCA-1"

   GUID 0xe41d2d0300163651 port 1: [PortXmitWait == 4071]

 

## Summary: 19 nodes checked, 17 bad nodes found

##          171 ports checked, 54 ports have errors beyond threshold

##

## Suppressed:

 

Any ideas? I've already try to setup the port speed at "7" but with no luck at all, in fact it also does not come Up, just with speed "4"

 

Thanks in advance,

 

Cheers,

 

German


Viewing all articles
Browse latest Browse all 6211

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>