Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6211

Re: Re: New to infiniband, can't get a working connection.

$
0
0

No worries.  Which OS are you using?

 

Is there any chance you could do stuff on CentOS/RHEL 6.4?

 

Asking that because it's what I'm super familiar with.

 

If you're ok with that, please install the CentOS/RHEL provided IB software, and also pciutils:

 

$ sudo yum groupinstall "Infiniband Support"

$ sudo yum install mstflint pciutils

$ sudo chkconfig rdma on

$ sudo service rdma start

 

Then let's do some basic info gathering so we know what we're dealing with.

 

  • Run lspci -Qvvs on the ConnectX card, and at least one of the Infinihost III's, then post the results here
  • Also query the firmware of both using mstflint

 

Example from a ConnectX card here.  First I find out it's PCI address in the box:

 

$ sudo lspci |grep Mell

01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

 

Then use lspci -Qvvs on that address, to retrieve all of the potentially useful info:


$ sudo lspci -Qvvs 01:00.0

01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

    Subsystem: Mellanox Technologies Device 0006

    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

    Latency: 0, Cache Line Size: 64 bytes

    Interrupt: pin A routed to IRQ 16

    Region 0: Memory at f7c00000 (64-bit, non-prefetchable) [size=1M]

    Region 2: Memory at f0000000 (64-bit, prefetchable) [size=8M]

    Capabilities: [40] Power Management version 3

        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

    Capabilities: [48] Vital Product Data

        Product Name: Eagle DDR

        Read-only fields:

            [PN] Part number: 375-3549-01         

            [EC] Engineering changes: 51

            [SN] Serial number: 1388FMH-0905400010     

            [V0] Vendor specific: PCIe x8        

            [RV] Reserved: checksum good, 0 byte(s) reserved

        Read/write fields:

            [V1] Vendor specific: N/A  

            [YA] Asset tag: N/A                            

            [RW] Read-write area: 111 byte(s) free

        End

    Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-

        Vector table: BAR=0 offset=0007c000

        PBA: BAR=0 offset=0007d000

    Capabilities: [60] Express (v2) Endpoint, MSI 00

        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited

            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+

        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-

            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-

            MaxPayload 256 bytes, MaxReadReq 512 bytes

        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-

        LnkCap:    Port #8, Speed 2.5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited

            ClockPM- Surprise- LLActRep- BwNot-

        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-

            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

        LnkSta:    Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-

        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported

        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled

        LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-

             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

             Compliance De-emphasis: -6dB

        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-

             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-

    Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)

        ARICap:    MFVC- ACS-, Next Function: 1

        ARICtl:    MFVC- ACS-, Function Group: 0

    Kernel driver in use: mlx4_core

    Kernel modules: mlx4_core

 

Note the blue highlighted bits.  For ConnectX cards this stuff is useful.   For my card, it's showing a Sun part number, as it was originally a Sun badged card (now reflashed to stock firmware).  The PCI link is in x8 state too, which is useful (if it wasn't, it would indicate a problem).

 

And the mstflint output example:

 

$ sudo mstflint -d 01:00.0 q

Image type:      ConnectX

FW Version:      2.9.1000

Device ID:       25418

Description:     Node             Port1            Port2            Sys image

GUIDs:           0003ba000100edb8 0003ba000100edb9 0003ba000100edba 0003ba000100edbb

MACs:                                 0003ba00edb9     0003ba00edba

Board ID:         (MT_04A0120002)

VSD:            

PSID:            MT_04A0120002

 

That tells us the firmware version on the card.  Useful to know, as it might need upgrading (very easy to do).

 

After you've pasted that info here, we can start figuring out if there's anything wrong with the basics first and fix them.  Then we can move onto the next stuff.

 

(note - edited for typo fixes)


Viewing all articles
Browse latest Browse all 6211

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>