Hi all!
I have some problems to get working IPoIB in XenServer environment.
Hardware:
IBM BladeCenter H with VOLTAIRE 40 GB (QDR) INFINIBAND SWITCH MODULE
IBM HS22 blades with Mellanox 40Gb/s QDR Infiniband Expansion Card (CFFh)
Software:
XenServer 6.1 with MLNX_OFED_LINUX-2.1-1.0.0-xenserver6.x-i686.iso installed.
Some outputs from Blade3:
[root@blade3 ~]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c9030028b394
System image GUID: 0x0002c9030028b397
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030028b395
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030028b396
Link layer: InfiniBand
[root@blade3 ~]# ibdev2netdev
mlx4_0 port 1 ==> ib0 (Up)
mlx4_0 port 2 ==> ib1 (Down)
Some outputs from Blade4:
[root@blade4 ~]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c9030028b38c
System image GUID: 0x0002c9030028b38f
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9030028b38d
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c9030028b38e
Link layer: InfiniBand
[root@blade4 ~]# ibdev2netdev
mlx4_0 port 1 ==> ib0 (Up)
mlx4_0 port 2 ==> ib1 (Down)
The problem is that I can't start opensm, it hangs after starting:
[root@blade4 ~]# cat /var/log/opensm.log
Jan 17 09:37:50 890720 [B75958D0] 0x03 -> OpenSM 4.0.5.MLNX20131217.d8345a7
Jan 17 09:37:50 890775 [B75958D0] 0x80 -> OpenSM 4.0.5.MLNX20131217.d8345a7
Jan 17 09:37:50 891449 [B75958D0] 0x02 -> osm_vendor_init: 1000 pending umads specified
Jan 17 09:37:50 905260 [B75958D0] 0x80 -> Entering DISCOVERING state
Jan 17 09:37:50 905359 [B75958D0] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x2c9030028b38d
Jan 17 09:37:50 939076 [B75958D0] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x2c9030028b38d
Jan 17 09:37:50 939124 [B75958D0] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x2c9030028b38d
Jan 17 09:37:50 939169 [B75958D0] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x2c9030028b38d
Jan 17 09:37:50 939212 [B75958D0] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0002c9030028b38d
Jan 17 09:37:50 946927 [B3D8DB90] 0x80 -> Entering MASTER state
Jan 17 09:37:50 946952 [B3D8DB90] 0x01 -> osm_prtn_make_partitions: Partition configuration /etc/opensm/partitions.conf is not accessible (No such file or directory)
Jan 17 09:37:50 947868 [B3D8DB90] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
Jan 17 09:37:50 949698 [B3D8DB90] 0x02 -> SUBNET UP
Jan 17 09:37:51 621411 [B6D93B90] 0x01 -> log_trap_info: Received Generic Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed) Producer:1 (Channel Adapter) from LID:1 TID:0x0000000000000016
Jan 17 09:37:51 621445 [B6D93B90] 0x02 -> trap_rcv_process_request: Trap 144 Node description update
Jan 17 09:37:51 621463 [B6D93B90] 0x02 -> log_notice: Reporting Generic Notice type:4 num:144 (CapabilityMask, NodeDescription, Link [Width|Speed] Enabled, SM priority changed) from LID:1 GID:fe80::2:c903:28:b38d
Jan 17 09:37:52 940296 [B6592B90] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::ffff:ffff
Jan 17 09:37:52 940710 [B258AB90] 0x01 -> log_rcv_cb_error: ERR 3111: Received MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x12d4
Initial path: 0,1 Return path: 0,28
Jan 17 09:37:52 940985 [B6592B90] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff28:b38d
Jan 17 09:37:52 941240 [B258AB90] 0x01 -> log_rcv_cb_error: ERR 3111: Received MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x12d7
Initial path: 0,1 Return path: 0,28
Jan 17 09:37:52 941521 [B6D93B90] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::1
Jan 17 09:37:52 941752 [B258AB90] 0x01 -> log_rcv_cb_error: ERR 3111: Received MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x12d8
Initial path: 0,1 Return path: 0,28
Jan 17 09:37:52 942026 [B6592B90] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1
Jan 17 09:37:52 942249 [B258AB90] 0x01 -> log_rcv_cb_error: ERR 3111: Received MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x12d9
Initial path: 0,1 Return path: 0,28
Jan 17 09:37:52 949767 [B258AB90] 0x01 -> log_rcv_cb_error: ERR 3111: Received MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x12da
Initial path: 0,1 Return path: 0,28
Jan 17 09:37:52 950519 [B5590B90] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:601b:ffff::1:ff28:b395
Jan 17 09:37:52 950979 [B258AB90] 0x01 -> log_rcv_cb_error: ERR 3111: Received MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x12de
Initial path: 0,1 Return path: 0,28
Jan 17 09:37:52 952236 [B258AB90] 0x01 -> log_rcv_cb_error: ERR 3111: Received MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x12e2
Initial path: 0,1 Return path: 0,28
Jan 17 09:37:54 530472 [B75958D0] 0x80 -> Exiting SM
How can I fix it ? What's wrong ?