Hi,
This is my forth day of fighting with SR-IOV and KVM.
I can ping from VM to other IPoIB computer but when I tried to use ibnetdiscover command I get SIGSEGV
ibnetdiscover
src/query_smp.c:98; send failed; -5
#
# Topology file: generated on Fri Jul 19 19:28:24 2013
#
Segmentation fault (core dumped)
Any access from most of the ib commands failed, dmesg shows:
mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:3 in_param 0x29f3a000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1
mlx4_core 0000:04:00.0: vhcr command SET_PORT (0xc) slave:3 in_param 0x29f3a000 in_mod=0x1, op_mod=0x0 failed with error:0, status -22
mlx4_core 0000:04:00.0: slave 3 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting
Looks like command firmware MAD_IFC is failing by somereason in device but I don't have idea about the cause, possibly this part of code is related to this:
+ if (slave != dev->caps.function &&
+ ((smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) ||
+ (smp->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED &&
+ smp->method == IB_MGMT_METHOD_SET))) {
+ mlx4_err(dev, "slave %d is trying to execute a Subnet MGMT MAD, "
+ "class 0x%x, method 0x%x for attr 0x%x. Rejecting\n",
+ slave, smp->method, smp->mgmt_class,
+ be16_to_cpu(smp->attr_id));
+ return -EPERM;
+ }
from
+static int mlx4_MAD_IFC_wrapper(struct mlx4_dev *dev, int slave,
+ struct mlx4_vhcr *vhcr,
+ struct mlx4_cmd_mailbox *inbox,
+ struct mlx4_cmd_mailbox *outbox,
+ struct mlx4_cmd_info *cmd)
Please find below some details about my build.
Reallly appreciate if anybody can point me the right direction or even better help me to fix the issue.
Thanks in advance
Marcin
Host:
-----
Motherboard: Supermicro X9DRI-F
CPUs: 2x E5-2640
System: CentOS 6.3:2.6.32-279.el6.x86_64 and CentOS 6.4 2.6.32-358.el6.x86_64
Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3], MCX354A-QCB
Mellanox OFED: MLNX_OFED_LINUX-2.0-2.0.5-rhel6.3-x86_64
qemu-kvm.x86_64 2:0.12.1.2-2.355.el6
#lspci | grep Mel
04:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
04:00.1 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
04:00.2 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
04:00.3 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
04:00.4 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
04:00.5 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
04:00.6 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
04:00.7 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
04:01.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
#dmesg | grep mlx4
mlx4_core: Mellanox ConnectX core driver v1.1 (Apr 23 2013)
mlx4_core: Initializing 0000:04:00.0
mlx4_core 0000:04:00.0: PCI INT A -> GSI 32 (level, low) -> IRQ 32
mlx4_core 0000:04:00.0: setting latency timer to 64
mlx4_core 0000:04:00.0: Enabling SR-IOV with 5 VFs
mlx4_core 0000:04:00.0: Running in master mode
mlx4_core 0000:04:00.0: irq 109 for MSI/MSI-X
mlx4_core 0000:04:00.0: irq 110 for MSI/MSI-X
mlx4_core 0000:04:00.0: irq 111 for MSI/MSI-X
mlx4_core 0000:04:00.0: irq 112 for MSI/MSI-X
mlx4_core: Initializing 0000:04:00.1
mlx4_core 0000:04:00.1: enabling device (0000 -> 0002)
mlx4_core 0000:04:00.1: setting latency timer to 64
mlx4_core 0000:04:00.1: Detected virtual function - running in slave mode
mlx4_core 0000:04:00.1: Sending reset
mlx4_core 0000:04:00.0: Received reset from slave:1
mlx4_core 0000:04:00.1: Sending vhcr0
mlx4_core 0000:04:00.1: HCA minimum page size:512
mlx4_core 0000:04:00.1: irq 113 for MSI/MSI-X
mlx4_core 0000:04:00.1: irq 114 for MSI/MSI-X
mlx4_core 0000:04:00.1: irq 115 for MSI/MSI-X
mlx4_core 0000:04:00.1: irq 116 for MSI/MSI-X
mlx4_core: Initializing 0000:04:00.2
mlx4_core 0000:04:00.2: enabling device (0000 -> 0002)
mlx4_core 0000:04:00.2: setting latency timer to 64
mlx4_core 0000:04:00.2: Skipping virtual function:2
mlx4_core: Initializing 0000:04:00.3
mlx4_core 0000:04:00.3: enabling device (0000 -> 0002)
mlx4_core 0000:04:00.3: setting latency timer to 64
mlx4_core 0000:04:00.3: Skipping virtual function:3
mlx4_core: Initializing 0000:04:00.4
mlx4_core 0000:04:00.4: enabling device (0000 -> 0002)
mlx4_core 0000:04:00.4: setting latency timer to 64
mlx4_core 0000:04:00.4: Skipping virtual function:4
mlx4_core: Initializing 0000:04:00.5
mlx4_core 0000:04:00.5: enabling device (0000 -> 0002)
mlx4_core 0000:04:00.5: setting latency timer to 64
mlx4_core 0000:04:00.5: Skipping virtual function:5
<mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (Apr 23 2013)
mlx4_core 0000:04:00.0: mlx4_ib: multi-function enabled
mlx4_core 0000:04:00.0: mlx4_ib: initializing demux service for 80 qp1 clients
mlx4_core 0000:04:00.1: mlx4_ib: multi-function enabled
mlx4_core 0000:04:00.1: mlx4_ib: operating in qp1 tunnel mode
mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.1 (Apr 23 2013)
mlx4_en 0000:04:00.0: Activating port:2
mlx4_en: eth2: Using 216 TX rings
mlx4_en: eth2: Using 4 RX rings
mlx4_en: eth2: Initializing port
mlx4_en 0000:04:00.1: Activating port:2
mlx4_en: eth3: Using 216 TX rings
mlx4_en: eth3: Using 4 RX rings
mlx4_en: eth3: Initializing port
mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is up
mlx4_core 0000:04:00.0: Received reset from slave:2
mlx4_core 0000:04:00.0: slave 2 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting
mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:2 in_param 0x106a10000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1
mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is up
mlx4_core 0000:04:00.0: slave 2 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting
mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:2 in_param 0x119079000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1
mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is down
mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is down
mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is up
mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is up
# ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.500
node_guid: 0002:c903:00a2:8fb0
sys_image_guid: 0002:c903:00a2:8fb3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: MT_1090110018
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
#cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core num_vfs=8 port_type_array=1,1 probe_vf=1
KVM Guest: CentOS 6.4 and CentOS 6.3
----------------------
Mellanox OFED: MLNX_OFED_LINUX-2.0-2.0.5-rhel6.3-x86_64
Kernel: 2.6.32-279.el6.x86_64
#lspci | grep Mel
00:07.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]
#ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.500
node_guid: 0014:0500:c0bb:4473
sys_image_guid: 0002:c903:00a2:8fb3
vendor_id: 0x02c9
vendor_part_id: 4100
hw_ver: 0x0
board_id: MT_1090110018
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
# sminfo
ibwarn: [3673] _do_madrpc: send failed; Function not implemented
ibwarn: [3673] mad_rpc: _do_madrpc failed; dport (Lid 1)
sminfo: iberror: failed: query
OpenSM log:
Jul 19 09:57:54 001056 [C520D700] 0x02 -> osm_vendor_init: 1000 pending umads specified
Jul 19 09:57:54 002074 [C520D700] 0x80 -> Entering DISCOVERING state
Using default GUID 0x14050000000002
Jul 19 09:57:54 191924 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x14050000000002
Jul 19 09:57:54 671075 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x14050000000002
Jul 19 09:57:54 671503 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x14050000000002
Jul 19 09:57:54 672363 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x14050000000002
Jul 19 09:57:54 672774 [C520D700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0014050000000002
Jul 19 09:57:54 673345 [C520D700] 0x01 -> osm_vendor_set_sm: ERR 5431: setting IS_SM capmask: cannot open file '/dev/infiniband/issm0': Invalid argument
Jul 19 09:57:54 674233 [C1605700] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f11b00008c0 of size 256 TID 0x1234 failed -5 (Invalid argument)
Jul 19 09:57:54 674278 [C1605700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_ERROR): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1234
Jul 19 09:57:54 674311 [C1605700] 0x01 -> vl15_send_mad: ERR 3E03: MAD send failed (IB_UNKNOWN_ERROR)
Jul 19 09:57:54 674336 [C0C04700] 0x01 -> state_mgr_is_sm_port_down: ERR 3308: SM port GUID unknown