Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all articles
Browse latest Browse all 6211

Help debugging Infiniband fabric

$
0
0

Running mpi we are getting the message:

error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id c272de0 opcode 1  vendor error 129 qp_idx 3

whenever we include one specific host. According to the various comments on the openmpi list, I understand that this is pointing to a fabric problem. I have tried the obvious thing and swapped switch ports, and it still continues. Is there any way to debug this that is better than just swapping parts until I prove it must be the board? When this happens, the system generally hangs completely. You can get a login prompt on the IPMI console, but you can not type at all. The only time it didn't hang completely I found that the infiniband had changed from FDR to FDR10. Anyone have any suggestions?


Viewing all articles
Browse latest Browse all 6211

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>