Running mpi we are getting the message:
error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id c272de0 opcode 1 vendor error 129 qp_idx 3
whenever we include one specific host. According to the various comments on the openmpi list, I understand that this is pointing to a fabric problem. I have tried the obvious thing and swapped switch ports, and it still continues. Is there any way to debug this that is better than just swapping parts until I prove it must be the board? When this happens, the system generally hangs completely. You can get a login prompt on the IPMI console, but you can not type at all. The only time it didn't hang completely I found that the infiniband had changed from FDR to FDR10. Anyone have any suggestions?