Hi David,
While you are running your MPI test and having this problematic host in the fabric, validate the health of the fabric by running:
#ibdiagnet -r -pc -P all=1 --pm_pause_time 600
Once done, review the fabric health from log ibdiagnet2.log located under /var/tmp/ibdiagnet2/ibdiagnet2.log file.
(ie: ports counters, link speed and width issues etc...)
The ibdiagnet utility validate the health check of the fabric and provide an entire report.
If this is the only problematic host, validate OS/Kernel/Mellanox OFED Driver version & FW compatibility.
If your MPI tests run without errors when this host is no longer in the fabric, try to determine what are the main differences between this host and the others.
If the link downgrade from FDR to FDR10, try to reseat the cable from both end.
If FDR10 remains, try to swap the cable and test again.
Regards,
Sophie.