The issue appears to be with HP ConnectX cards, we were using 2 of these in our cluster with fw 2.7, we downloaded and burnt 2.8 (sourced from hp site) after carefully checked PSID to match the documentation, however it appears post firmware update the HCA is now bricked.
lspci / mstflint report the new firmware, but this card KP's the host.
We are not using pci passthrough, just using the IB fabric for the storage network