I have a pair of SuperMicro servers, each with dual port ConnectX-3 EN cards. Each of these servers is connected to a SX1012, and between the SX1012s is a 40Gb trunk all using Mellanox cabling. The OS/cards report their own link is up at 40Gb. So in theory end to end there is a clean 40Gb path but results with netperf and iperf are telling me anything but. Despite disabling PF, dialing up socket buffers, tcp window sizes, minimum segment sizes, stack send queue, forcing interrupt cpu affinity, the best I can pull out of these cards with a 1500 MTU is about 3.5Gb/s. Only bumping to a 9000 MTU and setting all access and trunk interfaces to the same on the SX1012s, gets me a jump in performance. However that still only brings me a consistent max throughput of 14Gb/s. For comparison, I have four other SuperMicro servers with Intel 82599 10Gb interfaces. Those are connected to the same SX1012 switches using the 10/40 fanout cables and with not nearly the same amount of goofing around or crazy network stack settings they're each able to push a near-wire speed of 9.3Gb/s on a stock MTU of 1500.
The ConnectX-3 cards have been flashed to the 2.1.5 firmware and drivers. The SX1012 switches are running the latest 3.4.1120 MLNX-OS.
The throughput numbers are so odd they don't really give me much of a clue what is going on. They're certainly not secretly negotiating 10Gb while reporting 40. Looking at the SX1012 side I don't see any errors or discards occurring on the relevant interfaces.