Hi,
I'm testing out ConnectX-3 Pro with VXLAN Offload in our lab. Using a single-stream iperf performance test, we get ~34Gbit/s transfer speed of non-VXLAN transport, but only ~28Gbit/s with VXLAN encapsulation.
In both cases, the bottleneck is the CPU on the receiving side. Looking at a perf dump, the top usage:
Without VXLAN:
+ 24.27% iperf [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 6.49% iperf [kernel.kallsyms] [k] mlx4_en_process_rx_cq
+ 5.34% iperf [kernel.kallsyms] [k] tcp_gro_receive
+ 3.43% iperf [kernel.kallsyms] [k] dev_gro_receive
+ 3.28% iperf [kernel.kallsyms] [k] mlx4_en_complete_rx_desc
+ 3.05% iperf [kernel.kallsyms] [k] memcpy
+ 2.88% iperf [kernel.kallsyms] [k] inet_gro_receive
With VXLAN:
+ 20.06% iperf [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 6.04% iperf [kernel.kallsyms] [k] mlx4_en_process_rx_cq
+ 5.43% iperf [kernel.kallsyms] [k] inet_gro_receive
+ 3.29% iperf [kernel.kallsyms] [k] dev_gro_receive
+ 3.24% iperf [kernel.kallsyms] [k] tcp_gro_receive
+ 3.08% iperf [kernel.kallsyms] [k] skb_gro_receive
+ 3.02% iperf [kernel.kallsyms] [k] memcpy
+ 2.85% iperf [kernel.kallsyms] [k] mlx4_en_complete_rx_desc
This is Centos 6.5, kernel 3.15.0, Firmware 2.31.5050.
We're certainly happy with 28Gbit/s, but I'm wondering if there are plans to improve this to the point that VXLAN adds no additional CPU overhead at all, or if there is any tuning I can do towards the same goal?
- Thorvald