This is exactly what I'm thinking, with the addition that host2 has been working fine (with that module loaded) for almost 2 months.
OS and kernel are the same (CentOS 6.6, 2.6.32-504.el6.x86_64), also Nvidia drivers never changed so are still the same, as well as Infiniband driver.
These are the only differences in kernel modules (lsmod command), all the rest is identical:
HOST1
ib_core 125047 13 nv_peer_mem,rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_umad 14390 4
ipv6 334932 342 ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6,ib_ipoib,ib_core,ib_addr
nvidia 8594822 2 nvidia_uvm,nv_peer_mem
nv_peer_mem 4006 0 <<--- module loaded !
HOST2
ib_core 125047 12 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_umad 14390 0
ipv6 334932 426 ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6,ib_ipoib,ib_core,ib_addr
nvidia 8594822 1 nvidia_uvm
What could have caused the change, and how to sinchronize again both hosts ?
Thanks and bye,
Stefano