So we wanted to deploy CEPH at parts of our my-webspace.at Infrastructure. CEPH is a high availability, distributed, infinitly scalable storage solution and it's open source too! But there's one caveat - it needs really fast connections between nodes. Our main infrastructure runs on Gigabit Ethernet. This was fine, we have extensive monitoring and it's not causing any bottlenecks, but too slow for CEPH. Upgrading the core network to 10 Gigabit would be very costly, increase power consumption and only be marginally beneficial.
Luckily we got our hands on some used Dual-Port Mellanox X4242A Infiniband cards - they were basically free. They are not very useful without Infiniband storage but can easily be switched to Ethernet mode using two commands we simply put into /etc/rc.local (the PCI ID needs to be adapted for each system):
echo eth > /sys/devices/pci0000:60/0000:60:03.1/0000:62:00.0/mlx4_port1 echo eth > /sys/devices/pci0000:60/0000:60:03.1/0000:62:00.0/mlx4_port2
This results in two 10 Gigabit ports with a weird QSFP+ interface. When working with three servers, a dual-port card could be put into each one and each server connect to the others. The wiring would look like this:
However, how do you configure the networking stack to utilize those extra peer to peer links? Traditionally networks are not designed to work with setups like this very well. Our approach was to assign the same IP address to all interfaces - the main 1 Gigabit as well as both infiniband ports and configure the routing table to prioritze the Infiniband links for specific hosts. Also, since the cards are in used condition we needed to make sure that when a card suddenly dies, no outages occur. So we set everything up on some test servers and configured the routing tabes. Then we ran a continuous ping, pulled a cable ... and the connection dropped. Linux is not smart enough to skip over the downed interface. Meh :/ - Also it wouldn't catch errors with a card appearing to work but not transmitting data. Our solution was to develop some custom software which continously checks the connectivity and configures the routing table. When the fast link dies, it switches to the slower 1 GIgabit interface within less than two seconds and alerts us. So everything will continue working, but slower. This makes it very safe to use in a production environment.
We made this Open Source. It's available here: https://git.dkia.at/dkia-oss/bypasslink
After extensive testing were almost ready to deploy this setup. One final part was missing, the cards only had half height PCIe slots and two of our servers use full height cards. Based on the PCI Bracket Generator on Thingiverse I created a slot cover for the Mellanox Cards. The ZIP file contains the OpenSCAD code as well as an STL. For printing I used my Renkforce RF100. The object needs to be put diagonally onto the small build plate, screw holes down and it printed out nicely.
After some printing we were finally able to install it, and everything worked flawlessly.
GNU GPL 3