With Bitfusion, VMWare and Mellanox, GPU accelerators can now be part of a common infrastructure resource pool, available for use by any virtual machine in the data center in full or partial configurations, attached over the network. The solution works with any type of GPU server and any networking configuration such as TCP, RoCE or InfiniBand. IT can now pool together resources and offer an elastic GPU as a service – much like network attached storage, enabling dynamic assignment of GPU resources based on an organization’s business needs and priorities.
Figure 1. Elastic AI Infrastructure with Bitfusion, VMWare and Mellanox
Mellanox and Bitfusion set the infrastructure configuration as shown in Figure 1 to emulate a real-life Elastic AI Infrastructure. The test bed included a cluster of Dell R740 GPU servers and Dell R640 CPU servers (no GPUs), Mellanox SN2700 100GbE switch and Mellanox ConnectX5 cards. On the clients, VMWare VSphere ESX 6.5 was setup along with Ubuntu 16.04 for the VM operating system, CUDA 9.1, CuDnn 7.3 and TensorFlow 1.9.
Bitfusion FlexDirect runs in the user space and doesn’t require any changes to the OS, drivers, kernel modules or AI frameworks. It’s worth noting that FlexDirect can also support a heterogeneous cluster with hybrid operating systems, so for instance a cluster can have FlexDirect client run on, say, Ubuntu, and have that connect to a FlexDirect server on, say, CentOS (and vice versa).
Figure 2a shows the measurement of the performance for remote attach of GPUs over the network compared to running the same workload locally on the GPU system, over 100Gbps RoCE. Figure 2b shows the same for 10Gbps RoCE. Figure 3a shows the measurement of performance of multiple network attached fractional half GPUs versus using the full physical GPU, over 100Gbps RoCE. Figure 3b shows the same for 10Gbps RoCE.
Bitfusion FlexDirect with VMWare and Mellanox demonstrates that network attached full and fractional GPUs accomplish near native performance across the suite of benchmarks.
Figure 2a Images/sec with FlexDirect remote GPUs on ESXi 6.5, Ubuntu 16.04 VMs with ConnectX-5 Ethernet cards (100Gbps RoCE)
Figure 2b Images/sec with FlexDirect remote GPUs on ESXi 6.5, Ubuntu 16.04 VMs with ConnectX-5 Ethernet cards (10Gbps RoCE)
Figure 3a. Images/sec with FlexDirect partial GPUs on ESXi 6.5, Ubuntu 16.04 VMs with ConnectX-5 Ethernet cards (100Gbps RoCE)
Figure 3b. Images/sec with FlexDirect partial GPUs on ESXi 6.5, Ubuntu 16.04 VMs with ConnectX-5 Ethernet cards (10Gbps RoCE)
The benefit of combining Bitfusion FlexDirect software with VMWare VSphere and Mellanox’s portfolio of networking solutions allows customers to consolidate multiple siloed GPU clusters into a single shared platform, to decrease CapEx and OpEx as well as increase productivity.
Visit https://www.bitfusion.io/product/flexdirect to get more information on Elastic Network Attached GPUs.