Authors: Bhavesh Patel – Dell EMC & Mazhar Memon – Bitfusion

Deep Learning (DL), a key technique driving artificial intelligence innovation, such as image recognition, chatbots, and self-driving cars, requires algorithms be ‘trained’ using large data sets. Initially, this can be done on a single node (server). However, as the models and datasets grow ever larger and more complex, it becomes essential to scale-out.

Within a single node, power, thermal, storage, and memory limits will cap the scale of the training solution.  When a training solution reaches or exceeds the capabilities of a single node, it becomes necessary to scale-out. The emphasis on deep learning has, so far, been focused on single node runs. But, as neural networks are becoming more integrated into existing workloads like Hadoop and HPC, it is becoming necessary to look at multi-node scaling.

A single node (server), supporting one to eight GPUs or accelerators is relatively simple to deploy and operate, but a multi-node system will improve performance significantly.  In multi-node systems, training operations can leverage parallelism and increased capacities to handle larger models and datasets.   In addition, where support for users on the system is high or growing, scale-out may be required.

This is one of the areas where Dell EMC has been focusing on i.e. how can you enable GPU scaling, a method whereby users can add GPU nodes as their workload need increases? Also, how can you give the end customer an ability to scale-out at a rack level as the number of users accessing these resources increases? Based on some of these requirements, Dell EMC started evaluating Bitfusion, which provides software to enable elastic deep learning and makes development of deep learning applications faster and more economical. There are other benefits in using the Bitfusion stack on top of a Dell EMC hardware solution, as it addresses some of the pain points, such as:

Limitations in hardware infrastructure: When you look at a standard rack, there is a limit to the number of scale-up servers you can fit within a rack, given the airflow and power requirements. There is also a limit when you scale-up, i.e. support 8 or 16 GPUs within a node, as well as how much host memory and storage is available to service these GPUs. Figure 1, below, shows some of the limitations when configuring a standard 42U rack with scale-up servers.

Figure 1: Current infrastructure limitations

Software complexity overload: In deep learning space alone there are many solutions available for software management, model management, infrastructure management, data management, and workload management. This creates a nightmare scenario for end users who have to decipher between all the different tools available for them to implement deep learning. Figure 2, below, shows the current map of software solutions available for each particular area.

Figure 2: Today’s complex software ecosystem

The need to simplify and scale solutions, as the usage for deep learning within an organization grows, is clear. Some of the benefits of a partnered Dell EMC and Bitfusion approach are:

Converged rack solution: No matter how you scale-up, (i.e. fit more GPUs within a node), at some point, you will need to scale-out to address distributed training. Hence, using dense 1U servers, like the PowerEdge C4130, with the right balance between GPU, memory, and IO, we can build a better converged rack solution. It allows better intra-rack network bandwidth, minimizing inter-rack data movement, and the TCO is improved per GPU. It also addresses composability, because it allows you to add nodes as demand grows. Figure 3 below talks about some of the benefits of a converged rack solution using PowerEdge C4130.

Figure 3: Dell EMC converged rack solution with Bitfusion remote GPU virtualization

Streamlined deep learning development: Using the Bitfusion software stack, users can develop on pre-installed containers, which have optimized drivers, frameworks, and libraries. This allows users to train multiple models in multi-GPU environment.

Figure 4: Bitfusion Flex – streamlined AI development

In order to address some of the pain points mentioned above, Dell EMC started working with Bitfusion to look at the performance with local and remote attached GPUs. We targeted the Dell EMC PowerEdge C4130 server, which is a 1U form factor that can support 4x double-wide full-height GPUs. There are several advantages in using the PowerEdge C4130 including:

  • Support for 4x GPU/accelerators across the front of the chassis to obtain the best cooling and performance
  • Flexible GPU, CPU, and IO configurations based on workload needs
  • Support for high bandwidth networking cards, (i.e. InfiniBand and EDR (100Gb/s))
  • Since it’s 1U, it allows better rack utilization, with headroom to support TOR switches and storage servers

Figure 5 shows the PowerEdge C4130 and its internal feature set.

Figure 5: Dell EMC C4130 internal configuration

Hardware Configuration Setup

To demonstrate a variety of usage scenarios, we assembled a pair of Dell EMC PowerEdge R730 servers for CPU and four PowerEdge C4130 GPU-enabled servers connected with a Mellanox FDR switch as shown in Figure 6.  This is considered our baseline configuration upon which other upgrades are available. This includes enhancements such as Pascal P100s, EDR networking, and a large variety of PCIe configurations to optimize data movement depending on the workload.

Figure 6: Test configuration

Server Configuration Details

The table below shows the internal configuration for R730 (Client) and C4130 (GPU nodes).

Dell EMC Hardware + Bitfusion Software Stack Setup

The software stack installed on the test configuration [Figure 6] is depicted in Figure 7.  The ‘client server’, which may be a CPU intensive server (PowerEdge R730) or GPU intensive server (PowerEdge C4130), runs the deep learning application(s) such as TensorFlow and Caffe, as well as a Bitfusion Client Library in a fully containerized deployment.  The Bitfusion client components are responsible for managing the data transfers and virtualization features to the client CUDA or OpenCL application.  Each of the servers runs the Bitfusion Service Daemon, which handles requests from one or more Bitfusion client end points, and provides process isolation, and resource guarantees based on configured SLA.  In the current test configuration, we mainly focused on the basic setup, allowing for runtime attach of GPU resources to evaluate performance and efficiency in the most common datacenter scenarios.

Figure 7: Software components

One important aspect of Bitfusion’s virtualization capabilities is the fined-grained control the user and administrator have over resource management.  Not only can you scale-up with disaggregated GPUs, individual GPUs can be partitioned into arbitrarily small virtual GPUs, as shown in Figure 8.  Though conventional device virtualization approaches are limited by common fractions (1/2, 1/4, 1/8, etc.) of a device and require a full system reboot. With Bitfusion’s virtualization layer, device partitions can be arbitrary fractions (e.g. 1/7, 5/20, 3/4, etc.) and can be assigned on a per-user or per-application basis, without any system in the cluster being rebooted.  The implications are dramatically increased control over resources, higher utilization, and a significantly better user experience, resulting in accelerated AI development.

Figure 8: Bitfusion partial GPUs


Transport level benchmarking

To evaluate the efficiency of remotely attached GPU resources, we evaluated the test setup from the ground up, focusing on data movement efficiency.  First, we measured bandwidth and latency between all possible source and destination endpoints and created performance matrices as measured at the CUDA application level.  NVIDIA includes bandwidth and latency tests with their CUDA SDK, which allowed us to quickly measure data movement overheads as seen by any CUDA application.

Table 1 shows the throughput between the Host CPU and GPUs in the GPU server (C4130).  As can be seen, host-to-GPU throughput is close to PCIe3x16 speeds, while GPU-GPU throughput is around 5-6 GB/s.  Internal GPU bandwidth, as highlighted by the green diagonal, nears 100GB/s.

Table 2 shows low latency within the GPU of ~7μs, and relatively higher latency between GPUs over PCIe of ~20-30μs, with all of the associated CUDA memory copy overheads.

1. Bandwidth and latency data – Intra-node

Table 1: Single node bandwidth matrix showing bandwidth from (columns) the host CPU (H) and all GPUs (0-3) to all other GPUs in the 4GPU C4130 server.

Table 2: Single node latency matrix

2. Bandwidth and latency data – Intra- and Inter-node

By combining four PowerEdge C4130 servers, we can effectively create a 16 GPU virtual server, capable of running a much larger workload.  The bandwidth and latency matrices below exhibit very good properties: minimal NUMA effects, uniform performance between all GPUs, and GPU-to-GPU latencies that are better than native.

How is this achieved? 

The Bitfusion virtualization layer has several runtime optimizations, which automatically select the best combination of transports: PCIe, InfiniBand, GPUDirect RDMA, as well as host CPU copies to achieve the best results.

3. Relative performance of remote Vs native GPU( Intra and Inter node) – with Caffe and TensorFlow

The next step in our evaluation is to assess application performance, first by measuring how efficient remote attached GPYs perform relative to native GPUs as shown in Figure 9.  We compare the training throughput as measured by total training time of native 4 GPUs (N4) locally attached GPUs with Bitfusion’s virtualization layer (L4), 2 local and 2 remote (L2R2), and finally 4 remotely attached GPUs (R4).

The results are fairly impressive, even with the potential “virtualization” overhead; every scenario using virtualized or remote GPUs resulted in a slight increase in application performance relative to native performance.  Bitfusion engineers explain that a lot of runtime optimizations exist to make both data movement, as well as the application’s interface to the CUDA, as efficient as possible.

Figure 9: Several ways to create a 4-GPU virtual machine

Extending testing across frameworks and batch sizes, we indeed see that the remotely attached GPUs achieve native performance, as shown in in Figure 10.

Figure 10: Remote vs. Natively attached GPU performance

Finally, we ran Caffe and TensorFlow over all available GPUs in the system and measured total training time. It’s clear that Bitfusion offers a powerful new virtualization technology to elastically manipulate compute resources, while also enabling a highly streamlined AI development experience.  The results are shown below in Figure 11.

Figure 11.1: Demonstrated multi-node scaling

Figure 11.2: Demonstrated multi-node scaling


As can be seen from performance data above, we are able to achieve the same or better performance using remote attached GPUs. And this not limited to only GPUs, but applies to other accelerators that can be used for deep learning.

Some key takeaways:

  • PowerEdge C4130 server with density of 4x accelerators in 1U is better suited for GPU scaling based on its flexibility and thermal efficiency.
  • Using Bitfusion as a software stack with Dell EMC infrastructure, it is possible for organizations to grow their deep learning needs as the number of users accessing GPUs grows.

For more information, visit