Written by Sanidhya Gurudev
What is a GPU?
We all know what a GPU is, that costly thing which we demand running a notepad. No. A GPU can be defined as a graphics processing unit is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics and image processing. GPU helps in image and video upscaling and gives a realistic feeling to any animation.
How does a GPU help in Deep Learning?
GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously. Additionally, computations in deep learning need to handle huge amounts of data— this makes a GPU's memory bandwidth most suitable.
Why do we need more hardware for deep learning?
For any neural network, the training phase of the deep learning model is the most resource-intensive task.
While training, a neural network takes in inputs, which are then processed in hidden layers using weights that are adjusted during training, and the model then spits out a prediction. Weights are adjusted to find patterns in order to make better predictions.Both these operations are essentially matrix multiplications. Simple matrix multiplication can be represented by the image below-
In a neural network, the first array is the input to the neural network, while the second array forms its weight.
Yes, if your neural network has around 10, 100 or even 100,000 parameters. A computer would still be able to handle this in a matter of minutes, or even hours at the most.
But what if your neural network has more than 10 billion parameters? It would take years to train this kind of system employing the traditional approach. Your computer would probably give up before you’re even one-tenth of the way.
“A neural network that takes search input and predicts from 100 million outputs, or products, will typically end up with about 2,000 parameters per product. So you multiply those, and the final layer of the neural network is now 200 billion parameters. And I have not done anything sophisticated. I’m talking about a very, very dead simple neural network model.” — a Ph.D. student at Rice University
Making Deep Learning Models Train Faster
Deep Learning models can be trained faster by simply running all operations at the same time instead of one after the other.
You can achieve this by using a GPU to train your model. A GPU is a specialized processor with dedicated memory that conventionally performs floating-point operations required for rendering graphics. In other words, it is a single-chip processor used for extensive Graphical and Mathematical computations which frees up CPU cycles for other jobs. The main difference between GPUs and CPUs is that GPUs devote proportionally more transistors to arithmetic logic units and fewer to caches and flow control as compared to CPUs. While CPUs are mostly applicable for problems that require parsing through or interpreting complex logic in code, GPUs are designed to the dedicated graphical rendering workhorses of computer games, and which were later enhanced to accelerate other geometric calculations (for instance, transforming polygons or rotating verticals into different coordinate systems like 3D). A GPU is smaller than a CPU but tends to have more logical cores (arithmetic logic units or ALUs, control units, and memory cache) than the latter.
GPUs originally developed for accelerating graphics processing can dramatically speed up computational processes for deep learning. They are an essential part of a modern artificial intelligence infrastructure, and new GPUs have been developed and optimized specifically for deep learning.GPUs can perform multiple, simultaneous computations. This enables the distribution of training processes and can significantly speed machine learning operations. With GPUs, you can accumulate many cores that use fewer resources without sacrificing efficiency or power.
When designing the deep learning architecture, the decision to include GPUs relies on several factors:
Memory bandwidth—including GPUs can provide the bandwidth needed to accommodate large datasets. This is because GPUs include dedicated video RAM (VRAM), enabling you to retain CPU memory for other tasks.
Dataset size— GPUs in parallel can scale more easily than CPUs, enabling you to process massive datasets faster. The larger your datasets are, the greater benefit you can gain from GPUs.
Optimization—a downside of GPUs is that optimization of long-running individual tasks is sometimes more difficult than with CPUs.
AI accelerates deep learning on GPU by helping data scientists optimize expensive compute resources and improve the quality of their models.
GPU utilization metrics measure the percentage of time our GPU kernels are running (i.e. GPU utilization). We can use these metrics to determine your GPU capacity requirements and identify bottlenecks in your pipelines. You can access this metric with NVIDIA’s system management interface.
One of the most admired characteristics of a GPU is the ability to compute processes in parallel. This is the point where the concept of parallel computing kicks in. A CPU in general completes its task in a sequential manner. A CPU can be divided into cores and each core takes up one task at a time. Suppose if a CPU has 2 cores. Then two different task processes can run on these two cores thereby achieving multitasking.
But still, these processes execute in a serial fashion.
This doesn’t mean that CPUs aren’t good enough. In fact, CPUs are really good at handling different tasks related to different operations like handling operating systems, handing spreadsheets, playing HD videos, extracting large zip files, all at the same time. These are some things that a GPU simply cannot do.
Where the difference lies?
As discussed previously a CPU is divided into multiple cores so that they can take on multiple tasks at the same time, whereas GPU will be having hundreds and thousands of cores, all of which are dedicated to a single task. These are simple computations that are performed more frequently and are independent of each other. And both stores frequently required data into their respective cache memory, thereby following the principle of ‘locality reference.’
There are many software and games that can take advantage of GPUs for execution. The idea behind this is to make some parts of the task or application code parallel but not the entire process. This is because most of the task’s processes have to be executed in a sequential manner only. For example, logging into a system or application does not need to be parallel.
When there is part of execution that can be done in parallel it is simply shifted to GPU for processing where at the same time-sequential task gets executed in CPU, then both of the parts of the task are again combined together.
In the GPU market, there are two main players i.e AMD and Nvidia. Nvidia GPUs are widely used for deep learning because they have extensive support in the forum software, drivers, CUDA, and cuDNN. So in terms of AI and deep learning, Nvidia is the pioneer for a long time.
Neural networks are said to be embarrassingly parallel, which means computations in neural networks can be executed in parallel easily and they are independent of each other.
Some computations like calculation of weights and activation functions of each layer, backpropagation can be carried out in parallel. There are many research papers available on it as well.
Nvidia GPUs come with specialized cores known as CUDA cores which help for accelerating deep learning.
A Big Leap with Tensor Cores
Back in the year 2018, Nvidia launched a new lineup of their GPUs i.e 2000 series. Also called RTX, these cards come with tensor cores that are dedicated to deep learning and based on Volta architecture.
Tensor cores are particular cores that perform matrix multiplication of 4 x 4 FP16 matrix and addition with 4 x 4 matrix FP16 or FP32 in half-precision, the output will be resulting in 4 x 4 FP16 or FP32 matrix with full precision.
As stated by Nvidia, the new generation tensor cores based on volta architecture are much faster than CUDA cores based on Pascal architecture. This gave a huge boost to deep learning.
At the time of writing this blog, Nvidia announced the latest 3000 series of their GPU lineup which come with Ampere architecture. In this, they improved the performance of tensor cores by 2x. Also bringing new precision values like TF32(tensor float 32), FP64(floating point 64). The TF32 works the same as FP32 but with speedup up to 20x, as a result of all this Nvidia claims the inference or training time of models will be reduced from weeks to hours.
Best GPUs for Deep Learning Projects
While consumer GPUs are not suitable for large-scale deep learning projects, these processors can provide a good entry point for deep learning. Consumer GPUs can also be a cheaper supplement for less complex tasks, such as model planning or low-level testing.
In particular, the Titan V has been shown to provide performance similar to datacenter-grade GPUs when it comes to Word RNNs. Additionally, its performance for CNNs is only slightly below higher tier options.
Deep Learning GPUs for Small scale and consumer-based Projects:
NVIDIA Titan V The Titan V is a PC GPU that was designed for use by scientists and researchers. It is based on NVIDIA’s Volta technology and includes Tensor Cores. The Standard edition provides 12GB memory, 110 teraflops performance, a 4.5MB L2 cache, and a 3,072-bit memory bus. The CEO edition provides 32GB memory and 125 teraflops performance, 6MB cache, and 4,096-bit memory bus. The latter edition also uses the same 8-Hi HBM2 memory stacks that are used in the 32GB Tesla units.
NVIDIA Titan RTX The Titan RTX is a PC GPU based on NVIDIA’s Turing GPU architecture that is designed for creative and machine learning workloads. It includes Tensor Core and RT Core technologies to enable ray tracing and accelerated AI. Each Titan RTX provides 130 teraflops, 24GB GDDR6 memory, 6MB cache, and 11 GigaRays per second.
NVIDIA GeForce RTX 2080 Ti The GeForce RTX 2080 Ti is a PC GPU designed for enthusiasts. It is based on the TU102 graphics processor. Each GeForce RTX 2080 Ti provides 11GB of memory, a 352-bit memory bus, a 6MB cache, and roughly 120 teraflops of performance.
Deep Learning GPUs for Large-scale Projects:
NVIDIA Tesla A100 The A100 is a GPU with Tensor Cores that incorporates multi-instance GPU (MIG) technology. It was designed for machine learning and data analytics. The Tesla A100 is meant to be scaled to up to thousands of units and can be partitioned into seven GPU instances for any size workload. Each Tesla A100 provides up to 624 teraflops performance, 40GB memory, 1,555 GB memory bandwidth, and 600GB/s interconnect.
NVIDIA Tesla K80 The Tesla K80 is a GPU based on the NVIDIA Kepler architecture that is designed to accelerate scientific computing and data analytics. It includes 4,992 NVIDIA CUDA cores and GPU Boost technology. Each K80 provides up to 8.73 teraflops of performance, 24GB of GDDR5 memory, and 480GB of memory bandwidth.
Nvidia Geforce RTX 3090 Nvidia’s one of the latest GPU which is both consumer-based and a GPU for work purposes. With ray tracing enabled and having 10496 Nvidia Cuda cores which are just perfect for any large-scale project.
Now we all know how efficient and essential a GPU is in the process and the work of Deep Learning, but what if I told you that we analyze or predict the medical problem through GPU. Well, that is true! And I just learned how it can eventually help in medical science.
GPU CUDA Programming Model in Medical Image Analysis
With the technology development of the medical industry, processing data is expanding rapidly and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse sequences, and the growing complexity of algorithms. GPU addresses these problems and gives the solutions for using their features such as high computation throughput, high memory bandwidth, support for floating-point arithmetic, and low cost. Compute unified device architecture (CUDA) is a popular GPU programming model introduced by NVIDIA for parallel computing. This review paper briefly discusses the need for GPU CUDA computing in medical image analysis.
Medical image processing and analysis are computationally expensive while medical imaging data dimension is increasing. The conventional CPU with a limited multi-core is not sufficient to process these types of huge data. GPU is a new technology capable of finding solutions to computational problems in all the engineering and medical fields. In the medical industry, GPU is more suitable for processing the higher dimension data. GPU computation has provided a huge edge over the CPU with respect to computation speed.
GPU is highly parallel, multi-thread, multiple core processors, and has high memory bandwidth to give the solution to the computational problems. The main reason for the evolution of powerful GPUs is the constant demand for greater realism in computer games. During the past few decades, the computational performance of GPUs has increased much more quickly than that of conventional CPUs. Hence it plays a major role in the field of modern industrial research and development. GPU has already achieved a significant speed (2x-1000x) than CPU implementation on various fields
A large performance gap occurs between GPU and general-purpose multi-core CPU. Architectural level comparison of CPU and GPU is given in. The design of a CPU is optimized for sequential programming. It makes use of sophisticated control logic to allow instructions from a single thread of execution to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. Modern CPU microprocessors typically have four large processor cores designed to deliver strong sequential code performance but not enough to process the huge data. A basic model of GPU has a large number of processor cores, ALU's, control units, and various types of memories. In general, heterogeneous CPU and GPU computation are appreciable instead of standalone CPU or GPU implementation. The dependent processes are recommended in the CPU and the independent processes can be accelerated by the GPU. GPUs with high amounts of threads give better performance.
Overview of GPU computing model – CUDA
CUDA is a parallel programming model and its instruction set architecture uses a parallel compute engine from NVIDIA GPU to solve large computational problems. CUDA is an open-source and extension of the C programming language. CUDA programs contain two phases that are executed in either host or device. There is no data parallelism in the host code. The phases that exhibit rich amounts of data parallelism are implemented in the device code. A CUDA program uses the NVIDIA C compiler (NVCC) that separates the two phases during the compilation process. The host code is ANSI C code and the device code is ANSI C code with extended keywords. The windows user can compile the CUDA programming with Microsoft visual studio 2008 onwards using NVIDIA Nsight.
GPU - CUDA programming model
GPU - CUDA hardware builds with three main parts to utilize effectively the full computation capability of GPU. CUDA is capable of executing a large number of parallel threads. Threads are grouped by blocks and groups of blocks by a thread. In these three levels of hierarchical architecture, the execution is independent among the entities of the same level. A grid is a set of thread blocks that may each be executed independently. Blocks are organized as a 3D array of threads and each block has a unique block ID. Threads are executed by the kernel function and each thread has a unique thread ID (threadIdx). The total size of a block is limited to 1024 threads.
GPU computation for medical image analysis
Nowadays, the modern medical industry produces large quantities of data and processes them with complex algorithms. Generally, 2D, 3D, and 4D volumes are generated by the medical image modalities to diagnose and surgical planning processes. These factors instigate the need for a high-performance computing system with huge computational power and hardware configuration The designing filter and registration are famous preprocessing areas. Segmentation is to simplify and change the representation of the image into something that is more meaningful and easier to analyze and diagnose. Visualization contains a famous post-processing method in medical image representation. The major techniques involved in medical image analysis are denoising, registration, segmentation, and visualization as shown in the figure given below.
1. Image denoising
Medical images obtained from MRI are generally affected by random noise that arises during the image acquisition, measurement, and transmission processes. The solution to this problem may lead to improved diagnosis and surgical procedures. Image denoising is an important task in medical imaging applications in order to enhance and recover hidden details from the data Image registration and segmentation algorithms give expected accuracy while using the denoising algorithms. The most commonly used denoising algorithms in the medical domain are adaptive filtering, anisotropic diffusion, bilateral filtering, and non-local means filter. All these algorithms are partially or fully supported for data parallelism using pixel or voxel per thread scheme.
2. Adaptive filtering
The denoising approach uses an adaptive filter introduced by Knutsson et al, in 1983. The adaptive filter is a self-modifying digital filter that adjusts its filter coefficients in an attempt to minimize an error function. The error function measures the distance between the reference signal and the output of the adaptive filter.
3. Region growing
Region growing is a commonly used medical image segmentation technique. Region growing starts with the initial seed point from the object which is given either manually or automatically using prior knowledge. The operation starts from the seed point and connects the neighboring pixel which is similar to the seed point based on some criteria. The criteria can be, for example, pixel intensity, grayscale texture, or color.
The proposed method performance compared with single and quad-core CPU using OpenMP on lung and colon images. In the CUDA implementation, they refer to information from neighboring voxels using eight threads due to limitation of available threads. When the segmented-region size increased, the single and quad-core CPU methods required considerably increasing computation time, whereas the CUDA exhibited a constant computation time. Westhoff presented a parallel seeded region growing algorithm for medical images obtained from polarized light imaging (PLI). Due to the very high resolution at the sub-millimeter scale, an immense amount of image data has to be reconstructed three-dimensionally before it can be analyzed. They chose regions growing for segmentation and accelerated the algorithm by a factor of about 20 using CUDA. They achieved high accelerating gain while creating 448 threads per block.
Morphological image processing is a structure-based analysis method with a combination of some segmentation methods. These operations are based on set theory with binary images introduced by Serra. The fundamental morphological operations are dilation and erosion that support the expansion and shrinking properties of images respectively. These operations use a small matrix mask called a structuring element. The matrix is filled with ones and zeros and various shapes like diamond, disk, octagon, square, arbitrary, and line. These morphological operations fully pixel-based independent operations that support parallel processing in CUDA.
A sample MRI brain gray image, Ridler thresholded binary image, dilated image, eroded image, and corresponding structuring element are shown. Various works have been done on the implementation of morphological operations using CUDA. Kalaiselvi et al., implemented a parallel morphological operations technique using CUDA on general images. They have taken various sizes of images to their experimental study. They implemented the operations in C++ and MATLAB for CPU implementation and CUDA for GPU programming. They concluded that the GPU implementation improved the performance when the image size was increased.
The grayscale image can be viewed as a topographic surface and treated as a three-dimensional object in the watershed segmentation. Here the third dimension is the intensity value of the pixel. In the grayscale image, high intensity is considered as a hill (watersheds ridgeline) and low intensity is treated as a valley (catchment basins). The watershed transform aimed to search for regions of the watersheds ridge line that divide neighbored catchment basins. The watershed algorithm is more useful for segmenting the objects that are touching one another. One of the drawbacks is watershed leads to an over-segmentation when given an image affected by the noise with the valley.
Pan et al., implements a few medical image segmentation algorithms using CUDA. They used multi-degree watershed segmentation on abdomen and brain images. Vitor et al., proposed two parallel algorithms for the watershed transform focused on fast image segmentation using off-the-shelf GPUs. These algorithms were constructed with the heterogeneous implementation of both serial and parallel techniques and showed optimal results. The performance gain reached up to 14% when image size increases.
6. Surface rendering
Surface rendering constructs the polygonal surface from the given medical dataset and renders the surfaces. Surface rendering techniques require contour extraction that defines the surface of the structure to be visualized. A surface rendering model is created from a contour extraction process on edges using 3D Doctor for BRATS tumor dataset and is shown in. An algorithm is applied to place surface patches or tiles at each contour point. The surface rendered after the shading and hidden surface removal. Standard computer graphics processes can be applied for object shading. GPU accelerates the process of geometric transformation and rendering processes. GPUs were originally made to speed up the memory-intensive calculations in demanding 3D computer games. These devices are now increasingly used to accelerate numerical computations like texture mapping, rendering polygons, and coordinate transformation.
The Marching cubes (MC) algorithm was introduced by Lorensen and Cline for creating a 3D surface consisting of triangles from a volumetric dataset of scalars. The algorithm uses a parameter, called the iso-value, to classify the points in the dataset either inside or outside the surface. The dataset is divided into a grid such that a number of cubes are formed. The corner of each cube is represented by the data point in the dataset. Smistad et al., proposed a data-parallel marching cube algorithm for surface rendering using GPU. They used OpenCL and CUDA for GPU programming. OpenCL gives better performance than CUDA because the largest dataset makes memory exhaustion.
Volume rendering resolves the issues of accurate representation of surface detection and used to visualize the three-dimensional data. Volume rendering is used to visualize the three spatial dimensions with the help of 2D projection with the semi-transparent volume. The major application area of volume rendering is medical imaging. One of the most popular volumes rendering techniques is a ray-casting algorithm. Ray casting is not implemented in any geometric structure and solves the limitations in the surface extraction. Ray casting solves a major limitation of surface extraction namely they fail to project a thin shell in the acquisition plane. It needs a random search in a three-dimensional dataset, and that requires a large amount of computational power and bandwidth. CUDA gives solutions to these problems. A parallel ray casting method for forwarding projection is proposed by Weinlich et al., using CUDA and OpenGL (Open Graphics Language). They used two GPUs (Geforce 8800 GTX and Quadro Fx 5600) for CUDA and OpenGL implementations. The implementation result shows that OpenGL will perform better than CUDA 1.1. They realized CUDA 2.0 is three times faster than the lower version of CUDA 1.1. Final outcome of this work is time reduction, up to 148 times than unoptimized CPU. Zhang et al. presented a new algorithm to synchronize the phases of the dynamic heart to clinical ECG signals to calculate and compensate for latencies in the visualization pipeline using GPU. They have implemented 4D cardiac visualization using three algorithms that are 3D texture mapping (3DTM), software-based ray casting (SOFTRC), and hardware-accelerated ray casting (HWRC). They used three GPUs to compare the performance gain.
In the end, it really doesn't matter which GPU we are using because even if the GPU is outdated or is not as powerful as the one which is mentioned, it (your GPU) is still capable enough to do heavy work tasks. And also this study proposed how just a lavishly costly computer part widely used in the gaming industry can change the way we look at artificial intelligence and medical science. We discussed the most common areas of GPU computing in medical imaging analysis. The existing works of medical image analysis are investigated and performance gains with the CUDA programming are discussed. This investigation intimates the importance of GPU computing in the area of the medical industry. Finally, few optimization concepts are suggested for medical image algorithms. Few facts are discussed for calculating the speedup ratio between CPU and GPU.