Parallel Programming: GPGPU and CUDA | Scientific Programmmer | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

18 Lessons (3h 6m)
    • 1. Welcome

    • 2. CUDA - High Level Concepts

    • 3. CUDA - Programming Model

    • 4. CUDA - Parallelizing a For Loop

    • 5. CUDA - Indexing Threads within Grids and Blocks

    • 6. CUDA Memory Model

    • 7. CUDA Synchronization

    • 8. CUDA Install (Windows)

    • 9. CUDA Install (Linux)

    • 10. CUDA First Program

    • 11. CUDA Vector Addition

    • 12. CUDA Unified Memory Vector Addition

    • 13. CUDA Matrix Multiplication

    • 14. CUDA Matrix Multiplication Code

    • 15. CUDA Cache Tiled Matrix Multiplication

    • 16. CUDA Accounting (nvidia-smi)

    • 17. CUDA Thurst Library

    • 18. CUDA Thrust Library - Part 2

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

This course is intends to explain the basics of CUDA using C/C++ programming language, which you can use your stepping stone to machine learning, deep learning and big data careers.

You will learn:

  • GPU Basics

  • CUDA Threads and Blocks in various combinations

  • CUDA Coding Examples

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. When it was first introduced, the name was an acronym for Compute Unified Device Architecture, but now it's only called CUDA. CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU) using graphics APIs: Scattered reads, Unified virtual memory (CUDA 4.0 and above), Unified memory (CUDA 6.0 and above), Shared memory, faster downloads and readbacks to and from the GPU, full support for integer and bitwise operations, including integer texture lookups.


The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. It has wrappers for many other languages including Python, Java and so on. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming.  Also, CUDA supports programming frameworks such as OpenACC and OpenCL. CUDA is a freeware, works on Windows, Linux and Mac OSX.

Meet Your Teacher

Teacher Profile Image

Scientific Programmmer

Instructor - Scientific Computing Expert


Scientific Programmer helps you to learn the use of scientific programming languages, such as CUDA, Julia, OpenMP, MPI, C++, Matlab, Octave, Bash, Python Sed and AWK including RegEx in processing scientific and real-world data. We help you to solve large-scale science biological, engineering, and humanities problems, gain adequate understanding through the analysis of mathematical models implemented on high-performance computers and share the knowledge.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Welcome: This course presents to you a masterclass on CUDA programming with C plus plus covering 0 to Adventist features on cuda and GPGPU. Instructors here are ready to help you with a wealth of knowledge on Linux, DevOps, HPC, and data sciences. I hope you will enjoy this course and see you in the next lesson. 2. CUDA - High Level Concepts: The primary reason we want to use GPUs is the extreme computational power that they offer. There are two major advantages the GPUs have over CPUs. Tremendous computational throughput in extremely high memory bandwidth. This graph illustrates how the computational throughput if Nvidia and Intel processors have increased during this decade, measured in units of billions of floating point operations per second. The green lines represent invidious flagship GPU at different points in time. Whereas the blue lines represent Intel's flagship CPU at different points in time. Focusing our attention on the g-force Titan, we can see that the amount of computational throughput the GPU offers us is almost an entire order of magnitude greater than Intel's corresponding CPU. The reason behind the discrepancy in floating-point capability between graphics processors and CPUs is that the GPU is specialized for compute intensive, highly parallel computation. Exactly what graphics rendering is all about in this therapy for designed such that more transistors are devoted data processing rather than data caching and advanced control logic, as we will see in the next slide. Cpus are designed to minimize latency and work very fast on just a few tasks. To achieve this latency minimization, the majority of the area of a CPU is dedicated to various sophisticated control logic and on-chip cache. This design works extremely well for very general purpose programs, such as running an operating system. Whereas CPUs are designed to minimize latency, GPUs are designed specifically to maximize computational throughput. This design is characterized by dedicating the majority of the silicons area to a massive number of arithmetic logic units, which NVIDIA calls CUDA cores. More specifically, the GPU is well-suited to address problems that can be expressed as data parallel computations, where the same program is executed on mini data elements in parallel. Because the same program is executed for each data element. There is a lower requirement for sophisticated control logic. In because it is executed on many data elements, the memory access latency can be hidden with calculations instead of big caches. In order to properly utilize a GPU, the parts of the program that can be parallelized must be decomposed into a large number of threads that can run concurrently. In CUDA, these threads are defined by special functions called kernels. It kernel is simply a function that is executed on the GPU. The execution of kernel is called launching a kernel. When we launch a kernel, the kernel is executed as a set of threads. Each thread is mapped to a single CUDA cores on the GPU. The two main pieces of hardware in the CUDA programming model, or the CPU and the GPU, each with their own dedicated memory region. The CPU, together with the computer's RAM, is referred to as the host. Similarly, the GPU, together with its own dedicated dram, is called the device. The CPU is very good at running serial programs. The problem with CPUs is if they run into a portion of code that is massively parallel, this part of the program will likely cause the bottleneck in the flow of the program, since CPUs are just not designed for massively parallel throughput. Since GPUs are very good at running a part of a program that is very parallel. We want offload these massively parallel portions of the code onto the GPU. Using the host and device together is termed heterogeneous parallel programming. This is where CUDA comes in. Cuda is the heterogeneous parallel programming language designed specifically for Nvidia GPUs. Qd is simply c with a set of extensions that allows for the host and device to work together. In the CUDA programming model, the host is in control of the program. For the most part. It CUDA program will run just like a traditional piece of software would on the CPU. Whenever we run into a portion of code that can be massively parallelized, we will pass the execution of that portion of the code to the device. The host and devise talk to each other over the, over the PCI express bus. The PCI express bus is very slow relative to the host and device. So the cost of exchanging data between the host and devices very expensive. Therefore, the only portions of the code do we want to execute on the device or the parts that are massively parallel. Let's take a closer look at the concept of a cuda threat. To utilize the massive number of cores on the GPU, we want to execute CUDA kernels as a very large number of threads. The GPU is designed to execute. Each kernel is thousands or even millions of threads. Cpu to threads execute in a single instruction, multiple data fashion, commonly termed SIM D in the computer architecture literature. That is, each thread executes the same instruction, but on a separate piece of data. The exact manner in which threads execute is slightly different from the precise definition of this cmd model in video calls their version of Cindy sim t, which is an acronym that stands for single instruction multiple thread. The exact difference between SIM D and sim t is not really important here. But these terms are thrown around in the cuda literature. So just think of these terms is having the same meaning. Another thing to note is that threads do not all execute at the same rates, even though each thread is performing the same operation. Set of data being computed on in each thread is generally different, which can give rise to threads from the same kernel launch, executing at different rates in order to organize how threads are matched to CORS on the GPU, cuda has a hierarchy of threads. Now this is one of the most important concepts and Qd is programming model. There are three levels in the hierarchy, which we will discuss now. At the lowest level of the hierarchy, we have individual threads. We've already seen the thread, which is the execution of a kernel on a single piece of data. Each thread gets mapped to one coup de corps on the GPU when the kernel is launched. In the next level of the hierarchy, we have a structure called the block. Sets of threads are grouped in the blocks. When the kernel is launched, the blocks get mapped onto corresponding sets of CUDA cores, which will be covered in the following lecture. At the highest level of the CUDA hierarchy, we have grids, which are collections of blocks. The entire kernel launches executed as one grid, which is mapped onto the entire GPU and its memory. In summary, one kernel launch executes as one grid, which is mapped onto the entire device. To keep terms in order, it may be helpful to think of threads as elements of blocks and blocks as elements of grids. So each grid is composed of a set of blocks which are organized into either a 1D 2D or a 3D structure. In this example, the block structure is composed of a three by two grid. So there are three blocks in the x dimension in two blocks in the y-dimension, giving us a total of six blocks in this grid. Analogously, blocks are composed of threads which are also organized into either a 1D 2D or a 3D structure. In this same example, the thread structure is composed of a four by three block. That is, there are four threads in the x dimension, in three threads in the y dimension, giving us a total of 12 threads per block. Since this example has six blocks per grid and 12 threads per block, we have a total of 72 threads in the grid for this specific example, set another way. When we launched this kernel corresponding to this grid, there are a total of 72 threads that will execute on the GPU concurrently. 3. CUDA - Programming Model: Let's now take a look at how a CUDA program is organized in code. The biggest thing to remember is that the CPU is in control. That is, the host controls the entire flow of the program. The flow of the program begins just as any normal C program would beginning in the main function. Sequential flow of the program occurs as normal until you reach a portion of the code that we want to offload onto the GPU. This is achieved through launching a kernel which is executed on the device as a grid. Control of the program is immediately return to the host. The main C function continues from the point directly after the kernel launch. The main function continues to execute any serial code until we launch another kernel. One thing to note here is that the main C function does not wait on the kernels completion. So if we need to gather the results from a specific kernel launch, we need to create an explicit barrier in the host code to tell the main C function to wait on the kernels completion to continue. We will cover this asynchronous host and device execution in more detail soon. Just remember that the host code does not wait on the kernels completion unless explicitly told to do so. Let's now consider the syntax of a kernel launch. A kernel launch is just like calling any normal C function. The launch begins with the name of the kernel that we're executing and parameters passed into the kernel or placed inside the parentheses. The only difference between a kernel launch in a function call is that we must specify the kernels grid and blocked dimensions which are passed into the kernel launch inside the triple chevron brackets. So before we launch a kernel, we need to configure its launch parameters. These configuration parameters define the grid in block dimensions. The values of x, y, and z in the parentheses are integer values. Dimm 3 is a cuda data structure that is simply a set of integers that correspond to the size of the x, y, and z dimensions. In the code shown, grid size and block size are the Dim three data structure variable names. Let's take another look at our previous example to demonstrate how to implement this kernel launch syntax, we first configure the grid and block dimensions, which in this example, the grid dimension is of size three by two in the block dimension is of size four by three. We then launch the kernel with the specified configuration parameters as seen in the example. Let's take a closer look at the flow of a CUDA program. One thing to remember is that the host and device have different memory regions. In order to operate on any data inside a kernel, we first need to allocate memory on the device. Next, we need to copy any relevant data into the device's memory region that we previously allocated. This copying of data between the host and device is one of the most important in limiting aspects that drives the flow of a CUDA program. We then launch the kernel with a specific grid and block dimension determined by the configuration parameters. We need to retrieve the results obtained from a colonels execution. We have to copy the data back from the device to the host. Will discuss these memory management concepts in greater detail in the following two slides. Allocating device memory is analogous to allocating memory and C. Recall that in C, the function to allocate memory is malloc. The function to de-allocate memory in C is free. Analogously to allocate memory on the device, we use the function CUDA Malik. Cuda malloc takes two arguments. The first argument is the memory location that we want to copy data to on the device. And the second argument is the size of the memory region that we want to allocate. To de-allocate memory on the device, we simply use the keyword CUDA free, similar to C's function free. To transfer data between the host and device, we use the function CUDA mem copy. Cuda mem copy takes four arguments. The first argument is a pointer to an address of the memory that we are copying into. The second argument is a pointer to an address of the memory that we are copying from. The third argument is the size of the data that we are transferring in units of bytes. Finally, the fourth argument is the direction in which we are transferring in the data. If we are transferring data from the host to device, we set the direction to CUDA mem copy host to device. If we are transferring the data from the device to the host, we set the direction to CUDA mem copy device to host. Let's now look at an example CUDA program that will tie together all of the concepts we've discussed up to this point. We begin the program by stepping into the main function. Next, we're going to define two variables that are pointers to ints. Now since the host and device have separate memory regions, de-referencing a device pointer on the host would cause the program to crash. In order to differentiate between host and device variables, we are going to follow a naming convention that consists of proceeding any variable that lives on the host with an H, and proceeding the name of any variable that lives on the device with a D. Next, we allocate device memory using the function cuda malloc while passing in its two parameters. The first parameter is a pointer that is pointing to the address of the memory that we are allocating on the device. The second parameter is simply the size of the memory region that we are allocating. In a real program that would already be data stored in the host variable. So let's just assume that H underscore C is already initialized with data. In this example. We then use CUDA mem copy to copy this data from the host memory into the device's memory. Next, we set up the configuration parameters to specify the grid and block dimensions of the kernel launch. In this example, the grid and block dimensions are both of size one by one by one. The kernel is then launched with its configuration parameters inside of the triple chevron brackets. We pass into the kernel, any arguments inside the kernels parentheses. Note that the kernel launch in this example is executed as a single block containing a single thread. We then copy the results from the kernel launch back from the device onto the host using CUDA mem copy. And finally, we deallocate any memory that we previously allocated. Their returns 0 value concludes the main function. 4. CUDA - Parallelizing a For Loop: Let's now consider the syntax of a kernel definition. Defining a kernel is very similar to defining a normal C function. The main difference is the global keyword proceeding the void return type. Note that there are two underscores before and after the word global. This global keyword is called the kernels declaration specifier. The global declaration specifier is a directive that tells the compiler that this function is supposed to be called from the host that ran on the device. Kernels that are called from the host never return a value. So the return-type must be set to avoid any results that the kernel computes are stored in the device's memory. In order to manipulate data, we have to use the pass by reference technique. To perform pass by reference. We pass in a pointer argument that points to the memory address of the variable stored in the device's memory. So when a kernel is launched, the operations inside the kernels body are performed for every thread that executes this kernel. In this example, every thread will simply copy the 0th element of the array d sub in into the 0th element of the array d sub out. Clearly, this is not a very useful kernel for us to index into a different element of an array. For each instance of a thread, we need a method of distinguishing the threads from one another. This is achieved through determining each threads index. In practice, we always want to launch a kernel as a large number of threads. So the threads need a way of keeping track of their position within their corresponding block. Threads are aware of their position within a block by using CUDA is intrinsic thread idx. Thread idx has a corresponding x, y, and z component, which can be accessed using the dot notation dot x, dot y, and z. In this example on the screen, we're going to launch a kernel is a grid with a single block. Within that block are in threads in the x dimension. So for this example, the thread ID x values corresponding to this block span the range from thread ID x dot x equals 0 to thread ID x dot x equals n minus 1. Let's now take a look at an example that will demonstrate how to use this thread idx intrinsic. In this example, we're going to convert a for loop into a set of parallel threads. Let's first take a look at the for loop in normal C code that would run on the CPU. We begin the program in the main function just as normal. We then declare a LinkedIn array and initialize the array with data. Next, we pass into a function, the array, it's linked in. Now inside the body of the function, we simply increment the value of each element in the array by one, which is achieved with a for-loop running from 0 to n minus 1. Since all iterations through this for loop are independent of each other, this problem can be easily decomposed into a parallel problem using in parallel CPU to threads. To perform the same operation using in parallel threads. And acute a kernel, we begin the main function, as in the CPU program, declaring and initializing an array of length n. We'll call this array H sub a because it's data will live on the host. Next, we allocate device memory of the same size as the hosts array. Then we copy the data from the array and host memory into the device's memory. Next, we set up the configuration parameters of the grid in the block dimension. In this example, we're going to execute a grid with one block, and inside that block there's going to be in threads. Next we launch the kernel and pass in as arguments, the array d sub a, and it's linked in. Now in the body of the kernel, we first grab the index of the specific thread being executed from the thread ID x-dot ex intrinsic, and we store that value in the variable i. Then we're going to replace the for-loop with the single increment operation, where each thread is going to perform an increment on a different element of the array. As a safety precaution, we will perform the increment operation inside an if statement that is bounded by the length of the array. This if statement ensures that we don't step off the end of the array. In this example, since we're launching the kernel with only in threads inside a single block, we actually already know that we will not access out any elements beyond the end of the array. However, it's good practice to implement this if statement, because in a real CUDA program will usually execute more threads than there are elements of a specific array. The following code is not actually shown in order to display the code on a single screen. But after the kernel launch, we would of course copy the results back into the host's memory from the device memory using CUDA mem copy. And then we would free any memory allocated. To finish up this lecture. Let's take a look at a more practical yet still very simple example. Let's say that we want to add two linked in floating point vectors. Since the addition of vectors is a symbol element wise operation, each of these additions is independent of all other additions. So vector addition can be achieved by adding each corresponding elements of the two arrays together in a separate thread. Implementation of a kernel will perform this vector addition will be left as a programming exercise. 5. CUDA - Indexing Threads within Grids and Blocks: So far we've seen how to get a threads index within a block using thread idx, which is sufficient for determining the threads location within a single block. But that index is not unique within the entire grid. For example, what's assumed that we launch a kernel with a grid that has two blocks in the x dimension. And within each one of those blocks we have four threads in there. X dimension. To read idx dot x will span the range from 0 to three inside the first block. Thread ID x dot x will also have the same values within the second block. In order to determine a threads unique index within an entire grid, we need to introduce a few more indexing variables that cuda offers us. There are three or more sets of indexing variables that act similar to thread idx. Just like you can retrieve the index of a thread within a block. Using thread idx, you can retrieve the index of a block within a grid using block idx. Also using the keywords Grid DEM and blocked him. You can retrieve the dimensions grid in the block, which are the values that were set in the configuration parameters before the launch of the kernel. Note that each one of these built-in variables has a corresponding x, y, and z component, which is accessible by using the dot notation dot x, dot y and z. Let's look again at the previous example where we have launched a kernel as a grid with two blocks in the x dimension and four threads within each blocks X dimension. Combining the keywords from the previous slide together will allow us the ability to determine the threads unique index within this grid. This combination of keywords is composed of adding the product of the blocks index and the blocks dimension to the threads index in that same dimension. Almost every cue to kernel will require determining a threads unique index within a grid. So this line of code is placed in virtually every kernel definition. This line of code looks slightly confusing at first, but it's actually pretty straightforward. Let's continue analyzing our previous example of a grid with two blocks in the x dimension and four threads within each block. See what this line of code is actually doing. This table decomposes this code into one column displaying the product of each blocks index in the block dimension. And another column displaying the threads index. Thread ID x dot x simply counts from 0 to three within the first block. And again counts from 0 to three in the second block. The blocks dimension doesn't change throughout the kernels execution, which is four in this example. Since there are four threads in the x dimension of each block. Since C is zero-based, the first blocks index is 0. So the second column does not contribute anything to the line of code for the first four threads. The second blocks X dimension has an index of one. So the product of the blocks index and the dimension of the block will offset the thread index by a value of four for the following four threads. So now looking at the grid and the graphic illustrating this example from the left to the right. The threads unique index within this grid spans the range from 0 to seven, Covering all eight threads within this grid. Let's now take a look at a few examples of how these indexing variables work inside of a kernel. Let's assume we launch a kernel with the grid size of three blocks in the x dimension, in a block size of four threads within each blocks X dimension. We're also going to pass in an array named a into the kernel. Since this kernel is launched with three blocks and four threads within each block, there will be a total of 12 threads that this kernel executes, as. In the first example, we simply grab the value of the blocks X dimension and copy that value into each element of the array a. So every element in the array will be assigned the value four. In the second example, we're going to copy the value of the current thread index into each element of the array. This example illustrates how thread index, thread idx is unique only within its corresponding block. The result of this kernel will be 0, 1, 2 3, 0, 1, 2 3, 0, 1, 2, 3 stored in the array. In the next example, we will simply copy the value of the current block index into each element of the array. The result will be four zeros, four ones, and four twos stored in the array. And in the final example, we're going to grab the unique thread index within the grid and copy that value into each element of the array, which will result in the values from 0 to 12 stored contiguously in the array. 6. CUDA Memory Model: Cuda introduces the concept of a partition to memory space to expose to the programmer the different levels of parallelism and performance that the GPU offers. This lecture is about this memory model, which is one of the major differences between a CUDA program in a normal C program, at least from a programmer's perspective. Now the goal of all of these short lectures so far has been to introduce the CUDA programming model from a very high level and not really focused on very much syntax. We're actually how to use any of these high level concepts. Since any real program, it can only be learned through actually writing code. But in this lecture, we're going to take an even higher level view and try not to view more than maybe one or two slides of code. Since this abstract notion of the memory model is such a fundamental concepts to CUDA. So far we've already discussed the first fundamental concept of the CUDA programming model, which is this thread hierarchy. The second fundamental concept of cuda is its memory model that has a direct one-to-one correspondence with this thread hierarchy. Just like there are three levels in the thread hierarchy, there are also three levels to this main memory model. At the lowest level, we have a memory space termed local memory, which corresponds to individual threads. Each thread has its own private local memory that cannot be accessed by any other thread. When a thread has completed its execution, any local memory related to that thread is automatically destroyed. Threads also have private registers that have the same scope and lifetime is local memory, but have drastically different performance characteristics, which we'll discuss soon. In the next level up the hierarchy, we have shared memory, which corresponds to blocks in the thread hierarchy. Each block has its own region of shared memory that is visible and accessible to all the threads within that block. When a block has completed its execution, the contents of its shared memory or automatically destroyed. Now at the highest level of the hierarchy, we have global memory, which corresponds to all of the grids in the entire program. The contents of global memory are visible to every thread in the entire program. The lifetime of data stored in global memory last the duration of the entire program, or can be manually destroyed using the CUDA free function in the host code. Let's take a look at how the memory is physically laid out on the device. As discussed in the first lecture, we have the host on the left, which is composed of the CPU and the computer's RAM. And on the right we have the device which is composed of the GPU and its corresponding dram. The device is dram is where the global and local memory spaces are physically located. One thing that's very important to note here is that the name Local is not referring to its physical location. The term local refers to the scope and lifetime of this memory space. As discussed previously, each one of the green squares in the lower image on the GPU represents a physical CUDA core. These CUDA cores are grouped together into what are called streaming multi-processors. Sms for short. The CSMs are represented in the bottom image as the yellow rectangles grouped together as sets of CUDA cores. Understanding how blocks are mapped onto SMs is fundamental during the design of kernels to gain optimal computational performance from the GPU. So we'll discuss SMs in more detail in a following lecture. Shared memory and registers are termed on chip device memory since they're physically located on the GPUs streaming multiprocessors. The global and local memory regions are termed off chip device memory, since it's not physically located on the GPU itself. Taking a closer look at the physical layout of an NVIDIA GPU might help to clarify why the dram is termed off chip memory. The circuit in this image is a g-force tighten with the protective cover and heat sink taken off. The chip in the center is the actual GPU. The region surrounded by the blue lines is where the devices dram is physically located. As we'll see in a future lecture, we usually want to move frequently accessed data to the fastest memory available to us. This slide illustrates the relative speeds of the different memory spaces. We will crudely refer to the memory speed in this discussion as the combination of the approximate bandwidth and latency of each memory space. The different memory regions are designed to be used for different purposes. But because of their physical design, each memory space has drastically different memory bandwidths and latencies. Global memory is the region that we're most familiar with so far. Anytime we allocate memory on the device by using CUDA malloc, this is the region where that allocation takes place. Since global memory is stored in off chip dram, it exhibits a very slow memory characteristics relative to the other memory spaces. The use of global memory can't be avoided since we must transfer data between the host and device using this memory space. However. The goal is to minimize the global memory traffic. Since it is so slow. The upside, the global memory is that it's very large, comparable to a computer's RAM, which is usually about eight or 16 gigabytes. The Titan and Tesla k 40 both have six gigabytes of global dram. Now as we've discussed already, Each thread has its own private local memory and registers that cannot be accessed by any other thread. Any variable that is declared inside of a kernel is stored in a register. If the size of the contents stored in this variable become too large to fit into the register space, then the contents will spill over into the local memory. Register spilling into local memory is very undesirable because although registers are the fastest form of memory and CUDA, local memories, one of the slowest regions. So the goal is simply to attempt to not place more information in local variables that can be stored in the register space. The register spaces is scarce hardware resource, and we'll discuss how to better utilize it when we look at the GPU architecture closer in a future lecture. Now, shared memory is an extremely special memory space and its use is critical in the goal of computational performance and correctness. Shared memory is special for two primary reasons. The first reason is that it's extremely fast since it's implemented on chip. The second reason shared memory is special is because it allows for threads within a block to talk to each other. In a sense, shared memory can sort of be thought of as user-defined L1 cache. And as we'll see in a future lecture, shared memory and L1 cache actually have an intimate relationship. Now as promised, there's only going to be one slide in this entire lecture that covers any code whatsoever. And it's this slide here. And it's very little code actually. So let's go ahead and take a quick look at how we do declare a variable in shared memory. So to create a shared variable, you simply annotate the variable declaration with the keyword shared. And this is done inside the kernels body. Note the double underscores before and after the keyword shared. This is similar to whenever we use the keyword global. Now as we know, this kernel will very likely be executed by thousands or even millions of threads. So the first thread that performs the actions described in the body of this kernel will see the shared key word preceding the variables declaration. In this will alert the compiler to store this variable in shared memory. All subsequent threads that execute this kernel, we'll simply overlook the line of code that declares the shared variable. In this very simple example, a pointer to the array named in, which happens to be stored in global memory is passed into the kernel is an argument. In each thread that executes this kernel. A single element of the global device variable in is copied into a corresponding element of the array whose contents are stored in shared memory. Now there's one more portion of the memory model that we need to discuss that doesn't really have a corresponding level of thread hierarchy. And it's called the constant memory space. The idea of constant memory is that since the GPU's don't have a very big cash, as we'll see in a future lecture. We can implement a very simple kind of cash using constant memory. Constant memory is very large since it's actually located in the devices dram. All threads have access to it, but it is read only memory. We want to use this memory space for data that is accessed frequently but does not change its contents throughout a colonels execution. Although constant memory is implemented in hardware, in off chip dram, its contents are actually aggressively cached into on-chip memory. Therefore, using constant memory can substantially reduce global memory traffic throughout a colonels execution. To summarize this lecture, Let's take a real quick look at how all of these memory spaces appear to the programmer. At the lowest level, we have registers and local memory. Any normal variable is declared inside of a kernel, is stored in a register. If the contents are too large to fit inside the register files that are allocated to that thread. Then the contents will spill over into local memory, which is undesirable as we discussed because local memory is located in off chip dram. And then we have the extremely special memory space, shared memory. And as we discussed, shared memory is very special for two reasons. The first is that it's extremely fast because it's physically located in on chip dram. A second reason shared memory is so special is because it allows threads with any block to communicate with each other. Then we have the special memory space called constant. In this constant memory space, we want to store any variables that have contents that do not change throughout the kernels execution. Although constant memory is physically located in off chip dram, its contents are aggressively cached into on-chip memory, therefore reducing global memory traffic. And then finally, we have the familiar global memory space, which we've been using all along. Whenever we copy data back and forth from the host to the device. And that summarizes all of the basic memory spaces of the CPU to memory model. 7. CUDA Synchronization: Since threads are allowed to run in parallel, we can easily run into race conditions. The most common form of race condition occurs when a thread reads from a memory location before the correct data has been stored into that memory location, or vice versa. If the thread writes its results to a specific memory location before another thread has read the data from that location. In order to avoid these race conditions, we need a way to allow our threads to synchronize. In a sense, force portions of the device code to run in serial. Force threads within a block to synchronize with each other, we can implement a barrier. A barrier is a point in a program where all of the threads that are executing a kernel with any block will hold their execution when they reach this barrier point. For illustrative purposes, let's assume we have four threads within the block executing in parallel. At a point in the kernel where we need the threads to synchronize. We can implement a barrier which will cause each thread within this block to pause at this point until all of the threads within this block have reached the barrier. Now once all the threads within a specific block have reached this barrier point, we then say that the threads have been synchronized and the threads within this block are then allowed to continue their execution. We can explicitly implement a barrier by using the sink threads keyword. So let's take a look at an example of how to use thread synchronization. Let's assume that we want to shift the contents of an array to the left by one element. For this example, let's assume this array named a is stored in global memory as of length 4. Now we can implement this process in parallel using a CUDA kernel. We're a pointer that points to the memory address of the first element of the array in global memory is passed into the kernel as an argument. Now for each thread that implements this kernel, we will first grab the threads unique index within its block. And we will store that value in a local variable named i, which will actually be stored in a register. Then we can perform the operation to shift each element in the array to the left by one. Now just to be safe and ensure that we don't step off the end of the array. We can implement this shifting operation inside of an if statement. Now this code won't actually work as we intended to, because this shifting operation is actually a read and a write operation. So what we need to do is ensure that all of the elements have been read from before they are written to. In this global array variable. A. Now we can easily solve this problem by creating a variable that will be stored in a register named temp. The contents of the I plus one element of a is copied into temp. Now since threads execute in parallel, we want to implement a barrier at this point to ensure that all of the threads that are executing this kernel will have read their data stored in there I plus one element of a before any other thread writes into this element. Now that we are certainly all of the data from the entire array has been stored into either a register or has spilled over into local memory. If the array were very large, we can now write the data into the global array variable a. So we will grab the contents stored in the temp variable now in right, those contents into the ith element of the global array variable a. Now, just to be safe and ensure that all of the rights have occurred before any other threads read from the array. We can implement another barrier, synchronization. In addition to the explicit barriers used for threads synchronization we've discussed so far in this lecture. There is another major form of a barrier which is implemented implicitly between kernel launches. Before we discuss this, let's actually take a step back and look at the asynchronous behavior between the interaction between the host and device code. Now, as we discussed in a previous lecture, the host does not wait on the completion of a kernel to continue its flow. In order to force the host code to wait on E kernels completion, we can use the keyword CUDA device synchronize. When we use this keyword CUDA device synchronize, the host will pause at this point until the previous kernel has completed its execution. So as we've seen, the host and device operate asynchronously unless the host code is explicitly specified to wait on the device. However, the execution of consecutive kernel launches do operate synchronously, which can be thought of as an implicit barrier between kernel launches. For example, if we launch two kernels consecutively, we are guaranteed that the grid from the second kernel launch will not be scheduled to execute on the device until the first kernel has completed its execution. As described in the CUDA programming guide, there are three key abstractions at the core of CUDA, which consist of a hierarchy, computations with a corresponding memory hierarchy. And finally, barrier synchronization primitives, as we've discussed in this lecture. So this lecture concludes the absolute basics of cuda. In subsequent lectures, we will learn how to optimize cuda programs to fully utilize all of the resources that the GPU offers us. 8. CUDA Install (Windows): Download the toolkit from the cuda download page on the NVIDIA website. I'm going to grab the package corresponding to my operating system, which is a 64 bit Windows. Follow the prompts and extract an install the package standard toolkit installation provide accelerated libraries, a compiler, development tools, and sample code. Now with the could't toolkit, you have everything you need to create, compile, optimized, and deploy your applications onto NVIDIA GPUs. Once the installation completes, we're now ready to run an example. Navigate to the cuda samples folder, where you'll find code samples illustrating key cuda concepts, as well as other samples focused on areas such as image processing, financial computing, particle physics, fluid dynamics, among many others. We'll use the matrix multiplication example. So let's navigate to that folder. I have Visual Studio 15 on my machine, and we'll open the corresponding solution file. Now the cooler toolkit installer automatically installed the insight plug-in during installation, which I'm going to use here. Let's build this project and run the generated executable. The application runs successfully and prints its status to the prompt. And that's it. You're now ready to accelerate your applications on GPU's using cuda. 9. CUDA Install (Linux): I'm going to show you an example of how to install the cuda toolkit and driver using the Debian infrastructure on an Ubuntu machine. And once the file has finished downloading, we can install it with D package. Once installed, we need to tell apt to update its package list. We do this with apt-get update. From there, we can see a list of available packages with the following command. This is a list of all packages available on the Ubuntu repository. This is handy in case you want to install a specific package, but don't know what it's called. You can install a package with the apt-get install command. For example, if you wanted to just install the cuda core, which contains core binaries as well as other core functionality with this command. But that's not our goal for this video. So I'll just cancel doing that. There are two special Meta packages here, cuda and cuda cross. These are meta packages, could have met. A package will install the driver native toolkit and the samples. This package is intended for native development. That could across meta package will install the native toolkit, the cross-platform toolkit, and the samples, but not the driver. It is intended for cross-platform development. Let's go ahead and install the package for native development. We're gonna do this with sudo apt-get install cuda. And before I hit Enter, there's a caveat I need to mention, and that is that the Debian driver package is not compatible with the run file installer. If you've installed the driver and using the run file installer previously, you can uninstall it using the Nvidia uninstall binary located in user bin, which I've already done. Now let's proceed. Here you can see the list of packages that are going to be installed. The packages that begin with cuda or the toolkit and samples, and the rest are for the driver. At this point, the cuda toolkit samples and driver had been installed. If you look under the user, local could have 55 directory. You can see the toolkit. And under there the samples directory where the samples are installed. The sample is located. Here are the pristine samples and I encourage you not to modify these samples directly. You can copy these samples using the convenient script located in the bin directory of the toolkit called cuda install samples 55. I'm going to edit my path to point to this bin directory so I can call the script from anywhere. Say I wanted to install the samples into my downloads folder. I just call the script and give it my current path as the destination. And here you can see my own writable copy of the Nvidia CUDA 55 samples. At this point, we should reboot our computer to ensure the Nvidia kernel module is loaded properly. Now that we've rebooted, we can see that the Nvidia kernel module has loaded successfully using Mod info. We can also see that the version is correct. In this case it's three 1921. The rpm and Debian packages also support side-by-side installation. If you look under user local, you can see two cuda directories. The first is a symbolic link which points to the current could a directory 55 and the actual directory itself. When a newer version of cuda comes out and you install it, the sibling will be updated to point to the latest version. The older version of cutover remain on the file system right beside the new one. Finally, I will show you how to update your cuda and a video driver packages on Ubuntu, you can use this command which will update all the specific cuda and NVIDIA packages. However, since we already have the latest version of cuda, there's nothing to update right now. This information and more, including how to get kudos setup on the other distributions can be found inside the cuda Getting Started guide for Linux available on docs dot 10. CUDA First Program: I'm using Visual Studio with the n-side plugin, which was installed with the cuda toolkit. To start, we create a new project, which you can do by selecting the cuda template from the Visual Studio template library. By default, a cooler project contains a kernel file which sample code? We won't be using the sample code. So let's replace it with CPU only code that adds two vectors. Let's look through this code now. The main function allocates three variables, a, b, and c. The variables are initialized in a for loop. And after initialization, vector add function is called. In the vector add function, we add the elements of vector a to the elements of vector b and assign the result to vector C within a for loop. Back in the main function, the resulting vector C is printed and the memory is freed. Now let's compile and run this example. It prints out the first few elements of vector C. Note that even though the CPU only code, it can be run in a cuda project. Now let's parallelize this function to execute on the GPU. This has three steps. First, make the variables accessible to the CPU as well as the GPU. Second, specify the launch configuration of the kernel function. Third, battle lies the Kernel function to run on a GPU. So first, let's make sure that our variables are accessible to the GPU. To do this, I allocate my variables using the function cuda malloc managed. This command. Places are variables in a memory space called unified memory that is available both do the GPUs and CPUs in the system. Cuda malloc managed returns a pointer that you can then access from host and device code to free data in unified memory. Pass the pointer to the function could have free. Next, let's update the function call vector AD, specify the launch configuration of the kernel. To do this, we will use triple angled brackets. The first parameter in the triple angle brackets is the number of thread blocks, which in our case, for simplicity, we only use one block. The next parameter is the number of threads within each thread block. In our case, we will set that number to the number of elements in the vector, which is 1024. This way, each instance of the kernel will launch in an independent thread to ensure that the CPU waits for the kernels to complete before continuing. Use the command. Good advice, synchronize after the kernel launches. Finally, let's paralyze the vector add function. So and the keyword global to the function definition. This tells the compiler that this function will be executed on the GPU and is quotable from the host. This function will be run by many threads in parallel. So do identify its thread. We can use a read-only variable called thread ID Expert thinner. Thread idx is unique within each thread. And since each element of the vector is now being independently executed, we do not need to loop through using the for loop anymore. However, we should use an if Jack to limit operations on the number of elements in the array. This way, even if we were to launch more threads, then the elements in the array, we should not have a problem. Now build and run the application. And we can see that the result is identical to the one which CPU only code. And that's it. We've now written a function in the C programming language, but can execute on the GPU using cuda. 11. CUDA Vector Addition: In C plus plus. So let's start with the sim t model. So the classical execution. So if we just think about writing a C, C plus plus or Python program for vector addition. We want to add two vectors together. Now, the way we normally do it is just do successive additions. So we'd have say, a for-loop and we'd add a 0, b 0 and store the result in c. And we do that in this case four times. So we'd have four instructions that each calculate one addition. Now in the CMT ModelSim T stands for single instruction, multiple threads. And so we actually get this naturally parallel vector addition. So a single instruction will say I wanted to do a vector based operation. So in this case, a vector addition of two arrays and the output will also be an array. So it's a natural way of mapping, you know, very parallel data in this very clean way. And so this is the model for GPUs. Gpus start out like anything, it's a thread. So that's the lowest granularity of execution. So these are what execute instructions. But threads are actually composed into these things called warps. So warps actually, remember this is COMT, so warps actually execute instructions in lockstep, meaning that at any given moment, all of these threads may be executing the same instruction together. Now sometimes depending on the way you size your problem, some of these could be masked off, meaning that they're not executing any instructions, but in general, they're all executing the same instruction at any given time. Now, moving on, we have these things called thread blocks. So this is what we actually programming. So all of these warps get collected into thread blocks. So you as the programmer say, I want 256 threads, that's all you need to say. It will automatically translate that into individual warps. And thread blocks get assigned to single shader cores. So these are the cores in the GPU, and these are three-dimensional. So you can have thread blocks in the x-direction and the y-direction in the z-direction. So this just kind of helps with mapping a problem to the actual threads themselves. And thread blocks themselves are composed into grids. And now your grid represents how your entire problem gets mapped to the GPU. And so this is something that we assigned to our launch parameters. So we're launched perimeters are our grid size. So how many thread blocks there are in each dimension? So how many third blocks in the x, y, and z dimension? And then also the other launch perimeter is our thread block size. So how many threads are within each thread block? So our baseline GPU architecture generally looks something like this. This is from a Pascal GP120, 100 GPU. So we can see we've got this giga thread engine up here. And then within each of these ethnic SMs, that's where we have our warp scheduler. So we said that the warp is a low schedule entity. So up here, we schedule our thread blocks to each of these SMs. And then within the SMC themselves, we have a warp scheduler that actually starts execution inside of each of these cores. And then within the course themselves, we have things like texture caches and L1 caches. We have this shared memory, sometimes called scratchpad memory. And then we have of course, all our ALUs double-precision units, load store units, and special functional units. Okay, so that's going to do it for the slides part of it. So let's actually go through some code. So here we are in Visual Studio and we'll go ahead and run through the specific GPU parts of the code. Now, we'll start out by saying we're going to do vector addition. So we're going to add two vectors together. Every single element, the addition of every single element is completely independent. It doesn't depend on any other elements and those vectors. So it's a very parallel. Sometimes we call this an embarrassingly parallel task. So first we allocate size for our vectors, then we have to allocate memory for our device vector. So for the GPU, and this is because the GPU has its own physical memory, not connected to the CPU physical memory. So when we do could malloc, it's saying I need to allocate memory on the GPU. Now there is something called unified membrane where we can have just one set of memory that gets migrated between the GPU and the CPU and vice versa. But we'll leave that to another example. So then we'll just initialize a matrix is some random values. This is a function that I've written up top. And then we actually need, remember it's different physical memory. So we need to copy our data from the CPU memory to the CPU memory. And we do this with CUDA mem copy. And this says inside of my device memory pointer. Remember when we call could a malloc within our device pointer right here, this da, which is just an integer pointer, it fills an address to GPU memory and it says, I need you to put what the contents of HA, of size bytes. So our host vector of size bytes into GPU memory. And I say which direction the transfer is, whether it's from the CPU to the GPU or from the GPU to the CPU, or even from a GPU to another GPU. I specified this with this last parameter, which is CUDA mem copy host to device, saying go from the host, which is the CPU, to the device which is the GPU. Now you may wonder how do we get threat block size or a grid size. So how do I get the number of threads I should put in each thread block and the number of grid, the grid size for my problem. Now a lot of this comes, comes back to tuning in tuning for your specific architecture. So we'll just do one that's convenience and fairly simple to program. It's important to know that there are maximum number of threads per thread block and maximum grid sizes. But those are things that you can easily look up within. A white paper for your specific architecture. So in this case, we'll just say I want 256 threads per thread block. It's generally good to do this a size of 32 because remember, these have to translate it to warps which are of size 32. So it should be a multiple of 32. Then we have our grid size. And so in this case, what we want is a single thread calculating each element of vector addition. Now to get that will divide the number of total elements divided by the number of threads Per thread block. So we're going to launch num blocks, thread blocks each of size num threads. So if you were to multiply num threads times num blocks, we would get the number of elements. So basically we're, we're allocating as many threads as we have elements in this case. Now I said these things were three-dimensional. In this case, if you just give it a single number like an integer, it will treat that as only a one-dimensional thing. So just in the x dimension, which is really what we want for something as simple as vector addition. Now we call vector addition or our kernel. And we'll go over the kernel code in a little bit with this triple Chevron kind of notation. So three less than signs. Then our number of blocks or grid size, and our number of threads per block. There's two more parameters that we can leave as just default for announce. We didn't need to specify them, will go to them in a later video. And then we pass like any function, the parameters. So in this case R2, R3 device memory pointers, and then the size of our vectors. So then after this, we should assume that we've calculated vector addition. We've calculated or the sum of vector a and vector B and stored in vector c. So we copy back vector C into our host allocated vector. And we do could have mem copy device to host this time. So from the device, the GPU to the host, the CPU, we do an error check just to make sure we did it correctly and then we just print completed successfully. So let's actually look at the GPU code now. So we've gone through basically the entire setup, but what does actual GPU code look like? So GPU code is we've got this global that comes before it that says that I'm a cuda kernel, I'm going to be executed on the GPU. And here's our parameters that are just like any other parameters. So the first thing we have to do is get our, as we said, we're having one thread per every single, sorry, one thread per element that gets, that gets added. So one thread does a 0 plus B equals C 0. Thread T2 does a one plus b1 equals C1, so on. So we need to figure out, out of all of our threads, what is, who am I? So this code will get executed for each thread. So all two to the 16 threads that we're launching. And so we need to figure out who we are. So how we do this is by doing this block ID x times a block dim dot x plus thread ID x. So this says which block him I. So from the 0th block, I will be block ID dot x will be 0. And my block dimension dot x. So remember, this is going to be our, our block size that we specified earlier, which was 256. And then which thread am I in that block? Remember, so all of these in with a dot x because like I said, these are one-dimensional because we only gave it an integer. So it assumes the y and z. We have nothing in the y and z direction. So our block ID, which will, which will be which block out of our big line of blocks. Do we pick? Then from there, we look at a, we multiply that by the dimension of each one. So if I'm in the second block and my size is 256 for each block will then my first thread of the second block will be starting at 256. And then my offset within that block will be this thread ID. So thread ID we'll start at 0 for every single thread within a block. And then our global number of blocks also starts at 0. So we just use this to calculate which thread in the system we are. Then we do a quick boundary check just to make sure that we are that this is valid, right? So we're not accessing something out of bounds. Remember, we can have problems that maybe aren't. The vector isn't a multiple of 32, so it won't line up just perfectly. So we need to check this just to make sure we're not, like I said, accessing out of bounds memory. And then we just do simple vector addition. So for this thread ID, we calculate a for this thread ID plus b for this thread ID equals c for this thread ID. Okay, so that'll go ahead and do it. We can go ahead and build this project. I'll write, looks like it completed successfully. And we can go ahead and run it. And we're running. And we completed successfully. 12. CUDA Unified Memory Vector Addition: Managing memory can be messy sometimes especially when writing very complex, very complex code. So in this case, like I said, are, our kernel is exactly the same. So let's go to our main function. Now. In this case, last time when we looked at our previous example of vector ed, we had to declare two sets of pointers. We had to declare our pointers to host memory or CPU memory that we use malloc four. And then we had two you get are pointers to device memory that we use cuda malloc for. Now in the case of unified memory, we strip out this process entirely by having one set of pointers. And we'd call cuda malloc managed. Now this says that this is going to be Warnock going to manage this. This is going to be, we're not going to manage where these this data lies at any given time. It's if it's in the CPU and should be on the GPU, that transfer will automatically happen. And if it's on the GPU and it should be on the CPU, that transfer will automatically happen either behind the scenes or sometimes we can give some hints to say, well, you should be on the GPU Now, or you should come to the CPU now. So we'll have the exact same thing. So we can treat these still as regular pointers. Like I said, these can be accessed from either the CPU side or the GPU side. So we'll just, we'll initialize them and then we'll use the same size dimensions of our grid and blocks. So 256 threads per block. And then a grid size of the number of elements divided by R block size 256. Now, even our our cuda launch looks exactly the same. And so this is just kinda goes to show that all that we're really trying to strip away is this idea that we're trying to strip away. This idea that we have to manage our own memory. Now there is one new line though, which is, is CUDA device synchronize. Now why do we need this line? What does this line do? Couldn't device synchronized says that at this point in execution, after the synchronization point, all previous events on the GPU have finished. Uh, why is this important? This is important because in cuda could've launches or when we launch kernels, this is an asynchronous launch. So we go ahead and we launched things to the GPU. But by the time we return from that function, that kernel is not guaranteed to a finished yet. Now in the case of doing something like Could a malloc after a device launch like we did in our regular example, could a malloc is a synchronous coal and all of our things because we're doing it to the same stream, which is the default stream because we didn't specify any. We don't specify any. It goes to the default stream. It will automatically serialize all of those cuticles. So in that case, our CUDA mem copy was actually also our synchronization barrier. But this time because we don't have mem copies anymore using unified memory, we have to explicitly say that I need to make sure that everything's done. Now, why do we need to do this? Well, if we do anything else during this time between vector AD and this device synchronized, if we try to access this data, that means we're concurrently accessing the data that's on the GPU and the CPU at the same time. So then we end up with this race condition of who should be accessing what time. So for the case of vector add, only the GPU should have access to the code while it's doing that addition, or sorry, to that data while we're doing that addition. So then after the synchronize, it looks exactly like our code. So we know that vector AD should be done. So then we can just call check answer again, and print completed successfully. If we complete successfully, otherwise, we'll throw an assert. So we can go ahead and build this project and see if it runs correctly. So same size of two to the 16 elements and each vector. And we run and we go ahead and we exit successfully, so we didn't have any problems. Yeah, so, so functionally it works. Now how do we improve performance? Because there may be a couple times or we have problems. So think about between when we call malloc managed, we initialize these vectors and then we call vector AD. So previously, when we started GPU execution, we knew that the data was going to be on the GPU. And we knew the data was going to be on the GPU because we see explicitly called CUDA mem copy. In this case, we did no such thing. So what happens is the GPU starts up and it says, oh, I don't have any of this data. So it does something called a page fault. So that's the page in memory from the CPU to the GPU, which means it basically just has to transfer the memory kinda behind the scenes over to the GPU and get it page by page. And now the size of these pages can be different. I'm not quite sure what the page size is on GPUs, but it's doing. So for CPS is generally for K. So we'll chat, it'll transfer the granularity of four kilobytes. Now, how do we speed this up? Now we can give hints to say when date, when data should be wear. And we do this by prefetching. So prefetching just says, OK, behind the scenes, you can start transferring data while we're doing other things. So in this case we can call CUDA mem, prefetch async. And then we can say, I want you to prefetch a of size bytes to ID. Now this ID comes up here from CPU to get device ID. So in this case, it will get the device ID of the GPU. If you have multiple GPUs, there'll be multiple device IDs that you'll have to sort between. But in this case we have a system with one GPU. So when we get this ID, it will automatically be the identifier for the single GPU. And so this just says in the background, so asynchronously start transferring data ahead of time. Now, we can also have on the other side over here. So after our synchronization, we can also start prefetching the device back or that of the device memory back to the CPU. Because we know that the GPS done, its kernel has finished. It doesn't need the data anymore. So there's a built-in coda CPU device ID. And this just says, like we did to calculate our cuda CPU device ID. It automatically knows it. It's a built-in. So we'll just say I need C of size bites and I need it on the CPU now. Okay, and so we can go ahead and rebuild this. And looks like it's rebuilding. The rebuilt was successful so we can go ahead and run it again. And there we have it. So we completed successfully again. So we didn't throw an assert, didn't yell at me. Okay, so that is the basics of unified memory and GPUs. As you can see, it's a very powerful thing when we don't have to explicitly manage memory. And so we use this concept of virtual memory using cuda malloc manage, and then we can boost performance back to what we get with cuda malloc using things like CUDA mem prefetch, async. 13. CUDA Matrix Multiplication: Now, for every element of a matrix C that we're going to be a result. We'll take a row of matrix a and do a sum of products with matrix B. So what does that mean? Sum of products. So we'll take A11 times B11 plus a12 times B21, and that will equal C 11. So as you can see, will march along a single row of a and then a column of B to generate a single output In the C result. And then we'll to, to kind of toggle up and down between these two elements. So C11, C21, it will be between the first or the second row of a. And between C11 and C12, that's going to be as toggling between the first column of B and the second column of B keeping a constant. So how's this going to look like on GPUs? So using the same kind of diagram, our basic thought will be for every single output, every element of our result matrix C, we will assign one single thread. So this single thread will traverse an entire row of a and an entire column of B in order to calculate that sum of products to get this single element of C. Like we say here. So in this case will have a single thread will go for this, this pair. Remember, so we're talking about these threads. Every thread will do a unique pair of rows and columns. So one thread will do this row and this column. The next throw, next row we'll say do, or the next thread will do say this row and the next column. And then another one will do different row and the first column again. And it will do this different row and the last column. And so that will give us all four results. Now. So how did we do 2D indexing? So in our original thread blocks in our vector add examples that we talked about previously. A vector is really just a big long line of elements. So doing a 1D thread block makes sense. The problem easily maps to that. Now we have these square matrices are matrices in general, these map a lot better to a two-dimensional shapes. So things like squares or rectangles, usually squares just because it's very easy to lay out. So let's consider this case. So we'll start with very tiny thread blocks just for examples, but this is completely extensible to larger sizes. So we'll have these tiny two-by-two thread blocks. The thread blocks will be denoted by the outline and then also different colors. And we have the same variables available, but you have to remember that they can be in two-dimensions. So. We have block ID EX, which gives us which block number we are in the x and y dimensions. And then we have thread ID EX, which tells us which thread we are with, which thread we are within a threat block in the x and y dimensions. So and then we have blocked him, which is a constant and both the X and the Y direction, they can be different in the x and y dimension. But this is just a number of threads per thread block type variables. So this is going to be constant for every single thread block. So let's start with a basic example of how, of how we actually do these calculations. So let's take a 22 For example. So this first blue thread. Now our block idx, Let's see which block we need in the x-direction. So if we consider this column block 0 and the x direction, this will be blocked one in the x-direction. Now, how do we get to this block, this block down here? Well, that will be also block ID x dot y. So block ID x dot y will also be one. This will be block ID x dot y is equal to 0. This will be block ID x dot y is equal to one. So that means our block ID x dot x and our block ID x dot y will be equal to one to get this pool of threads in the bottom right. Now how did we get the individual thread? So if we go through and see which one it is, or see how we calculate that. That will be say, 01 and the x direction and 0, 1 in the y direction. So in this case is thread will be 000. So what's the key idea of doing this? So we can see that we can calculate the row of a specific thread, in the column of a specific thread using these basic formulas. So let's just run through this example. So block ID x dot y times block dim dot y. Or so the Rrose able to block ID x dot y times block dim dot y plus thread ID x dot y. So let's see if that holds true up here. So we want this thread, remember, so block ID x was one, thread ID, block dem was going to be two. Remember there's two threads per block. So that's going to be two. So it'd be 1 times 2 plus thread ID x dot y. So remember we said it was 0, so it'll be 2 plus 0, so 2, so our rows 2. Then we do the column calculations and it works out to be the same. So block ID x dot x is going to be one. Remember we go over one. So this will be this block ID x dot x. Then block dim dot x. The block dimensions in the x direction or two threads. So it'll be 1 times 2 here, which is equal to two. And then our thread, it will be this first thread over here and this column. So. Thread ID x dot x will be 0. So it'll be 2 plus 0 again or two. So we get the result to two. So this is extensible to any size matrix, so it's a very convenient way to his Mapping. So remember, we always need to figure out in our cuda kernel, because that code is executed by every single thread and the system. Every single threat in the system has to know first who it is. So it has to do these Thread Location calculations. So let's talk about some performance considerations. So these include, these are mainly coalescing rights. So we can see that memory, even if it looks like it's in a matrix, will actually be laid out into a flat line in memory. That's we call our address space. So if we look at something like this, so if we the first load of something like thread 0, if it does a 000. So this first element, and if we look at the, the load from a different thread, say a 10, we see that there's this offset. So if every single thread is loading from a different one of these rows, we see that we get this staggered pattern. So we get these really wide accuracy. So the memory is actually getting loaded from all over the place. Now, let's look at, let's look at different, a different kind of access patterns. So if we're going across, so we look across here and we see if we do, if we access all of these elements in a single row at the same time, we see that they all get bunched together and they can be done in say, maybe a single memory access instead of four memory axis, axises. So this goes into this idea of locality. So all of those pieces of data are probably located together on the same block. So when we load in one, we get all of them. So we don't have to wait for four blocks, we just have to wait for one block. So let's see how this kind of plays into matrix multiplication. So we know that we're going to be traversing a row and then we're going to be traversing a column. So let's think about every single access in there. So for say, between the first iteration and the second iteration. So in the first iteration will be loading up all of the first elements of the rows and then all of the first elements of the columns. Now, all of the elements of the, for the first iteration, all of the elements in the columns or all in one straight line for all the different threads. So though I'll be coalesced, but all the rows are going to be on different columns just like this. So we're going to be diverged axises. So we'll see how we can do some optimizations here and do some preprocessing of the data later in order to have both of these accuracies be coalesce, but we'll leave it alone for now. And there's also this idea of shared memory that we'll get into later. And that's this user managed to L1 cache that we can preload some of this data into, so we don't have to go all the way to memory. So starting here. So we have a very simple initialization like we always do. So this time we're doing 1024 by 1024 matrices. Remember this is a GPU doing big computations that we want to speed up. So it's going to be very large calculations. So we're going to allocate memory on our GPU. We're going to copy over our matrices after we initialize them. In this case, we'll be using a size 16 by 16 blocks. So 256 threads per block, but in two-dimensions. And then we'll be using equal-sized grids as well. So we'll divide the number of elements by 16, and then we'll end up having 1024 by 1024 total threads so that it ends up being a very large number. So two to the 20th. Okay? So then we launch matrix multiply, copy back our result as per usual, and then we verify our results. So let's actually look at two things. Let's look at this serial implementation. So the serial implementation is very simple. It's just a triply nested for loop. So we'll say, I want to go through all the rows. I want to go through all the columns. And then for every single, and then I want to go over every single element in those rows and columns. And then we do this successive some of, so this will be the sum of these product terms. Now let's see how it translates to GPUs. Remember, so we've got two dimensions of threads now. So what we actually end up doing is getting rid of two for loops. So how did we do that? We do that because we automatically get this simultaneous, or we get this single instruction, multiple threads type thing or type model, where we have our rows and columns kind of pre-calculated for us because we were assigning a thread to each one of those positions. So we're still doing the same. We're doing the same sum of product calculation. But the only difference is we only need to do it for, or we only need to really think about for a single thread because each thread will have a unique row and column. So automatically we're going to get our entire result matrix C. And then of course at the very end, we write out our tymp some that we calculate across each iteration to this, to its corresponding position in that matrix C. So we can go ahead and build this. So rebuilt. And this would take a little bit longer to run because remember, we're doing this verification and doing this serially on the CPU takes a long time. The GPU matrix multiply actually runs pretty quick. So we're mainly waiting on the CPU verification and will run. Now plainly doing the CPU, as you can see, a high memory usage. And we're done. So we didn't get any assert, which means that our program didn't crash During the verification, so we did it correctly. All right. That's going to do it for today. 14. CUDA Matrix Multiplication Code: And how to write a matrix multiplication kernel using cuda, right? So last time we talked about the basics of cuda and writing a kernel that computes the sum of two vectors, right? And this time we're going to introduce some new concepts by introducing a 2D grids of threads, right? So with vector addition, we just needed a big long line of threads, which is basically just one-dimension, the x dimension. But for matrix multiplication, what we're going to do is we're going to assign a thread to each element of the matrix. Now a very natural way to do this is by having a 2D grid of threads. That way, you can index a matrix as a row and column, right? So it's just a very intuitive way of programming this. So let's go ahead and create a new file. We'll call it matrix Moldau ASU. And at the very top right, we'll start out with our normal comments, right? We'll say, print out this program computes matrix multiplication on the GPU using cuda, right? And then by Nick from coffee for arch. Okay? And as always, right, we're going to need a main function. And then we can go ahead and we can go ahead and write our kernel, right? At least the name of our kernel, right? So global void, matrix mall, right? And then we'll figure out what arguments we need and a little bit. All right, So as per usual, we'll start out by declaring our problem size, right? Set our problem size. Now in this case, we've got a couple of choices here, right? So do we want to make a very general matrix multiplication? Or we want to do one that kind of gets to the point and it's easily extensible, right? So here we can go ahead and just assume that we have square matrices. If we didn't want square matrices where we wanted, you know, kind of arbitrary matrix multiplication for different dimensions. We would just need three-dimensional, need to declare three dimensions. But in this case, we can just simplify things by leaving it as a square matrix so that the dimension of the x and y are the same. So we can just say int n is equal to. Now in this case, we can go ahead and set our problem size or default to be a 1024 by 1024 element matrix, right? So that's going to equal two to the 20 total elements, right? So we can do one shifted over ten times, right? And then problem, we can go ahead and say set our matrix dimension. And we can make sure that we call it a square matrix, right? That way we know when we're looking back what kind of matrix multiplication it, this is. And then it'll be two to the 20 or two to the 10 by two to the 10 default, right? So that's our default input. Okay? So just like our last one, we're going to allocate some memory so we can just go ahead and compute the number of bytes we're going to need ahead of time. All right, so we'll do size t. Bytes is equal to n times n times the size of an integer, right? Or whatever datatype are going to use in this case, it'll just be an integer. And it's n by n because n by n elements and an n-by-n matrix. Okay? So the next thing we do is we need to allocate some memory, allocate memory for our matrices, right? So we'll need some pointers first. So we'll have a matrix a, matrix B, and matrix C, right? And then we will need to allocate memory. So we'll use cuda malloc managed, right? So we'll use unified memory. That way we don't have to worry about mem copies. And we pass into here the address of a, right, so that pointer will get filled with a pointer to the actual memory write that is able to be transferred either to the GPU or CPU depending on where it's being accessed currently and how much space we need. Well, we need bytes of space, right? And so just like last time, right? The nice thing about this is that now when we change the size of our matrix, the only line of code that we need to change is going to be up here, right? This bytes, right? Oh, we lift into here. Okay. So now we've allocated space for our two input matrices, a and B and our output matrix C. And like I said, it's gonna be a square matrix, right? So basically what happens is we have an int. In the more general case, we'd have a, an m by n matrix times an N by K matrix would lead to an m by k, a result, right? So basically what we do here is we just allocate membrane, we access memory based on those dimensions, right? So it's a fairly easy extension from this. So the next thing we need to do is we need to go ahead and initialize our data, right? So we need some random numbers. We'll just be using random numbers because we don't have an actual problem that we're trying to solve here. We're just implementing matrix multiplication, right? So we'll do this with a function just for convenience. And so we can say, you know, initialize our matrix or matrices, right? And we can call a knit matrix and we'll pass in a pointer, right? So we'll pass in the pointer, say a first and the dimension. And we'll do this twice, right? Because we need to, and if a and B, so now we actually need to write our function that initializes the matrix, right? So you can say initializes paid matrix with random numbers between. We can just do 0100 again, right? And we can go ahead and make sure that we know this is a square matrix. If it's a non-square matrix, right? We'd need to maybe change how we're accessing matrix. So, right, we'll go ahead and make this just a void function. We'll call it an IP matrix. You'll take two pointers, one to the matrix, and then oral take one pointer and then one integer. The integer being the dimensions of the square matrix. So it was an n by n matrix, right? So to initialize this, we'll just do it in a single for loop. I is less than n times n. Remember we have n squared elements in a square matrix, alright? And then iterate over. And then we can go ahead and say, you know, m of I is equal to rand mod 100, right? And this will give us random numbers between 0. And because we're using rand will need to go ahead and include. C, standard lib, right? One better random numbers. Go ahead and use the Random. Use the go ahead and include random and use other like Emerson twister. Ok, so now we've initialized our matrices. So the next thing we need to do is start talking about or grid, right? So we're basically ready to send everything off to the GPU. So we should start talking about what we're going to send off. So like I said, we've got a square matrix times a square matrix will give us the exact same dimension square matrix as a result. Now, for common strategy is just to assign one thread per output element, right? In that thread is responsible for calculating the output element right? Now. To simplify and to make it intuitive of how we're going to index each element in our output matrix will make a 2D grid of threads. That way, we can intuitively access it with a row and column pair for each element. So to do that, all right, we'll go ahead and say set our block and grid, or we can say CTA or Cooperative Threat array. It's often switch between copter photo a or thread block kind of interchangeable terms. So we'll set our CTA and Grid dimensions. So in this case we typically do, you know, square thread blocks, right? It looks for matrix multiplication and especially in this case because we have square matrix is also very simple to write code for right? Now, thinking about things in terms of squares instead of say, rectangles, right? It's usually a little easier. So we can go ahead and set int threads. This will be our threads per thread block. And we can do that same padding trick that we did last time. Where we do, it'll be in this case or first case, no padding, right? It's just threads. So, right, so if we want 256 and a thread block that's equivalent as 16 times 16, we want to make a square matrix, right? So each dimension we'll go ahead and set as being 16, right? And we'll leave that there for now. And then for our grid or the number of blocks we want, we'll go ahead and set this to be using this padding trick. It'll be in plus threads minus 1 over threads, right? But right now we still have just a single, we have a one-dimensional grid, right? So we have 16 threads per thread block, and we have n plus threads minus 1 over Threads total blocks, right? But this isn't what we want. What we actually wanted was a 2D grid of threads. So how do we go ahead and make a 2D grid of threats, right? So in this case, what gets passed into the kernel, right? When we want a 2D grid of threads is a struct that has three fields in it, right? So the three fields being an extension of Y dimension and a z dimension, right? So this is a DIMM 3, right? So we go ahead and say, you know, set up our kernel launch parameters, right? So we can specify a three-dimensional grid, three-dimensional thread block. And then we have two more launch parameters which are going to be shared memory allocation and what stream we're actually assigning this kernel to or yeah, What was streamed more signing like this kernel 2. In this case, we're just doing a kernel launch. We don't need to use the default stream and we're not allocating shared memory This time, right? So we call this Dim three threads. And what we'll do is we'll make it a 16 by 16 grid of threads. And this is the same thing as if I were to make a, if I were to make say, a D3 called threads and then I would set threads dot X equal to be 16, right? So you can access each of those fields individually like that. Or you can just use a constructor, right? So in this case, let's make it a little more general. That way we can change it if we want to. So we'll set this to be, oops, it'll be lowercase threads, right? Alright. So the next thing we need to do is we need to do our 2D grid, right? So the actual blocks that are going to be in this grid, right? So we'll call this blocks right, uppercase. And then this will be of the dimension blocks, comma, blocks, right? Oops, and a semicolon, right? So what we basically did here will be figured out how many threads we need. Or we set a threads per thread block. And then we said how many blocks just to go across the matrix, right, in one-dimension, right? But now here we're saying, okay, well, if you need that many for one-dimension, you'll need, in two-dimensions, you'll need the same, right? Because this is a square matrix. So threads by threads, blocks by blocks, right? So that's all, that's our calculation and that's basically how we make our 2D grid of threads. Okay? So now we can actually get to writing or kernel, right? So here we can say we're going to launch our kernel, right? So we'll call matrix mole that we have to find above are we've defined above, right? And then our launch parameters, which will be blocks, comma threads. And like I said, there's two other parameters here that are going to be the amount of shared memory in the stream. But like I said, we're using default values there because we're not using either of the two. Okay? So, right, so that'll be there and then we have our arguments, right? So let's start. Let's go up to r, up here, right? So here's our matrix mole and we can start thinking about what we want to pass. Well, I know I want to pass the three matrices, so, or three-pointers, a, b, and c. Right? And then I'll need to pass my point O or my value of the dimension of matrix, right? And so again, if this was a more general version, you would pass int in int and an int k, right? If you're using, if you're doing a general form, like if you're doing making Jim Wright GE him. Okay, but this time right, square matrix, so we can just pass in a single dimension because we are square matrices, right? So let's call it with a, b, c, and n, right? And we have to remember kernel launches or asynchronous, because our asynchronous, we have to make sure we have to make sure that we synchronize afterwards, right? We have our synchronize or we directly have to call it could've mem copy. Right, because couldn't mem copies or a synchronous call. But kernel launches or not, right? So if we are using something unified memory and we don't want to directly do a mem copy, right? We have to make sure to at least synchronized in some way. Okay, so now we've launched our kernel, right? So now our, all of our yell when we access these matrices on the device, will end up getting the values it will transfer over unified memory. So they're not, you know, by default on the device. So with this, what we can go ahead and do is just like we did for a vector addition kernel, where we said, okay, we need to find a way to differentiate threads. So again, a kernel like this, this matrix mode kernel, every single thread that you've launched, right, or that we've launched will execute the same code, right? So to differentiate the threads, you have to calculate some threads specific information. So basically like who am I within this grid that we launched of blocks and threads, right? So for the 1D case, it was just a base and an offset, right? Because we had a big long line of threads. So we need to figure out the starting value for the threads and our block, and they're offset within that block. So now we have a 2D grid of threads. So we need to calculate two things, right? So we need to figure out a, an, a base and an offset for the row and the mini to figure out a base and an offset for the column as well, right? So we have two calculations now. So we'll calculate the global row and column for each thread, right? So it'll be the exact same calculation, but now in two dimensions, right? So for the, to figure out what row we are, right? So rows are in the y-dimension. So this will be int row is equal to block ID x dot y times block dim dot y plus thread ID x dot y, right? And this just says, you know, first of all, you calculate the starting thread, right, of this thread block in the y-dimension, and then the offset in the y dimension that this thread is located at. Okay? And then it's exact same for column, right? The only difference being we change the dimension, right? So the column is going to be, you know, across the x dimension, right? So we'll just change all the y's to x's now. Right? So now we've figured out where we are within our grid, right? So we've got a square matrix, right? And in this case we've got a square grid of threads perfectly matched to our square matrix, right? Because we've got this power of two matrix. Okay? So what do we do now? So we can actually start calculating matrix multiplication. But, you know what if we didn't have a perfect power of two matrix that we're trying to do matrix multiplication for. Well, in that case, semaphore threads will be outside of the rose, right? We'll maybe have some extra to other rows and we'd like to have some extra outside the columns as well, right? So we have to make sure that we're still imbalance, right? Which means we can't exceed the single dimension we have in, right? So to do that, and it's just a quick If statement, right? So we'll call this a boundary check for our matrix. And it's just coming if row is going to be less than n, right? And then it's also going to work. We also need to make sure that column is less than n, right? So that's our condition. We need to make sure that we're on a valid row and a valid column, right? So we have a valid position right, to access. And then we can start our matrix multiplication, right? So our metrics multiplication in this case, right? Because every single thread is going to compute one element in the output matrix. Every single thread will go over one row and go down one column, right? So that's all we need to compute one element of a matrix. So it's just a single for loop, right? And we'll do for int I is equal to 0. I is less than k or n rather. And then I plus, plus, right? So all we've done here is we're going over the entire length of the matrix. Okay? So the next thing that we do is we need to accumulate some kind of result. So we didn't initialize C to be equal to 0. So we need to go ahead and have a temporary variable, right? So we'll, we'll have say up here or right here is fine road. We'll have int temp is equal to 0. And then on every iteration of this loop will just accumulate our partial results into temp. So it'll be temp plus equals and then we need to start accessing or a and B matrix, right? So for a, the way that we access it is we want, we're going to go across the row of a and down the column of B. So to go across the row of a, need to keep the row constant, which me, which means that it will be rho times n, right? So that calculates one row in our matrix and then plus I, right? So this means that in this row we're going to traverse every single element, right? And then what are we going to multiply it by every single iteration? Well, we're going to iterate, multiply it by the corresponding column, a single column, right? So this time the column is going to change, right? So to change the column, this will just be I times n Now. So we're going to go down a row every single time, right? But the column stays constant, right? So plus column. All right, so now we've accumulated our temp result. And then, so here we can just say write back the result. And then here we can say accumulate a partial salt. And let's go ahead and close out of this window so we've got a little better view. All right, so now to write back the result, like I said at the very beginning, we launched as many threads or potentially more, but we've already masked them off up here by making sure they don't touch any of our data. Now we just need to make sure that we write back our corresponding row and column, right? So every thread competes one element, that element will be stored at this row, column index, right? So here we can just say c of Rho times N plus column is equal to temp, right? So everyone computes a single element, right? So now we've actually computer matrix multiplication. But it's always good to check to make sure that we did everything right, right, So we'll write a functional test and this will just be, we'll compute matrix multiplication on the CPU. And we'll make sure that we get the same answer as we got in the GPU, right? So to do that, let's go ahead and write a verification function. Alright, so, so we can say verify the result on the CPU, right? And so will, will just be a void function called verify. Oops, result. And it will take the exact same arguments, right? So a, B, and C, and also the dimension of our square matrix. Now in this case, a and B will be the exact same matrices, right? So they'll just be our two input matrices. But now C is actually going to contain Cedric, going to contain our solution from the GPU, right? So what we'll do is we'll just accumulate our result into a temporary variable and then we'll just check it against an index and C. Right? So in here, right, we'll have int temp is equal. We'll just say int temp. And then we'll have, if you're unfamiliar with implementing matrix multiplication on the CPU, It's just going to be a triply nested for loop, right? So it'll be for int I is equal to 0, I is less than n, i plus, plus. And then for int j is equal to 0, j is less than n, j plus plus. And then finally, int k is equal to 0, K is less than n, k plus, plus, right? So this may seem confusing, but we can write in some comments that makes it a lot more clear, right? So this just says the first, the very outer one says for every row. The DJ says for every column. And then this inner loop, right? This k loop says for every element in the row, column pair, right? So we basically iterate over every single row in our matrix, every single call, our matrix. And then for every row, column pair, we'll go over all the elements of each and compute one of our outputs, right? So the cool thing about GPU programming is that it takes something like this, like a triply nested for loop. And it basically unrolled these loops, right? So instead of having two loops here, we have two dimensions of threads, right? And that allows us to have only a single for-loop inside of our code, right? Because those two outer for-loops are basically transformed into this single row, column calculation up here. Okay? So let's actually compute it on the, Let's actually compute it on the CPU now, right? So every single iteration, right, for every row, column pair, we want to set temp equal to 0. And then inside of here, we'll go ahead and compute matrix multiplication, right? So it'll be temp plus equals and we'll do the same indexing that we did using the row and column kind of strategy on the GPU, right? So in this time, from a, it'll be this outer loop. I write, so for every row, I corresponds to the row, right? So it will be I times n plus k. So every element in that row. And then multiply this by b of now, this time, the rows are changing and the column remains constant, right? So to change the column, it'll be a k times n and then j, right? So now the column is going to remain fixed. Okay? And then now we need to check the result actually. So to do that, we'll go ahead and include a couple of more things, right? So we'll include C assert, right? So we can just assert if we get any of our answers wrong. We'll also include iostream just to print out if we completed successfully. And then we'll pollute or a namespace a little bit. But just using namespace standard that way when we do a print, we don't have to type std C out. Okay? So back down to a verification function will be just quit out here, right? So now we just need to verify that temp is correct, right? So check each results, right? And here we'll just call assert. And we'll make sure that tymp is the exact same as whatever. See, which stores the result that we computed on the GPU. And then for the row, for this specific, particular index, right? So for this row, column pair, it'll just be I for the row j for the column, right? So it, it'll be I times n plus j. Okay? Right, so now we have our verification function. All we have to do now is to call it, right? Get out, get rid of some of that whitespace. Okay, so right down here, we can go ahead and call verify the result. And we can call verify results just using a, b, c, and n. And then that will either pass right? And it will just continue to the end of the main little fail and it'll just kill the program and tell us where the assert failed. Right? And in this case, the last thing we'll do is we'll just print out if we get to this point, that means we've succeeded program completed successfully, right? And then a newline character, oops, and then return 0. Okay, So that's our program. Let me go ahead and compile this in VCC matrix, small dash o matrix, mall. Get rid of it, See you. And we can just make sure it works. We'll go ahead and run it, right. And you see that it takes quite a while to actually run. This isn't because it takes a long time and the GPU to run, it's because the CPU one takes so long to run, as you can see it's a triply nested for loop. We didn't do any optimizations for matrix multiplication on the CPU. There's plenty of optimizations you can do there. But we're just carrying about a functional tests, right? So we're not really carrying about the performance about that in this video. So write it completed successfully and we can even profile it. Let's go ahead and disable this verify result. Now that we check to make sure that it's correct, we don't want to be waiting so long for it to complete being called NP Prof on matrix mall. And you can see, oh, we didn't recompile it. Devi. Oops. Alright, so we'll recompile and then we'll profile it, right? And you can see here's our matrix small kernel, right? It only takes about 28 milliseconds in order to actually compute the whole thing. But you can see that it took seconds upon seconds upon seconds to actually compute it, you know, granted a naive way on the CPU, right? And this isn't even an optimized version of matrix multiplication. We'll do that another time, right? And so we can even extend this a little bit more or rather up, show off a little bit more here. So right now we're assuming we've got a perfect power of two matrix, so 1024 by 1024. So again, right, because we have the if statement in there that makes sure that we handle non power of two inputs, right? We can say what happens if I send you 10 25 elements by 10 25 elements, right? Because we, you know, we did a good job, right? We made sure that it worked right. So we can even let's make sure that we're verifying our result for this case. All right, and we'll read, whoops. Let's go ahead and make sure envy. Prof, Actually, we just ate into VCC, right? Right, will recompile this. And then we'll run it, right, and it'll take multiple seconds just to run it on the CPU. And then what we'll see is still complete successfully, right? So everything, everything completed successfully. So we can go ahead and wrap up this episode talking about a couple of things, right? So, so basic idea here is now we have a 2D grid of threads instead of a 1D grid of threads. And every thread corresponds to one output element, which is a very common strategy. Here we have a single if statement to check to see if Rho is less than n and column is less than n. Right? How would we extend this to be a little more general? Well, instead of having just, you know, int n is equal to 1 shifted over 10 times. We'd had three inputs, right? So we'd have an m and n and a k dimension. And the new just allocate for each of those. And then we could leave the threads and the blocks exactly the same. But now we just need to make sure that the dimensions of our grid are going to match up with what we're launching now based upon, say, him in k dimensions, right? Okay. 15. CUDA Cache Tiled Matrix Multiplication: So let's start with cached. So what's the problem? So the problem is that dram is slow. If we remember back to what we previously did, what we ended up having was some, are we mem copy or two matrices are input matrices into memory and to dram. And then we would have to load them in when we're going to do all of our calculations. Now, we're going to get some locality within the L1 and L2 caches because there is value reuse between a different threads. However, the problem is we still have time for maybe there's going to be conflicts or things will get evicted from the cache. And we have to go all the way up back to main memory. And if you're familiar with computer architecture at all, you know that main memory is pretty slow when we don't want to be going out there if we can avoid it. So what we'd like is a way to have fewer memory-related stalls. That way more of our time is spent towards computation are doing useful work. So this idea of how can we maybe guarantee that things are going to be in the cache. So what's the solution? So the solution in terms of cuda enabled GPUs, Nvidia GPUs is this idea of shared memory, sometimes called scratchpad memory. So what scratchpad memory is, it's a user managed L1 cache. It's easy way to think about it. That's private per thread block. Now, what do we mean by user managed? So when we say user manage, managed, this means that the programmer actually loads values directly into shared memory and then it stays there that way we can get that very nice five cycle or so, 510 cycle access time to shared memory versus the a hundreds of cycles it would take to go all the way out to dram. So let's continue on. So the other big thing that's really nice about using shared memory and cache tiling is that when we have very large inputs, the entire input won't, won't fit in the cache. So our solution to this is, let's only put the pieces of the large input that we're using right now in the cache. And that way we can always know that what's in the cache is useful. So that's what we're gonna do is shared memory. All right, so how does this look with matrix multiplication? So we'll start out with a familiar type diagram. So we're going to have a times b. It's equal to a matrix C were to squids to square matrices here. So as before, with our original naive implementation of matrix multiply, we would have each thread calculate, calculate a single element within this within the C matrix. Now, every thread is still going to calculate one element within the C matrix. But now instead of sweeping across all of a at a time, it's now going to sweep across a subset of a. And that subset of a is going to be our tile. So it's going to end up sweeping step by step by step. So every loop iteration will progress the tile by tile size. And so for, in our case, we're going to be doing simply an integer number of moves. It perfectly lines up with the square matrix because a square matrix will be a power of two. And then the R to R matrix size will also be a sari matrix that would be a power of two in file size will also be a power of two. Then we'll make sure that I can divide evenly. It's not much harder to, in case your matrix, your matrices aren't sized perfectly, but you just need to add a little bit of error checking code. So, so how does this differ? So previously when we were calculating, when we were calculating values, we didn't have to worry about what value values we are using at that exact moment because we didn't care about latency. We're just going all the way out to dram, possibly. And we just wanted to make sure that we're getting that its element as we marched along an entire row and entire column of B. But now with shared memory, like we said, it's programmer manage. So what do we need to do for each tile on each iteration? So we're going to need, we're going to need the need to load the values of this block. Oops. We're going to need to load the value of this block of a that corresponds to this a matrix. And then this block corresponding to the B matrix into our scratchpad or shared memory. So an important thing to note is that when we're loading from the a matrix, the way that we're going to index this is we need to think that this is a concept row, meaning threads here, as we progress along, the row doesn't change, only the column changes. So it's a loop varying column. And then for the B matrix, because the tiles are moving downward along the y-axis of the B matrix, we're going to have a constant column, but a loop fairing row. So what is our actual calculation going to look like? So it's actually going to be pretty simple compared to our, our original matrix multiply, our original matrix multiply. We had to think about what global value of C or a and B we were grabbing from. But because we're loading everything into these tiles, we're just going to index purely by our thread ID within that actual tile. Now, the tiles in this case are a proxy for the thread block. So our tile size will be the same size as our thread block size. So a tile in our case will be 16 threads by 16 threads. So that's what mean by tile size in this case. But like we said, we're taking a large input in reducing it to the size of our thread blocks at each loop iteration. So we're only working on a small chunk of the entire problem at a time. And then. So as we also kind of imagined, so for a the rows loop invariant. So when we're doing our calculation for our tymp values, every tile iteration, a thread will go across a and then it will go down BY, but just within that tile. And so it's the same as we saw before. So the column being loop invariant and the roping loop invariant depending on which matrix we're selecting. So that's going to do it as far as a quick little background. So let's get to the actual code. Okay, So here we are in Visual Studio and a lot of this will be exactly the same. So our answer checking function or matrix initialization, even our problem size will still be 1024 by 1024 matrix. And so we're not going to go through all the allocation of memory copies because we've done that in the previous videos. I have added the freeing of device memory and host memory. That's good practice, doesn't make much of a difference for our case because our program exits immediately after this, so it will get freed anyway, but it's good practice. So let's go ahead and get to our kernel. So it's going to be tiled matrix multiply. So the first thing we need to know, shared memory. So shared memory can be allocated two ways. It can be dynamic or it can be static. So in this case, we'll do simplicity. So it would just have a constant value in this case. Like I said previously, we're going to have thread blocks at our 16 threads by 16 threads, which will be total of 256 threads per thread block. And we're going to have our shared memory size. Set that equal to one element per thread gets loaded at a time. So if we have 256 threads and the thread block, the third block will load 256 elements and remember, shared memories per foot block private. So we have to do this for every single thread block. So in this case a size n integers 4. So I just wrote that out as a literal. And then 16 times 16. And then in here, if we use this prefix shared ahead of this array, that says that this is going to be stored and shared memory. So that's how we denote that. It's this. We're going to use this scratchpad memory or this programmable or user manage its cash. So then some things we've seen before, we'll go ahead and just for the sake of brevity will shorten thread ID EX to just TX, TY BX and BY for block ID. That's just too. So we don't have to keep writing this over and over as the long form. Then we'll calculate our global row and our global column as we've seen before. And then we see this new. This new thing that's different or that sorts differentiate title matrix multiplication. So now, instead of just iterating over the entire matrix one at a time, or an, a loop that will go over the entire length of the matrix. This time, we are going to do it based upon the total size of the matrix divided by the tile size. So the amount of tile steps there are in the matrix. So here we'll say n, which is our matrix size divided by tile size, which in this case is 16 because we have a 16 by 16 thread block. And to kinda just clarify that a little bit. The reason why it said is, if you think about how many threads you need to completely line up on the x-direction and the y-direction. So if the x-direction of the matrix is size 1024, you would need 1024 divided by 16, which would be the number of blocks. So let's see how our indexing works. So for indexing into loading shared memory. So to load shared memory, it's like anything else. It's like a regular array, except in the backend it's going to be put into that user managed L1 cache. So to calculate which row for the a matrix, we're going to start with Rho times N. Because it's a matrix, the row is going to be constant because remember, for the a matrix we go across the rows. And then in order to calculate the column that is changing, first, we need to index the set of columns that we're working on. So remember that tile has only a subset of the total column. So this will give us which subset of the columns we're working on. And then the actual thread in the x-direction will say which column within that set we're going to work on. Then for the B matrix, it's exactly the opposite. So the rows are changing and the columns are invariant. So here our column indexes the global column that we're going to work on. I times tile times n will index the new set of rows. So whatever the iteration number is, times the tile size times n. Because we have to think about this in how it's laid out linearly in memory. So if we want to move to the next button memory, we'd have to go over the entire block that would be associated with that entire row. So we have to multiply it by the matrix size. And then in order to get the specific row within that set of rows, we do our thread ID y, which will be up and down times in the matrix side. So the number of elements between this row, or say row 1 and row 2, there'll be n elements in between there. So that's why we multiply by n. And so that's what we'll do in order to load things into this a and B. So every single thread will load a single element in there. And then we have to call sink threads. So this is a new thing now. So what sink breads does is it says every single thread within this thread block has to be done by this point. So it's a barrier. And the reason why we need that is because down here. So after we go through here and we do this loop iteration, and we go all the way back up here. We don't want to start stomping memory, so we're loading all the stuff into shared memory under the assumption that, you know, even though only one thread is loading that value, many threads are going to access it. We call sink threads here in order to make sure that everyone's finished loading their values before we proceed. So then we do this second loop. This is our actual calculation. So this just says that we're going to, for every thread is going to sweep across the rows within the tile and the columns within the tile. And that's simply indexed by for, for a, it's just going to be what role are we? And then the only thing that changes this loop invariant is the column as we go across the row. And then it's going to be the opposite for B. So the row we're going to be accessing because we're going downward, down a column, a single column that's going to be changing with the loop. So j times tile size and then a loop invariant column. So that plus TX just says, which column are we going to be on within that tile. And then we call a sink threads again. And the reason why we call this sync threads, as I already mentioned, if some of these values haven't been computed yet and we didn't have a sink grids here. We would go back up to the top of the loop and we'd start overriding shared memory potentially before some people have read shared memory. So we need to sync threads here. One to make sure everything's loaded and then one to make sure that everyone has finished accessing. All right, and then finally, after we've done all the iterations, we will write out to dram now this tip value that we've been storing up here, and that's it. So that's going to go ahead and do our kernel. So we'll go ahead and build solutions, explore, rebuild. This should look just fine. And then we'll run it. And it will go through and error checker. And let's see if it gives us an error. No error. So it looks like a completed successfully. Now, I claimed that this was going to be faster than our naive implementation, so we can do a quick check to make sure that it is. So let's go ahead and open our previous project. So let's open our naive matrix multiplication. So that's this simple loop over. Let's see. Yeah, this is our simple loop over every single element and then multiply and, and do the row times column. But we're accessing global memory here. So we can use Insight and we can start. Performance analysis. Will profile the code application will launch. And then we can look at launches. So here in duration, we see that we have to look like and I can zoom in on them. So we have a little over 18.7 k microseconds, right? So about 18 milliseconds in order to do a 1024 times 1024 matrix multiplication. So let's see what happens when we do our tiled matrix multiplication. Okay, so we don't need to save that. Let's open up tiles again. Insight, start performance analysis. Profile, launch. We'll go to lunches and look down here. So now we're all the way down to 8 K microseconds, all the way down from 18 k. So we saved about a, so we saved 10 thousand microseconds between cash tiling and our naive implementation. So that's a, that's a really huge when we think about performance increases. So in later videos in this series, we'll go into other opportunities. We have to even further optimize this matrix multiplication, digging into some of the architectural little bit, as well as how to optimize other types of applications and when maybe optimizations kinda fail or breakdown. But that's going to do it for today. 16. CUDA Accounting (nvidia-smi): In this video, we're going to explore a new feature that was added to nvidia SMI include a 55. One interest that is fairly unique to high-performance computing is the ability to track the resource usage of each shop in a cluster. Let's just for the purpose of improving, proving utilization, justifying new hardware purchases are building users for their time on the system. This is frequently described with the term accounting, and it could have five dot 5. We've added some basic accounting stats to nvidia, SMI to help address this. Before we begin, it should be noted that on Linux, it is usually desirable to enable GPU persistence mode. This keeps the Nvidia drivers initialized even when no clients are present. And environments without x, this can be a common situation. By keeping the driver initialized between cuda application runs or nvidia SMI queries. We preserve any non-persistent driver settings. This is important for GPU accounting, as in the in-memory data associated with this feature is lost if the driver is unloaded. Persistence mode can be enabled like so. When trying to understand available features and command line options, a good place to start is always the nvidia SMI man page. And also the nvidia SMI help. Both of these, you'll see descriptions of the relevant accounting commands and outputs. The following commands with the type of data that can be associated with an accounting process. The basic idea is that the Nvidia driver will keep track of each GPU process referenced by the CPU process ID and calculate the following per process properties. One time, the lifetime of the processed, defined as the duration of its compute contexts, number to the high watermark for memory usage. This is the maximum frame buffer memory requirement for the process over its lifetime. And finally three, the average memory and GPU utilization for the process. These values are averaged across the lifetime of the process. One way to view current accounting data is to run nvidia SMI inquiry, the accounting. We see that accounting is currently disabled on this GPU. So the first step is to enable it. This requires root Eb and privileges. We run sudo nvidia, SMI. It's like dot index and we enable accounting. Rerunning the previous command. We now see that accounting is enabled, but there are no processes recorded. This makes sense as we have yet to run any code applications on the GPU with accounting enabled. Before running it could application. We'd like to ensure that it runs on a GPU we're interested in, which is in my case is the k2 tc. One way to do this is used a cuda visible devices environmental variable, which can restrict the set of GPUs that couldn't apps can run on. In this case, we'd like to restrict, could add apps to run only on the k2 tc, the GPU that currently has GPU counting enabled. Since the enumeration of GPUs by cuda can be different than enumeration of devices by nvidia SMI, we should check first to find the cuda index of the k2 tc. We can do this by running the device query sample included in the cuda toolkit. Here we can see that the K2 is at index 0. With this in mind, we can restrict cuda applications run only on this GPU, like so. Running device query again, we see that only the K2 is available. Note that could have visible devices only impacts what cuda applications can see. Nvidia SMI as a monitoring tool and management tool is unaffected by these cuda specific requests. That is why we must still use the dash I command to limit the scope of our actions. With this done, we can now run some cuda applications to accrue accounting information from the crudest samples. Let's run n body. This is a compute heavy app. Let's also run another app like the bandwidth test. Now let's view the collected accounting data. An easy way to provide a human-readable summary is via the command we used earlier. Here we can see that two processes have now been recorded. The first is the n body, and the second is the bandwidth test. Another useful way to access as accounting data is through scriptable output. This can be done using a different command as shown here. This is useful in cases where the immediate consumer of this data is another process. Rather than humanize, accounting data can be reset by running. Checking the accounting after doing this shows that all the previous accounting has been removed. That's it for now on the new accounting feature of nvidia SMI available with cuda 55. 17. CUDA Thurst Library: Thrust provides a flexible high level interface for GPU programming that greatly enhances developer productivity. Using thrust C plus plus developers can write just a few lines of code to perform GPU accelerated sort, scan, transform and reduction operations. The high level interface also enables performance portability between GPUs and multi-core CPUs. And interoperability with established technologies such as cuda, thread building blocks and open MP facilitates integration. Like the standard template library for us provides a vector container, much like STLs vector. There are host vector and device vector types. The first thing we're going to do is generate 500 million random numbers on the host CPU. To generate these numbers on the host, Let's first allocate a large enough host factor to hold the value. In this case, we're sorting integer numbers. So I used int as a template parameter to the host vector constructor. Now let's generate these numbers using the CPU. This line tells us that we are generating all the numbers starting from H vec dot begin, or the first element in the vector to the last element of the vector, H vector end points to the non-existent element that is one past the end of our vector. And Rand is a host function generate will use to create a value for each element in the vector. Now thrust knows to run this on the CPU as H Beck is a host vector. Now that we have randomized our input data, let's transfer it over to the device. With thrust. This is just one line of code which contains nothing GPU specific. This time we're creating a device vector as a vector events and then assigning it from the already initialized host factor. Thrust will handle copying the data from the host to the device. At this point, we can read or set individual elements that the device vector using standard array notations. For example, we can set element 31 to five for element 100 to 10. But you should note that when running this code on a GPU with thrust kudos backend, each one of these calls will invoke a CUDA mem copy across the PCI express bus. So it's best to avoid this in the interest of programs speed. Let's go ahead and leave these lines as they are not relevant to the program. Now that we have are initialized data on the device, our next step is to sort the data. And again, this is a single thrust call. This version of thrust sort sorts the data in place, but there are variations available that sort data out of place, as well as the selection of sorting algorithms. Our final step is to return the data to the host. Although for this contrived example, we won't be doing anything with the data after we copy it back. This time, I'll use the thrust copy function to do this. And that's it. We've written our basic sort and there was no GPU specific code anywhere to be found. The file will be compiling is the same as we've just written, but with some timing code around the device vector transfers and the sort. Now let's compile, which is very easy using NVIDIA is C plus plus compiler and VCC set the optimization level, specify the GPU architecture we're compiling for our output. Q double. Now let's run this. And this took about 2.1 seconds to sort our 500 million numbers. And this includes the data transfer time across the PCI express bus. In a real-world application, you would want to move as much of your parallel program data over to the GPU and leave it there as long as possible to hide this data transfer overhead. Finally, another benefit of thrust is its portability. While most users use thrust to take advantage of the massively parallel compute capabilities of a GPU. It supports multiple backend implementations, such as open MP and Intel's Thread building blocks. And you can even implement your own back-end if required. Now it's quickly show how easy it is to move from running on the GPU to using Open MPI and running on the CPU. In fact, we're not going to make any changes to our source code. All I have to do is add an option to the host compiler, which I do using the dash x compiler option to MVCC. And we tell it to use the Open MPI device backend instead of the default could have backends. Note that when using Open MPI or the thread building block backends, you are not required to use the NBC C compiler. Now let's time the CPU version and see what we get. Now even with the overt, in addition to the built-in algorithms and operations, thrust makes use of functors to extend its capabilities. A function is a function object, which is an object that can be called as if it were an ordinary function in C plus, plus a functor is just a class or struct that defines the function call operator. Because they're are objects, functors can be passed along with their state to other functions as a parameter. Thrust lets you apply custom operations inside its algorithms by passing a front door to the algorithm. Thrust comes with a handful of predefined functors, but I'll also show you how to write your own and use it in a thrust algorithm. For our simple contrived program, we want to generate 500 million random floats on the CPU and then transfer them to the device. As we did this in the previous video, we'll start our code from there. Now that we have our values on the device, let's next create a temporary vector, which is the negative of each element of the input. I can do this with a thrust transform function. However, since I don't want to overwrite the original vector, my case I'm going to create a temporary vector to hold the result of the transform. We set the size of the temporary vector to be the same size as our input vector. The transform function call looks like this. The last parameter to this function as our front door. Since this functor is a template object, we have to tell it what type of values that'll be expecting. In our case floats. We also have to put parentheses as we're calling the constructor, which instantiates the function object. Alternatively, we could declare an object like this and pass it in as the last parameter to transform. So now detect holds the negated values of our input vector. Next in our program, I wanted to do a minimize reduction on our detailed vector. A reduction takes a number of elements and reduces them down to a single value. And in a minimize reduction, it'll be the minimum value of all elements. The third parameter is the initial element used in the reduction. For minimum reduction, I can just use the first element in the vector, like so. The last argument is the built in minimize the door. And finally, we'll print out our value. We comply with MVCC As in our previous video like so. Sarah optimization level specify the architecture we're targeting, in my case S and 35. And we run this, running it, we see it took about 0.8 seconds to transfer the data, do the transformation and then the reduction. Now one problem with the program has written is where using a large temporary vector on the device to hold the results of the transform. Before we do the reduction. In addition to the extra memory used, we're doing a lot of written rights to global memory, which we want to avoid if possible. Because this transform introduce pattern is so common, the rest has a built-in combined transform reduce function, which eliminates the extra memory and read and writes, and it's called kernel fusion. So let's modify our code and try that. First, I'm going to delete this reduce call. We'll move our knit variable of the transform. And we will now use the transform reduce function. We no longer need our 10th vector. The third parameter is the functor that transform will use. And the last two parameters are the reduction as before. Can now delete our temporary vector, change our knit value to use the original vector. And then finally, we'll modify our headers to use the transform reduce filing, and eliminate the separate copies will just recompile this and run again. And you can see with just a simple change, we've got the same answer and for less code we got more speed. Now what if I wanted to change, are transformed to square the value of each element instead of returning the negative. Well, a quick look through the thrust documentation so there's no built-in square factor. So we'll have to write our own. And we do this with a structure. The host device line tells the NBC C compiler to create both that advice and a host version of this function to maintain portability between CPUs and GPUs. We're overloading the function call operator, which can take any number of input parameters and return any type of output parameter. It's the most versatile of the overloaded operators. The actual function just returns the element times itself. Now that we have our front door, let's use it in our transform reduce call. And let's compile again and run this. And it runs successfully. You could also create your own functor for the reduced part of the algorithm, you would just create a function that takes in two elements and returns a single value or what's called a binary operation. The square operation we wrote was a unitary operator. It only took one element as input. We have now seen some of the more useful algorithms available and thrust, as well as the flexibility and power that comes from. 18. CUDA Thrust Library - Part 2: Okay, morning and welcome to another spot, a coda arrival. Start looking at some of the GPU accelerated libraries that come with coda. So code is actually a pretty rich environment. And when you download Cura, you actually get a bunch of really cool libraries to use as well, which can make GPU accelerated algorithms in general, CUDA programming much, much easier. So we'll have a look at one of those. In particular today. Here's just a few of the standard libraries that you get with code as a rasp. This would be what we look at a bit of today. So this is a library of algorithms and data structures for many different common things like it's sorting in his scanning. And it's occurred the implementation of the C plus plus STL library, STL should say. So anybody that's coded say plus, plus for a little while It's probably familiar with STL, co-brand or, and this is a classic Quran is classic. This provides pseudorandom number generators for use by the host and the GPU. So this is really, really cool. If you've got like a simulation, maybe you're doing with the GPU and just simulating a bunch of objects. Maybe some of your objects need some sort of randomness to their behavior. I'm kill brand is really, really good for that. Okay, cool glass or glass. This is a good implementation of the blast library or basic linear algebra subroutines. We also got the clue FFA for the fast Fourier transform. And there's a bunch more. So there's no actually that many more standard queue, the lobbies that come with the CPU to download. But there is a lot of libraries out there written by other people. Or odd, even answer the Minecraft. So today we're gonna be looking at the thrust library. We're just introducing it, I guess it's, it's pretty big. It's a big library, but we'll just sort of scratch the surface today. And thrusts, like I said a minute ago, the standard template library for C plus, plus only cuda or GPU accelerated. So it's also a really, really easy to use. It's all based on vectors, are a lot of it's based on vectors. So just like STL, except in thrust, we've got two different sorts of vectors. We got host vectors, end device vectors, and has vectors are stone in system memory, that's your CPU vector. Pretty similar I guess to the regular STL vectors, which would, of course they are host vector. But we've also got device vectors in that thrust. So you want to include your head is, so there's a couple of head is we'll have a look at setting up these projects at the end. But I have noticed that there's a bit of a problem with this. Yeah. You might have to add the cooler include path for your projects head is so your project head of paths. Yeah, you don't have to. So this is the enclosed steel for the vectors. Um, all right, so the first thing that we should look at is just defining a vector. So you want to use the fully qualified names here. So thrust colon, colon has vector just to save confusion between the STL vectors. Yeah, So I host vector is. Yeah, just defined with host vector. And then you give it a taught in a triangle braces. You give it a name and then you give it the number of items that you want. So a 100 in this example. And you can also define a device vector just as easily. So this is really cool. This is going to do a cuda malloc in the background. But something like this. Thrust, thrust, thrust colon, colon device vector flux is gonna give me a device vector of flux. I'm going to call it dV and it's going to have 25 items. Yeah, so easy is that the other thing that you can do is set the items of your vector to some initial value so you can pass an extra argument to that initial and define of your vector. Right here, for example, sets up a host vector of integers. There's a 100 integers, That's the first parameter and they're all set to 25. I don't know why I've put a float, therefore an integer vector, but yeah, I think you get the points, so that should just be 25. The other really nice thing for us, really convenient thing is that memory management is automatic so you don't have to delete these vectors. Yeah, it's automatic. Or RD moving along accessing elements is really, really easy. You can use the multi call this the array indexing operator to access the device vectors or host vectors. So just here I've got a bit of an example of accessing an element from a DeVos vector, element number 15 from my device vector of integers. Once again, for some piece, I'll put 23.4 has a float as my initial value to an integer vector. I don't know what else thinking. Obviously thinking about twenty-three point for, for some reason. But what you should be aware of is that, that is actually queued up Men copy device to host just their implicit there. So it's not going to be particularly fast. You wanna kinda limit the number of elements accesses and things like that with device vectors because there is a CUDA mem copy in the background. Don't forget that. Also, this is really, really cool. So you can actually copy from a device vector to a host vector or vice versa. Really, really quickly just using the equals operator. Yeah, so there's a CUDA mem copy there as well. Don't forget that it won't necessarily be as fast as just accessing host vector elements or something like that. But it's pretty convenient. Code was yeah, just the equals operator. Um, or you can use the thrust copy algorithm with your iterator. The iterator is still available with silver things like begin and end iterators. So you can use thrust copy as well to copy elements from one vector to another. Very cool. Okay, adding and removing items is as simple as it was in regular S, T L for C plus plus, we just use pushed back and pop back. So here's a bit of an example of that. I just pushed back to a host vector, which conveniently not setup. Yes, this is an example. As I push back 1, 2, 3, 0 is going to push back the integer three hundred and two hundred and thirty two housed vector. You don't need to call Mike's v and dv, obviously. I'm obviously not, for example. And we've also got some pushback to a device vector, which is just as easy. And I happen to use hex Veda up back there, but you don't have to. You could just be an integer. And then I just pop them one after another for no reason at all. Yes, I push back and push back and say you add and remove items from a vector sign as STL. Okay, but moving along, this is where at thrust really comes into its own. So we've also got the algorithms, things like sorting. And you want to include the thrust sought the heights header, and it should be right as rain. So this is just an example of sorting a list of integers on the device. And the device, sorry. And what I've done here is easy to generate algorithm to generate a random number of integers. So on a host factor that's just going to call this random function here, which returns a random integer from 0 to 99. And then I've copied that host vector to the device. I've used thrust sort to sort that DeVos vector that's going to be an unstable suitably. There is also a stable sort. Then I've copied it back to the host and printed them out just to make sure that the device is actually sorting like it should. So thrusts saw is going to sort those a 100 items using a GPU accelerated salt. Very, very cool. Yeah, And the other thing we'll look at tonight, There's a lot of algorithms that thrust can actually do, but we're just looking at a fee, simple ones that I will look at summing so some reduce its cold. If you ever need to get sort of a single value from an array of items, they call this a reduce algorithm. So some reduces to sum all of the items in an array. Anyway, what you wanna do is include the reduce header in your code. And yeah, here's a bit of an example of summer j. So this example just here is pretty much the same as the last one except instead of using random numbers, generated a sequence on the host in the host vector. So morphic in a day this sequence does here is going to generate a sequence of integers from one to 1000. And then I use thrust reduced to sum all of those items. So that should give you, in theory, that should give you or a 1000 divided by two multiplied by 1000 plus one. Yeah, if it's, if it's working and I print that some atoms to make, so the logP is actually summing or some ridges. Okay, but here we've got a bit of a coding example. So I'll have a bit of a gallon this, and hopefully you can have a gallery as well. But what I might do first of all, is just have a quick look at how to set up a project to use thrust, because yeah, it might just help. So we gotta get through setting up a basic cooler project first is thrust. Create a thrust again. So what I'll do first of all over here on my new project in Visual. 2012 might include my build customization. Six out of six, if you've not got it, there's been big changes. You get. So that's my build customization. I might add a new item, I'll call it k. I'm not on a stream. And the using namespace and State day and int and returns zeros. So that's pretty much says the standard Little say pos, pos project. Then the next thing I might do is make sure the linker is linking to q dot, dot, dot, dot. That's the cuda runtime. What happened there that disappeared in the same place. Now that's better. And it used to be that you'd have to set up your cuda library paths, but they seem to be fixed. But what is a problem on this computer? I don't know if it'll be the same on yours, but include directory for CUDA doesn't seem to be working on my machine, so we'll have to set that up. You just got a base 8 plus plus directories and then edit. And you click the little taller than the dot-dot-dot and you find your coulda include path. So that's going to be probably the same on my machine as it is on yours, but I don't know. So C drive on one chain as the main the main system drive and we got to NVIDIA GPU computing toolkit, then CUDA version six or whatever version you're using and include. So there's the thrust, include philologists, they're never to die, actually go into the thrusting quote, fall. We'll just stop it, include and select formula and apply. The other thing that I want do before we get coding is just change my compute capability to OU and apply a can or arts. And now we should be ready to start coding a bit of trust. Let's have a look. So the first question, create a robust device and husband and gold Davey in HB. I guess so the first thing when one is advanced and host vector, we want to include the rust headers for the device underscore vector, right? So that's the one. And I'm not just copy that and the host and this whole thing. Vector as well. I guess some or Davey and the nice rust. And I'm say you don't want to do the old and using namespace Rust just in case it conflicts with the STL algorithms in a lot of them have the same name, so just be careful that what namespaces you use. It's psychosis D is fully qualified names, but I mean, there's no rules you can you can do whatever you need to do in the end. Anyway, the vasa recta, it's an int and I'll just call it dV and I'll set it to 0. So it's going to have 0 items in it initially. And the other one. Is a host. Hector makes me sad that and I'm not just hit play and make sure we're rolling so far. I guess taken a little while probably to Lincoln like crazy to all of these really excellent thrust to libraries. Yeah, it seems to be working so far. Okay, so add five ints to the host vector using rand. So ran is just them. Regular C plus plus function. Using push-back. Push-back is a thrust vector method. So what do we want to HVA dot push back, rained on a 100 for int I equals 0 because of the cost. Okay, so that should add five integers. They're not actually going to be the range of 0 to 100 days will be in the range from 0 to 99. So maybe I'll just, there we go. That's going to give us the possibility of selecting 100. Slide six ones that, like I said, our next question, copy the host vector to the device vector. Well, this is very easy. We can say dV equals a to me. And so there is a mallet going on in the background there or not? It could have Malik, Sorry, I could have mem copy from the host to device, so it wouldn't be particularly fast. But once we've got our vector on the device, while some of these algorithms will absolutely fly. So that's that question I Feizi Is that sought the device vector, I guess a sorting is another algorithm. You should include the rust sort, anna, soft header, and the crust saw EDA at some of these algorithms is actually need a header. Yes, and maybe reduce or something. And I'm not sure, but it's best to include the headers anyway, I suppose. But salts. And what do we want to sort STR. Okay, so we want to sort from the NAIDOC begin all the way into daily dot n. So you can just use it begin and end iterators there in the sort algorithm, which is lovely. That should sort the device vector. We might as well just print that out to make sure it's working. So for int I equals that more because of your R being less than five and the onPause pathway market CS. Awesome. The thing, I will just make sure it's sorting the vector there. So this is going to be slow. Never, never actually iterate and printed out the item sold a device vector you want to copy into a host vector first. But for this example, it should be good. I'm hoping that the DeVos sorts properly. Fingers crossed. It's painful because it takes I want to build I mean, it doesn't take that long amount if the step again, item 0 is 38, well, that's good. You're going to give you the rest of them. Only one, 73, ID, and 85. Yeah, So they've been sorted on the host, which is really, really good. You will find that the steps through doesn't really work that well with Visual Studio, the device code is not actually sitting where it looks like it's sitting in your code file. So debugging is a little bit strange, but never mind, it's sorted it anyway, which was the point of the example. So the next trick, the next trick in my amazing bag of tricks is some reduce. I guess there's some reduce is another algorithm. And we don't want three. In mood. We weren't included. Encourage your header, which is reduce the right. If we ever get Tom, which you'd have a look at some reduction using plain old cooler because it's not, it's not as easy as a one liner. It's an interesting topic, but I'm not safe was, and some male colon polyps and some might get rid of this loop up here to print out the values. We don't care about that anymore. And what are we doing? Rust reduce and the same parameters as before. It is possible you want to reduce this iterators a daily beginning to the end. And that should some the five items in the vector on the device. So that's going to be pretty quick, husband. And the final thing that we want to do is print out the average. Average is, and the average is going to be that sum divided by the number of items which was 5, 0, 0, and there's no decimal gland. Okay, so that's the final question. Let's see what we come up with. Hopefully it should be something around 50. I guess we picked five random items in the range from 0 to a 100. So it should be about 50 for the average. Taking a while again, there we go. So the average is 63.2. Sounds pretty good to me. I mean, it's not 50, but I'm not too far off at either. So that's not proof that the device actually perform the sum reduce on the items of the arrival. It's pretty good indication that it did. All right, so that's just a general introduction to crust, the thrust library. It's CUDA version of our SEL. Thanks for watching no, fill.