# Hello CUDA!

## The Problem

If you want to run some simple CUDA code on your GPU, matrix multiplication is a good starting point1. Here I'll show how to find the product of two matrices that is simple enough to verify by hand.

$$\begin{pmatrix}2 & 2 & 1 & 3\\ 2 & 2 & 3 & 3\\3 & 2 & 1 & 3\\2 & 1 & 3 & 2\end{pmatrix} \begin{pmatrix}1 \\ 4 \\ 9 \\ 16 \end{pmatrix}= ?$$

## A Solution

Technology moves quickly - most examples of CUDA code involve explicit copying of memory from CPU to GPU and back again with cudaMalloc, cudaMemcpy and cudaFree. Explicit copying is no longer necessary since unified memory was introduced2, but even new examples use the old ways. This example uses cudaMallocManaged.

The calculation code is fairly simple:

#include <iostream>

//Kernel function to find AB=C, the product of a square matrix A
//with dimensions n, and column vector B, storing the result in c.
__global__
void matrix_multiply(float* A, float* B, float* C, int n)
{
//Each of the four threads calculates a single entry in C
C[index] = 0.0;
for (int i = 0; i < n; ++i)
{
C[index] += A[i + index*n] * B[i];
}
}

int main()
{
constexpr int n = 4;
float* A;
float* B;
float* C;

// Allocate unified memory, accessible from CPU and GPU
cudaMallocManaged(&A, n * n * sizeof(float));
cudaMallocManaged(&B, n * sizeof(float));
cudaMallocManaged(&C, n * sizeof(float));

// initialize A and B arrays on the CPU
int i = 0;
for (float entry : {
2, 2, 1, 3,
2, 2, 3, 3,
3, 2, 1, 3,
2, 1, 3, 2})
{
A[i++] = entry;
}
B = 1;
B = 4;
B = 9;
B = 16;

//Run the kernel on the GPU
matrix_multiply<<<1, n>>>(A, B, C, n);
auto err = cudaGetLastError();
if (err != cudaSuccess)
{
std::cout << cudaGetErrorString(err) << "\n";
}

//Print C, the result of the matrix multiplication
std::cout << "Hello ";
for (int i = 0; i < n; ++i)
{
std::cout << (char)C[i];
}
std::cout << "!\n";

cudaFree(A);
cudaFree(B);
cudaFree(C);
}


In addition to the comments in the code, some things are worth explaining:

• The sizeof(float) is checked on the CPU, so you might wonder what would happen if the GPU has a different size. It can't - for all supported platforms, floats have 32 bits. You can also use 64-bit doubles or 16-bit half-precision floating point numbers.
• The example uses floating point numbers, when the data is clearly all integers. I think that it makes the code a little easier to follow - all the matrix data and only the matrix data is floats. Do experiment with changing both the values and the data types.

## The Output

Compile and execute it3 and the output should be pretty much what we expect from an introductory code listing:

Hello CUDA!


That is, treating the contents of the output matrix C as ascii printable characters, the code calculates:

$$\begin{pmatrix}2 & 2 & 1 & 3\\ 2 & 2 & 3 & 3\\3 & 2 & 1 & 3\\2 & 1 & 3 & 2\end{pmatrix} \begin{pmatrix}1 \\ 4 \\ 9 \\ 16 \end{pmatrix}= \begin{pmatrix}67 \\ 85 \\ 68 \\ 65 \end{pmatrix}= \begin{pmatrix}C \\ U \\ D \\ A \end{pmatrix}$$

## Profiling

Although it obviously doesn't take very long, it's worth checking out the profiling tool nvprof for this trivial task:

==8968== NVPROF is profiling process 8968, command: cuda_test.exe
Hello CUDA!
==8968== Profiling application: cuda_test.exe
==8968== Profiling result:
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:  100.00%  2.8800us         1  2.8800us  2.8800us  2.8800us  matrix_multiply(float*, float*, float*, int)
API calls:   99.02%  143.40ms         3  47.799ms  16.950us  143.35ms  cudaMallocManaged
0.44%  631.23us        94  6.7150us       0ns  374.94us  cuDeviceGetAttribute
0.25%  368.22us         3  122.74us  22.210us  304.80us  cudaFree
0.16%  235.54us         1  235.54us  235.54us  235.54us  cuDeviceGetName
0.06%  88.840us         1  88.840us  88.840us  88.840us  cudaLaunch
0.06%  83.580us         1  83.580us  83.580us  83.580us  cudaDeviceSynchronize
0.00%  6.7220us         1  6.7220us  6.7220us  6.7220us  cuDeviceTotalMem
0.00%  2.9230us         1  2.9230us  2.9230us  2.9230us  cudaConfigureCall
0.00%  2.6300us         3     876ns     292ns  2.0460us  cuDeviceGetCount
0.00%  1.4610us         4     365ns     292ns     584ns  cudaSetupArgument
0.00%     876ns         2     438ns       0ns     876ns  cuDeviceGet
0.00%     292ns         1     292ns     292ns     292ns  cudaGetLastError

==8968== Unified Memory profiling result:
Device "GeForce GTX 1060 6GB (0)"
Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
2  4.0000KB  4.0000KB  4.0000KB  8.000000KB  10.81400us  Host To Device
3  4.0000KB  4.0000KB  4.0000KB  12.00000KB  3.799000us  Device To Host


## Understanding

Scanning code on a website is all very well, but if your aim is understanding you'll do better if you experiment. Try changing the number of threads that the kernel runs with from n=4 to 0, 2, 3, 5, or 1280. Or remove an allocation, free, or synchronization call. Change B and C to square matrices. As a last resort read the documentation.

1. There is a very nice and simple walkthough here, and of course the documentation, but it's good to have alternatives. And this one has an easter egg. ↩︎

2. List of cards and supported CUDA versions here. Of course this too shall pass - if you notice it's outdated, let me know so I can warn others. ↩︎

3. Compilation details depend on your environment, but your CUDA installation will have instructions. ↩︎