Hello CUDA!
The Problem
If you want to run some simple CUDA code on your GPU, matrix multiplication is a good starting point^{1}. Here I'll show how to find the product of two matrices that is simple enough to verify by hand.
$$\begin{pmatrix}2 & 2 & 1 & 3\\ 2 & 2 & 3 & 3\\3 & 2 & 1 & 3\\2 & 1 & 3 & 2\end{pmatrix} \begin{pmatrix}1 \\ 4 \\ 9 \\ 16 \end{pmatrix}= ?$$
A Solution
Technology moves quickly  most examples of CUDA code involve explicit copying of memory from CPU to GPU and back again with cudaMalloc, cudaMemcpy and cudaFree. Explicit copying is no longer necessary since unified memory was introduced^{2}, but even new examples use the old ways. This example uses cudaMallocManaged.
The calculation code is fairly simple:
#include <iostream>
//Kernel function to find AB=C, the product of a square matrix A
//with dimensions n, and column vector B, storing the result in c.
__global__
void matrix_multiply(float* A, float* B, float* C, int n)
{
//Each of the four threads calculates a single entry in C
int index = threadIdx.x;
C[index] = 0.0;
for (int i = 0; i < n; ++i)
{
C[index] += A[i + index*n] * B[i];
}
}
int main()
{
constexpr int n = 4;
float* A;
float* B;
float* C;
// Allocate unified memory, accessible from CPU and GPU
cudaMallocManaged(&A, n * n * sizeof(float));
cudaMallocManaged(&B, n * sizeof(float));
cudaMallocManaged(&C, n * sizeof(float));
// initialize A and B arrays on the CPU
int i = 0;
for (float entry : {
2, 2, 1, 3,
2, 2, 3, 3,
3, 2, 1, 3,
2, 1, 3, 2})
{
A[i++] = entry;
}
B[0] = 1;
B[1] = 4;
B[2] = 9;
B[3] = 16;
//Run the kernel on the GPU
matrix_multiply<<<1, n>>>(A, B, C, n);
cudaDeviceSynchronize();
auto err = cudaGetLastError();
if (err != cudaSuccess)
{
std::cout << cudaGetErrorString(err) << "\n";
}
//Print C, the result of the matrix multiplication
std::cout << "Hello ";
for (int i = 0; i < n; ++i)
{
std::cout << (char)C[i];
}
std::cout << "!\n";
cudaFree(A);
cudaFree(B);
cudaFree(C);
}
In addition to the comments in the code, some things are worth explaining:
 The sizeof(float) is checked on the CPU, so you might wonder what would happen if the GPU has a different size. It can't  for all supported platforms, floats have 32 bits. You can also use 64bit doubles or 16bit halfprecision floating point numbers.
 The example uses floating point numbers, when the data is clearly all integers. I think that it makes the code a little easier to follow  all the matrix data and only the matrix data is floats. Do experiment with changing both the values and the data types.
The Output
Compile and execute it^{3} and the output should be pretty much what we expect from an introductory code listing:
Hello CUDA!
That is, treating the contents of the output matrix C as ascii printable characters, the code calculates:
$$\begin{pmatrix}2 & 2 & 1 & 3\\ 2 & 2 & 3 & 3\\3 & 2 & 1 & 3\\2 & 1 & 3 & 2\end{pmatrix} \begin{pmatrix}1 \\ 4 \\ 9 \\ 16 \end{pmatrix}= \begin{pmatrix}67 \\ 85 \\ 68 \\ 65 \end{pmatrix}= \begin{pmatrix}C \\ U \\ D \\ A \end{pmatrix}$$
Profiling
Although it obviously doesn't take very long, it's worth checking out the profiling tool nvprof for this trivial task:
==8968== NVPROF is profiling process 8968, command: cuda_test.exe
Hello CUDA!
==8968== Profiling application: cuda_test.exe
==8968== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.8800us 1 2.8800us 2.8800us 2.8800us matrix_multiply(float*, float*, float*, int)
API calls: 99.02% 143.40ms 3 47.799ms 16.950us 143.35ms cudaMallocManaged
0.44% 631.23us 94 6.7150us 0ns 374.94us cuDeviceGetAttribute
0.25% 368.22us 3 122.74us 22.210us 304.80us cudaFree
0.16% 235.54us 1 235.54us 235.54us 235.54us cuDeviceGetName
0.06% 88.840us 1 88.840us 88.840us 88.840us cudaLaunch
0.06% 83.580us 1 83.580us 83.580us 83.580us cudaDeviceSynchronize
0.00% 6.7220us 1 6.7220us 6.7220us 6.7220us cuDeviceTotalMem
0.00% 2.9230us 1 2.9230us 2.9230us 2.9230us cudaConfigureCall
0.00% 2.6300us 3 876ns 292ns 2.0460us cuDeviceGetCount
0.00% 1.4610us 4 365ns 292ns 584ns cudaSetupArgument
0.00% 876ns 2 438ns 0ns 876ns cuDeviceGet
0.00% 292ns 1 292ns 292ns 292ns cudaGetLastError
==8968== Unified Memory profiling result:
Device "GeForce GTX 1060 6GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 4.0000KB 4.0000KB 4.0000KB 8.000000KB 10.81400us Host To Device
3 4.0000KB 4.0000KB 4.0000KB 12.00000KB 3.799000us Device To Host
Understanding
Scanning code on a website is all very well, but if your aim is understanding you'll do better if you experiment. Try changing the number of threads that the kernel runs with from n=4 to 0, 2, 3, 5, or 1280. Or remove an allocation, free, or synchronization call. Change B and C to square matrices. As a last resort read the documentation.

There is a very nice and simple walkthough here, and of course the documentation, but it's good to have alternatives. And this one has an easter egg. ↩︎

List of cards and supported CUDA versions here. Of course this too shall pass  if you notice it's outdated, let me know so I can warn others. ↩︎

Compilation details depend on your environment, but your CUDA installation will have instructions. ↩︎
⁂