CUDA integration with C#

This article will focus on how to create an unmanaged dll with CUDA code and use it in a C# program. The example will show some differences between execution times of managed, unmanaged and new .NET 4 parallel versions of for() loops used to do computations on arrays.

I will show in brief how to configure CUDA environment and run example program. CUDA itself is out of scope for this article because of some great online documents, about which I have added links to. There will be a few words only regarding efficiency and execution speed of CUDA kernels and memory management.

CUDA enabled hardware and .NET 4 (Visual Studio 2010 IDE or C# Express 2010) is needed to successfully run the example code. Visual C++ Express 2008 has been used as a CUDA C editor (2010 version has changed custom build rules feature and cannot work with that provided by CUDA SDK for easy VS integration).

Part 1: Environment and tools configuration for CUDA

CUDA is a general purpose parallel computing architecture introduced by NVIDIA. CUDA programs (kernels) run on GPU instead of CPU for better performance (hundreds of cores that can collectively run thousands of computing threads). It comes with a software environment that allows developers to use C as a high-level programming language. This computation technology is used in mathematics, science, finance, modeling, image processing and so on.

Basic CUDA configuration for developing purpose:

  • Download and install CUDA toolkit for correct OS, use version 3.1 or latest 3.2RC or version 3.0 for device emulation (limited usage but works without CUDA enabled device, not supported after version 3.0) 
  • Download and install SDK for the same OS and toolkit version
  • Update display drivers if you encounter problems with SDK installation

Visual C++ Express 2008 (or VS2008) configuration:

1. Syntax coloring

  • Open Tools -> Options from main window, then Text Editor -> File Extension, add .cu and .cuh extensions with Microsoft Visual C++ Editor
  • Copy file usertype.dat from [sdk dir]\C\doc\syntax_highlighting\visual_studio_8\ to ProgramFiles\Microsoft Visual Studio 9.0\Common7\IDE\ folder
  • Restart Visual Studio

2. New project, 32bit Windows XP version

  • Default SDK location is c:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\
  • Default toolkit location is C:\CUDA\
  • Create empty Win32 console application and add source file with .cu extension
  • Select newly created project in the Solution Explorer window and then right mouse key  Custom Build Rules, use Find Existing button to locate Cuda.rules file in [sdk dir] \C\common\ folder, add it and mark on the list of available rule files
  • Select again project -> Properties, select Release Configuration and then from the tree view:
    - Configuration Properties -> Linker -> General -> Additional Library Directories:
    add this line (default folders) C:\CUDA\lib;"C:\Documents and Settings\All Users\Application
    Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\lib"
  • Configuration Properties -> Linker -> Input -> Additional Dependencies: cudart.lib
  • Use the same settings for Debug Configuration
  • For emulator configurations use Configuration Manager to add new configurations based on Release and Debug ones, name them as EMU-Release and EMU-Debug (or as you wish)
  • Select EMU-Release from Configuration combo-box:
    - Change cudart.lib to cudartemu.lib
    - From tree view select CUDA Build Rule v3.0.14 (or similar)  General and set Emulation Mode to Yes
  • Do the same for EMU-Debug

3. New project, 64bit Windows 7 version

  • Default SDK location is c:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\
  • Default toolkit location is C:\CUDA\
  • The rest is basically similar to win32 variant except that:
    - Use Configuration Manager to create new solution configurations AMD64_Release, AMD64_Debug and emulation versions if needed (copy settings from Release and Debug configurations)
    - Use Configuration Manager to add new solution platform x64 and copy setting from Win32
    - In Project contexts window choose platform x64 for all AMD64 configurations
    - Linker -> General -> Additional Library Directories: c:\CUDA\lib64; "c:\ProgramData\NVIDIACorporation\NVIDIA GPU Computing SDK\C\common\lib"
    - Linker -> Input - Additional Dependencies: cudart.lib

Important notes about example files:

You must do this basic configuration to be able to open attached example because of custom build rule entry in project solution file. VC++ 2008 will show errors if that rule won't be accessible due to wrong paths. You can also edit file cudalib.vcproj (*.vcproj) and fix path to cuda.rules file.

If this won't help then just create new project, manually add custom build rule as described before in configuration section and then copy source .cu file from the example to that new project.

You need also manually copy file cutil.h from \NVIDIA GPU Computing SDK\C\common\inc\ dir to C:\CUDA\include\ (it is easiest way to fix import paths) and add cutil32.lib (or cutil64.lib for 64bit OS) to linker input additional dependencies -> should be now like this: cudart.lib cutil32.lib (already done in example projects).

CUDA toolkit and SDK version 3.0 had been used to build dll part.

The main project part (written in C#) will work even without this steps because I added already built dll library to \bin\Debug and \bin\Release folders (you have to remember to replace it if you change anything in the dll part – this is not done automatically – at least in my example).

And you need VS 2010 version for C# project part because .NET 4 features has been used.

Part 2: CUDA DLL

The dll part code has been written in VC++ 2008 IDE. Assuming that IDE is configured properly create new Win32 Console Application, switch application type to "DLL" and mark "Empty project" (no need precompiled header and/or dllmain() function). Add new source file and save it with .cu extension. Syntax coloring should work. Remember to add CUDA custom build rule and linker dependencies.

To run CUDA kernel (function executed on the GPU device) we need some sort of wrapper function exposed outside the dll. Kernel setup and invocation will be inside that function. The nice feature is that we can pass kernel execution configuration parameters to it (grid size, block size and shared memory size) instead of set them to constants which gives us possibility to run benchmark on target machine to determine best values (and so on).

// cuda wrapper function
extern "C" int __declspec(dllexport) __stdcall SomeCalculationsCU
      float *a_h,  // pointer to input array
      const unsigned int N, // input array size
      const unsigned int M, // kernel M parameter
      const int cuBlockSize = 512, // kernel block size (max 512)
      const int showErrors = 1 // show CUDA errors in console window
      int tmp = PRINT_ERRORS;
      PRINT_ERRORS = showErrors;
      float *a_d; // pointer to device array
      size_t size = N * sizeof(float);
      int cuerr = 0; // no errors
      unsigned int timer = 0;
      cudaMalloc((void**)&a_d, size); // allocate array on device   
      cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
      int n_blocks = N / cuBlockSize + (N % cuBlockSize == 0 ? 0 : 1);
      cutCreateTimer(&timer); // from cutil.h
      some_calculations <<<n_blocks, cuBlockSize>>> (a_d, N, M);  // kernel invocation
      cudaThreadSynchronize(); // by default kernel runs in parallel with CPU code
      cuerr = checkCUDAError("cuda kernel");
      cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost);  
      if(!cuerr) cuerr = checkCUDAError("cuda memcpy");
      sExecutionTime = cutGetTimerValue(timer);
      if(!cuerr) cuerr = checkCUDAError("cuda free");
      PRINT_ERRORS = tmp;
      return cuerr;

The most important in that function is extern "C" int _declspec(dllexport) _stdcall part which makes it visible outside dll. The calling convention (_stdcall here) must be specified because by default C functions are using _cdecl one and .NET platform invoke use CallingConvention.Winapi which is _stdcall. More info about Win32 calling conventions is here Generally important is to use the same convention for caller and callee functions and not use _stdcall with variadic ones.

This function could be also the main() function in console application project so you can modify it and add some print results section if you want or just add main() function like this one:

int main(void)
      float *a_h;
      const unsigned int N = 2000;
      const unsigned int M = 10;
      const int cublocks = 256;

      size_t size = N * sizeof(float);
      a_h = (float*)malloc(size);
      for(unsigned int i = 0; i < N; i++) a_h[i] = (float)i;
      SomeCalculationsCU(a_h, N, M, cublocks, 1);
      printf("exec time = %f ms\n", sExecutionTime);

To run this as an application (not build dll) change Configuration Type to Application (.exe) in project properties -> Configuration Properties -> General section. This also allows you to use CUDA Profiler (located in C:\CUDA\cudaprof\bin\ directory).

File has also example of variable available to access from the outside of dll:

// external variable example
extern "C" { float __declspec(dllexport) sExecutionTime = -1; }

// variable wrapper function
extern "C" float __declspec(dllexport) __stdcall GetExecutionTime()
      return sExecutionTime;

It is possible to access that kind of variable directly (will be shown in C# part) but this is much easier with wrapper functions approach (could works like get/set accessors).

Finally example kernel function:

// cuda kernel (internal)
__global__ void some_calculations(float *a, unsigned int N, unsigned int M)
      unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < N)
            // note1: no need for shared memory here
            // note2: global memory access is coalesced
            //        (no structs, float only used)
            // do computations M times on each thread
            // to extend processor's time
            for(unsigned int i = 0; i < M; i++)
                  // some easy arithmetics           
                  a[idx] = a[idx] * a[idx] * 0.1 - a[idx] - 10;

And CPU version for execution time comparison:

extern "C" void __declspec(dllexport) __stdcall SomeCalculationsCPU
      float *a_h,
      const unsigned int N,
      const unsigned int M

      unsigned int timer = 0;
      for(unsigned int i = 0; i < N; i++)
            for(unsigned int j = 0; j < M; j++)
                  *(a_h + i) = *(a_h + i) * *(a_h + i) * 0.1 - *(a_h + i) - 10;
      sExecutionTime = cutGetTimerValue(timer);

Part 3: Using DLL in C# code

This is easy part with .NET platform invoke functionality however one thing is worth mention - accessing variables because we cannot use DllImport attribute on them. We have to find address of that variable and then marshal data.

using System.Runtime.InteropServices;

#region hard way to import variable from unmanaged dll
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
internal static extern IntPtr GetProcAddress(IntPtr hModule, string procName);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
internal static extern IntPtr LoadLibrary(string lpszLib);
static float ReadsExecutionTime()
    IntPtr hdl = LoadLibrary("cudalib.dll");
    if (hdl != IntPtr.Zero)
        IntPtr addr = GetProcAddress(hdl, "sExecutionTime");
        if (addr != IntPtr.Zero)
            //int value = Marshal.ReadInt32(addr);      // for integer types
            float[] managedArray = new float[1];        // single value
            Marshal.Copy(addr, managedArray, 0, 1);     // for other types
            return managedArray[0];
    return 0;
// easy way to import variable from unmanaged dll - make wrapper if you can
[DllImport("cudalib.dll", CharSet = CharSet.Ansi, SetLastError = true, CallingConvention = CallingConvention.StdCall)]
public static extern float GetExecutionTime();

Things are little bit easier with integer types because you may use Marshal class like in commented line above. For other types Marshal.Copy() method is a possible solution. I used one-element array of float values to retrieve my variable. Marshal.Copy() overloads gives you also possibility to copy data from managed source to unmanaged destination (the opposite direction).

As for other functions from dll, including CUDA calculations function:

[DllImport("cudalib.dll", CharSet = CharSet.Ansi, SetLastError = true, CallingConvention = CallingConvention.StdCall)]
public static extern int SomeCalculationsCU(float[] a_h, uint N, uint M, int cuBlockSize, int showErrors);
[DllImport("cudalib.dll", CharSet = CharSet.Ansi, SetLastError = true, CallingConvention = CallingConvention.StdCall)]
public static extern void SomeCalculationsCPU(float[] a_h, uint N, uint M);

Again, remember to set calling convention properly and that is all.

The example code shows also how to use Parallel.For() method (one of possible ways to do that). This is a .NET 4 feature. More information on parallel programming in the .NET framework here  and examples here

using System.Threading.Tasks;
using System.Diagnostics;
private static double ParallelForVersion(float[] farr3, uint N, uint M)
    Stopwatch stp = new Stopwatch();
    Parallel.For(0, N, i =>
        for (uint j = 0; j < M; j++)
            farr3[i] = farr3[i] * farr3[i] * 0.1f - farr3[i] - 10;
    return stp.Elapsed.TotalMilliseconds;

This code does exactly the same calculations as CUDA version however as you will see the calculation results are not exactly the same because of different error bounds for CPU and GPU arithmetic. There is also use_fast_math option for nvcc complier which force it to use less accurate but faster functions like __sinf(x) instead of sinf(x) not every function has its counterpart prefixed with __ (i.e. there is no fast double precision version of sin(x)).

Code example

The attached example shows execution times of easy computations on every array's element for CUDA (GPU), unmanaged native code (CPU), managed .NET code (Sequential) and .NET parallel versions of the same function.

The configuration parameters which you can change are size of the array, number of loops for each thread (more computations), CUDA block size and number of test rounds for average value.

Block size should be less than array size to utilize more multiprocessors. Good values are 256 or 512(max for most CUDA enabled devices). You can use CUDA profiler from CUDA SDK to check occupancy (should be 1 or close to 1). Values which are not multiply of 16 (half warp) will cause uncoalesced global memory operations and decrease in performance.

CUDA kernel here is very simple one without any thread synchronization and shared memory usage. In fact all threads run totally independent which produce great performance when there is a lot of computations for each array element.

You can check results with less loops – like 1 or 2 to see that even for big arrays (tasks) simple computations are faster on CPU because of time needed for copy operations. The usage of parallel for() is determined mostly by array size and is generally ineffective on small ones.

Here are results for my laptop with very old Nvidia GPU with compute capability 1.1:


Please see attachment with complete code for x86 and x64 platforms.

Similar Articles