Interoperability with CUDA

Although ArrayFire is quite extensive, there remain many cases in which you may want to write custom kernels in CUDA or OpenCL. For example, you may wish to add ArrayFire to an existing code base to increase your productivity, or you may need to supplement ArrayFire's functionality with your own custom implementation of specific algorithms.

ArrayFire manages its own memory, runs within its own CUDA stream, and creates custom IDs for devices. As such, most of the interoperability functions focus on reducing potential synchronization conflicts between ArrayFire and CUDA.

Basics

It is fairly straightforward to interface ArrayFire with your own custom code. ArrayFire provides several functions to ease this process including:

FunctionPurpose
Array::new_from_device_ptrConstruct an ArrayFire Array from device memory
Array::device_ptrObtain a pointer to the device memory (implies lock())
Array::lockRemoves ArrayFire's control of a device memory pointer
Array::unlockRestores ArrayFire's control over a device memory pointer
get_deviceGets the current ArrayFire device ID
set_deviceSwitches ArrayFire to the specified device
get_device_native_idFetches CUDA deviceID for a given ArrayFire device ID
set_device_native_idSwitches active device to the specified CUDA device ID
get_streamGet the current CUDA stream used by ArrayFire

Using custom CUDA kernels in existing ArrayFire application

By default, ArrayFire manages its own memory and operates in its own CUDA stream. Thus there is a slight amount of bookkeeping that needs to be done in order to integrate your custom CUDA kernel.

Ideally, we recommend using ArrayFire's CUDA stream to launch your custom kernels. However, this is currently not possible due to limitation on RustaCUDA not being to able to wrap an existing cudaStream_t/CUstream_t objects. The current work around is to create a stream of your own and launch the kernel on it.

Notice that since ArrayFire and your kernels are not sharing the same CUDA stream, there is a need to perform explicit synchronization before launching kernel on your stream that depends on the computation carried out by ArrayFire earlier. This extra step is unnecessary once the above stated limiation of RustaCUDA's stream is eliminated.

This process is best illustrated with a fully worked example:

use arrayfire as af;
use rustacuda::prelude::*;
use rustacuda::*;

use std::ffi::CString;

fn main() {
    // MAKE SURE to do all rustacuda initilization before arrayfire API's
    // first call. It seems like some CUDA context state is getting messed up
    // if we mix CUDA context init(device, context, module, stream) with ArrayFire API
    match rustacuda::init(CudaFlags::empty()) {
        Ok(()) => {}
        Err(e) => panic!("rustacuda init failure: {:?}", e),
    }
    let device = match Device::get_device(0) {
        Ok(d) => d,
        Err(e) => panic!("Failed to get device: {:?}", e),
    };
    let _context =
        match Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device) {
            Ok(c) => c,
            Err(e) => panic!("Failed to create context: {:?}", e),
        };
    let ptx = CString::new(include_str!("./resources/add.ptx")).unwrap();
    let module = match Module::load_from_string(&ptx) {
        Ok(m) => m,
        Err(e) => panic!("Failed to load module from string: {:?}", e),
    };
    let stream = match Stream::new(StreamFlags::NON_BLOCKING, None) {
        Ok(s) => s,
        Err(e) => panic!("Failed to create stream: {:?}", e),
    };

    af::set_device(0);
    af::info();

    let num: i32 = 10;
    let x = af::constant(1f32, af::dim4!(10));
    let y = af::constant(2f32, af::dim4!(10));
    let out = af::constant(0f32, af::dim4!(10));

    af::af_print!("x", x);
    af::af_print!("y", y);
    af::af_print!("out(init)", out);

    //TODO Figure out how to use Stream returned by ArrayFire with Rustacuda
    // let af_id = get_device();
    // let cuda_id = get_device_native_id(af_id);
    // let af_cuda_stream = get_stream(cuda_id);

    //TODO Figure out how to use Stream returned by ArrayFire with Rustacuda
    // let stream = Stream {inner: mem::transmute(af_cuda_stream)};

    // Run a custom CUDA kernel in the ArrayFire CUDA stream
    unsafe {
        // Obtain device pointers from ArrayFire using Array::device() method
        let d_x: *mut f32 = x.device_ptr() as *mut f32;
        let d_y: *mut f32 = y.device_ptr() as *mut f32;
        let d_o: *mut f32 = out.device_ptr() as *mut f32;

        match launch!(module.sum<<<1, 1, 0, stream>>>(
        memory::DevicePointer::wrap(d_x),
        memory::DevicePointer::wrap(d_y),
        memory::DevicePointer::wrap(d_o),
        num
        )) {
            Ok(()) => {}
            Err(e) => panic!("Kernel Launch failure: {:?}", e),
        }

        // wait for the kernel to finish as it is async call
        match stream.synchronize() {
            Ok(()) => {}
            Err(e) => panic!("Stream sync failure: {:?}", e),
        };

        // Return control of Array memory to ArrayFire using unlock
        x.unlock();
        y.unlock();
        out.unlock();
    }
    af::af_print!("sum after kernel launch", out);
}

Adding ArrayFire to existing CUDA Application

Adding ArrayFire to an existing application is slightly more involved and can be somewhat tricky due to several optimizations we implement. The most important are as follows:

  • ArrayFire assumes control of all memory provided to it.
  • ArrayFire does not (in general) support in-place memory transactions.

We will discuss the implications of these items below. To add ArrayFire to existing code you need to:

  1. Finish any pending CUDA operations (e.g. cudaDeviceSynchronize() or similar stream functions)
  2. Create ArrayFire arrays from existing CUDA pointers
  3. Perform operations on ArrayFire arrays
  4. Instruct ArrayFire to finish operations using eval and sync
  5. Obtain pointers to important memory
  6. Continue your CUDA application.
  7. Free non-managed memory

To create the Array fom device pointer, you should use one of the following approaches:

Using DeviceBuffer from RustaCUDA, or a Wrapper Object for CUDA device memory


# #![allow(unused_variables)]
#fn main() {
let mut buffer = memory::DeviceBuffer::from_slice(&v).unwrap();

let array_dptr = Array::new_from_device_ptr(
    buffer.as_device_ptr().as_raw_mut(), dim4!(10, 10));

array_dptr.lock(); // Needed to avoid free as arrayfire takes ownership
#}

Using raw pointer returned from cuda_malloc interface exposed by RustaCUDA


# #![allow(unused_variables)]
#fn main() {
let mut dptr: *mut f32 = std::ptr::null_mut();
unsafe {
    dptr = memory::cuda_malloc::<f32>(10*10).unwrap().as_raw_mut();
}

let array_dptr = Array::new_from_device_ptr(dptr, dim4!(10, 10));
// After ArrayFire takes over ownership of the pointer, you can use other
// arrayfire functions as usual.
#}

ArrayFire's memory manager automatically assumes responsibility for any memory provided to it. Thus ArrayFire could free or reuse the memory at any later time. If this behavior is not desired, you may call Array::unlock and manage the memory yourself. However, if you do so, please be cautious not to free memory when ArrayFire might be using it!

The seven steps above are best illustrated using a fully-worked example:

use arrayfire::{af_print, dim4, info, set_device, Array};
use rustacuda::prelude::*;

fn main() {
    // MAKE SURE to do all rustacuda initilization before arrayfire API's
    // first call. It seems like some CUDA context state is getting messed up
    // if we mix CUDA context init(device, context, module, stream) with ArrayFire API
    match rustacuda::init(CudaFlags::empty()) {
        Ok(()) => {}
        Err(e) => panic!("rustacuda init failure: {:?}", e),
    }
    let device = match Device::get_device(0) {
        Ok(d) => d,
        Err(e) => panic!("Failed to get device: {:?}", e),
    };
    let _context =
        match Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device) {
            Ok(c) => c,
            Err(e) => panic!("Failed to create context: {:?}", e),
        };
    let stream = match Stream::new(StreamFlags::NON_BLOCKING, None) {
        Ok(s) => s,
        Err(e) => panic!("Failed to create stream: {:?}", e),
    };

    let mut in_x = DeviceBuffer::from_slice(&[1.0f32; 10]).unwrap();
    let mut in_y = DeviceBuffer::from_slice(&[2.0f32; 10]).unwrap();

    // wait for any prior kernels to finish before passing
    // the device pointers to ArrayFire
    match stream.synchronize() {
        Ok(()) => {}
        Err(e) => panic!("Stream sync failure: {:?}", e),
    };

    set_device(0);
    info();

    let x = Array::new_from_device_ptr(in_x.as_device_ptr().as_raw_mut(), dim4!(10));
    let y = Array::new_from_device_ptr(in_y.as_device_ptr().as_raw_mut(), dim4!(10));

    // Lock so that ArrayFire doesn't free pointers from RustaCUDA
    // But we have to make sure these pointers stay in valid scope
    // as long as the associated ArrayFire Array objects are valid
    x.lock();
    y.lock();

    af_print!("x", x);
    af_print!("y", y);

    let o = x + y;
    af_print!("out", o);

    let _o_dptr = unsafe { o.device_ptr() }; // Calls an implicit lock

    // User has to call unlock if they want to relenquish control to ArrayFire

    // Once the non-arrayfire operations are done, call unlock.
    o.unlock(); // After this, there is no guarantee that value of o_dptr is valid
}