ArrayFire is a high performance software library for parallel computing with an easy-to-use API. ArrayFire abstracts away much of the details of programming parallel architectures by providing a high-level container object, the Array, that represents data stored on a CPU, GPU, FPGA, or other type of accelerator. This abstraction permits developers to write massively parallel applications in a high-level language where they need not be concerned about low-level optimizations that are frequently required to achieve high throughput on most parallel architectures.

Supported data types

ArrayFire provides one generic container object, the Array on which functions and mathematical operations are performed. The Array can represent one of many different basic data types:

F32 real single-precision (float)
C32 complex single-precision (cfloat)
F64 real double-precision (double)
C64 complex double-precision (cdouble)
B8 8-bit boolean values (bool)
S32 32-bit signed integer (int)
U32 32-bit unsigned integer (unsigned)
U8 8-bit unsigned values (unsigned char)
S64 64-bit signed integer (intl)
U64 64-bit unsigned integer (uintl)
S16 16-bit signed integer (short)
U16 16-bit unsigned integer (unsigned short)
F16 16-bit floating point number (half::f16)

Most of these data types are supported on all modern GPUs; however, some older devices may lack support for double precision arrays. In this case, a runtime error will be generated when the array is constructed.

If not specified, Arrays are created as single precision floating point numbers (F32).

Creating and populating an Array

ArrayFire Array's represent memory stored on the device. As such, creation and population of an array will consume memory on the device which cannot freed until the array object goes out of scope. As device memory allocation can be expensive, ArrayFire also includes a memory manager which will re-use device memory whenever possible.

Arrays can be created using one of the array constructors. Below we show how to create 1D, 2D, and 3D arrays with uninitialized values:

let garbageVals = Array::new_empty(Dim4::new(&[3, 1, 1, 1]), DType::F32);

However, uninitialized memory is likely not useful in your application. ArrayFire provides several convenient functions for creating arrays that contain pre-populated values including constants, uniform random numbers, uniform normally distributed numbers, and the identity matrix:

// Create an array filled with constant value of 2.0 of type floating point
// The type of Array is infered from the type of the constant argument
let cnst = constant(2.0f32, Dim4::new(&[5, 5, 1, 1]));
print(&cnst);

println!("Create a 5-by-3 matrix of random floats on the GPU");
let dims = Dim4::new(&[5, 3, 1, 1]);
let a = randu::<f32>(dims);
print(&a);

As stated above, the default data type for arrays is F32(32-bit floating point number) unless specified otherwise.

ArrayFire Arrays may also be populated from data found on the host. For example:

let values: [u32; 3] = [1u32, 2, 3];
let indices = Array::new(&values, Dim4::new(&[3, 1, 1, 1]));
print(&indices);

Properties of an Array

ArrayFire provides several functions to determine various aspects of arrays. This includes functions to print the contents, query dimensions, and determine various other aspects of arrays.

The print function can be used to print arrays that have already been generated or any expression involving arrays:

let values: [f32; 3] = [1.0, 2.0, 3.0];
let indices = Array::new(&values, Dim4::new(&[3, 1, 1, 1]));
print(&indices);

The dimensions of an array may be determined using either a Dim4 object or by accessing the dimensions directly using the Dim4::get and Dim4::numdims functions:

let values: [f32; 3] = [1.0, 2.0, 3.0];
let dims: Dim4 = Dim4::new(&[3, 1, 1, 1]);
let indices = Array::new(&values, dims);
println!("Dims {:?} with dimensions {}", dims.get(), dims.ndims());

In addition to dimensions, arrays also carry several properties including methods to determine the underlying type and size (in bytes). You can even determine whether the array is empty, real/complex, a row/column, or a scalar or a vector. For further information on these capabilities, we suggest you consult the full documentation on the Array.

Writing math expressions using ArrayFire

ArrayFire features an intelligent Just-In-Time (JIT) compilation engine that converts expressions using arrays into the smallest number of CUDA/OpenCL kernels. For most operations on Arrays, ArrayFire functions like a vector library. That means that an element-wise operation, like c[i] = a[i] + b[i] in C, would be written more concisely without indexing, like c = a + b. When there are multiple expressions involving arrays, ArrayFire's JIT engine will merge them together. his "kernel fusion" technology not only decreases the number of kernel calls, but, more importantly, avoids extraneous global memory operations.

Our JIT functionality extends across C API boundary and only ends when a non-JIT function is encountered or a synchronization operation is explicitly called by the code.

ArrayFire provides hundreds of functions for element-wise operations. All of the standard operators (e.g. +,-,*,/) are supported as are most transcendental functions (sin, cos, log, sqrt, etc.). Here are a few examples:

let num_rows: u64 = 5;
let num_cols: u64 = 3;
let dims = Dim4::new(&[num_rows, num_cols, 1, 1]);
let a = randu::<f32>(dims);
let b = randu::<f32>(dims);
print(&a);
print(&b);
let c = a + b;
print(&c);

//Example of *Assign traits
let mut d = randu::<f32>(dims);
let e     = constant(1f32, dims);
d += e;
print(&d);

Indexing

Like all functions in ArrayFire, indexing is also executed in parallel on the OpenCL/CUDA device. To index Arrays you may use one or a combination of the following functions:

Seq representing a linear sequence
Seq::Default() representing the entire dimension
row(&Array, i) or col(&Array, i) specifying a single row/column
rows(&Array, first,last) or cols(&Array, first,last) specifying a span of rows or columns

Please see the indexing page for several examples of how to use these functions.

Access to Array memory on the host

Memory in af::Arrays may be accessed using the Array::host() method. The host function copies the data from the device and makes it available in a standard slice or similar container on the host. As such, it is up to the developer to manage any memory returned by host.

Bitwise operators

In addition to supporting standard mathematical functions, Arrays that contain integer data types also support bitwise operators including and, or, and shift etc. Operator traits for Array as well as separate functions are also defined to support various use cases.

let dims = Dim4::new(&[5, 3, 1, 1]);
let a = randu::<bool>(dims);
let b = randu::<bool>(dims);

print(&a);
print(&b);

let c = &a | &b; //Borrowing to avoid move of a and b, a | b is also valid
let d = bitand(&a, &b, false);

print(&c);
print(&d);

Where to go for help?

Google Groups
ArrayFire Services: Consulting | Support | Training
ArrayFire Blogs
Email: mailto:technical@arrayfire.com

Vectorization

Programmers and Data Scientists want to take advantage of fast and parallel computational devices. Writing vectorized code is necessary to get the best performance out of the current generation parallel hardware and scientific computing software. However, writing vectorized code may not be immediately intuitive. ArrayFire provides many ways to vectorize a given code segment. In this chapter, we present several methods to vectorize code using ArrayFire and discuss the benefits and drawbacks associated with each method.

Generic/Default vectorization

By its very nature, ArrayFire is a vectorized library. Most functions operate on Arrays as a whole i.e. on all elements in parallel. For example consider the following code:

let mut a = af::range(Dim::new(&[10, 1, 1, 1]));  // [0,  9]
a = a + 1;                                        // [1, 10]

This code will result in a single kernel that operates on all 10 elements of a in parallel.

A small subset of such vectorized ArrayFire functions are given below for quick reference:

Operator Category	Functions
Arithmetic operations	+, -, *, /, %, >>, <<
Logical operations	&&, \|\|, <, >, ==, != etc.
Numeric functions	abs, floor, round, min, max, etc.
Complex operations	real, imag, conjg, etc.
Exponential and logarithmic fns	exp, log, expm1, log1p, etc.
Trigonometric functions	sin, cos, tan, etc.
Hyperbolic functions	sinh, cosh, tanh, etc.

In addition to element-wise operations, many other functions are also vectorized in ArrayFire.

Notice that even functions that perform some form of aggregation (e.g. sum or min), signal processing (like convolve), and image processing functions (i.e. rotate etc.)

all support vectorization on different columns or images.

For example, if we have NUM images of size WIDTHxHEIGHT, one could convolve each image in a vector fashion as follows:

let g_coef: [f32, 9] = { 1, 2, 1, 2, 4, 2, 1, 2, 1 };

let f = Array::new(g_coef, Dim4::new(&[3, 3, 1, 1]));
let filter = f * 1.0f32/16;

let signal = randu(WIDTH, HEIGHT, NUM);
let conv   = convolve2(signal, filter, ConvMode::DEFAULT, ConvDomain::AUTO);

Similarly, one can rotate 100 images by 45 degrees in a single call using code like the following:

// Construct an array of 100 WIDTH x HEIGHT images of random numbers
let imgs = randu(WIDTH, HEIGHT, 100);

// Rotate all of the images in a single command
let rot_imgs = rotate(imgs, 45.0, False, InterpType::LINEAR);

Although most functions in ArrayFire do support vectorization, some do not. Most notably, all linear algebra functions. Even though they are not vectorized linear algebra operations, they still execute in parallel on your hardware.

Using the built in vectorized operations should be the first and preferred method of vectorizing any code written with ArrayFire.

GFOR

This construct is similar to gfor loop from C++ API of ArrayFire. It has not been implemented in rust wrapper. This section will be updated once the feature has been added to the crate.

batch_func

This another pending feature that is similar to our C++ API of batchFunc()

Array and Matrix Manipulation

ArrayFire provides several different methods for manipulating arrays and matrices. The functionality includes:

moddims() - change the dimensions of an array without changing the data
flat() - flatten an array to one dimension
flip() - flip an array along a dimension
join() - join up to 4 arrays
reorder() - changes the dimension order within the array
shift() - shifts data along a dimension
tile() - repeats an array along a dimension
transpose() - performs a matrix transpose

Below we provide several examples of these functions and their use.

moddims()

The moddims function changes the dimensions of an array without changing its data or order. Note that this function modifies only the metadata associated with the array. It does not modify the content of the array. Here is an example of moddims() converting an 8x1 array into a 2x4 and then back to a 8x1:

a [8 1 1 1]
    1.0000
    2.0000
    1.0000
    2.0000
    1.0000
    2.0000
    1.0000
    2.0000

let new_dims = Dim4::new(&[2, 4, 1, 1]);
moddims(&a, new_dims)
[2 4 1 1]
    1.0000     1.0000     1.0000     1.0000
    2.0000     2.0000     2.0000     2.0000

let out = moddims(&a, a.elements(), 1, 1, 1);
[8 1 1 1]
    1.0000
    2.0000
    1.0000
    2.0000
    1.0000
    2.0000
    1.0000
    2.0000

flat()

The flat function flattens an array to one dimension:

a [3 3 1 1]
    1.0000     4.0000     7.0000
    2.0000     5.0000     8.0000
    3.0000     6.0000     9.0000

flat(&a)
[9 1 1 1]
    1.0000
    2.0000
    3.0000
    4.0000
    5.0000
    6.0000
    7.0000
    8.0000
    9.0000

flip()

The flip function flips the contents of an array along a chosen dimension. In the example below, we show the 5x2 array flipped along the zeroth (i.e. within a column) and first (e.g. across rows) axes:

a [5 2 1 1]
    1.0000     6.0000
    2.0000     7.0000
    3.0000     8.0000
    4.0000     9.0000
    5.0000    10.0000

flip(a, 0) [5 2 1 1]
    5.0000    10.0000
    4.0000     9.0000
    3.0000     8.0000
    2.0000     7.0000
    1.0000     6.0000

flip(a, 1) [5 2 1 1]
    6.0000     1.0000
    7.0000     2.0000
    8.0000     3.0000
    9.0000     4.0000
   10.0000     5.0000

join()

The join, join_many functions can be used to join arrays along a specific dimension.

Here is an example of how to use join an array to itself:

a [5 1 1 1]
    1.0000
    2.0000
    3.0000
    4.0000
    5.0000

join(0, a, a) [10 1 1 1]
    1.0000
    2.0000
    3.0000
    4.0000
    5.0000
    1.0000
    2.0000
    3.0000
    4.0000
    5.0000

join(1, a, a) [5 2 1 1]
    1.0000     1.0000
    2.0000     2.0000
    3.0000     3.0000
    4.0000     4.0000
    5.0000     5.0000

reorder()

The reorder function modifies the order of data within an array by exchanging data according to the change in dimensionality. The linear ordering of data within the array is preserved.

a [2 2 3 1]
    1.0000     3.0000
    2.0000     4.0000

    1.0000     3.0000
    2.0000     4.0000

    1.0000     3.0000
    2.0000     4.0000


reorder(&a, 1, 0, 2)
[2 2 3 1]  //equivalent to a transpose
    1.0000     2.0000
    3.0000     4.0000

    1.0000     2.0000
    3.0000     4.0000

    1.0000     2.0000
    3.0000     4.0000


reorder(&a, 2, 0, 1)
[3 2 2 1]
    1.0000     2.0000
    1.0000     2.0000
    1.0000     2.0000

    3.0000     4.0000
    3.0000     4.0000
    3.0000     4.0000

shift()

The shift function shifts data in a circular buffer fashion along a chosen dimension. Consider the following example:

a [3 5 1 1]
    0.0000     0.0000     0.0000     0.0000     0.0000
    3.0000     4.0000     5.0000     1.0000     2.0000
    3.0000     4.0000     5.0000     1.0000     2.0000

shift(&a, 0, 2 )
[3 5 1 1]
    0.0000     0.0000     0.0000     0.0000     0.0000
    1.0000     2.0000     3.0000     4.0000     5.0000
    1.0000     2.0000     3.0000     4.0000     5.0000

shift(&a, -1, 2 )
[3 5 1 1]
    1.0000     2.0000     3.0000     4.0000     5.0000
    1.0000     2.0000     3.0000     4.0000     5.0000
    0.0000     0.0000     0.0000     0.0000     0.0000

tile()

The tile function repeats an array along the specified dimension. For example below we show how to tile an array along the zeroth and first dimensions of an array:

a [3 1 1 1]
    1.0000
    2.0000
    3.0000

// Repeat array a twice in the zeroth dimension
tile(&a, 2)
[6 1 1 1]
    1.0000
    2.0000
    3.0000
    1.0000
    2.0000
    3.0000

// Repeat array a twice along both the zeroth and first dimensions
tile(&a, 2, 2)
[6 2 1 1]
    1.0000     1.0000
    2.0000     2.0000
    3.0000     3.0000
    1.0000     1.0000
    2.0000     2.0000
    3.0000     3.0000

// Repeat array a twice along the first and three times along the second
// dimension.
let tile_dims = Dim4::new(&[1, 2, 3, 1]);
tile(a, tile_dims) [3 2 3 1]
    1.0000     1.0000
    2.0000     2.0000
    3.0000     3.0000

    1.0000     1.0000
    2.0000     2.0000
    3.0000     3.0000

    1.0000     1.0000
    2.0000     2.0000
    3.0000     3.0000

transpose()

The transpose function performs a standard matrix transpose. The input array must have the dimensions of a 2D-matrix.

a [3 3 1 1]
    1.0000     3.0000     3.0000
    2.0000     1.0000     3.0000
    2.0000     2.0000     1.0000

transpose(&a, False) //Second parameter to be used for conjugate transpose
[3 3 1 1]
    1.0000     2.0000     2.0000
    3.0000     1.0000     2.0000
    3.0000     3.0000     1.0000

Combining functions to enumerate grid coordinates

By using a combination of the functions, one can quickly code complex manipulation patterns with a few lines of code. For example, consider generating (x,y) coordinates for a grid where each axis goes from 1 to n. Instead of using several loops to populate our arrays we can just use a small combination of the above functions.

let a      = iota::<u32>(Dim4::new(&[3, 1, 1, 1]),
                         Dim4::new(&[1, 3, 1, 1]));
let b      = transpose(&a, false);
let coords = join(1, &flat(&a), &flat(&b));
print(&coords);

The output for a [3 3 1 1] matrix will be the following.

[9 2 1 1]
         0          0
         1          0
         2          0
         0          1
         1          1
         2          1
         0          2
         1          2
         2          2

Indexing

Indexing in ArrayFire is a powerful but easy to abuse feature. This feature allows you to reference or copy subsections of a larger array and perform operations on only a subset of elements.

This chapter is split into the following sections:

Indexer structure is the key element used in Rust wrapper of ArrayFire for creating references to existing Arrays. The above sections illustrate how it can be used in conjunction with Seq and/or Array. Apart from that, each section also showcases a macro based equivalent code(if one exists) that is more terse in syntax but offers the same functionality.

Using Seq objects

Create a view of an existing Array

We will Sequences and the function index in this approach.

        let dims = Dim4::new(&[5, 5, 1, 1]);
        let a = randu::<f32>(dims);
        //af_print!("a", a);
        //a
        //[5 5 1 1]
        //    0.3990     0.5160     0.8831     0.9107     0.6688
        //    0.6720     0.3932     0.0621     0.9159     0.8434
        //    0.5339     0.2706     0.7089     0.0231     0.1328
        //    0.1386     0.9455     0.9434     0.2330     0.2657
        //    0.7353     0.1587     0.1227     0.2220     0.2299

        // Index array using sequences
        let seqs = &[Seq::new(1u32, 3, 1), Seq::default()];
        let _sub = index(&a, seqs);
        //af_print!("a(seq(1,3,1), span)", sub);
        // [3 5 1 1]
        //     0.6720     0.3932     0.0621     0.9159     0.8434
        //     0.5339     0.2706     0.7089     0.0231     0.1328
        //     0.1386     0.9455     0.9434     0.2330     0.2657

However, the same above code can be condensed into a much terse syntax with the help of view macro. Take a look at the following two approaches using view macro.

        let dims = dim4!(5, 5, 1, 1);
        let a = randu::<f32>(dims);
        let first3 = seq!(1:3:1);
        let allindim2 = seq!();
        let _sub = view!(a[first3, allindim2]);

        let a = randu::<f32>(dim4!(5, 5));
        let _sub = view!(a[1:3:1, 1:1:0]); // 1:1:0 means all elements along axis

Modify a sub region of an existing Array

Let us take a look at an example where a portion of an existing Array will be set to with another Array. We will an constant value Array and the function assign_seq in the below example.

        let mut a = constant(2.0 as f32, dim4!(5, 3));
        //print(&a);
        // 2.0 2.0 2.0
        // 2.0 2.0 2.0
        // 2.0 2.0 2.0
        // 2.0 2.0 2.0
        // 2.0 2.0 2.0

        let b = constant(1.0 as f32, dim4!(3, 3));
        let seqs = [seq!(1:3:1), seq!()];
        assign_seq(&mut a, &seqs, &b);
        //print(&a);
        // 2.0 2.0 2.0
        // 1.0 1.0 1.0
        // 1.0 1.0 1.0
        // 1.0 1.0 1.0
        // 2.0 2.0 2.0

A much terser way of doing the same using macro is shown below

        let mut a = randu::<f32>(dim4!(5, 5));
        let b = randu::<f32>(dim4!(2, 2));
        eval!(a[1:2:1, 1:2:1] = b);

NOTE Normally you want to avoid accessing individual elements of the array like this for performance reasons.

Using Array and Seq combination

Create a view of an existing Array

To use a combination of Array and Seq objects to index an existing Array, we will need a more generalized function index_gen.

        let values: [f32; 3] = [1.0, 2.0, 3.0];
        let indices = Array::new(&values, Dim4::new(&[3, 1, 1, 1]));
        let seq4gen = Seq::new(0.0, 2.0, 1.0);
        let a = randu::<f32>(Dim4::new(&[5, 3, 1, 1]));
        // [5 3 1 1]
        //     0.0000     0.2190     0.3835
        //     0.1315     0.0470     0.5194
        //     0.7556     0.6789     0.8310
        //     0.4587     0.6793     0.0346
        //     0.5328     0.9347     0.0535

        let mut idxrs = Indexer::default();
        idxrs.set_index(&indices, 0, None); // 2nd arg is indexing dimension
        idxrs.set_index(&seq4gen, 1, Some(false)); // 3rd arg indicates batch operation

        let _sub2 = index_gen(&a, idxrs);
        //println!("a(indices, seq(0, 2, 1))"); print(&sub2);
        // [3 3 1 1]
        //     0.1315     0.0470     0.5194
        //     0.7556     0.6789     0.8310
        //     0.4587     0.6793     0.0346

Similar to how view macro helped with abreviating the syntax when indexing with just sequences, it can also help when using a combination of Seq and Array.

        let values: [f32; 3] = [1.0, 2.0, 3.0];
        let indices = Array::new(&values, Dim4::new(&[3, 1, 1, 1]));
        let seq4gen = seq!(0:2:1);
        let a = randu::<f32>(Dim4::new(&[5, 3, 1, 1]));
        let _sub2 = view!(a[indices, seq4gen]);

Modify a sub region of an existing Array

Set a portion of an existing Array with another Array using a combination of Seq and Array. We will use assign_gen function to do it.

       let values: [f32; 3] = [1.0, 2.0, 3.0];
       let indices = Array::new(&values, dim4!(3, 1, 1, 1));
       let seq4gen = seq!(0:2:1);
       let mut a = randu::<f32>(dim4!(5, 3, 1, 1));
       // [5 3 1 1]
       //     0.0000     0.2190     0.3835
       //     0.1315     0.0470     0.5194
       //     0.7556     0.6789     0.8310
       //     0.4587     0.6793     0.0346
       //     0.5328     0.9347     0.0535

       let b = constant(2.0 as f32, dim4!(3, 3, 1, 1));

       let mut idxrs = Indexer::default();
       idxrs.set_index(&indices, 0, None); // 2nd arg is indexing dimension
       idxrs.set_index(&seq4gen, 1, Some(false)); // 3rd arg indicates batch operation

       let _sub2 = assign_gen(&mut a, &idxrs, &b);
       //println!("a(indices, seq(0, 2, 1))"); print(&sub2);
       // [5 3 1 1]
       //     0.0000     0.2190     0.3835
       //     2.0000     2.0000     2.0000
       //     2.0000     2.0000     2.0000
       //     2.0000     2.0000     2.0000
       //     0.5328     0.9347     0.0535

       let values: [f32; 3] = [1.0, 2.0, 3.0];
       let indices = Array::new(&values, dim4!(3));
       let seq4gen = seq!(0:2:1);
       let mut a = randu::<f32>(dim4!(5, 3));

       let b = constant(2.0 as f32, dim4!(3, 3));

       eval!(a[indices, seq4gen] = b);

Extract or Set rows/columns of an Array

Extract a specific set of rows/coloumns from an existing Array.

        let a = randu::<f32>(dim4!(5, 5, 1, 1));
        //print(&a);
        // [5 5 1 1]
        //     0.6010     0.5497     0.1583     0.3636     0.6755
        //     0.0278     0.2864     0.3712     0.4165     0.6105
        //     0.9806     0.3410     0.3543     0.5814     0.5232
        //     0.2126     0.7509     0.6450     0.8962     0.5567
        //     0.0655     0.4105     0.9675     0.3712     0.7896
        let _r = row(&a, 4);
        // [1 5 1 1]
        //     0.0655     0.4105     0.9675     0.3712     0.7896
        let _c = col(&a, 4);
        // [5 1 1 1]
        //     0.6755
        //     0.6105
        //     0.5232
        //     0.5567
        //     0.7896

You can also use rows & cols to retrieve a subset of rows or coloumns respectively.

Similarly, set_row & set_rows can be used to change the values in a particular set of rows using another Array. set_col & set_cols has same functionality, except that it is for coloumns.

Negative Indices

Negative indices can also be used to refer elements from the end of a given axis. Negative value for a row/column/slice will fetch corresponding row/column/slice in reverse order. Given below are some examples that showcase getting row(s)/col(s) from an existing Array.

        let a = randu::<f32>(dim4!(5, 5));
        // [5 5 1 1]
        //     0.6010     0.5497     0.1583     0.3636     0.6755
        //     0.0278     0.2864     0.3712     0.4165     0.6105
        //     0.9806     0.3410     0.3543     0.5814     0.5232
        //     0.2126     0.7509     0.6450     0.8962     0.5567
        //     0.0655     0.4105     0.9675     0.3712     0.7896
        let _r = row(&a, -1);
        // [1 5 1 1]
        //     0.0655     0.4105     0.9675     0.3712     0.7896
        let _c = col(&a, -1);
        // [5 1 1 1]
        //     0.6755
        //     0.6105
        //     0.5232
        //     0.5567
        //     0.7896

        let a = randu::<f32>(dim4!(5, 5));
        // [5 5 1 1]
        //     0.6010     0.5497     0.1583     0.3636     0.6755
        //     0.0278     0.2864     0.3712     0.4165     0.6105
        //     0.9806     0.3410     0.3543     0.5814     0.5232
        //     0.2126     0.7509     0.6450     0.8962     0.5567
        //     0.0655     0.4105     0.9675     0.3712     0.7896
        let _r = rows(&a, -1, -2);
        // [2 5 1 1]
        //     0.2126     0.7509     0.6450     0.8962     0.5567
        //     0.0655     0.4105     0.9675     0.3712     0.7896
        let _c = cols(&a, -1, -3);
        // [5 3 1 1]
        //     0.1583     0.3636     0.6755
        //     0.3712     0.4165     0.6105
        //     0.3543     0.5814     0.5232
        //     0.6450     0.8962     0.5567
        //     0.9675     0.3712     0.7896

Configuring Arrayfire Environment

Following are the list of environment and runtime configurations that will help enhance your experience with ArrayFire.

AF_PATH

This is the path with ArrayFire gets installed, ie. the includes and libs are present in this directory. You can use this variable to add include paths and libraries to your projects.

AF_PRINT_ERRORS

When AF\_PRINT\_ERRORS is set to 1, the exceptions thrown are more verbose and detailed. This helps in locating the exact failure.

AF_PRINT_ERRORS=1 ./myprogram

AF_CUDA_DEFAULT_DEVICE

Use this variable to set the default CUDA device. Valid values for this variable are the device identifiers shown when info is run.

AF_CUDA_DEFAULT_DEVICE=1 ./myprogram

Note: set_device call in the source code will take precedence over this variable.

AF_OPENCL_DEFAULT_DEVICE

Use this variable to set the default OpenCL device. Valid values for this variable are the device identifiers shown when info is run.

AF_OPENCL_DEFAULT_DEVICE=1 ./myprogram

Note: set_device call in the source code will take precedence over this variable.

AF_OPENCL_DEFAULT_DEVICE_TYPE

Use this variable to set the default OpenCL device type. Valid values for this variable are: CPU, GPU, ACC (Accelerators). When set, the first device of the specified type is chosen as default device.

AF_OPENCL_DEFAULT_DEVICE_TYPE=CPU ./myprogram

Note: AF_OPENCL_DEFAULT_DEVICE and set_device takes precedence over this variable.

AF_OPENCL_DEVICE_TYPE

Use this variable to only choose OpenCL devices of specified type. Valid values for this variable are:

ALL: All OpenCL devices. (Default behavior).
CPU: CPU devices only.
GPU: GPU devices only.
ACC: Accelerator devices only.

When set, the remaining OpenCL device types are ignored by the OpenCL backend.

AF_OPENCL_DEVICE_TYPE=CPU ./myprogram

AF_OPENCL_CPU_OFFLOAD

When ArrayFire runs on devices with unified memory with the host (ie. CL_DEVICE_HOST_UNIFIED_MENORY is true for the device) then certain functions are offloaded to run on the CPU using mapped buffers.

ArrayFire takes advantage of fast libraries such as MKL while spending no time copying memory from device to host. The device memory is mapped to a host pointer which can be used in the offloaded functions.

This functionality can be disabled by using the environment variable AF_OPENCL_CPU_OFFLOAD=0.

The default bevaior of this has changed in version 3.4. Prior to v3.4, CPU Offload functionality was used only when the user set AF_OPENCL_CPU_OFFLOAD=1 and disabled otherwise. From v3.4 onwards, CPU Offload is enabled by default and is disabled only when AF_OPENCL_CPU_OFFLOAD=0 is set.

AF_OPENCL_SHOW_BUILD_INFO

This variable is useful when debuggin OpenCL kernel compilation failures. When this variable is set to 1, and an error occurs during a OpenCL kernel compilation, then the log and kernel are printed to screen.

AF_DISABLE_GRAPHICS

Setting this variable to 1 will disable window creation when graphics functions are being called. Disabling window creation will disable all other graphics calls at runtime. This is a useful when running code on servers and systems without displays. When graphics calls are run on such machines, they will print warning about window creation failing. To suppress those calls, set this variable.

AF_SYNCHRONOUS_CALLS

When this environment variable is set to 1, ArrayFire will execute all functions synchronously.

AF_SHOW_LOAD_PATH

When using the Unified backend, if this variable is set to 1, it will show the path where the ArrayFire backend libraries are loaded from.

If the libraries are loaded from system paths, such as PATH or LD_LIBRARY_PATH etc, then it will print "system path". If the libraries are loaded from other paths, then those paths are shown in full.

AF_MEM_DEBUG

When AF_MEM_DEBUG is set to 1 (or anything not equal to 0), the caching mechanism in the memory manager is disabled. The device buffers are allocated using native functions as needed and freed when going out of scope. When the environment variable is not set, it is treated to be non zero.

AF_MEM_DEBUG=1 ./myprogram

AF_MAX_BUFFERS

When AF_MAX_BUFFERS is set, this environment variable specifies the maximum number of buffers allocated before garbage collection kicks in. Please note that the total number of buffers that can exist simultaneously can be higher than this number. This variable tells the garbage collector that it should free any available buffers immediately if the treshold is reached. When not set, the default value is 1000.

AF_OPENCL_MAX_JIT_LEN

When set, this environment variable specifies the maximum height of the OpenCL JIT tree after which evaluation is forced. The default value, as of v3.4, is 50 on OSX, 100 everywhere else. This value was 20 for older versions.

AF_CUDA_MAX_JIT_LEN

When set, this environment variable specifies the maximum height of the CUDA JIT tree after which evaluation is forced. The default value, as of v3.4, 100. This value was 20 for older versions.

AF_CPU_MAX_JIT_LEN

When set, this environment variable specifies the maximum length of the CPU JIT tree after which evaluation is forced. The default value, as of v3.4, 100. This value was 20 for older versions.

Interoperability with CUDA

Although ArrayFire is quite extensive, there remain many cases in which you may want to write custom kernels in CUDA or OpenCL. For example, you may wish to add ArrayFire to an existing code base to increase your productivity, or you may need to supplement ArrayFire's functionality with your own custom implementation of specific algorithms.

ArrayFire manages its own memory, runs within its own CUDA stream, and creates custom IDs for devices. As such, most of the interoperability functions focus on reducing potential synchronization conflicts between ArrayFire and CUDA.

Basics

It is fairly straightforward to interface ArrayFire with your own custom code. ArrayFire provides several functions to ease this process including:

Function	Purpose
Array::new_from_device_ptr	Construct an ArrayFire Array from device memory
Array::device_ptr	Obtain a pointer to the device memory (implies `lock()`)
Array::lock	Removes ArrayFire's control of a device memory pointer
Array::unlock	Restores ArrayFire's control over a device memory pointer
get_device	Gets the current ArrayFire device ID
set_device	Switches ArrayFire to the specified device
get_device_native_id	Fetches CUDA deviceID for a given ArrayFire device ID
set_device_native_id	Switches active device to the specified CUDA device ID
get_stream	Get the current CUDA stream used by ArrayFire

Using custom CUDA kernels in existing ArrayFire application

By default, ArrayFire manages its own memory and operates in its own CUDA stream. Thus there is a slight amount of bookkeeping that needs to be done in order to integrate your custom CUDA kernel.

Ideally, we recommend using ArrayFire's CUDA stream to launch your custom kernels. However, this is currently not possible due to limitation on RustaCUDA not being to able to wrap an existing cudaStream_t/CUstream_t objects. The current work around is to create a stream of your own and launch the kernel on it.

Notice that since ArrayFire and your kernels are not sharing the same CUDA stream, there is a need to perform explicit synchronization before launching kernel on your stream that depends on the computation carried out by ArrayFire earlier. This extra step is unnecessary once the above stated limiation of RustaCUDA's stream is eliminated.

This process is best illustrated with a fully worked example:

use arrayfire as af;
use rustacuda::prelude::*;
use rustacuda::*;

use std::ffi::CString;

fn main() {
    // MAKE SURE to do all rustacuda initilization before arrayfire API's
    // first call. It seems like some CUDA context state is getting messed up
    // if we mix CUDA context init(device, context, module, stream) with ArrayFire API
    match rustacuda::init(CudaFlags::empty()) {
        Ok(()) => {}
        Err(e) => panic!("rustacuda init failure: {:?}", e),
    }
    let device = match Device::get_device(0) {
        Ok(d) => d,
        Err(e) => panic!("Failed to get device: {:?}", e),
    };
    let _context =
        match Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device) {
            Ok(c) => c,
            Err(e) => panic!("Failed to create context: {:?}", e),
        };
    let ptx = CString::new(include_str!("./resources/add.ptx")).unwrap();
    let module = match Module::load_from_string(&ptx) {
        Ok(m) => m,
        Err(e) => panic!("Failed to load module from string: {:?}", e),
    };
    let stream = match Stream::new(StreamFlags::NON_BLOCKING, None) {
        Ok(s) => s,
        Err(e) => panic!("Failed to create stream: {:?}", e),
    };

    af::set_device(0);
    af::info();

    let num: i32 = 10;
    let x = af::constant(1f32, af::dim4!(10));
    let y = af::constant(2f32, af::dim4!(10));
    let out = af::constant(0f32, af::dim4!(10));

    af::af_print!("x", x);
    af::af_print!("y", y);
    af::af_print!("out(init)", out);

    //TODO Figure out how to use Stream returned by ArrayFire with Rustacuda
    // let af_id = get_device();
    // let cuda_id = get_device_native_id(af_id);
    // let af_cuda_stream = get_stream(cuda_id);

    //TODO Figure out how to use Stream returned by ArrayFire with Rustacuda
    // let stream = Stream {inner: mem::transmute(af_cuda_stream)};

    // Run a custom CUDA kernel in the ArrayFire CUDA stream
    unsafe {
        // Obtain device pointers from ArrayFire using Array::device() method
        let d_x: *mut f32 = x.device_ptr() as *mut f32;
        let d_y: *mut f32 = y.device_ptr() as *mut f32;
        let d_o: *mut f32 = out.device_ptr() as *mut f32;

        match launch!(module.sum<<<1, 1, 0, stream>>>(
        memory::DevicePointer::wrap(d_x),
        memory::DevicePointer::wrap(d_y),
        memory::DevicePointer::wrap(d_o),
        num
        )) {
            Ok(()) => {}
            Err(e) => panic!("Kernel Launch failure: {:?}", e),
        }

        // wait for the kernel to finish as it is async call
        match stream.synchronize() {
            Ok(()) => {}
            Err(e) => panic!("Stream sync failure: {:?}", e),
        };

        // Return control of Array memory to ArrayFire using unlock
        x.unlock();
        y.unlock();
        out.unlock();
    }
    af::af_print!("sum after kernel launch", out);
}

Adding ArrayFire to existing CUDA Application

Adding ArrayFire to an existing application is slightly more involved and can be somewhat tricky due to several optimizations we implement. The most important are as follows:

ArrayFire assumes control of all memory provided to it.
ArrayFire does not (in general) support in-place memory transactions.

We will discuss the implications of these items below. To add ArrayFire to existing code you need to:

Finish any pending CUDA operations (e.g. cudaDeviceSynchronize() or similar stream functions)
Create ArrayFire arrays from existing CUDA pointers
Perform operations on ArrayFire arrays
Instruct ArrayFire to finish operations using eval and sync
Obtain pointers to important memory
Continue your CUDA application.
Free non-managed memory

To create the Array fom device pointer, you should use one of the following approaches:

Using DeviceBuffer from RustaCUDA, or a Wrapper Object for CUDA device memory


# #![allow(unused_variables)]
#fn main() {
let mut buffer = memory::DeviceBuffer::from_slice(&v).unwrap();

let array_dptr = Array::new_from_device_ptr(
    buffer.as_device_ptr().as_raw_mut(), dim4!(10, 10));

array_dptr.lock(); // Needed to avoid free as arrayfire takes ownership
#}

Using raw pointer returned from cuda_malloc interface exposed by RustaCUDA


# #![allow(unused_variables)]
#fn main() {
let mut dptr: *mut f32 = std::ptr::null_mut();
unsafe {
    dptr = memory::cuda_malloc::<f32>(10*10).unwrap().as_raw_mut();
}

let array_dptr = Array::new_from_device_ptr(dptr, dim4!(10, 10));
// After ArrayFire takes over ownership of the pointer, you can use other
// arrayfire functions as usual.
#}

ArrayFire's memory manager automatically assumes responsibility for any memory provided to it. Thus ArrayFire could free or reuse the memory at any later time. If this behavior is not desired, you may call Array::unlock and manage the memory yourself. However, if you do so, please be cautious not to free memory when ArrayFire might be using it!

The seven steps above are best illustrated using a fully-worked example:

use arrayfire::{af_print, dim4, info, set_device, Array};
use rustacuda::prelude::*;

fn main() {
    // MAKE SURE to do all rustacuda initilization before arrayfire API's
    // first call. It seems like some CUDA context state is getting messed up
    // if we mix CUDA context init(device, context, module, stream) with ArrayFire API
    match rustacuda::init(CudaFlags::empty()) {
        Ok(()) => {}
        Err(e) => panic!("rustacuda init failure: {:?}", e),
    }
    let device = match Device::get_device(0) {
        Ok(d) => d,
        Err(e) => panic!("Failed to get device: {:?}", e),
    };
    let _context =
        match Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device) {
            Ok(c) => c,
            Err(e) => panic!("Failed to create context: {:?}", e),
        };
    let stream = match Stream::new(StreamFlags::NON_BLOCKING, None) {
        Ok(s) => s,
        Err(e) => panic!("Failed to create stream: {:?}", e),
    };

    let mut in_x = DeviceBuffer::from_slice(&[1.0f32; 10]).unwrap();
    let mut in_y = DeviceBuffer::from_slice(&[2.0f32; 10]).unwrap();

    // wait for any prior kernels to finish before passing
    // the device pointers to ArrayFire
    match stream.synchronize() {
        Ok(()) => {}
        Err(e) => panic!("Stream sync failure: {:?}", e),
    };

    set_device(0);
    info();

    let x = Array::new_from_device_ptr(in_x.as_device_ptr().as_raw_mut(), dim4!(10));
    let y = Array::new_from_device_ptr(in_y.as_device_ptr().as_raw_mut(), dim4!(10));

    // Lock so that ArrayFire doesn't free pointers from RustaCUDA
    // But we have to make sure these pointers stay in valid scope
    // as long as the associated ArrayFire Array objects are valid
    x.lock();
    y.lock();

    af_print!("x", x);
    af_print!("y", y);

    let o = x + y;
    af_print!("out", o);

    let _o_dptr = unsafe { o.device_ptr() }; // Calls an implicit lock

    // User has to call unlock if they want to relenquish control to ArrayFire

    // Once the non-arrayfire operations are done, call unlock.
    o.unlock(); // After this, there is no guarantee that value of o_dptr is valid
}

Interoperability with OpenCL

ArrayFire manages its own context, queue, memory, and creates custom IDs for devices. As such, most of the interoperability functions focus on reducing potential synchronization conflicts between ArrayFire and OpenCL.

Basics

It is fairly straightforward to interface ArrayFire with your own custom code. ArrayFire provides several functions to ease this process including:

Function	Purpose
Array::new_from_device_ptr	Construct an ArrayFire Array from cl_mem
Array::device_ptr	Obtain a pointer to the device memory (implies `lock`)
Array::lock	Removes ArrayFire's control of a device memory pointer
Array::unlock	Restores ArrayFire's control over a device memory pointer
get_platform	Get ArrayFire's current cl_platform
get_device	Gets the current ArrayFire device ID
get_device_id	Get ArrayFire's current cl_device_id
set_device_id	Set ArrayFire's device from a cl_device_id
set_device	Switches ArrayFire to the specified device
get_context	Get ArrayFire's current cl_context
get_queue	Get ArrayFire's current cl_command_queue
get_device_type	Get the current DeviceType

Note that the pointer returned by Array::device_ptr should be cast to cl_mem before using it with OpenCL opaque types. The pointer is a cl_mem internally that is force casted to pointer type by ArrayFire before returning the value to caller.

Additionally, the OpenCL backend permits the programmer to add and remove custom devices from the ArrayFire device manager. These permit you to attach ArrayFire directly to the OpenCL queue used by other portions of your application.

Function	Purpose
add_device_context	Add a new device to ArrayFire's device manager
set_device_context	Set ArrayFire's device from cl_device_id & cl_context
delete_device_context	Remove a device from ArrayFire's device manager

Below we provide two worked examples on how ArrayFire can be integrated into new and existing projects.

Adding custom OpenCL kernels to an existing ArrayFire application

By default, ArrayFire manages its own context, queue, memory, and creates custom IDs for devices. Thus there is some bookkeeping that needs to be done to integrate your custom OpenCL kernel.

If your kernels can share operate in the same queue as ArrayFire, you should:

Obtain the OpenCL context, device, and queue used by ArrayFire
Obtain cl_mem references to Array objects
Load, build, and use your kernels
Return control of Array memory to ArrayFire

Note, ArrayFire uses an in-order queue, thus when ArrayFire and your kernels are operating in the same queue, there is no need to perform any synchronization operations.

This process is best illustrated with a fully worked example:

//! A trivial example. Copied from ocl-core crate repository.
use af_opencl_interop as afcl;
use arrayfire as af;

use ocl_core::{ArgVal, Event};

use std::ffi::CString;

fn main() {
    af::info();
    let dims = af::dim4!(8);
    let af_buffer = af::constant(0f32, dims.clone());
    af::af_print!("af_buffer", af_buffer);

    let src = r#"
        __kernel void add(__global float* buffer, float scalar) {
            buffer[get_global_id(0)] += scalar;
        }
    "#;

    let af_did = afcl::get_device_id();
    let af_ctx = afcl::get_context(false);
    let af_que = afcl::get_queue(false);

    let _devid = unsafe { ocl_core::DeviceId::from_raw(af_did) };
    let contx = unsafe { ocl_core::Context::from_raw_copied_ptr(af_ctx) };
    let queue = unsafe { ocl_core::CommandQueue::from_raw_copied_ptr(af_que) };

    // Define which platform and device(s) to use. Create a context,
    // queue, and program then define some dims..
    let src_cstring = CString::new(src).unwrap();
    let program = ocl_core::create_program_with_source(&contx, &[src_cstring]).unwrap();
    ocl_core::build_program(
        &program,
        None::<&[()]>,
        &CString::new("").unwrap(),
        None,
        None,
    )
    .unwrap();

    // Fetch cl_mem from ArrayFire Array
    let ptr = unsafe { af_buffer.device_ptr() };
    let buffer = unsafe { ocl_core::Mem::from_raw_copied_ptr(ptr) };

    // Create a kernel with arguments matching those in the source above:
    let kernel = ocl_core::create_kernel(&program, "add").unwrap();
    ocl_core::set_kernel_arg(&kernel, 0, ArgVal::mem(&buffer)).unwrap();
    ocl_core::set_kernel_arg(&kernel, 1, ArgVal::scalar(&10.0f32)).unwrap();

    let ocl_dims: [usize; 3] = [dims[0] as usize, dims[1] as usize, dims[2] as usize];
    unsafe {
        ocl_core::enqueue_kernel(
            &queue,
            &kernel,
            1,
            None,
            &ocl_dims,
            None,
            None::<Event>,
            None::<&mut Event>,
        )
        .unwrap();
    }
    ocl_core::finish(&queue).unwrap();
    af_buffer.unlock(); //Give back control of cl_mem to ArrayFire memory manager

    af::af_print!("af_buffer after running Custom Kernel on it", af_buffer);
}

If your kernels needs to operate in their own OpenCL queue, the process is essentially identical, except you need to instruct ArrayFire to complete its computations using the sync function prior to launching your own kernel and ensure your kernels are complete using clFinish (or similar) commands prior to returning control of the memory to ArrayFire:

Obtain the OpenCL context, device, and queue used by ArrayFire
Obtain cl_mem references to Array objects
Instruct ArrayFire to finish operations using sync
Load, build, and use your kernels
Instruct OpenCL to finish operations using clFinish() or similar commands.
Return control of Array memory to ArrayFire

Adding ArrayFire to an existing OpenCL application

Adding ArrayFire to an existing application is slightly more involved and can be somewhat tricky due to several optimizations we implement. The most important are as follows:

ArrayFire assumes control of all memory provided to it.
ArrayFire does not (in general) support in-place memory transactions.

We will discuss the implications of these items below. To add ArrayFire to existing code you need to:

Instruct OpenCL to complete its operations using clFinish (or similar)
Instruct ArrayFire to use the user-created OpenCL Context
Create ArrayFire arrays from OpenCL memory objects
Perform ArrayFire operations on the Arrays
Instruct ArrayFire to finish operations using sync
Obtain cl_mem references for important memory
Continue your OpenCL application

ArrayFire's memory manager automatically assumes responsibility for any memory provided to it. If you are creating an array from another RAII style object, you should retain it to ensure your memory is not deallocated if your RAII object were to go out of scope.

If you do not wish for ArrayFire to manage your memory, you may call the Array::unlock function and manage the memory yourself; however, if you do so, please be cautious not to call clReleaseMemObj on a cl_mem when ArrayFire might be using it!

Given below is a fully working example:

//! A trivial example. Copied from ocl-core crate repository.
use af_opencl_interop as afcl;
use arrayfire as af;

use ocl_core::{ContextProperties, Event};

fn main() {
    // Choose platform & device(s) to use. Create a context, queue,
    let platform_id = ocl_core::default_platform().unwrap();
    let device_ids = ocl_core::get_device_ids(&platform_id, None, None).unwrap();
    let device_id = device_ids[0];
    let context_properties = ContextProperties::new().platform(platform_id);
    let context =
        ocl_core::create_context(Some(&context_properties), &[device_id], None, None).unwrap();
    let queue = ocl_core::create_command_queue(&context, &device_id, None).unwrap();
    let dims = [8, 1, 1];

    // Create a `Buffer`:
    let mut vec = vec![0.0f32; dims[0]];
    let buffer = unsafe {
        ocl_core::create_buffer(
            &context,
            ocl_core::MEM_READ_WRITE | ocl_core::MEM_COPY_HOST_PTR,
            dims[0],
            Some(&vec),
        )
        .unwrap()
    };
    ocl_core::finish(&queue).unwrap(); //sync up before switching to arrayfire

    // Add custom device, context and associated queue to ArrayFire
    afcl::add_device_context(device_id.as_raw(), context.as_ptr(), queue.as_ptr());
    afcl::set_device_context(device_id.as_raw(), context.as_ptr());
    af::info();

    let mut af_buffer = af::Array::new_from_device_ptr(
        buffer.as_ptr() as *mut f32,
        af::Dim4::new(&[dims[0] as u64, 1, 1, 1]),
    );

    af::af_print!("GPU Buffer before modification:", af_buffer);

    af_buffer = af_buffer + 10f32;

    af::sync(af::get_device());
    unsafe {
        let ptr = af_buffer.device_ptr();
        let obuf = ocl_core::Mem::from_raw_copied_ptr(ptr);

        // Read results from the device into a vector:
        ocl_core::enqueue_read_buffer(
            &queue,
            &obuf,
            true,
            0,
            &mut vec,
            None::<Event>,
            None::<&mut Event>,
        )
        .unwrap();
    }
    println!("GPU buffer on host after ArrayFire operation: {:?}", vec);

    // Remove device from ArrayFire management towards Application Exit
    af::set_device(0); // Cannot pop when in Use, hence switch to another device
    afcl::delete_device_context(device_id.as_raw(), context.as_ptr());
}

ArrayFire in Threaded Applications

In this chapter, we will looking at how to use ArrayFire in multi-threaded programs. We shall go over the details in the following order.

Move an Array to thread

In this section, we are going to create an Array on main thread and move it to a child thread, modify it and then print it from the child thread.

        set_device(0);
        info();
        let mut a = constant(1, dim4!(3, 3));

        let handle = thread::spawn(move || {
            //set_device to appropriate device id is required in each thread
            set_device(0);

            println!("\nFrom thread {:?}", thread::current().id());

            a += constant(2, dim4!(3, 3));
            print(&a);
        });

        //Need to join other threads as main thread holds arrayfire context
        handle.join().unwrap();

Read Array from Multiple threads

Now, let's expand the earlier example to do a bunch of arithmetic operations in parallel on multiple threads using the same Array objects.

        let ops: Vec<_> = vec![Op::Add, Op::Sub, Op::Div, Op::Mul, Op::Add, Op::Div];

        // Set active GPU/device on main thread on which
        // subsequent Array objects are created
        set_device(0);

        // ArrayFire Array's are internally maintained via atomic reference counting
        // Thus, they need no Arc wrapping while moving to another thread.
        // Just call clone method on the object and share the resulting clone object
        let a = constant(1.0f32, dim4!(3, 3));
        let b = constant(2.0f32, dim4!(3, 3));

        let threads: Vec<_> = ops
            .into_iter()
            .map(|op| {
                let x = a.clone();
                let y = b.clone();
                thread::spawn(move || {
                    set_device(0); //Both of objects are created on device 0 earlier
                    match op {
                        Op::Add => {
                            let _c = x + y;
                        }
                        Op::Sub => {
                            let _c = x - y;
                        }
                        Op::Div => {
                            let _c = x / y;
                        }
                        Op::Mul => {
                            let _c = x * y;
                        }
                    }
                    sync(0);
                    thread::sleep(std::time::Duration::new(1, 0));
                })
            })
            .collect();
        for child in threads {
            let _ = child.join();
        }

Given below is the definition of the enum Op we used in the example for illustration simplicity.

    #[derive(Debug, Copy, Clone)]
    enum Op {
        Add,
        Sub,
        Div,
        Mul,
    }

Write to Array from Multiple threads

Let us further expand the earlier example by accumulating the results of the arithmetic operations into a single Array object.

The code will differ from earlier section in couple of locations:

In the main thread, we wrap the accumulating Array in a read-write lock (std::sync::RwLock) which is in turn wrapped in an atomically reference counted counter a.k.a std::sync::Arc.
In the children threads, we use the guarded objects returned by RwLock's write method to access the accumulator Array.

        let ops: Vec<_> = vec![Op::Add, Op::Sub, Op::Div, Op::Mul, Op::Add, Op::Div];

        // Set active GPU/device on main thread on which
        // subsequent Array objects are created
        set_device(0);

        let c = constant(0.0f32, dim4!(3, 3));
        let a = constant(1.0f32, dim4!(3, 3));
        let b = constant(2.0f32, dim4!(3, 3));

        // Move ownership to RwLock and wrap in Arc since same object is to be modified
        let c_lock = Arc::new(RwLock::new(c));

        // a and b are internally reference counted by ArrayFire. Unless there
        // is prior known need that they may be modified, you can simply clone
        // the objects pass them to threads

        let threads: Vec<_> = ops
            .into_iter()
            .map(|op| {
                let x = a.clone();
                let y = b.clone();

                let wlock = c_lock.clone();
                thread::spawn(move || {
                    //Both of objects are created on device 0 in main thread
                    //Every thread needs to set the device that it is going to
                    //work on. Note that all Array objects must have been created
                    //on same device as of date this is written on.
                    set_device(0);
                    if let Ok(mut c_guard) = wlock.write() {
                        match op {
                            Op::Add => {
                                *c_guard += x + y;
                            }
                            Op::Sub => {
                                *c_guard += x - y;
                            }
                            Op::Div => {
                                *c_guard += x / y;
                            }
                            Op::Mul => {
                                *c_guard += x * y;
                            }
                        }
                    }
                })
            })
            .collect();

        for child in threads {
            let _ = child.join();
        }

        //let read_guard = c_lock.read().unwrap();
        //af_print!("C after threads joined", *read_guard);
        //C after threads joined
        //[3 3 1 1]
        //    8.0000     8.0000     8.0000
        //    8.0000     8.0000     8.0000
        //    8.0000     8.0000     8.0000

Write to single Array using Channel

In this section, we shall modify the example to use channel instead of data sharing.

        let ops: Vec<_> = vec![Op::Add, Op::Sub, Op::Div, Op::Mul, Op::Add, Op::Div];
        let ops_len: usize = ops.len();

        // Set active GPU/device on main thread on which
        // subsequent Array objects are created
        set_device(0);

        let mut c = constant(0.0f32, dim4!(3, 3));
        let a = constant(1.0f32, dim4!(3, 3));
        let b = constant(2.0f32, dim4!(3, 3));

        let (tx, rx) = mpsc::channel();

        let threads: Vec<_> = ops
            .into_iter()
            .map(|op| {
                // a and b are internally reference counted by ArrayFire. Unless there
                // is prior known need that they may be modified, you can simply clone
                // the objects pass them to threads
                let x = a.clone();
                let y = b.clone();

                let tx_clone = tx.clone();

                thread::spawn(move || {
                    //Both of objects are created on device 0 in main thread
                    //Every thread needs to set the device that it is going to
                    //work on. Note that all Array objects must have been created
                    //on same device as of date this is written on.
                    set_device(0);

                    let c = match op {
                        Op::Add => x + y,
                        Op::Sub => x - y,
                        Op::Div => x / y,
                        Op::Mul => x * y,
                    };
                    tx_clone.send(c).unwrap();
                })
            })
            .collect();

        for _i in 0..ops_len {
            c += rx.recv().unwrap();
        }

        //Need to join other threads as main thread holds arrayfire context
        for child in threads {
            let _ = child.join();
        }

        //af_print!("C after accumulating results", &c);
        //[3 3 1 1]
        //    8.0000     8.0000     8.0000
        //    8.0000     8.0000     8.0000
        //    8.0000     8.0000     8.0000

Rust Wrapper for ArrayFire HPC Library