GFOR: Parallel For-Loops

Run many independent loops simultaneously on the GPU or device.

The gfor-loop construct may be used to simultaneously launch all of the iterations of a for-loop on the GPU or device, as long as the iterations are independent. While the standard for-loop performs each iteration sequentially, ArrayFire's gfor-loop performs each iteration at the same time (in parallel). ArrayFire does this by tiling out the values of all loop iterations and then performing computation on those tiles in one pass.

You can think of `gfor`

as performing auto-vectorization of your code, e.g. you write a gfor-loop that increments every element of a vector but behind the scenes ArrayFire rewrites it to operate on the entire vector in parallel.

Behind the scenes, ArrayFire rewrites your code into this equivalent and faster version:

A = A + 1;

It is best to vectorize computation as much as possible to avoid the overhead in both for-loops and gfor-loops.

To see another example, you could run an FFT on every 2D slice of a volume in a for-loop, or you could "vectorize" and simply do it all in one gfor-loop operation:

for (int i = 0; i < N; ++i)

A(span,span,i) = fft2(A(span,span,i)); // runs each FFT in sequence

gfor (seq i, N)

A(span,span,i) = fft2(A(span,span,i)); // runs N FFTs in parallel

There are three formats for instantiating gfor-loops.

- gfor(var,n) Creates a sequence _{0, 1, ..., n-1}_
- gfor(var,first,last) Creates a sequence _{first, first+1, ..., last}_
- gfor(var,first,last,incr) Creates a sequence _{first, first+inc, first+2*inc, ..., last}_

So all of the following represent the equivalent sequence: *0,1,2,3,4*

More examples:

array A = constant(1, n, n);

array B = constant(1, 1, n);

gfor (seq k, 0, n-1) {

B(span, k) = sum(A(span, k) * A(span,k)); // inner product

}

array A = randu(n,m);

array B = constant(0,n,m);

gfor (seq k, 0, m-1) {

B(span,k) = fft(A(span,k));

}

If you have defined a function that you want to call within a GFOR loop, then that function has to meet all the conditions described in this page in order to be able to work as expected.

Consider the (trivial) example below. The function compute() has to satisfy all requirements for GFOR Usage, so you cannot use if-else conditions inside it.

array compute(array A, array B, float ep)

{

array H;

if (ep > 0) H = (A * B) / ep; // BAD

else H = A * 0;

return H;

}

int m = 2, n = 3;

array A = randu(m, n);

array B = randu(m, n);

float ep = 2.35;

array H = constant(0,m,n);

gfor (seq ii, n)

H(span,ii) = compute(A(span,ii), B(span,ii), ep);

The iterator can be involved in expressions.

A = constant(1,n,n,m);

B = constant(1,n,n);

gfor (seq k, m)

A(span,span,k) = (k+1)*B + sin(k+1); // expressions

Iterator definitions can include arithmetic in expressions.

A = constant(1,n,n,m);

B = constant(1,n,n);

gfor (seq k, m/4, m-m/4)

A(span,span,k) = k*B + sin(k+1); // expressions

More complicated subscripting is supported.

A = constant(1,n,n,m);

B = constant(1,n,10);

gfor (seq k, m)

A(span,seq(10),k) = k*B; // subscripting, seq(10) generates index [0,9]

Iterators can be combined with arithmetic in subscripts.

array A = randu(n,m);

array B = constant(1,n,m);

gfor (seq k, 1, m-1)

B(span,k) = A(span,k-1);

A = randu(n,2*m);

B = constant(1,n,m);

gfor (seq k, m)

B(span,k) = A(span,2*(k+1)-1);

A = randu(n,2*m);

B = constant(1,n,m);

gfor (seq k, m)

B(span,k) = A(span,floor(k+.2));

Within the loop, you can use a result you just computed.

Although it is more efficient to store the value in a temporary variable:

In some cases, GFOR behaves differently than the typical sequential FOR-loop. For example, you can read and modify a result in place as long as the accesses are independent.

Subscripting behaviors `arrays`

also work with GFOR.

A = constant(1,n,n,m,k);

m = m * k; // precompute since cannot have expressions in iterator

gfor (seq k, m)

A(span,span,k) = 4 * A(span,span,k); // collapse last dimension

Random data should always be generated outside the GFOR loop. This is due to the fact that GFOR only passes over the body of the loop once. Therefore, any calls to randu() inside the body of the loop will result in the same random matrix being assigned to every iteration of the loop.

For example, in the following trivial code, all columns of `B`

are identical because `A`

is only evaluated once:

B = 0.1209 0.1209 0.1209 0.6432 0.6432 0.6432 0.8746 0.8746 0.8746

This can be rectified by bringing the random number generation outside the loop, as follows:

B = 0.0892 0.1655 0.7807 0.5626 0.5173 0.2932 0.5664 0.5898 0.1391

This is a trivial example, but demonstrates the principle that random numbers should be pre-allocated outside the loop in most cases.

This preliminary implementation of GFOR has the following restrictions.

The most important property of the loop body is that each iteration must be independent of the other iterations. Note that accessing the result of a separate iteration produces undefined behavior.

No conditional statements in the body of the loop, (i.e. no branching). However, you can often find ways to overcome this restriction. Consider the following two examples:

Example 1: Problem

However, you can do a few tricks to overcome this restriction by expressing the conditional statement as a multiplication by logical values. For instance, the block of code above can be converted to run as follows, without error:

Example 1: Solution

Another example of overcoming the conditional statement restriction in GFOR is as follows:

Example 2: Problem

array A = constant(1,n,n,m);

array B = randu(n,n);

gfor (seq k, 4) {

if ((k % 2) != 0)

A(span,span,k) = B + k;

else

A(span,span,k) = B * k;

}

Instead, you can make two passes over the same data, each pass performing one branch.

Example 2: Solution

A = constant(1,n,n,m);

B = randu(n);

gfor (seq k, 0, 2, 3)

A(span,span,k) = B + k;

gfor (seq k, 1, 2, 3)

A(span,span,k) = B * k;

Nesting GFOR-loops within GFOR-loops is unsupported. You may interleave FOR-loops as long as they are completely independent of the GFOR-loop iterator.

Nesting FOR-loops within GFOR-loops is supported, as long as the GFOR iterator is not used in the FOR loop iterator, as follows:

gfor (seq k, n) {

for (int j = 0; j < (m+k); j++) { // bad

// ...

}

}

gfor (seq k, n) {

for (int j = 0; j < m; j++) { // good

//...

}

}

Nesting GFOR-loops inside of FOR-loops is fully supported.

Logical indexing like the following is not supported:

The problem is that every GFOR tile has a different number of elements, something which GFOR cannot yet handle.

Similar to the workaround for conditional statements, it might work to use masked arithmetic:

Sub-assignment with scalars and logical masks is supported:

Since each computation is done in parallel for all iterator values, you need to have enough card memory available to do all iterations simultaneously. If the problem exceeds memory, it will trigger "out of memory" errors.

You can work around the memory limitations of your GPU or device by breaking the GFOR loop up into segments; however, you might want to consider using a larger memory GPU or device.

// BEFORE

gfor (seq k, 400) {

array B = A(span,k);

C(span,span,k) = matmulNT(B * B); // outer product expansion runs out of memory

}

// AFTER

for (int kk = 0; kk < 400; kk += 100) {

gfor (seq k, kk, kk+99) { // four batches of 100

array B = A(span,k);

C(span,span,k) = matmulNT(B, B); // now several smaller problems fit in card memory

}

}