v3.9.0

Improvements

Add oneAPI backend [#3296]
Add support to directly access arrays on other devices [#3447]
Add broadcast support [#2871]
Improve OpenCL CPU JIT performance [#3257] [#3392]
Optimize thread/block calculations of several kernels [#3144]
Add support for fast math compiliation when building ArrayFire [#3334] [#3337]
Optimize performance of fftconvolve when using floats [#3338]
Add support for CUDA 12.1 and 12.2
Better handling of empty arrays [#3398]
Better handling of memory in linear algebra functions in OpenCL [#3423]
Better logging with JIT kernels [#3468]
Optimize memory manager/JIT interactions for small number of buffers [#3468]
Documentation improvements [#3485]
Optimize reorder function [#3488]

Fixes

Improve Errors when creating OpenCL contexts from devices [#3257]
Improvements to vcpkg builds [#3376] [#3476]
Fix reduce by key when nan's are present [#3261]
Fix error in convolve where the ndims parameter was forced to be equal to 2 [#3277]
Make constructors that accept dim_t to be explicit to avoid invalid conversions [#3259]
Fix error in randu when compiling against clang 14 [#3333]
Fix bug in OpenCL linear algebra functions [#3398]
Fix bug with thread local variables when device was changed [#3420] [#3421]
Fix bug in qr related to uninitialized memory [#3422]
Fix bug in shift where the array had an empty middle dimension [#3488]

Contributions

Special thanks to our contributors: Willy Born Mike Mullen

v3.8.3

Improvements

Add support for CUDA 12 [#3352]
Modernize documentation style and content [#3351]
memcpy performance improvements [#3144]
JIT performance improvements [#3144]
join performance improvements [#3144]
Improve support for Intel and newer Clang compilers [#3334]
CCache support on Windows [#3257]

Fixes

Fix issue with some locales with OpenCL kernel generation [#3294]
Internal improvements
Fix leak in clfft on exit.
Fix some cases where ndims was incorrectly used ot calculate shape [#3277]
Fix issue when setDevice was not called in new threads [#3269]
Restrict initializer list to just fundamental types [#3264]

Contributions

Special thanks to our contributors: Carlo Cabrera Guillaume Schmid Willy Born ktdq

v3.8.2

Improvements

Optimize JIT by removing some consecutive cast operations [#3031]
Add driver checks checks for CUDA 11.5 and 11.6 [#3203]
Improve the timing algorithm used for timeit [#3185]
Dynamically link against CUDA numeric libraries by default [#3205]
Add support for pruning CUDA binaries to reduce static binary sizes [#3234] [#3237]
Remove unused cuDNN libraries from installations [#3235]
Add support to staticly link NVRTC libraries after CUDA 11.5 [#3236]
Add support for compiling with ccache when building the CUDA backend [#3241]
Make cuSparse an optional runtime dependency [#3240]

Fixes

Fix issue with consecutive moddims operations in the CPU backend [#3232]
Better floating point comparisons for tests [#3212]
Fix several warnings and inconsistencies with doxygen and documentation [#3226]
Fix issue when passing empty arrays into join [#3211]
Fix default value for the AF_COMPUTE_LIBRARY when not set [#3228]
Fix missing symbol issue when MKL is staticly linked [#3244]
Remove linking of OpenCL's library to the unified backend [#3244]

Contributions

Special thanks to our contributors: Jacob Kahn Willy Born

v3.8.1

Improvements

moddims now uses JIT approach for certain special cases - [#3177]
Embed Version Info in Windows DLLs - [#3025]
OpenCL device max parameter is now queries from device properties - [#3032]
JIT Performance Optimization: Unique funcName generation sped up - [#3040]
Improved readability of log traces - [#3050]
Use short function name in non-debug build error messages - [#3060]
SIFT/GLOH are now available as part of website binaries - [#3071]
Short-circuit zero elements case in detail::copyArray backend function - [#3059]
Speedup of kernel caching mechanism - [#3043]
Add short-circuit check for empty Arrays in JIT evalNodes - [#3072]
Performance optimization of indexing using dynamic thread block sizes - [#3111]
ArrayFire starting with this release will use Intel MKL single dynamic library which resolves lot of linking issues unified library had when user applications used MKL themselves - [#3120]
Add shortcut check for zero elements in af_write_array - [#3130]
Speedup join by eliminating temp buffers for cascading joins - [#3145]
Added batch support for solve - [#1705]
Use pinned memory to copy device pointers in CUDA solve - [#1705]
Added package manager instructions to docs - [#3076]
CMake Build Improvements - [#3027] , [#3089] , [#3037] , [#3072] , [#3095] , [#3096] , [#3097] , [#3102] , [#3106] , [#3105] , [#3120] , [#3136] , [#3135] , [#3137] , [#3119] , [#3150] , [#3138] , [#3156] , [#3139] , [#1705] , [#3162]
CPU backend improvements - [#3010] , [#3138] , [#3161]
CUDA backend improvements - [#3066] , [#3091] , [#3093] , [#3125] , [#3143] , [#3161]
OpenCL backend improvements - [#3091] , [#3068] , [#3127] , [#3010] , [#3039] , [#3138] , [#3161]
General(including JIT) performance improvements across backends - [#3167]
Testing improvements - [#3072] , [#3131] , [#3151] , [#3141] , [#3153] , [#3152] , [#3157] , [#1705] , [#3170] , [#3167]
Update CLBlast to latest version - [#3135] , [#3179]
Improved Otsu threshold computation helper in canny algorithm - [#3169]
Modified default parameters for fftR2C and fftC2R C++ API from 0 to 1.0 - [#3178]
Use appropriate MKL getrs_batch_strided API based on MKL Versions - [#3181]

Fixes

Fixed a bug JIT kernel disk caching - [#3182]
Fixed stream used by thrust(CUDA backend) functions - [#3029]
Added workaround for new cuSparse API that was added by CUDA amid fix releases - [#3057]
Fixed const array indexing inside gfor - [#3078]
Handle zero elements in copyData to host - [#3059]
Fixed double free regression in OpenCL backend - [#3091]
Fixed an infinite recursion bug in NaryNode JIT Node - [#3072]
Added missing input validation check in sparse-dense arithmetic operations - [#3129]
Fixed bug in getMappedPtr in OpenCL due to invalid lambda capture - [#3163]
Fixed bug in getMappedPtr on Arrays that are not ready - [#3163]
Fixed edgeTraceKernel for CPU devices on OpenCL backend - [#3164]
Fixed windows build issue(s) with VS2019 - [#3048]
API documentation fixes - [#3075] , [#3076] , [#3143] , [#3161]
CMake Build Fixes - [#3088]
Fixed the tutorial link in README - [#3033]
Fixed function name typo in timing tutorial - [#3028]
Fixed couple of bugs in CPU backend canny implementation - [#3169]
Fixed reference count of array(s) used in JIT operations. It is related to arrayfire's internal memory book keeping. The behavior/accuracy of arrayfire code wasn't broken earlier. It corrected the reference count to be of optimal value in the said scenarios. This may potentially reduce memory usage in some narrow cases - [#3167]
Added assert that checks if topk is called with a negative value for k - [#3176]
Fixed an Issue where countByKey would give incorrect results for any n > 128 - [#3175]

Contributions

Special thanks to our contributors: [HO-COOH][https://github.com/HO-COOH] [Willy Born][https://github.com/willyborn] [Gilad Avidov][https://github.com/avidov] [Pavan Yalamanchili][https://github.com/pavanky]

v3.8.0

Major Updates

Non-uniform(ragged) reductions [#2786]
Bit-wise not operator support for array and C API (af_bitnot) [#2865]
Initialization list constructor for array class [#2829] [#2987]

Improvements

New API for following statistics function: cov, var and stdev - [#2986]
allocV2 and freeV2 which return cl_mem on OpenCL backend [#2911]
Move constructor and move assignment operator for Dim4 class [#2946]
Support for CUDA 11.1 and Compute 8.6 [#3023]
Fix af::feature copy constructor for multi-threaded sceanarios [#3022]

v3.7.3

Improvements

Add f16 support for histogram - [#2984]
Update confidence connected components example with better illustration - [#2968]
Enable disk caching of OpenCL kernel binaries - [#2970]
Refactor extension of kernel binaries stored to disk .bin - [#2970]
Add minimum driver versions for CUDA toolkit 11 in internal map - [#2982]
Improve warnings messages from run-time kernel compilation functions - [#2996]

Fixes

Fix bias factor of variance in var_all and cov functions - [#2986]
Fix a race condition in confidence connected components function for OpenCL backend - [#2969]
Safely ignore disk cache failures in CUDA backend for compiled kernel binaries - [#2970]
Fix randn by passing in correct values to Box-Muller - [#2980]
Fix rounding issues in Box-Muller function used for RNG - [#2980]
Fix problems in RNG for older compute architectures with fp16 - [#2980] [#2996]
Fix performance regression of approx functions - [#2977]
Remove assert that check that signal/filter types have to be the same - [#2993]
Fix checkAndSetDevMaxCompute when the device cc is greater than max - [#2996]
Fix documentation errors and warnings - [#2973] , [#2987]
Add missing opencl-arrayfire interoperability functions in unified backend - [#2981]

Contributions

Special thanks to our contributors: P. J. Reed

v3.7.2

Improvements

Cache CUDA kernels to disk to improve load times(Thanks to @cschreib-ibex) [#2848]
Staticly link against cuda libraries [#2785]
Make cuDNN an optional build dependency [#2836]
Improve support for different compilers and OS [#2876] [#2945] [#2925] [#2942] [#2943] [#2945] [#2958]
Improve performance of join and transpose on CPU [#2849]
Improve documentation [#2816] [#2821] [#2846] [#2918] [#2928] [#2947]
Reduce binary size using NVRTC and template reducing instantiations [#2849] [#2861] [#2890] [#2957]
reduceByKey performance improvements [#2851] [#2957]
Improve support for Intel OpenCL GPUs [#2855]
Allow staticly linking against MKL [#2877] (Sponsered by SDL)
Better support for older CUDA toolkits [#2923]
Add support for CUDA 11 [#2939]
Add support for ccache for faster builds [#2931]
Add support for the conan package manager on linux [#2875]
Propagate build errors up the stack in AFError exceptions [#2948] [#2957]
Improve runtime dependency library loading [#2954]
Improved cuDNN runtime checks and warnings [#2960]
Document af_memory_manager_* native memory return values [#2911]

Fixes

Bug crash when allocating large arrays [#2827]
Fix various compiler warnings [#2827] [#2849] [#2872] [#2876]
Fix minor leaks in OpenCL functions [#2913]
Various continuous integration related fixes [#2819]
Fix zero padding with convolv2NN [#2820]
Fix af_get_memory_pressure_threshold return value [#2831]
Increased the max filter length for morph
Handle empty array inputs for LU, QR, and Rank functions [#2838]
Fix FindMKL.cmake script for sequential threading library [#2840] [#2952]
Various internal refactoring [#2839] [#2861] [#2864] [#2873] [#2890] [#2891] [#2913] [#2959]
Fix OpenCL 2.0 builtin function name conflict [#2851]
Fix error caused when releasing memory with multiple devices [#2867]
Fix missing set stacktrace symbol from unified API [#2915]
Fix zero padding issue in convolve2NN [#2820]
Fixed bugs in ReduceByKey [#2957]

Contributions

Special thanks to our contributors: Corentin Schreiber Jacob Kahn Paul Jurczak Christoph Junghans

v3.7.1

Improvements

Improve mtx download for test data [#2742]
Documentation improvements [#2754] [#2792] [#2797]
Remove verbose messages in older CMake versions [#2773]
Reduce binary size with the use of nvrtc [#2790]
Use texture memory to load LUT in orb and fast [#2791]
Add missing print function for f16 [#2784]
Add checks for f16 support in the CUDA backend [#2784]
Create a thrust policy to intercept tmp buffer allocations [#2806]

Fixes

Fix segfault on exit when ArrayFire is not initialized in the main thread
Fix support for CMake 3.5.1 [#2771] [#2772] [#2760]
Fix evalMultiple if the input array sizes aren't the same [#2766]
Fix error when AF_BACKEND_DEFAULT is passed directly to backend [#2769]
Workaround name collision with AMD OpenCL implementation [#2802]
Fix on-exit errors with the unified backend [#2769]
Fix check for f16 compatibility in OpenCL [#2773]
Fix matmul on Intel OpenCL when passing same array as input [#2774]
Fix CPU OpenCL blas batching [#2774]
Fix memory pressure in the default memory manager [#2801]

Contributions

Special thanks to our contributors: padentomasello glavaux2

v3.7.0

Major Updates

Added the ability to customize the memory manager(Thanks jacobkahn and flashlight) [#2461]
Added 16-bit floating point support for several functions [#2413] [#2587] [#2585] [#2587] [#2583]
Added sumByKey, productByKey, minByKey, maxByKey, allTrueByKey, anyTrueByKey, countByKey [#2254]
Added confidence connected components [#2748]
Added neural network based convolution and gradient functions [#2359]
Added a padding function [#2682]
Added pinverse for pseudo inverse [#2279]
Added support for uniform ranges in approx1 and approx2 functions. [#2297]
Added support to write to preallocated arrays for some functions [#2599] [#2481] [#2328] [#2327]
Added meanvar function [#2258]
Add support for sparse-sparse arithmetic support
Added rsqrt function for reciprocal square root
Added a lower level af_gemm function for general matrix multiplication [#2481]
Added a function to set the cuBLAS math mode for the CUDA backend [#2584]
Separate debug symbols into separate files [#2535]
Print stacktraces on errors [#2632]
Support move constructor for af::array [#2595]
Expose events in the public API [#2461]
Add setAxesLabelFormat to format labels on graphs [#2495]

Improvements

Better error messages for systems with driver or device incompatibilities [#2678] [#2448]
Optimized unified backend function calls
Optimized anisotropic smoothing [#2713]
Optimized canny filter for CUDA and OpenCL
Better MKL search script
Better logging of different submodules in ArrayFire [#2670] [#2669]
Improve documentation [#2665] [#2620] [#2615] [#2639] [#2628] [#2633] [#2622] [#2617] [#2558] [#2326] [#2515]
Optimized af::array assignment [#2575]
Update the k-means example to display the result [#2521]

Fixes

Fix multi-config generators
Fix access errors in canny
Fix segfault in the unified backend if no backends are available
Fix access errors in scan-by-key
Fix sobel operator
Fix an issue with the random number generator and s16
Fix issue with boolean product reduction
Fix array_proxy move constructor
Fix convolve3 launch configuration
Fix an issue where the fft function modified the input array [#2520]

Contributions

Special thanks to our contributors: Jacob Khan William Tambellini Alexey Kuleshevich Richard Barnes Gaika ShalokShalom

v3.6.4

Bug Fixes

Address a JIT performance regression due to moving kernel arguments to shared memory [#2501]
Fix the default parameter for setAxisTitle [#2491]

v3.6.3

Improvements

Graphics are now a runtime dependency instead of a link time dependency [#2365]
Reduce the CUDA backend binary size using runtime compilation of kernels [#2437]
Improved batched matrix multiplication on the CPU backend by using Intel MKL's cblas_Xgemm_batched[#2206]
Print JIT kernels to disk or stream using the AF_JIT_KERNEL_TRACE environment variable [#2404]
void* pointers are now allowed as arguments to af::array::write() [#2367]
Slightly improve the efficiency of JITed tile operations [#2472]
Make the random number generation on the CPU backend to be consistent with CUDA and OpenCL [#2435]
Handled very large JIT tree generations [#2484] [#2487]

Bug Fixes

Fixed af::array::array_proxy move assignment operator [#2479]
Fixed input array dimensions validation in svdInplace() [#2331]
Fixed the typedef declaration for window resource handle [#2357].
Increase compatibility with GCC 8 [#2379]
Fixed af::write tests [#2380]
Fixed a bug in broadcast step of 1D exclusive scan [#2366]
Fixed OpenGL related build errors on OSX [#2382]
Fixed multiple array evaluation. Performance improvement. [#2384]
Fixed buffer overflow and expected output of kNN SSD small test [#2445]
Fixed MKL linking order to enable threaded BLAS [#2444]
Added validations for forge module plugin availability before calling resource cleanup [#2443]
Improve compatibility on MSVC toolchain(_MSC_VER > 1914) with the CUDA backend [#2443]
Fixed BLAS gemm func generators for newest MSVC 19 on VS 2017 [#2464]
Fix errors on exits when using the cuda backend with unified [#2470]

Documentation

Updated svdInplace() documentation following a bugfix [#2331]
Fixed a typo in matrix multiplication documentation [#2358]
Fixed a code snippet demostrating C-API use [#2406]
Updated hamming matcher implementation limitation [#2434]
Added illustration for the rotate function [#2453]

Misc

Use cudaMemcpyAsync instead of cudaMemcpy throughout the codebase [#2362]
Display a more informative error message if CUDA driver is incomptible [#2421] [#2448]
Changed forge resource managemenet to use smart pointers [#2452]
Deprecated intl and uintl typedefs in API [#2360]
Enabled graphics by default for all builds starting with v3.6.3 [#2365]
Fixed several warnings [#2344] [#2356] [#2361]
Refactored initArray() calls to use createEmptyArray(). initArray() is for internal use only by Array class. [#2361]
Refactored void* memory allocations to use unsigned char type [#2459]
Replaced deprecated MKL API with in-house implementations for sparse to sparse/dense conversions [#2312]
Reorganized and fixed some internal backend API [#2356]
Updated compilation order of cuda files to speed up compile time [#2368]
Removed conditional graphics support builds after enabling runtime loading of graphics dependencies [#2365]
Marked graphics dependencies as optional in CPack RPM config [#2365]
Refactored a sparse arithmetic backend API [#2379]
Fixed const correctness of af_device_array API [#2396]
Update Forge to v1.0.4 [#2466]
Manage Forge resources from the DeviceManager class [#2381]
Fixed non-mkl & non-batch blas upstream call arguments [#2401]
Link MKL with OpenMP instead of TBB by default
use clang-format to format source code

Contributions

Special thanks to our contributors: Alessandro Bessi zhihaoy Jacob Khan William Tambellini

v3.6.2

Features

Added support for batching on the cond argument in select() [#2243]
Added support for broadcasting batched matmul() [#2315]
Added support for multiple nearest neighbors in nearestNeighbour() [#2280]
Added support for clamp-to-edge padding as an af_border_type option [#2333]

Improvements

Improved performance of morphological operations [#2238]
Fixed linking errors when compiling without Freeimage/Graphics [#2248]
Improved the usage of ArrayFire as a CMake subproject [#2290]
Enabled configuration of custom library path for loading dynamic backend libraries [#2302]

Bug Fixes

Fixed LAPACK definitions and linking errors [#2239]
Fixed overflow in dim4::ndims() [#2289]
Fixed pow() precision for integral types [#2305]
Fixed issues with tile() with a large repeat dimension [#2307]
Fixed svd() sub-array output on OpenCL [#2279]
Fixed grid-based indexing calculation in histogram() [#2230]
Fixed bug in indexing when used after reorder [#2311]
Fixed errors when exiting on Windows when using CLBlast [#2222]
Fixed fallthrough error in medfilt1 [#2349]

Documentation

Improved unwrap() documentation [#2301]
Improved wrap() documentation [#2320]
Improved accum() documentation [#2298]
Improved tile() documentation [#2293]
Clarified approx1() and approx2() indexing in documentation [#2287]
Updated examples of select() in detailed documentation [#2277]
Updated lookup() examples [#2288]
Updated set operations' documentation [#2299]

Misc

af* libraries and dependencies directory changed to lib64 [#2186]
Added new arrayfire ASSERT utility functions [#2249] [#2256] [#2257] [#2263]
Improved error messages in JIT [#2309]

Contributions

Special thanks to our contributors: Jacob Kahn, Vardan Akopian

v3.6.1

Improvements

FreeImage is now a run-time dependency [#2164]
Reduced binary size by setting the symbol visibility to hidden [#2168]
Add memory manager logging using the AF_TRACE=mem environment variable [#2169]
Improved CPU Anisotropic Diffusion performance [#2174]
Perform normalization after FFT for improved accuracy [#2185][#2192]
Updated CLBlast to v1.4.0 [#2178]
Added additional validation when using af::seq for indexing [#2153]
Perform checks for unsupported cards by the CUDA implementation [#2182]

Bug Fixes

Fixed region when all pixels were the foreground or background [#2152]
Fixed several memory leaks [#2202][#2201][#2180][#2179][#2177][#2175]
Fixed bug in setDevice which didn't allow you to select the last device [#2189]
Fixed bug in min/max where the first element of the array was a NaN value [#2155]
Fixed window cell indexing for graphics [#2207]

v3.6.0

The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.6.0.tar.bz2

Major Updates

Added the topk() function Documentation. ¹
Added batched matrix multiply support. ² ³
Added anisotropic diffusion, anisotropicDiffusion(). Documentation ⁴.

Features

Added support for batched matrix multiply. ¹ ²
New anisotropic diffusion function, anisotropicDiffusion(). Documentation ³.
New topk() function, which returns the top k elements along a given dimension of the input. Documentation. ⁴
New gradient diffusion example.

Improvements

JITted select() and shift() functions for CUDA and OpenCL backends. ¹
Significant CMake improvements. ² ³ ⁴
Improved the quality of the random number generator, thanks to Ralf Stubner. ⁵
Modified af_colormap struct to match forge's definition. ⁶
Improved Black Scholes example. ⁷
Using CPack to generate installers. ⁸
Refactored black_scholes_options example to use built-in af::erfc function for cumulative normal distribution.⁹.
Reduced the scope of mutexes in memory manager ¹⁰
Official installers do not require the CUDA toolkit to be installed
Significant CMake improvements have been made. Using CPack to generate installers. ¹¹ ¹² ¹³
Corrected assert function calls in select() tests. ¹⁴

Bug fixes

Fixed shfl_down() warnings with CUDA 9. ¹
Disabled CUDA JIT debug flags on ARM architecture.²
Fixed CLBLast install lib dir for linux platform where lib directory has arch(64) suffix.³
Fixed assert condition in 3d morph opencl kernel.⁴
Fix JIT errors with large non-linear kernels⁵
Fix bug in CPU jit after moddims was called ⁵
Fixed deadlock caused by calls to from the worker thread ⁶

Documentation

Fixed variable name typo in vectorization.md. ¹
Fixed AF_API_VERSION value in Doxygen config file. ²

Known issues

Several OpenCL tests failing on OSX:
- `canny_opencl, fft_opencl, gen_assign_opencl, homography_opencl, reduce_opencl, scan_by_key_opencl, solve_dense_opencl, sparse_arith_opencl, sparse_convert_opencl, where_opencl`

Community contributions

Special thanks to our contributors: Adrien F. Vincent, Cedric Nugteren, Felix, Filip Matzner, HoneyPatouceul, Patrick Lavin, Ralf Stubner, William Tambellini

v3.5.1

The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.5.1.tar.bz2

Installer CUDA Version: 8.0 (Required) Installer OpenCL Version: 1.2 (Minimum)

Improvements

Relaxed af::unwrap() function's arguments. ¹
Changed behavior of af::array::allocated() to specify memory allocated. ¹
Removed restriction on the number of bins for af::histogram() on CUDA and OpenCL kernels. ¹

Performance

Improved JIT performance. ¹
Improved CPU element-wise operation performance. ¹
Improved regions performance using texture objects. ¹

Bug fixes

Fixed overflow issues in mean. ¹
Fixed memory leak when chaining indexing operations. ¹
Fixed bug in array assignment when using an empty array to index. ¹
Fixed bug with af::matmul() which occured when its RHS argument was an indexed vector. ¹
Fixed bug deadlock bug when sparse array was used with a JIT Array. ¹
Fixed pixel tests for FAST kernels. ¹
Fixed af::replace so that it is now copy-on-write. ¹
Fixed launch configuration issues in CUDA JIT. ¹
Fixed segfaults and "Pure Virtual Call" error warnings when exiting on Windows. ^{1 2}
Workaround for clEnqueueReadBuffer bug on OSX. ¹

Build

Fixed issues when compiling with GCC 7.1. ¹ ²
Eliminated unnecessary Boost dependency from CPU and CUDA backends. ¹

Misc

Updated support links to point to Slack instead of Gitter. ¹

v3.5.0

Major Updates

ArrayFire now supports threaded applications. ¹
Added Canny edge detector. ¹
Added Sparse-Dense arithmetic operations. ¹

Features

ArrayFire Threading
- af::array can be read by multiple threads
- All ArrayFire functions can be executed concurrently by multiple threads
- Threads can operate on different devices to simplify Muli-device workloads
New Canny edge detector function, af::canny(). ¹
- Can automatically calculate high threshold with AF_CANNY_THRESHOLD_AUTO_OTSU
- Supports both L1 and L2 Norms to calculate gradients
New tuned OpenCL BLAS backend, CLBlast.

Improvements

Converted CUDA JIT to use NVRTC instead of NVVM.
Performance improvements in af::reorder(). ¹
Performance improvements in af::array::scalar<T>(). ¹
Improved unified backend performance. ¹
ArrayFire now depends on Forge v1.0. ¹
Can now specify the FFT plan cache size using the af::setFFTPlanCacheSize() function.
Get the number of physical bytes allocated by the memory manager af_get_allocated_bytes(). ¹
af::dot() can now return a scalar value to the host. ¹

Bug Fixes

Fixed improper release of default Mersenne random engine. ¹
Fixed af::randu() and af::randn() ranges for floating point types. ¹
Fixed assignment bug in CPU backend. ¹
Fixed complex (c32,c64) multiplication in OpenCL convolution kernels. ¹
Fixed inconsistent behavior with af::replace() and af_replace_scalar(). ¹
Fixed memory leak in af_fir(). ¹
Fixed memory leaks in af_cast for sparse arrays. ¹
Fixing correctness of af_pow for complex numbers by using Cartesian form. ¹
Corrected af::select() with indexing in CUDA and OpenCL backends. ¹
Workaround for VS2015 compiler ternary bug. ¹
Fixed memory corruption in cuda::findPlan(). ¹
Argument checks in af_create_sparse_array avoids inputs of type int64. ¹
Fixed issue with indexing an array with a step size != 1. ¹

Build fixes

On OSX, utilize new GLFW package from the brew package manager. ¹ ²
Fixed CUDA PTX names generated by CMake v3.7. ¹
Support gcc > 5.x for CUDA. ¹

Examples

New genetic algorithm example. ¹

Documentation

Updated README.md to improve readability and formatting. ¹
Updated README.md to mention Julia and Nim wrappers. ¹
Improved installation instructions - docs/pages/install.md. ¹

Miscellaneous

A few improvements for ROCm support. ¹
Removed CUDA 6.5 support. ¹

Known issues

Windows
- The Windows NVIDIA driver version 37x.xx contains a bug which causes fftconvolve_opencl to fail. Upgrade or downgrade to a different version of the driver to avoid this failure.
- The following tests fail on Windows with NVIDIA hardware: threading_cuda,qr_dense_opencl, solve_dense_opencl.
macOS
- The Accelerate framework, used by the CPU backend on macOS, leverages Intel graphics cards (Iris) when there are no discrete GPUs available. This OpenCL implementation is known to give incorrect results on the following tests: lu_dense_{cpu,opencl}, solve_dense_{cpu,opencl}, inverse_dense_{cpu,opencl}.
- Certain tests intermittently fail on macOS with NVIDIA GPUs apparently due to inconsistent driver behavior: fft_large_cuda and svd_dense_cuda.
- The following tests are currently failing on macOS with AMD GPUs: cholesky_dense_opencl and scan_by_key_opencl.

v3.4.2

Deprecation Announcement

This release supports CUDA 6.5 and higher. The next ArrayFire relase will support CUDA 7.0 and higher, dropping support for CUDA 6.5. Reasons for no longer supporting CUDA 6.5 include:

CUDA 7.0 NVCC supports the C++11 standard (whereas CUDA 6.5 does not), which is used by ArrayFire's CPU and OpenCL backends.
Very few ArrayFire users still use CUDA 6.5.

As a result, the older Jetson TK1 / Tegra K1 will no longer be supported in the next ArrayFire release. The newer Jetson TX1 / Tegra X1 will continue to have full capability with ArrayFire.

Docker

ArrayFire has been Dockerized.

Improvements

Implemented sparse storage format conversions between AF_STORAGE_CSR and AF_STORAGE_COO. ¹
- Directly convert between AF_STORAGE_COO <--> AF_STORAGE_CSR using the af::sparseConvertTo() function.
- af::sparseConvertTo() now also supports converting to dense.
Added cast support for sparse arrays. ¹
- Casting only changes the values array and the type. The row and column index arrays are not changed.
Reintroduced automated computation of chart axes limits for graphics functions. ¹
- The axes limits will always be the minimum/maximum of the current and new limit.
- The user can still set limits from API calls. If the user sets a limit from the API call, then the automatic limit setting will be disabled.
Using boost::scoped_array instead of boost::scoped_ptr when managing array resources. ¹
Internal performance improvements to getInfo() by using const references to avoid unnecessary copying of ArrayInfo objects. ¹
Added support for scalar af::array inputs for af::convolve() and set functions. ¹ ² ³
Performance fixes in af::fftConvolve() kernels. ¹ ²

Build

Support for Visual Studio 2015 compilation. ¹ ²
Fixed FindCBLAS.cmake when PkgConfig is used. ¹

Bug fixes

Fixes to JIT when tree is large. ¹ ²
Fixed indexing bug when converting dense to sparse af::array as AF_STORAGE_COO. ¹
Fixed af::bilateral() OpenCL kernel compilation on OS X. ¹
Fixed memory leak in af::regions() (CPU) and af::rgb2ycbcr(). ¹ ² ³

Installers

Major OS X installer fixes. ¹
- Fixed installation scripts.
- Fixed installation symlinks for libraries.
Windows installer now ships with more pre-built examples.

Examples

Added af::choleskyInPlace() calls to cholesky.cpp example. ¹

Documentation

Added u8 as supported data type in getting_started.md. ¹
Fixed typos. ¹

CUDA 8 on OSX

CUDA 8.0.55 supports Xcode 8. ¹

Known Issues

Known failures with CUDA 6.5. These include all functions that use sorting. As a result, sparse storage format conversion between AF_STORAGE_COO and AF_STORAGE_CSR has been disabled for CUDA 6.5.

v3.4.1

Installers

Installers for Linux, OS X and Windows
- CUDA backend now uses CUDA 8.0.
- Uses Intel MKL 2017.
- CUDA Compute 2.x (Fermi) is no longer compiled into the library.
Installer for OS X
- The libraries shipping in the OS X Installer are now compiled with Apple Clang v7.3.1 (previously v6.1.0).
- The OS X version used is 10.11.6 (previously 10.10.5).
Installer for Jetson TX1 / Tegra X1
- Requires JetPack for L4T 2.3 (containing Linux for Tegra r24.2 for TX1).
- CUDA backend now uses CUDA 8.0 64-bit.
- Using CUDA's cusolver instead of CPU fallback.
- Uses OpenBLAS for CPU BLAS.
- All ArrayFire libraries are now 64-bit.

Improvements

Add sparse array support to af::eval(). ¹
Add OpenCL-CPU fallback support for sparse af::matmul() when running on a unified memory device. Uses MKL Sparse BLAS.
When using CUDA libdevice, pick the correct compute version based on device. ¹
OpenCL FFT now also supports prime factors 7, 11 and 13. ¹ ²

Bug Fixes

Allow CUDA libdevice to be detected from custom directory.
Fix aarch64 detection on Jetson TX1 64-bit OS. ¹
Add missing definition of af_set_fft_plan_cache_size in unified backend. ¹
Fix intial values for af::min() and af::max() operations. ¹ ²
Fix distance calculation in af::nearestNeighbour for CUDA and OpenCL backend. ¹ ²
Fix OpenCL bug where scalars where are passed incorrectly to compile options. ¹
Fix bug in af::Window::surface() with respect to dimensions and ranges. ¹
Fix possible double free corruption in af_assign_seq(). ¹
Add missing eval for key in af::scanByKey in CPU backend. ¹
Fixed creation of sparse values array using AF_STORAGE_COO. ¹ ¹

Examples

Add a Conjugate Gradient solver example to demonstrate sparse and dense matrix operations. ¹

CUDA Backend

When using CUDA 8.0, compute 2.x are no longer in default compute list.
- This follows CUDA 8.0 deprecating computes 2.x.
- Default computes for CUDA 8.0 will be 30, 50, 60.
When using CUDA pre-8.0, the default selection remains 20, 30, 50.
CUDA backend now uses -arch=sm_30 for PTX compilation as default.
- Unless compute 2.0 is enabled.

Known Issues

af::lu() on CPU is known to give incorrect results when built run on OS X 10.11 or 10.12 and compiled with Accelerate Framework. ¹
- Since the OS X Installer libraries uses MKL rather than Accelerate Framework, this issue does not affect those libraries.

v3.4.0

Major Updates

Sparse Matrix and BLAS. ^{1 2}
Faster JIT for CUDA and OpenCL. ^{1 2}
Support for random number generator engines. ^{1 2}
Improvements to graphics. ^{1 2}

Features

Sparse Matrix and BLAS ^{1 2}
- Support for CSR and COO storage types.
- Sparse-Dense Matrix Multiplication and Matrix-Vector Multiplication as a part of af::matmul() using AF_STORAGE_CSR format for sparse.
- Conversion to and from dense matrix to CSR and COO storage types.
Faster JIT ^{1 2}
- Performance improvements for CUDA and OpenCL JIT functions.
- Support for evaluating multiple outputs in a single kernel. See af::array::eval() for more.
Random Number Generation ^{1 2}
- af::randomEngine(): A random engine class to handle setting the [type](af_random_type) and seed for random number generator engines.
- Supported engine types are (af_random_engine_type):
Graphics ^{1 2}
- Using Forge v0.9.0
- Vector Field plotting functionality. ¹
- Removed GLEW and replaced with glbinding.
  - Removed usage of GLEW after support for MX (multithreaded) was dropped in v2.0. ¹
- Multiple overlays on the same window are now possible.
  - Overlays support for same type of object (2D/3D)
  - Supported by af::Window::plot, af::Window::hist, af::Window::surface, af::Window::vectorField.
- New API to set axes limits for graphs.
  - Draw calls do not automatically compute the limits. This is now under user control.
  - af::Window::setAxesLimits can be used to set axes limits automatically or manually.
  - af::Window::setAxesTitles can be used to set axes titles.
- New API for plot and scatter:
  - af::Window::plot() and af::Window::scatter() now can handle 2D and 3D and determine appropriate order.
  - af_draw_plot_nd()
  - af_draw_plot_2d()
  - af_draw_plot_3d()
  - af_draw_scatter_nd()
  - af_draw_scatter_2d()
  - af_draw_scatter_3d()
New interpolation methods ¹
- Applies to
Support for complex mathematical functions ¹
- Add complex support for Trigonometric functions, af::sqrt(), af::log().
af::medfilt1(): Median filter for 1-d signals ¹
Generalized scan functions: scan and scanByKey
- Now supports inclusive or exclusive scans
- Supports binary operations defined by af_binary_op. ¹
Image Moments functions ¹
Add af::getSizeOf() function for af_dtype ¹
Explicitly extantiate af::array::device() for `void * ¹

Bug Fixes

Fixes to edge-cases in Morphological Operations. ¹
Makes JIT tree size consistent between devices. ¹
Delegate higher-dimension in Convolutions to correct dimensions. ¹
Indexing fixes with C++11. ^{1 2}
Handle empty arrays as inputs in various functions. ¹
Fix bug when single element input to af::median. ¹
Fix bug in calculation of time from af::timeit(). ¹
Fix bug in floating point numbers in af::seq. ¹
Fixes for OpenCL graphics interop on NVIDIA devices. ¹
Fix bug when compiling large kernels for AMD devices. ¹
Fix bug in af::bilateral when shared memory is over the limit. ¹
Fix bug in kernel header compilation tool bin2cpp. ¹
Fix inital values for Morphological Operations functions. ¹
Fix bugs in af::homography() CPU and OpenCL kernels. ¹
Fix bug in CPU TNJ. ¹

Improvements

CUDA 8 and compute 6.x(Pascal) support, current installer ships with CUDA 7.5. ^{1 2 3}
User controlled FFT plan caching. ¹
CUDA performance improvements for wrap, unwrap and Interpolation and approximation. ¹
Fallback for CUDA-OpenGL interop when no devices does not support OpenGL. ¹
Additional forms of batching with the transform functions. New behavior defined here. ¹
Update to OpenCL2 headers. ¹
Support for integration with external OpenCL contexts. ¹
Performance improvements to interal copy in CPU Backend. ¹
Performance improvements to af::select and af::replace CUDA kernels. ¹
Enable OpenCL-CPU offload by default for devices with Unified Host Memory. ¹
- To disable, use the environment variable AF_OPENCL_CPU_OFFLOAD=0.

Build

Compilation speedups. ¹
Build fixes with MKL. ¹
Error message when CMake CUDA Compute Detection fails. ¹
Several CMake build issues with Xcode generator fixed. ^{1 2}
Fix multiple OpenCL definitions at link time. ¹
Fix lapacke detection in CMake. ¹
Update build tags of
- clBLAS
- clFFT
- Boost.Compute
- Forge
- glbinding
Fix builds with GCC 6.1.1 and GCC 5.3.0. ¹

Installers

All installers now ship with ArrayFire libraries build with MKL 2016.
All installers now ship with Forge development files and examples included.
CUDA Compute 2.0 has been removed from the installers. Please contact us directly if you have a special need.

Examples

Added example simulating gravity for demonstration of vector field.
Improvements to financial/black_scholes_options.cpp example.
Improvements to graphics/gravity_sim.cpp example.
Fix graphics examples to use af::Window::setAxesLimits and af::Window::setAxesTitles functions.

Documentation & Licensing

ArrayFire copyright and trademark policy
Fixed grammar in license.
Add license information for glbinding.
Remove license infomation for GLEW.
Random123 now applies to all backends.
Random number functions are now under Random Number Generation.

Deprecations

The following functions have been deprecated and may be modified or removed permanently from future versions of ArrayFire.

af::Window::plot3(): Use af::Window::plot instead.
af_draw_plot(): Use af_draw_plot_nd or af_draw_plot_2d instead.
af_draw_plot3(): Use af_draw_plot_nd or af_draw_plot_3d instead.
af::Window::scatter3(): Use af::Window::scatter instead.
af_draw_scatter(): Use af_draw_scatter_nd or af_draw_scatter_2d instead.
af_draw_scatter3(): Use af_draw_scatter_nd or af_draw_scatter_3d instead.

Known Issues

Certain CUDA functions are known to be broken on Tegra K1. The following ArrayFire tests are currently failing:

assign_cuda
harris_cuda
homography_cuda
median_cuda
orb_cudasort_cuda
sort_by_key_cuda
sort_index_cuda

v3.3.2

Improvements

Family of Sort functions now support higher order dimensions.
Improved performance of batched sort on dim 0 for all Sort functions.
Median now also supports higher order dimensions.

Bug Fixes

Fixes to error handling in C++ API for binary functions.
Fixes to external OpenCL context management.
Fixes to JPEG_GREYSCALE for FreeImage versions <= 3.154.
Fixed for non-float inputs to af::rgb2gray().

Build

Disable CPU Async when building with GCC < 4.8.4.
Add option to disable CPUID from CMake.
More verbose message when CUDA Compute Detection fails.
Print message to use CUDA library stub from CUDA Toolkit if CUDA Library is not found from default paths.
Build Fixes on Windows.
- For compiling tests our of source.
- For compiling ArrayFire with static MKL.
Exclude <sys/sysctl.h> when building on GNU Hurd.
Add manual CMake options to build DEB and RPM packages.

Documentation

Fixed documentation for af::replace().
Fixed images in Using on OSX page.

Installer

Linux x64 installers will now be compiled with GCC 4.9.2.
OSX installer gives better error messages on brew failures and now includes link to [Fixing OS X Installer Failures] (https://github.com/arrayfire/arrayfire/wiki/Fixing-Common-OS-X-Installer-Failures) for brew installation failures.

v3.3.1

Bug Fixes

Fixes to af::array::device()
- CPU Backend: evaluate arrays before returning pointer with asynchronous calls in CPU backend.
- OpenCL Backend: fix segfaults when requested for device pointers on empty arrays.
Fixed af::operator%() from using rem to mod.
Fixed array destruction when backends are switched in Unified API.
Fixed indexing after af::moddims() is called.
Fixes FFT calls for CUDA and OpenCL backends when used on multiple devices.
Fixed unresolved external for some functions from af::array::array_proxy class.

Build

CMake compiles files in alphabetical order.
CMake fixes for BLAS and LAPACK on some Linux distributions.

Improvements

Fixed OpenCL FFT performance regression.
af::array::device() on OpenCL backend returns cl_mem instead of (void*)cl::Buffer*.
In Unified backend, load versioned libraries at runtime.

Documentation

Reorganized, cleaner README file.
Replaced non-free lena image in assets with free-to-distribute lena image.

v3.3.0

Major Updates

CPU backend supports aysnchronous execution.
Performance improvements to OpenCL BLAS and FFT functions.
Improved performance of memory manager.
Improvements to visualization functions.
Improved sorted order for OpenCL devices.
Integration with external OpenCL projects.

Features

af::getActiveBackend(): Returns the current backend being used.
Scatter plot added to graphics.
af::transform() now supports perspective transformation matrices.
af::infoString(): Returns af::info() as a string.
af::printMemInfo(): Print a table showing information about buffer from the memory manager
- The AF_MEM_INFO macro prints numbers and total sizes of all buffers (requires including af/macros.h)
af::allocHost(): Allocates memory on host.
af::freeHost(): Frees host side memory allocated by arrayfire.
OpenCL functions can now use CPU implementation.
- Currently limited to Unified Memory devices (CPU and On-board Graphics).
- Functions: af::matmul() and all LAPACK functions.
- Takes advantage of optimized libraries such as MKL without doing memory copies.
- Use the environment variable AF_OPENCL_CPU_OFFLOAD=1 to take advantage of this feature.
Functions specific to OpenCL backend.
- afcl::addDevice(): Adds an external device and context to ArrayFire's device manager.
- afcl::deleteDevice(): Removes an external device and context from ArrayFire's device manager.
- afcl::setDevice(): Sets an external device and context from ArrayFire's device manager.
- afcl::getDeviceType(): Gets the device type of the current device.
- afcl::getPlatform(): Gets the platform of the current device.
af::createStridedArray() allows array creation user-defined strides and device pointer.
Expose functions that provide information about memory layout of Arrays.
- af::getStrides(): Gets the strides for each dimension of the array.
- af::getOffset(): Gets the offsets for each dimension of the array.
- af::getRawPtr(): Gets raw pointer to the location of the array on device.
- af::isLinear(): Returns true if all elements in the array are contiguous.
- af::isOwner(): Returns true if the array owns the raw pointer, false if it is a sub-array.
- af::getStrides(): Gets the strides of the array.
- af::getStrides(): Gets the strides of the array.
af::getDeviceId(): Gets the device id on which the array resides.
af::isImageIOAvailable(): Returns true if ArrayFire was compiled with Freeimage enabled
af::isLAPACKAvailable(): Returns true if ArrayFire was compiled with LAPACK functions enabled

Bug Fixes

Fixed errors when using 3D / 4D arrays in select and replace
Fixed JIT errors on AMD devices for OpenCL backend.
Fixed imageio bugs for 16 bit images.
Fixed bugs when loading and storing images natively.
Fixed bug in FFT for NVIDIA GPUs when using OpenCL backend.
Fixed bug when using external context with OpenCL backend.
Fixed memory leak in af_median_all().
Fixed memory leaks and performance in graphics functions.
Fixed bugs when indexing followed by moddims.
af_get_revision() now returns actual commit rather than AF_REVISION.
Fixed releasing arrays when using different backends.
OS X OpenCL: LAPACK functions on CPU devices use OpenCL offload (previously threw errors).
Add support for 32-bit integer image types in Image IO.
Fixed set operations for row vectors
Fixed bugs in af::meanShift() and af::orb().

Improvements

Optionally offload BLAS and LAPACK functions to CPU implementations to improve performance.
Performance improvements to the memory manager.
Error messages are now more detailed.
Improved sorted order for OpenCL devices.
JIT heuristics can now be tweaked using environment variables. See Environment Variables tutorial.
Add BUILD_<BACKEND> options to examples and tests to toggle backends when compiling independently.

Examples

New visualization example simulating gravity.

Build

Support for Intel icc compiler
Support to compile with Intel MKL as a BLAS and LAPACK provider
Tests are now available for building as standalone (like examples)
Tests can now be built as a single file for each backend
Better handling of NONFREE build options
Searching for GLEW in CMake default paths
Fixes for compiling with MKL on OSX.

Installers

Improvements to OSX Installer
- CMake config files are now installed with libraries
- Independent options for installing examples and documentation components

Deprecations

af_lock_device_arr is now deprecated to be removed in v4.0.0. Use af_lock_array() instead.
af_unlock_device_arr is now deprecated to be removed in v4.0.0. use af_unlock_array() instead.

Documentation

Fixes to documentation for af::matchTemplate().
Improved documentation for deviceInfo.
Fixes to documentation for af::exp().

Known Issues

Solve OpenCL fails on NVIDIA Maxwell devices for f32 and c32 when M > N and K % 4 is 1 or 2.

v3.2.2

Bug Fixes

Fixed memory leak in CUDA Random number generators
Fixed bug in af::select() and af::replace() tests
Fixed exception thrown when printing empty arrays with af::print()
Fixed bug in CPU random number generation. Changed the generator to mt19937
Fixed exception handling (internal)
- Exceptions now show function, short file name and line number
- Added AF_RETURN_ERROR macro to handle returning errors.
- Removed THROW macro, and renamed AF_THROW_MSG to AF_THROW_ERR.
Fixed bug in af::identity() that may have affected CUDA Compute 5.2 cards

Build

Added a MIN_BUILD_TIME option to build with minimum optimization compiler flags resulting in faster compile times
Fixed issue in CBLAS detection by CMake
Fixed tests failing for builds without optional components FreeImage and LAPACK
Added a test for unified backend
Only info and backend tests are now built for unified backend
Sort tests execution alphabetically
Fixed compilation flags and errors in tests and examples
Moved AF_REVISION and AF_COMPILER_STR into src/backend. This is because as revision is updated with every commit, entire ArrayFire would have to be rebuilt in the old code.
- v3.3 will add a af_get_revision() function to get the revision string.
Clean up examples
- Remove getchar for Windows (this will be handled by the installer)
- Other miscellaneous code cleanup
- Fixed bug in plot3.cpp example
Rename clBLAS/clFFT external project suffix from external -> ext
Add OpenBLAS as a lapack/lapacke alternative

Improvements

Added AF_MEM_INFO macro to print memory info from ArrayFire's memory manager (cross issue)
Added additional paths for searching for libaf* for Unified backend on unix-style OS.
- Note: This still requires dependencies such as forge, CUDA, NVVM etc to be in LD_LIBRARY_PATH as described in Unified Backend
Create streams for devices only when required in CUDA Backend

Documentation

Hide scrollbars appearing for pre and code styles
Fix documentation for af::replace
Add code sample for converting the output of af::getAvailableBackends() into bools
Minor fixes in documentation

v3.2.1

Bug Fixes

Fixed bug in homography()
Fixed bug in behavior of af::array::device()
Fixed bug when indexing with span along trailing dimension
Fixed bug when indexing in GFor
Fixed bug in CPU information fetching
Fixed compilation bug in unified backend caused by missing link library
Add missing symbol for af_draw_surface()

Build

Tests can now be used as a standalone project
- Tests can now be built using pre-compiled libraries
- Similar to how the examples are built
The install target now installs the examples source irrespective of the BUILD_EXAMPLES value
- Examples are not built if BUILD_EXAMPLES is off

Documentation

HTML documentation is now built and installed in docs/html
Added documentation for af::seq class
Updated Matrix Manipulation tutorial
Examples list is now generated by CMake
- Examples are now listed as dir/example.cpp
Removed dummy groups used for indexing documentation (affcted doxygen < 1.8.9)

v3.2.0

Major Updates

Added Unified backend
- Allows switching backends at runtime
- Read Unified Backend for more.
Support for 16-bit integers (s16 and u16)
- All functions that support 32-bit interger types (s32, u32), now also support 16-bit interger types

Function Additions

Unified Backend
- af::setBackend() - Sets a backend as active
- af::getBackendCount() - Gets the number of backends available for use
- af::getAvailableBackends() - Returns information about available backends
- af::getBackendId() - Gets the backend enum for an array
Vision
- af::homography() - Homography estimation
- af::gloh() - GLOH Descriptor for SIFT
Image Processing
- af::loadImageNative() - Load an image as native data without modification
- af::saveImageNative() - Save an image without modifying data or type
Graphics
- af::Window::plot3() - 3-dimensional line plot
- af::Window::surface() - 3-dimensional curve plot
Indexing
CUDA Backend Specific
- afcu::setNativeId() - Set the CUDA device with given native id as active
  - ArrayFire uses a modified order for devices. The native id for a device can be retreived using nvidia-smi
OpenCL Backend Specific
- afcl::setDeviceId() - Set the OpenCL device using the clDeviceId

Other Improvements

Added c32 and c64 support for af::isNaN(), af::isInf() and af::iszero()
Added CPU information for x86 and x86_64 architectures in CPU backend's af::info()
Batch support for af::approx1() and af::approx2()
- Now can be used with gfor as well
Added s64 and u64 support to:
- af::sort() (along with sort index and sort by key)
- af::setUnique(), af::setUnion(), af::setIntersect()
- af::convolve() and af::fftConvolve()
- af::histogram() and af::histEqual()
- af::lookup()
- af::mean()
Added AF_MSG macro

Build Improvements

Submodules update is now automatically called if not cloned recursively
Fixes for compilation on Visual Studio 2015
Option to use fallback to CPU LAPACK for linear algebra functions in case of CUDA 6.5 or older versions.

Bug Fixes

Fixed memory leak in af::susan()
Fixed failing test in af::lower() and af::upper() for CUDA compute 53
Fixed bug in CUDA for indexing out of bounds
Fixed dims check in af::iota()
Fixed out-of-bounds access in af::sift()
Fixed memory allocation in af::fast() OpenCL
Fixed memory leak in image I/O functions
af::dog() now returns float-point type arrays

Documentation Updates

Improved tutorials documentation
- More detailed Using on Linux, OSX, Windows pages.
Added return type information for functions that return different type arrays

New Examples

Graphics
- Plot3
- Surface
Shallow Water Equation
Basic as a Unified backend example

Installers

All installers now include the Unified backend and corresponding CMake files
Visual Studio projects include Unified in the Platform Configurations
Added installer for Jetson TX1
SIFT and GLOH do not ship with the installers as SIFT is protected by patents that do not allow commercial distribution without licensing.

v3.1.3

Bug Fixes

Fixed bugs in various OpenCL kernels without offset additions
Remove ARCH_32 and ARCH_64 flags
Fix missing symbols when freeimage is not found
Use CUDA driver version for Windows
Improvements to SIFT
Fixed memory leak in median
Fixes for Windows compilation when not using MKL #1047
Fixed for building without LAPACK

Other

Documentation: Fixed documentation for select and replace
Documentation: Fixed documentation for af_isnan

v3.1.2

Bug Fixes

Fixed bug in assign that was causing test to fail
Fixed bug in convolve. Frequency condition now depends on kernel size only
Fixed bug in indexed reductions for complex type in OpenCL backend
Fixed bug in kernel name generation in ireduce for OpenCL backend
Fixed non-linear to linear indices in ireduce
Fixed bug in reductions for small arrays
Fixed bug in histogram for indexed arrays
Fixed compiler error CPUID for non-compliant devices
Fixed failing tests on i386 platforms
Add missing AFAPI

Other

Documentation: Added missing examples and other corrections
Documentation: Fixed warnings in documentation building
Installers: Send error messages to log file in OSX Installer

v3.1.1

Installers

CUDA backend now depends on CUDA 7.5 toolkit
OpenCL backend now require OpenCL 1.2 or greater

Bug Fixes

Fixed bug in reductions after indexing
Fixed bug in indexing when using reverse indices

Build

cmake now includes PKG_CONFIG in the search path for CBLAS and LAPACKE libraries
heston_model.cpp example now builds with the default ArrayFire cmake files after installation

Other

Fixed bug in image_editing.cpp

v3.1.0

Function Additions

Computer Vision Functions
- af::nearestNeighbour() - Nearest Neighbour with SAD, SSD and SHD distances
- af::harris() - Harris Corner Detector
- af::susan() - Susan Corner Detector
- af::sift() - Scale Invariant Feature Transform (SIFT)
  - Method and apparatus for identifying scale invariant features" "in an image and use of same for locating an object in an image," David" "G. Lowe, US Patent 6,711,293 (March 23, 2004). Provisional application" "filed March 8, 1999. Asignee: The University of British Columbia. For" "further details, contact David Lowe (lowe@.nosp@m.cs.u.nosp@m.bc.ca) or the" "University-Industry Liaison Office of the University of British" "Columbia.") * SIFT is available for compiling but does not ship with ArrayFire hosted installers/pre-built libraries * \ref af::dog() - Difference of Gaussians * Image Processing Functions * \ref ycbcr2rgb() and \ref rgb2ycbcr() - RGB <->YCbCr color space conversion * \ref wrap() and \ref unwrap() Wrap and Unwrap * \ref sat() - Summed Area Tables * \ref loadImageMem() and \ref saveImageMem() - Load and Save images to/from memory * \ref af_image_format - Added imageFormat (af_image_format) enum * Array & Data Handling * \ref copy() - Copy * array::lock() and array::unlock() - Lock and Unlock * \ref select() and \ref replace() - Select and Replace * Get array reference count (af_get_data_ref_count) * Signal Processing * \ref fftInPlace() - 1D in place FFT * \ref fft2InPlace() - 2D in place FFT * \ref fft3InPlace() - 3D in place FFT * \ref ifftInPlace() - 1D in place Inverse FFT * \ref ifft2InPlace() - 2D in place Inverse FFT * \ref ifft3InPlace() - 3D in place Inverse FFT * \ref fftR2C() - Real to complex FFT * \ref fftC2R() - Complex to Real FFT * Linear Algebra * \ref svd() and \ref svdInPlace() - Singular Value Decomposition * Other operations * \ref sigmoid() - Sigmoid * Sum (with option to replace NaN values) * Product (with option to replace NaN values) * Graphics * Window::setSize() - Window resizing using Forge API * Utility * Allow users to set print precision (print, af_print_array_gen) * \ref saveArray() and \ref readArray() - Stream arrays to binary files * \ref toString() - toString function returns the array and data as a string * CUDA specific functionality * \ref getStream() - Returns default CUDA stream ArrayFire uses for the current device * \ref getNativeId() - Returns native id of the CUDA device <h2>Improvements </h2> * dot * Allow complex inputs with conjugate option * AF_INTERP_LOWER interpolation * For resize, rotate and transform based functions * 64-bit integer support * For reductions, random, iota, range, diff1, diff2, accum, join, shift and tile * convolve * Support for non-overlapping batched convolutions * Complex Arrays * Fix binary ops on complex inputs of mixed types * Complex type support for exp * tile * Performance improvements by using JIT when possible. * Add AF_API_VERSION macro * Allows disabling of API to maintain consistency with previous versions * Other Performance Improvements * Use reference counting to reduce unnecessary copies * CPU Backend * Device properties for CPU * Improved performance when all buffers are indexed linearly * CUDA Backend * Use streams in CUDA (no longer using default stream) * Using async cudaMem ops * Add 64-bit integer support for JIT functions * Performance improvements for CUDA JIT for non-linear 3D and 4D arrays * OpenCL Backend * Improve compilation times for OpenCL backend * Performance improvements for non-linear JIT kernels on OpenCL * Improved shared memory load/store in many OpenCL kernels (PR 933) * Using cl.hpp v1.2.7 <h2>Bug Fixes </h2> * Common * Fix compatibility of c32/c64 arrays when operating with scalars * Fix median for all values of an array * Fix double free issue when indexing (30cbbc7) * Fix <a href="https://github.com/arrayfire/arrayfire/issues/901" >bug</a> in rank * Fix default values for scale throwing exception * Fix conjg raising exception on real input * Fix bug when using conjugate transpose for vector input * Fix issue with const input for array_proxy::get() * CPU Backend * Fix randn generating same sequence for multiple calls * Fix setSeed for randu * Fix casting to and from complex * Check NULL values when allocating memory * Fix <a href="https://github.com/arrayfire/arrayfire/issues/923" >offset issue for CPU element-wise operations

New Examples

Match Template
Susan
Heston Model (contributed by Michael Nowotny)

Installer

Fixed bug in automatic detection of ArrayFire when using with CMake in Windows
The Linux libraries are now compiled with static version of FreeImage

Known Issues

OpenBlas can cause issues with QR factorization in CPU backend
FreeImage older than 3.10 can cause issues with loadImageMem and saveImageMem
OpenCL backend issues on OSX
- AMD GPUs not supported because of driver issues
- Intel CPUs not supported
- Linear algebra functions do not work on Intel GPUs.
Stability and correctness issues with open source OpenCL implementations such as Beignet, GalliumCompute.

v3.0.2

Bug Fixes

Added missing symbols from the compatible API
Fixed a bug affecting corner rows and elements in af::grad()
Fixed linear interpolation bugs affecting large images in the following:

Documentation

Added missing documentation for af::constant()
Added missing documentation for array::scalar()
Added supported input types for functions in arith.h

v3.0.1

Bug Fixes

Fixed header to work in Visual Studio 2015
Fixed a bug in batched mode for FFT based convolutions
Fixed graphics issues on OSX
Fixed various bugs in visualization functions

Other improvements

Improved fractal example
New OSX installer
Improved Windows installer
- Default install path has been changed
Fixed bug in machine learning examples

v3.0.0

Major Updates

ArrayFire is now open source
Major changes to the visualization library
Introducing handle based C API
New backend: CPU fallback available for systems without GPUs
Dense linear algebra functions available for all backends
Support for 64 bit integers

Function Additions

Data generation functions
- range()
- iota()
Computer Vision Algorithms
- features()
  - A data structure to hold features
- fast()
  - FAST feature detector
- orb()
  - ORB A feature descriptor extractor
Image Processing
- convolve1(), convolve2(), convolve3()
  - Specialized versions of convolve() to enable better batch support
- fftconvolve1(), fftconvolve2(), fftconvolve3()
  - Convolutions in frequency domain to support larger kernel sizes
- dft(), idft()
  - Unified functions for calling multi dimensional ffts.
- matchTemplate()
  - Match a kernel in an image
- sobel()
  - Get sobel gradients of an image
- rgb2hsv(), hsv2rgb(), rgb2gray(), gray2rgb()
  - Explicit function calls to colorspace conversions
- erode3d(), dilate3d()
  - Explicit erode and dilate calls for image morphing
Linear Algebra
- matmulNT(), matmulTN(), matmulTT()
  - Specialized versions of matmul() for transposed inputs
- luInPlace(), choleskyInPlace(), qrInPlace()
  - In place factorizations to improve memory requirements
- solveLU()
  - Specialized solve routines to improve performance
- OpenCL backend now Linear Algebra functions
Other functions
- lookup() - lookup indices from a table
- batchFunc() - helper function to perform batch operations
Visualization functions
- Support for multiple windows
- window.hist()
  - Visualize the output of the histogram
C API
- Removed old pointer based C API
- Introducing handle base C API
- Just In Time compilation available in C API
- C API has feature parity with C++ API
- bessel functions removed
- cross product functions removed
- Kronecker product functions removed

Performance Improvements

Improvements across the board for OpenCL backend

API Changes

print is now af_print()
seq(): The step parameter is now the third input
- seq(start, step, end) changed to seq(start, end, step)
gfor(): The iterator now needs to be seq()

Deprecated Function APIs

Deprecated APIs are in af/compatible.h

devicecount() changed to getDeviceCount()
deviceset() changed to setDevice()
deviceget() changed to getDevice()
loadimage() changed to loadImage()
saveimage() changed to saveImage()
gaussiankernel() changed to gaussianKernel()
alltrue() changed to allTrue()
anytrue() changed to anyTrue()
setunique() changed to setUnique()
setunion() changed to setUnion()
setintersect() changed to setIntersect()
histequal() changed to histEqual()
colorspace() changed to colorSpace()
filter() deprecated. Use convolve1() and convolve2()
mul() changed to product()
deviceprop() changed to deviceProp()

Known Issues

OpenCL backend issues on OSX
- AMD GPUs not supported because of driver issues
- Intel CPUs not supported
- Linear algebra functions do not work on Intel GPUs.
Stability and correctness issues with open source OpenCL implementations such as Beignet, GalliumCompute.