Releases: ROCm/rocBLAS
Releases · ROCm/rocBLAS
rocBLAS 4.0.0 for ROCm 6.0.0
Added
- Addition of beta API rocblas_gemm_batched_ex3 and rocblas_gemm_strided_batched_ex3
- Added input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched
- Added rocblas_status_excluded_from_build to be used when calling functions which require Tensile when using rocBLAS built without Tensile
- Added system for async kernel launches setting a failure rocblas_status based on hipPeekAtLastError discrepancy
Optimized
- Trsm performance for small sizes m < 32 && n < 32
Deprecated
- In a future release atomic operations will be disabled by default so results will be repeatable. Atomic operations can always be enabled or disabled using the function rocblas_set_atomics_mode. Enabling atomic operations can improve performance.
Removed
- rocblas_gemm_ext2 API function is removed
- in-place trmm API from Legacy BLAS is removed. It is replaced by an API that supports both in-place and out-of-place trmm
- int8x4 support is removed. int8 support is unchanged
- The #define STDC_WANT_IEC_60559_TYPES_EXT has been removed from rocblas-types.h. Users who want ISO/IEC TS 18661-3:2015 functionality must define STDC_WANT_IEC_60559_TYPES_EXT before including float.h, math.h, and rocblas.h
- The default build removes device code for gfx803 architecture from the fat binary
Fixed
- Make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimension or increment potentially causing overflow:
- Level2: gbmv, gemv, hbmv, sbmv, spmv, tbmv, tpmv, tbsv, tpsv
- Lazy loading to support heterogeneous architecture setup and load appropriate tensile library files based on the device's architecture
- Guard against no-op kernel launches resulting in potential hipGetLastError
Changed
- Default verbosity of rocblas-test reduced. To see all tests set environment variable GTEST_LISTENER=PASS_LINE_IN_LOG
rocBLAS 3.1.0 for ROCm 5.7.1
rocBLAS code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.
rocBLAS 3.1.0 for ROCm 5.7.0
Added
- yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
- rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
Fixed
- make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
- Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
- Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
- Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
- General: set_vector, get_vector, set_matrix, get_matrix
- Related fixes: internal scalar loads with > 32bit offsets
- fix in-place functionality for all trtri sizes
Changed
- dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
- enhanced reporting of installation issues caused by runtime libraries (Tensile)
- standardized internal rocblas C++ interface across most functions
Deprecated
- Removal of STDC_WANT_IEC_60559_TYPES_EXT define in future release
Dependencies
- optional use of AOCL BLIS 4.0 on Linux for clients
- optional build tool only dependency on python psutil
rocBLAS 3.0.0 for ROCm 5.6.1
rocBLAS code for ROCm 5.6.1 did not change. The library was rebuilt for the updated ROCm 5.6.1 stack.
rocBLAS 3.0.0 for ROCm 5.6.0
Optimizations
- Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256.
- Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a.
Added
- Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.
Deprecated
- trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality
- rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release
- rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release
- rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size()
- rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release
Removed
- is_complex helper was deprecated and now removed. Use rocblas_is_complex instead.
- The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively.
- rocblas_set_int8_type_for_hipblas was deprecated and is now removed.
- rocblas_get_int8_type_for_hipblas was deprecated and is now removed.
Dependencies
- build only dependency on python joblib added as used by Tensile build
- fix for cmake install on some OS when performed by install.sh -d --cmake_install
Fixed
- make trsm offset calculations 64 bit safe
Changed
- refactor rotg test code
rocBLAS 2.47.0 for ROCm 5.5.1
rocBLAS code for ROCm 5.5.1 did not change. The library was rebuilt for the updated ROCm 5.5.1 stack.
rocBLAS 2.47.0 for ROCm 5.5.0
Added
- added functionality rocblas_geam_ex for matrix-matrix minimum operations
- added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions
- added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API
- added support for vector initialization in the rocBLAS test framework with negative increments
- added windows build documentation for forthcoming support using ROCm HIP SDK
- added scripts to plot performance for multiple functions
Optimizations
- improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU.
- improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU.
- improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs.
Fixed
- fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench
- fixed deprecated API compatibility with Visual Studio compiler
- fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory
Changed
- install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use --help)
- rocblas client executables all now begin with rocblas- prefix
Removed
- install.sh removed options -o --cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default
rocBLAS 2.46.0 for ROCm 5.4.4
rocBLAS code for ROCm 5.4.4 did not change. The library was rebuilt for the updated ROCm 5.4.4 stack.
rocBLAS 2.46.0 for ROCm 5.4.3
rocBLAS code for ROCm 5.4.3 did not change. The library was rebuilt for the updated ROCm 5.4.3 stack.
rocBLAS 2.46.0 for ROCm 5.4.2
rocBLAS code for ROCm 5.4.2 did not change. The library was rebuilt for the updated ROCm 5.4.2 stack.