-
Notifications
You must be signed in to change notification settings - Fork 7
/
ReleaseNotes
381 lines (351 loc) · 19.4 KB
/
ReleaseNotes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
MAGMA Release Notes
-----------------------------------------------------
MAGMA is intended for CUDA enabled NVIDIA GPUs.
It supports Fermi and Kepler GPUs.
Included are routines for the following algorithms:
* LU, QR, and Cholesky factorizations
* Hessenberg, bidiagonal, and tridiagonal reductions
* Linear solvers based on LU, QR, and Cholesky
* Eigenvalue and singular value (SVD) problem solvers
* Generalized Hermitian-definite eigenproblem solver
* Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky
* MAGMA BLAS including gemm, gemv, symv, and trsm
* Batched MAGMA BLAS including gemm, gemv, herk, and trsm
* Batched MAGMA LAPACK including LU, inverse (getri), QR, and Cholesky factorizations
* MAGMA Sparse including CG, GMRES, BiCGSTAB, LOBPCG, iterative refinement,
preconditioners, sparse kernels (SpMV, SpMM), and support for CSR, ELL, and
SELL-P data formats
Most routines have all four precisions:
single (s), double (d), single-complex (c), double-complex (z).
2.2.0 - Nov 20, 2016
* Added variable size batched Cholesky factorization
magma_[sdcz]potrf_vbatched
* Added new fixed size batched BLAS routines
magmablas_[cz]{hemm, hemv, trmm}_batched
magmablas_[sd]{symm, symv, trmm}_batched
* Added new variable size batched BLAS routines
magmablas_[cz]{hemm, hemv, trmm, trsm}_vbatched
magmablas_[sd]{symm, symv, trmm, trsm}_vbatched
* Fixed memory leaks in {sy,he}evdx_2stage and getri_outofplace_batched.
* Fixed bug for small matrices in {symm, hemm}_mgpu and updated tester.
* Fixed libraries in make.inc examples for MKL with gcc.
* More robust error checking for Batched BLAS routines.
MAGMA-sparse
* Added Incomplete Sparse Approximate Inverse (ISAI) Preconditioner
for sparse triangular solves, including batched generation.
* Added Block-Jacobi triangular solves, including variable blocksize
(based on supervariable amalgamation).
* Added ParILUT, a parallel threshold ILU based on OpenMP.
* Added CSR5 format and CSR5 SpMV kernel, a sparse matrix vector product
often outperforming the cuSPARSE SpMV CSR and HYB.
2.1.0 - Aug 30, 2016
* Added variable size batched routines:
magmablas_[sdcz]{gemm, gemv, syrk, herk, syr2k, her2k}_vbatched
* Improved performance of SVD routines, and fixed workspace size bugs.
* More robust error checking for BLAS routines.
* Expanded and reorganized documentation.
* Improved install (added DESTDIR, LIB_SUFFIX to Makefile; added install to CMake).
MAGMA-sparse
* Added a preconditioned QMR iterative solver (PQMR) including a kernel-merged version.
* Updated the preconditioner structure to allow for a specific ILU triangular solver.
2.0.1 - Feb 26, 2016
* Fixed bug with 'make install'
2.0.0 - final: Feb 8, 2016
- beta 3: Jan 22, 2016
- beta 2: Jan 6, 2016
* See "README-v2.txt" for details about updating code.
* Removed support for CUDA arch 1.x, which NVIDIA no longer supports since CUDA 6.
* Changed to non-recursive Makefile.
* Changed definition of magma_queue_t to opaque structure.
* Changed header from magma.h to magma_v2.h
* Changed magma_get_{getrf, geqp3, geqrf, geqlf, gelqf, gebrd, gesvd}_nb to take both m, n.
* Added queue argument to magmablas routines, and deprecated magmablas{Set,Get}KernelStream.
This resolves a thread safety issue with using global magmablas{Set,Get}KernelStream.
* Fixed bugs related to relying on CUDA NULL stream implicit synchronization.
* Fixed memory leaks (zunmqr_m, zheevdx_2stage, etc.). Add -DDEBUG_MEMORY option to catch leaks.
* Fixed geqrf*_gpu bugs for m == nb, n >> m (-N 64,10000); and m >> n, n == nb+i (-N 10000,129)
* Fixed zunmql2_gpu for rectangular sizes.
* Fixed zhegvdx_m itype 3.
* Added zunglq, zungbr, zgeadd2 (which takes both alpha and beta).
* Merged single & multi-GPU CPU interface testers (e.g., merged testing_dgeev_m into testing_dgeev).
* Deprecated magma_device_sync; use magma_queue_sync instead.
MAGMA-sparse
* Added QMR, TFQMR, preconditioned TFQMR
* Added CGS, preconditioned CGS
* Added kernel-fused versions for CGS/PCGS QMR, TFQMR/PTFQMR
* Changed relative stopping criterion to be relative to RHS
* Fixed bug in complex version of CG
* Accelerated version of Jacobi-CG
* Added very efficient IDR
* Performance tuning for SELLP SpMV
1.7.0 - final: Sep 11, 2015
- beta 1: Aug 25, 2015
* Added results archive to compare historical performance.
* Added Fortran code to example directory.
* Added magmaf_wtime for consistency with other Fortran interfaces; deprecated magma_wtime_f.
* Added and template batched MAGMA BLAS routine gemm, gemv, herk, trsv, and trsm
* Tuned batched MAGMA BLAS routines, in particular gemm, gemv, herk, and trsm
* Tuned batched MAGMA LAPACK routines, in particular Cholesky factorizations
* Tuned two stage symmetric eigenvalue code, {sy|he}heevdx_2stage, to improve performance.
* Tuned symmetric eigenvalue code, {sy|he}evd, to improve performance for N < 2000.
* Fixed NaN result with {sy|he}mv and {sy|he}mv_mgpu if GPU shared memory had NaN.
* Fixed Fortran constants (MagmaTrans, MagmaUpper, etc.).
* Fixed workspace requirements for the two stage symmetric eigenvalue problem
{sy|he}heevdx_2stage and multi-GPU {sy|he}heevdx_2stage_m.
* Fixed workspace requirements for Hessenberg (gehrd and gehrd_m) and multi-GPU geev_m.
* Fixed trtri for unit diagonal, and added tester.
* Fixed testing check for inverse (getri).
* Fixed multi-GPU {or|un}gqr_m for some k < n. (Currently only used in geev_m with m = n = k.)
* Fixed bug for batched routines
* Rename lapack_const to lapack_const_str, to avoid name conflict with PLASMA.
* Allow CMake build without Fortran (already existed for make).
MAGMA-sparse
* Added Induced Dimension Reduction Iterative solver (IDR).
* Added iterative sparse triangular solves for
incomplete factorization preconditioners.
1.6.2 - May 4, 2015
* Added magma_{s,d,c,z}sqrt for real and complex scalar square root.
* Added magma_ceildiv and magma_roundup.
* Fixed magmablas_zlaset and magmablas_zlacpy for large M or N > 4M.
* Fixed testers for geqrf_batched and trsm_batched to compile with CUDA 5.x.
MAGMA-sparse
* All allocation failures and other errors now return error codes.
* cuSPARSE error codes mapped to MAGMA error codes.
* LOBPCG sparse eigensolver enabled for preconditioning using Jacobi and
incomplete LU factorizations.
* Some name changes in MAGMA-sparse for consistency with dense MAGMA.
All functions working on matrices now start with the prefix magma_zm***
instead some of them starting with magma_z_m***.
* magma_zmvisu for printing a matrix is now called magma_zprint_matrix.
* Added a tester for the sparse level 1 BLAS.
* Rename magma_z_sparse_matrix into magma_z_matrix.
* Redefine all vectors as dense matrices.
* Replace the vector functions with matrix functions.
* Bug fix in complex FGMRES.
* Added iterative incomplete factorization routines (iterative ILU/iterative IC).
* Enhance the ILU/IC with fill-in (level-ILU).
1.6.1 - January 30, 2015
* Building as both shared and static library is default now.
Comment out FPIC in make.inc to build only static library.
* Added max norm and one norm to [zcsd]lange.
* Extended {sy|he}mv and {sy|he}mv_mgpu implementation to upper triangular.
* Fixed memory access bug in {sy|he}mv_mgpu, used in {sy|he}trd_mgpu.
* Fixed errant argument check in laswp, affecting getrf_mgpu.
* Fixed tau in [cz]gelqf, which needed to be conjugated.
* Fixed workspace size in symmetric/Hermitian eigenvalue solvers.
* Made fast magmablas_zhemv default in symmetric/Hermitian eigenvalue solvers
(previously needed to define -DFAST_HEMV option).
* Added FGMRES for non-constant preconditioner operator.
* Added backward communication interfaces for SpMV and
preconditioner passing the vectors on the GPU.
* Added function to generate cuSPARSE ILU level-scheduling information
for a given matrix.
* Added the batched QR routine.
* Performance improvments of all batched routines.
* Fixed "nan" output for batched factorizations.
1.6.0 - November 16, 2014
* Added MAGMA batched linear algebra routines:
* Batched MAGMA BLAS including gemm, gemv, herk, and trsm
* Batched LU, GETRI, and Cholesky factorizations
* Added Bunch-Kaufman factorization and solver for symmetric
indefinite matrices: [zcsd]{he|sy}trf
* Added non-pivoted LDLt
* Added a Random Butterfly Transformation (RBT) and a new solver based
on RBT + LU without pivoting + iterative refinement
* Comprehensive release of sparse routines:
* All sparse routines equipped with a queue.
* Enhanced debugging routines.
* Interface to cuSPARSE functions.
* Added interface to pass data structures located in main/device memory.
* Added generic interface to call any solver/eigensolver.
* Added testscript checking correctness of routines.
* Added capability to iterate in block-wise fashion.
* Checks for memory leaks.
1.5.0 - final: Aug 30, 2014
- beta 3: July 18, 2014
- beta 2: May 30, 2014
- beta 1: April 25, 2014
* Added pre-release of sparse routines.
* Replaced character constants with symbolic constants (enums),
e.g., 'N' with MagmaNoTrans.
* Added SVD with Divide & Conquer, gesdd.
* Added unmbr/ormbr, unmlq/ormlq, used in gesdd.
* Improved performance of geev when computing eigenvectors by using
multi-threaded trevc.
* Added testing/run_tests.py script for more extensive testing.
* Changed laset interface to match LAPACK.
* Fixed memory access bug in transpose, and changed interface to match LAPACK.
* Fixed memory access bugs in lanhe/lansy, zlag2c, clag2z, dlag2s, slag2d,
zlat2c, dlat2s, trsm (trtri_diag).
* Added clat2z, slat2d.
* Added upper & lower cases in lacpy.
* Fixed unmql/ormql for rectangular matrices.
* Allow compiling without Fortran, but then testers have reduced functionality.
* Added wrappers for CPU BLAS asum, nrm2, dotu, dotc, dot. This isolates
the dependence on CBLAS to src/cblas*.cpp.
* Added queue/stream interfaces for many MAGMABLAS routines, using _q suffix.
These take magma_queue_t, which is a wrapper around CUDA stream.
* Updated documentation to doxygen format.
1.4.1 - final: December 17, 2013
- beta 2: December 9, 2013
- beta 1: November 23, 2013
* Improved performance of geev when computing eigenvectors by using blocked trevc.
* Added right-looking multi-GPU Cholesky factorization.
* Added new CMake installation for compiling on Windows.
* Updated magmablas to call appropriate version based on CUDA architecture
at runtime. GPU_TARGET now accepts multiple architectures together.
1.4.0 - final: Aug 14, 2013
- beta 2: June 28, 2013
- beta 1: June 19, 2013
* Use magma_init() and magma_finalize() to initialize and cleanup MAGMA.
* Merge libmagmablas into libmagma to eliminate circular dependencies.
Link with just -lmagma now.
* User can now #include <cublas_v2.h> before #include <magma.h>.
See testing_z_cublas_v2.cpp for an example.
* Can compile as shared library; see make.inc.mkl-shared and 'make shared'.
* Fix required workspace size in gels_gpu, gels3_gpu, geqrs_gpu, geqrs3_gpu.
* Fix required workspace size in [zcsd]geqrf.
* Fix required workspace size in [he|sy]evd*, [he|sy]gvd*.
* [zc|ds]geqrsv no longer segfaults when M > N.
* Fix gesv and posv in some situations when GPU memory is close to full.
* Fix synchronization in multi-GPU getrf_m and getrf2_mgpu.
* Fix multi-GPU geqrf_mgpu for M < N.
* Add MAGMA_ILP64 to compile with int being 64-bit. See make.inc.mkl-ilp64.
* Add panel factorizations for LU, QR, and Cholesky entirely on the GPU,
correspondingly in [zcsd]getf2_gpu, [zcsd]geqr2_gpu, and [zcsd]potf2_gpu.
* Add QR with pivoting in GPU interface (functions [zcsd]geqp3_gpu);
improve performance for both CPU and GPU interface QR with pivoting.
* Add multi-GPU Hessenberg and non-symmetric eigenvalue routines:
geev_m, gehrd_m, unghr_m, ungqr_m.
* Add multi-GPU symmetric eigenvalue routines (one-stage)
([zhe|che|ssy|dsy]trd_mgpu,
[zhe|che|ssy|dsy]evd_m, [zhe|che|ssy|dsy]evdx_m,
[zhe|che|ssy|dsy]gvd_m, [zhe|che|ssy|dsy]gvdx_m ).
* Add single and multi-GPU symmetric eigenvalue routines (two-stage)
([zhe|che|ssy|dsy]evdx_2stage, [zhe|che|ssy|dsy]gvdx_2stage,
[zhe|che|ssy|dsy]evdx_2stage_m, [zhe|che|ssy|dsy]gvdx_2stage_m ).
* Add magma_strerror to get error message.
* Revised most testers to use common framework and options.
* Use CUBLAS gemm in src files, since it has been optimized for Kepler.
* Determine block sizes at runtime based on current card's architecture.
* In-place transpose now works for arbitrary n-by-n square matrix.
This also reduces required memory in zgetrf_gpu.
* Update Fortran wrappers with automated script.
* Fix Makefile for Kepler (3.0 and 3.5).
1.3.0 - November 12, 2012
* Add MAGMA_VERSION constants and magma_version() in magma.h.
* Fix printing complex matrices.
* Fix documentation and query for heevd/syevd workspace sizes.
* Fix singularity check in trtri and trtri_gpu.
* Fixes for compiling on Windows (small, __attribute__, magma_free_cpu, etc.)
* Implement all 4 cases for zunmqr (QC, Q'C, CQ, CQ') and fix workspace size.
* Fix permuting rows for M > 32K.
* Check residual ||Ax-b||; faster and uses less memory than ||PA-LU|| check.
1.2.1 - June 29, 2012
* Fix bug in [zcsd]getrf_gpu.cpp
* Fix workspace requirement for SVD in [zcsd]gesvd.cpp
* Fix a bug in freeing pinned memory (in interface_cuda/alloc.cpp)
* Fix a bug in [zcsd]geqrf_mgpu.cpp
* Fix zdotc to use cblas for portability
* Fix uppercase entries in blas/lapack headers
* Use magma_int_t in blas/lapack headers, and fix sources accordingly
* Fix magma_is_devptr error handling
* Add magma_malloc_cpu to allocate CPU memory aligned to 32-byte boundary
for performance and reproducibility
* Fix memory leaks in latrd* and zcgeqrsv_gpu
* Remove dependency on CUDA device driver
* Add QR with pivoting in CPU interface (functions [zcsd]geqp3)
* Add hegst/sygst Fortran interface
* Improve performance of gesv CPU interface by 30%
* Improve performance of ungqr/orgqr CPU and GPU interfaces by 30%;
more for small matrices
1.2.0 - May 10, 2012
* Fix bugs in [zcsd]hegst[_gpu].cpp
* Fix a bug in [zcsd]latrd.cpp
* Fix a bug in [zcsd]gelqf_gpu.cpp
* Added application of a block reflector H or its transpose from the Right.
Routines changed -- [zcsd]larfb_gpu.cpp, [zc]unmqr2_gpu.cpp, and
[ds]ormqr2_gpu.cpp
* Fix *larfb_gpu for reflector vectors stored row-wise.
* Fix memory allocation bugs in [zc]unmqr2_gpu.cpp, [ds]ormqr2_gpu.cpp,
[zc]unmqr.cpp, and [ds]ormqr.cpp (thanks to Azzam Haidar).
* Fix bug in *lacpy that overwrote memory.
* Fix residual formula in testing_*gesv* and testing_*posv*.
* Fix sizeptr.cpp compile warning that caused make to fail.
* Fix warning in *getrf.cpp when nb0 is zero.
* Add reduction to band-diagonal for symmetric/Hermitian definite matrices
in [zc]hebbd.cpp and [ds]sybbd.cpp
* Updated eigensolvers for standard and generalized eigenproblems for
symmetric/Hermitian definite matrices
* Add wrappers around CUDA and CUBLAS functions,
for portability and error checking.
* Add tracing functions.
* Add two-stage reduction to tridiabonal form
* Add matrix print functions.
* Make info and return codes consistent.
* Change GPU_TARGET in make.inc to descriptive name (e.g., Fermi).
* Move magma_stream to -lmagmablas to eliminate dependency on -lmagma.
1.1.0 - 11-11-11
* Fix a bug in [zcsd]geqrf_gpu.cpp and [zcsd]geqrf3_gpu.cpp for n>m
* Fix a bug in [zcsd]laset - to call the kernel only when m!=0 && n!=0
* Fix a bug in [zcsd]gehrd for ilo > 1 or ihi < n.
* Added missing Fortran interfaces
* Add general matrix inverse, [zcds]getri GPU interface.
* Add [zcds]potri in CPU and GPU interfaces
[Hatem Ltaief et al.]
* Add [zcds]trtri in CPU and GPU interfaces
[Hatem Ltaief et al.]
* Add [zcds]lauum in CPU and GPU interfaces
[Hatem Ltaief et al.]
* Add zgemm for Fermi obtained using autotuning
* Add non-GPU-resident versions of [zcds]geqrf, [zcds]potrf, and [zcds]getrf
* Add multi-GPU LU, QR, and Cholesky factorizations
* Add tile algorithms for multicore and multi-GPUs using the StarPU
runtime system (in directory 'multi-gpu-dynamic')
* Add [zcds]gesv and [zcds]posv in CPU interface. GPU interface was already in 1.0
* Add LAPACK linear equation testing code (in 'testing/lin')
* Add experimental directory ('exp') with algorithms for:
(1) Multi-core QR, LU, Cholskey
(2) Single GPU, all available CPU cores QR
* Add eigenvalue solver driver routines for the standard and generalized
symmetric/Hermitian eigenvalue problems [Raffaele Solca et al.].
1.0.0 - August 25th, 2011
* Fix make.inc.mkl (Thanks to ar1309)
* Add gpu interfaces to [zcsd]hetrd, [zcsd]heevd
* Add all cases for [zcds]unmtr_gpu
[Raffaele Solca et al.]
* Add generalized Hermitian-definite eigenproblem solver ([zcds]hegvd)
[Raffaele Solca et al.]
1.0.0RC5 - April 6th, 2011
* Add fortran interface for lapack functions
* Add new QR version on GPU ([zcsd]geqrf3_gpu) and corresponding
LS solver ([zcds]geqrs3_gpu)
* Add [cz]unmtr, [sd]ormtr functions
* Add two functions in fortran to compute the offset on device pointers
magmaf_[sdcz]off1d( NewPtr, OldPtr, inc, i)
magmaf_[sdcz]off2d( NewPtr, OldPtr, lda, i, j)
indices are given in Fortran (1 to N)
* WARNING: add FOPTS variable to the make.inc to use preprocessing
in compilation of Fortran files
* WARNING: fix bug with fortran compilers which don;t change the name
now fortran prefix is magmaf instead of magma
* Small documentation fixes
* Fix timing under windows, thanks to Evan Lazar
* Fix problem when __func__ is not present, thanks to Evan Lazar
* Fix bug with m==n==0 in LU, thanks to Evan Lazar
* Fix bug on [cz]unmqr, [sd]ormqr functions
* Fix bug in [zcsd]gebrd; fixes bug in SVD for n>m
* Fix bug in [zcsd]geqrs_gpu for multiple RHS
* Added functionality - zcgesv_gpu and dsgesv_gpu can now solve also
A' X = B using mixed-precision iterative refinement
* Fix error code in testings.h to compile with cuda 4.0
1.0.0RC4 - March 8th, 2011
* Add control directory to group all non computational functions
* Integration of the eigenvalues solvers
* Clean some f2c code in eigenvalues solvers
* Arithmetic consistency: cuDoubleComplex and cuFloatComplex are
the only types used for complex now.
* Consistency of the interface of some functions.
* Clean most of the return values in lapack functions
* Fix multiple definition of min, max,
* Fix headers problem under windows, thanks to Willem Burger