Release v0.95 · LLNL/lbann

============================= Release Notes: v0.95 ==============================

Support for new training algorithms:

Support for new network structures:

Support for new layers:

Performance optimizations:

Use Pinned memory for CPU activations matrices
Non-blocking GPU computation of objective functions and metrics
Refactored weight matrices and weight initialization
Manage GPU workspace buffers with memory pool
Slice and concatenation layer emit matrix views if possible
Used more fine-grained asynchronous calls when using Aluminum Library
- Minimized GPU stream synchronization events per call
Improved / minimized synchronization events when using a single GPU
Fixed GPU workspace size
GPU implementation of Adagrad optimizer
GPU model-parallel softmax
Optimized local CUDA kernel implementations
Support for distributed matrices with arbitrary alignment

Model portability & Usability:

Internals Features:

Support for multiple objective functions and metrics per network with arbitrary placement
- Objective functions represented as layers
- Metrics represented as layers
- Introduced evaluation layer construct
Ability to freeze specific layers for pre-training / fine-tuning
Refactoring tensor setup in setup, forward prop, and back prop
Layers store matrices in private smart pointers
Model automatically inserts evaluation layers where needed
Copy Layer activations between models
Annotated GPU profiling output with training phases
Fixed initialization of Comm object and Grid objects when using multiple models
General code cleanup, refactoring, and various bug fixes.
All layers overwrite error signal matrices
NCCL backend is now implemented via Aluminum Library
MPI calls are routed through the LBANN Comm object into Hydrogen or Aluminum
Provide runtime statistics summary from every rank
Reworked LBANN to use Hydrogen to manage GPU memory
GPU allocations now via CUB memory pool
Fixed Spack build interaction with Hydrogen Library

I/O & data readers:

Support for Conduit objects with HDF5 formatting
In-memory and locally offloaded data store
- Data Store can hold the entire training set in memory (or node-local storage)
- Data store will shuffle data samples between epochs and present samples to input layer
Updated synthetic data reader
Modified data readers to handle bad samples in JAG conduit data
Reworked the I/O layers (input and target) so that the input layer produces both the
sample and label / response if necessary.
- Target layer is being deprecated
Updated image data reader to use cv::imdecode to accelerate image load times
Allow users to specify an array of data sources for the independent/dependent
variables via prototext

Provide feedback