Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
============================== Release Notes: v0.104 ================…
…============== C++ API: Support for new training algorithms: Support for new network structures: - Added GPT-3 transformers and training recipes Support for new layers: - Select operator (set tensor value based on predicate) - Model parallelism for channel-wise fully-connected layers Python front-end: - Support for PyTorch Module conversion to LBANN graphs (requires PyTorch 2.0 or newer, compiled with PyTorch Dynamo) Performance optimizations: - Support in-place computations for capable layers as a memory optimization - Allow distconv-enabled convolution and batchnorm layers to reuse their input activations as error signals as a memory optimization if the parent layer does not need its activations in the backward pass. This optimization can be disabled by setting the environment variable DISTCONV_DISABLE_MEM_OPT=1. - Added support for selective weight sharding (also known as Fully-Sharded Data Parallelism, or FSDP). To enable, set sharded=true on weight objects. - Allow distconv to be disabled at runtime with LBANN_DISABLE_DISTCONV=1. - Activations are now deallocated when no longer needed via a reference counter, disable with LBANN_DISABLE_ACT_GC=1. - Added option for LBANN to set the number of OMP threads to modest default (4) if the environment doesn't specify anything. - Save memory on backpropagation by not replicating gradients between GradientManager and data_type_optimizer - Save more memory in FSDP by synchronizing previous outstanding async communication calls and freeing up local gradient contributions - FSDP: release full weight views after backprop - Batching heads in multi-head attention into single operations instead of on a per-head basis - Stacking the weights and biases for queries/keys/values in self-attention Model portability & usability: - Added support for profiling with Caliper Experiments & Applications: - Updated CosmoFlow model to automatically scale the model architecture and parallelism with input size. - Added a PyTorch reference implementation of CosmoFlow. Internal features: - Removed the mini_batch_size parameter from the following functions in the layer class hierarchy: fp_setup_inputs, fp_setup_outputs, bp_setup_gradient_wrt_inputs and the distconv_adapter class: fp_setup, bp_setup - Support global and local gradient norm clipping with the clip_gradient_norm callback - Interactive progress bar with the progress_bar callback - Evaluate progress callback allows for periodic monitoring during training with independent data set (intra-epoch evaluation) - Detailed memory usage profiling with the memory_profiler callback - Refactored subgraph parallelism I/O & data readers: - Renamed percent_of_data_to_use more accurately to fraction_of_data_to_use. - DataReaderMetaData, training_dr_linearized_data_size, and num_parallel_readers were removed from the model and layer API, and instead reside in the data ingestion pipeline. - Fixed implementation of background I/O to achive better decoupling of background data fetch. Can be enabled / disabled with runtime flag. - Set the default number of I/O threads to 4 - Changed the I/O and transform pipeline to use a bank of RNGs that is now indexed by the sample ID in the load sequence, rather than the I/O thread ID. This eliminates variablility when using different numbers of I/O threads. - Moved state tracking current position in a data set from the data reader to the dataset class. - Split the I/O RNGs into two banks one for training and one for all other execution modes. Build system: - Updated build script to use CachedCMakeProject mode, which should simplfy the overall workflow - Set a default time limit for CI tests to avoid unnecessary stalls Bug fixes: - Fixed a bug where in-place layers sometimes attached a locked view of a matrix to a mutable view. - Fixed a bug when trying to use the legacy HDF5 data reader without data store. - Fixed concurrency bugs in the data store - Fixed DistConv memory optimization bug Retired features: - Support for autoencoder strategy in the summarize images callback was removed - Removed deprecated Layer protobuf fields: weight_data, num_neurons_from_data_reader - Removed support for calculating a global mini-batch across multiple models using the imcomm callback or multiple trainers. The mini-batch is now strictly contained to a single model in a single trainer. This deprecates an unused (and old) multi-model execution mode using imcomm callback that predated LTFB. - Removed the notion of effective mini-batch size versus current mini-batch size. - Remove world master mini-batch adjustment. - Remove model offset field. No longer necessary since data sets do not span models. - Remove the cached value of the current mini-batch size from the SGD execution context. It is now only cached in the model. - Removed the imcomm "inter-model" callback - Removed the num-parallel-readers parameter to the I/O subsystem. This eliminates an older version of I/O parallelism that relied on a non-data-parallel I/O buffer and had different ranks fetching entire mini-batches. It is superseded by standard data-parallel I/O.
- Loading branch information