Merge branch 'release-v0.101'

============================== Release Notes: v0.101 ============================== Support for new training algorithms: Support for new network structures: - ATOM VAE model - Graph neural networks - Graph Convolutional Networks (GCN) - 3D U-Net Model Support for new layers: - Implemented optimized GRU layer using cuDNN kernel - Graph Layers: GCN, GIN, Graph, GatedGraph Python front-end: - Support for Graph and Graph Convolutional Networks - Added support for OCLF data center (Summit) Performance optimizations: - Optimize CUDA kernel for tensor reordering in GRU layer - Enabled TensorCore optimization for GRU layer - GCN and Graph layers also have a faster Dense variant which only utilizes Matrix Multiplication Model portability & usability: - Added Users Quickstart section to documentation including PyTorch to LBANN mini-tutorial - Added section on callbacks with detailed instructions on summarize images callback Internal features: - Support for double data type in distributed embedding layer - Support for large number of channels in GPU batchnorm layer - Modified LTFB so that NaNs lose tournaments - Improved numerical stability of reconstruction loss in ATOM VAE model - Skip bad gradients in Adam I/O & data readers: - Added support for ImageNet data reader to use sample lists - Refactored sample list code to be more flexible and generalize beyond JAG data reader - Added support for slab-based I/O in HDF5 data reader required by DistConv implementations of CosmoFlow 3D volumes - Extended slab-based HDF5 data reader to support labels and reconstruction modes for use with U-Net architecture Datasets: - Added two graph datasets (MNIST, and PROTEINS) Build system and Dependent Libraries: - Hydrogen 1.4.0 - Aluminum 0.4.0 - Spack v0.15.4+ (Requires new format for environments) - cuDNN 8.0.2 - Require C++14 - Added Spack build support for OCLF data center (Summit) Bug fixes: - Properly reset data coordinator after each LTFB round - Fixed bug in weights proxy when weights buffer is reallocated - Bugfix for smiles data reader bound checking and simple LTFB data distribution - Eliminated a race condition observed in VAE ATOM model with SMILES data reader. Added a barrier after each data store mini-batch exchange -- avoid race between non-blocking sends and receives and later GPU kernel communication. Retired features:
LLNL · Sep 29, 2020 · 13b5167 · 13b5167
2 parents d0fbac3 + 6a0f8bf
commit 13b5167
Show file tree

Hide file tree

Showing 211 changed files with 9,989 additions and 1,268 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -26,9 +26,9 @@ if (NOT DEFINED BUILD_SHARED_LIBS)
   set(BUILD_SHARED_LIBS ON)
 endif ()
 
-# Build with at least C++11 standard; allow newer standards.
+# Build with at least C++14 standard; allow newer standards.
 if (NOT CMAKE_CXX_STANDARD OR CMAKE_CXX_STANDARD EQUAL 98)
-  set(CMAKE_CXX_STANDARD 11)
+  set(CMAKE_CXX_STANDARD 14)
   set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
 endif ()
 
@@ -48,7 +48,7 @@ endif ()
 #
 
 set(LBANN_VERSION_MAJOR 0)
-set(LBANN_VERSION_MINOR 100)
+set(LBANN_VERSION_MINOR 101)
 set(LBANN_VERSION_PATCH 0)
 
 set(LBANN_VERSION "${LBANN_VERSION_MAJOR}.${LBANN_VERSION_MINOR}.${LBANN_VERSION_PATCH}")
@@ -188,16 +188,20 @@ set(LBANN_HAS_CEREAL ${CEREAL_FOUND})
 # The imported target is just called "cereal". Super.
 
 # Setup the linear algebra library
-find_package(Hydrogen 1.3.3 NO_MODULE QUIET
+find_package(Hydrogen 1.4.0 NO_MODULE QUIET
   HINTS ${Hydrogen_DIR} ${HYDROGEN_DIR} $ENV{Hydrogen_DIR} $ENV{HYDROGEN_DIR}
   PATH_SUFFIXES lib/cmake/hydrogen
   NO_DEFAULT_PATH)
 if (NOT Hydrogen_FOUND)
-  find_package(Hydrogen 1.3.3 NO_MODULE QUIET REQUIRED)
+  find_package(Hydrogen 1.4.0 NO_MODULE QUIET REQUIRED)
 endif ()
 message(STATUS "Found Hydrogen: ${Hydrogen_DIR}")
 set(LBANN_HAS_HYDROGEN ${Hydrogen_FOUND})
 
+if (_HYDROGEN_HAVE_ROCM)
+  message(FATAL_ERROR "ROCm not yet supported in LBANN.")
+endif ()
+
 # DiHydrogen and Distconv
 if (LBANN_WITH_DISTCONV AND NOT LBANN_WITH_DIHYDROGEN)
   message(FATAL_ERROR "Distconv requires DiHydrogen. Enable DiHydrogen to use Distconv.")
@@ -260,7 +264,7 @@ if (LBANN_HAS_CUDA)
   enable_language(CUDA)
 
   if (NOT CMAKE_CUDA_STANDARD OR CMAKE_CUDA_STANDARD EQUAL 98)
-    set(CMAKE_CUDA_STANDARD 11)
+    set(CMAKE_CUDA_STANDARD 14)
   endif ()
 
   set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
@@ -271,13 +275,13 @@ if (LBANN_WITH_ALUMINUM)
   if (NOT Aluminum_FOUND)
     message(WARNING
       "Using Aluminum without Hydrogen support may not be well-supported.")
-    find_package(Aluminum 0.3.0 NO_MODULE QUIET
+    find_package(Aluminum 0.4.0 NO_MODULE QUIET
       HINTS ${Aluminum_DIR} ${ALUMINUM_DIR} ${AL_DIR}
       $ENV{Aluminum_DIR} $ENV{ALUMINUM_DIR} $ENV{AL_DIR}
       PATH_SUFFIXES lib64/cmake/aluminum lib/cmake/aluminum
       NO_DEFAULT_PATH)
     if (NOT Aluminum_FOUND)
-      find_package(Aluminum 0.3.0 NO_MODULE QUIET)
+      find_package(Aluminum 0.4.0 NO_MODULE QUIET)
     endif ()
   endif ()
   set(LBANN_HAS_ALUMINUM ${Aluminum_FOUND})

diff --git a/ReleaseNotes.txt b/ReleaseNotes.txt
@@ -21,6 +21,75 @@ Bug fixes:
 
 Retired features:
 
+============================== Release Notes: v0.101 ==============================
+
+Support for new training algorithms:
+
+Support for new network structures:
+ - ATOM VAE model
+ - Graph neural networks
+ - Graph Convolutional Networks (GCN)
+ - 3D U-Net Model
+
+Support for new layers:
+ - Implemented optimized GRU layer using cuDNN kernel
+ - Graph Layers: GCN, GIN, Graph, GatedGraph
+
+Python front-end:
+ - Support for Graph and Graph Convolutional Networks
+ - Added support for OCLF data center (Summit)
+
+Performance optimizations:
+ - Optimize CUDA kernel for tensor reordering in GRU layer
+ - Enabled TensorCore optimization for GRU layer
+ - GCN and Graph layers also have a faster Dense variant which only utilizes Matrix Multiplication
+
+Model portability & usability:
+ - Added Users Quickstart section to documentation including PyTorch
+   to LBANN mini-tutorial
+ - Added section on callbacks with detailed instructions on summarize
+   images callback
+
+Internal features:
+ - Support for double data type in distributed embedding layer
+ - Support for large number of channels in GPU batchnorm layer
+ - Modified LTFB so that NaNs lose tournaments
+ - Improved numerical stability of reconstruction loss in ATOM VAE
+   model
+ - Skip bad gradients in Adam
+
+I/O & data readers:
+ - Added support for ImageNet data reader to use sample lists
+ - Refactored sample list code to be more flexible and generalize
+   beyond JAG data reader
+ - Added support for slab-based I/O in HDF5 data reader required by
+   DistConv implementations of CosmoFlow 3D volumes
+ - Extended slab-based HDF5 data reader to support labels and
+   reconstruction modes for use with U-Net architecture
+
+Datasets:
+ - Added two graph datasets (MNIST, and PROTEINS)
+
+Build system and Dependent Libraries:
+ - Hydrogen 1.4.0
+ - Aluminum 0.4.0
+ - Spack v0.15.4+ (Requires new format for environments)
+ - cuDNN 8.0.2
+ - Require C++14
+ - Added Spack build support for OCLF data center (Summit)
+
+Bug fixes:
+ - Properly reset data coordinator after each LTFB round
+ - Fixed bug in weights proxy when weights buffer is reallocated
+ - Bugfix for smiles data reader bound checking and simple LTFB data
+   distribution
+ - Eliminated a race condition observed in VAE ATOM model with SMILES
+   data reader.  Added a barrier after each data store mini-batch
+   exchange -- avoid race between non-blocking sends and receives and
+   later GPU kernel communication.
+
+Retired features:
+
 ============================== Release Notes: v0.100 ==============================
 Support for new network structures:
  - 3D molecular generation models for Metal Organic Frameworks from the CoRE MOF Database.