CHANGES.txt

This file lists the major changes as they appear in the stable branch.  No
attempt is made to keep this list accurate for the master branch.

Version 24.12.0 (December 20, 2024)
  * Legion
    - Numerous bug fixes
  * Regent
    - Support for running without the CUDA hijack
    - Support for NVIDIA Hopper GPU architecture
  * Tools
    - Support for exporting profiles to NVTXW format
    - Simplifications that may improve performance by removing
       obsolete features
  * Realm
    - Remove the need for dynamic_cast in ExternalResource
    - Support for CUPTI profiling
    - Support for registring per GPU reduction operations via CUfunction
    - Add a flag and default disable ATS/HMM support and shared CPU memories
    - Support for scalable barrier via radix tree
    - Support for querying resources of NUMA
    - Support for backtrace via cpptrace
    - More unit tests
    - CI coverage for HIP with NVIDIA backend

Version 24.09.0 (September 27, 2024)
  * Legion
    - Bug fixes for control replication and multi-node configurations
  * Regent
    - Fixes for ROCm 6.0 code generation
  * Tools
    - Legion Prof now uses subcommands (e.g., `legion_prof view`) to clarify
       which options apply to which actions
    - Legion Prof now tracks backtraces at the points where blocking wait
       calls are performed by the application
    - Legion Prof reports more detailed timing information for tasks
    - Legion Prof calculates clock skew between nodes and reports it when
       relevant
    - Commonly used features of Legion Prof are now enabled by default
    - The old Python Legion Prof implementation is no longer supported
  * Realm
    - `Point` fields `x`, `y`, `z` and `w` have been replaced by methods
    - Support for launching CUDA tasks onto a CUDA stream asynchronously via
       `cuCtxRecordEvent` without the need of CUDA hijack
    - Support for CUDA fabric sharing
    - Support for host-to-host copies via CUDA DMA
    - Support for querying number of NUMA nodes from the `NumaModuleConfig`
    - Added reference counting for preimage operations
    - Make `std::atomic` as the default atomic implementation
    - Remove `REALM_CXX_STANDARD`, and bump the minimal requirement to C++17
    - Implemented an ABI stable wrapper for GASNetEX
    - Additional unit tests including `CircularQueue`, `ReplicatedHeap`,
       `find_fastest_path`, `DynaamicTableAllocator`, `generate_gather_paths`,
       `TransferIteratorIndexSpace`
    - Dead code cleanups and bug fixes

Version 24.06.0 (June 28, 2024)
  * Build
    - Minimum required C++ standard is now 17
    - Embedded GASNet build in CMake now automatically enables GPU memory kinds
  * Legion
    - Support for nonidempotent traces (where the postconditions do
       not imply the preconditions of the trace)
    - Deletions are now committed in program order, making it easier for
       users to reason about when their effects take place
    - All tasks (and other operations) are now committed in order (a
       prerequisite for anticipated, but not yet implemented, precise
       exception support)
    - Improvements to Legion's internal algorithm for virtual
       instances, fixing various correctness bugs in the implementation
    - Improvements to the `DefaultMapper` handling of task layout constraints
  * Regent
    - Improvements to make compiler more deterministic
    - Improvements to auto-detect CUDA
    - Support for complex numbers in `std/format`
    - Static control replication (SCR) and RDIR have been completely
       removed. All SCR and RDIR related flags (`-fflow-*`) have been
       removed, except for `-fflow 0` which is permitted (but no
       longer does anything, and now issues a warning)
  * Tools
    - Restore profiler's ability to render dependent partitioning channels
    - Render mapper information on mapper calls in the profiler
    - Render user-provided profiling information in the profiler
  * Realm
    - UVM support for the HIP module
    - Error code support for command line parser
    - Support for querying MIG devices from NVML
    - Add indirection channel query
    - Additional unit tests and bug fixes

Version 24.03.0 (March 27, 2024)
  * Build
    - ROCm 6.0 is now supported, and support for ROCm 4.x has been removed
  * Legion
    - Support for control replication has been merged
    - Support for discarding region contents on task completion
    - Long-deprecated APIs, such as the old `HighLevel` namespace, have
       been removed
  * Mappers
    - Default mapper support for control replication
    - Default and null mapper now use C++ `override` keyword
  * Regent
    - Support for pure projection functors that capture arguments
    - Static control replication (SCR) has been deprecated and will be
       removed in a future release
  * Tools
    - The profiler now correctly recognizes the logger format version
       and throws an error if it does not match
    - The profiler now reports when a profile was generated with debug
       mode (or another expensive setting) was enabled
    - Many profiler fixes for correctly rendering runtime and mapper calls
    - Profiler now renders GPU device and host execution separately
    - Optimizations to improve profiler memory usage and running time
    - Rust profiler now requires at least Rust 1.74
  * Realm
    - Support for registration of dynamically allocated buffers
    - Support for handling poisoned events for reservation
    - Refactor CUDA allocation and IPC paths
    - Support for querying CUDA device information (GPU UUID and ID),
       process information (process ID, hostname, host ID) and timer
       calibration error from the profiler
    - Remove address alignment from serializer and deserializer
    - Support for creating network shared peers using IPC mailbox
    - Support OMP thread binding and allow for multiple OMP parallel
       sections when enabling system OMP runtime
    - Add Realm unit tests
    - Fixes for Realm tests, sparsity map, MemoryQuery, dynamic framebuffer
       memory and memcpy channel

Version 23.12.0 (December 14, 2023)
  * Regent
    - Support for HIP multi-GPU per runtime
  * Realm
    - Improve scalability of startup by replacing point-to-point
       communication with allgatherv for machine model announcements
    - Support shared memory communication for system memory
    - Provide sanity check for GPU tasks to detect any leak of CUDA streams
    - Support for GPU transposes in CUDA-DMA
    - Bug fixes for CUDA-DMA

Version 23.09.0 (September 28, 2023)
  * Regent
    - Elide future maps in index launches
    - Improvements to Pygion interop
  * Realm
    - Add a machine configuration API that allows applications to configure
       the machine model without using the command line
    - Expose Realm managed CUDA/HIP stream to applications to launch GPU tasks
       without device-wise synchronization when hijack is disabled
    - Change timers to use rdtsc
    - Improve performance for getting highest priority task available in any
       task queue
    - Implement framebuffer memory with `cuMemMap`
    - Initial work for moving STL dependencies to header only

Version 23.06.0 (June 28, 2023)
  * Build
    - Fixes for CMake build on macOS
    - Fixes for HIP build when arch is specified
  * Realm
    - Support for better backtraces via libdw and libunwind
    - Improve scalability and performance in task spawning by caching
       the triggering operation of an event if one is provided
    - Fix a minor issue with affinity queries to properly clear the
       user-provided vector before populating it
    - Add more accurate GPU memory bandwidth affinity calculations if
       NVML is available
    - Refactor CPU core topology enumeration to serve systems without
       NUMA capabilities (like Jetson ARM systems)
    - Improve scalability and performance of task spawning by moving event
       reuse freelists to be per-processor, reducing lock contention
    - Add a microbenchmark for measuring task throughput more accurately
    - Add a series of Realm API tutorials
    - Replace `CU_EVENT_DEFAULT` with `CU_EVENT_DISABLE_TIMING` for better
       performance of CUDA events
    - Support Kokkos interop for the HIP module
    - Fixes for Realm tests on macOS
  * Tools
    - Legion Prof now supports search in the new profiler UI
    - Legion Prof now supports an HTTP client/server interface. Launch the
       server with `--serve` (on port 8080 by default) and attach a client to
       it with `--attach http://127.0.0.1:8080`
    - Legion Prof now supports a new achival mode via the `--archive`
       flag. Generate an offline profile and view it either via `--attach` or
       by uploading it to a server and navigating to
       `https://legion.stanford.edu/prof-viewer/?url=...`
    - Legion Prof modes (client/server/viewer) are now parallel by
       default, and perform heavy computations off the UI thread for
       better responsiveness
    - Add support for rendering indirect copies (i.e., gather/scatter)
    - Fix rendering of profiles over HTTP with old profiler UI
    - Fix profiling of copies with different numbers of hops between instances

Version 23.03.0 (March 27, 2023)
  * Build
    - Minimum supported CMake version is now 3.16. (Some optional features may
       continue to require even newer versions.)
    - Minimum supported GCC version is now 8.
    - Minimum supported CUDA version is now 10.
  * Legion
    - Added support for padded layout constraints to provide scratch space
       in instances for tasks to use (see examples/padded_instances).
    - Added support for tiled layout constraints to provide an ability to
       layout instances by breaking down dimensions (see examples/tiling).
  * Realm
    - An experimental UCX network backend has been added.
    - Updated the Kokkos interop to support Kokkos 4.0.
  * Python
    - Support loading Legion as a library from a stock Python interpreter.
  * Regent
    - Fixes to avoid leaking futures.
    - Improvements to Regent's predicate optimization.
  * Tools
    - Legion Prof now supports a native viewer UI. Enable it with the `viewer`
       feature (e.g., `cargo run --features=viewer`) and use the flag
       `--view`.
    - Legion Prof now has better support for rendering a subset of available
       nodes. Pass all log files (from all nodes) into Legion Prof and add the
       `--subnodes` flag to specify which ones to render. This ensures all
       copies in/out of those nodes will be shown correctly.

Version 22.12.0 (December 30, 2022)
  * Regent
    - Support for nested predication of `if` and `while` statements
  * Realm
    - Support priorities for Copy operations
    - Support building with multiple network backends enabled, and use
       -ll:networks (gasnetex/gasnet1/mpi/none) to pick which one to use
       during runtime
    - Separate CUDA runtime from Realm by removing all references to CUDA
       runtime and relying only on driver API, which fixes an issue when
       mixing static and dynamic cudart across an application and improves
       Realm’s compatibility across driver versions
  * Tools
    - Legion Prof support visualization of Channel of indirect copy, and
       Instances being used by different operations including Task, Copy
       and Fill

Version 22.09.0 (September 30, 2022)
  * Python
     - Support for running packages via `legion_python -m`
     - Support for Jupyter Notebook on single node execution.
  * Regent
     - Deprecated support for LLVM versions less than 11 in
       `setup_env.py`. These versions will be removed in the next
       release. LLVM 13 is recommended, except on ARM where LLVM 11 is
       currently required
     - Added support for provenance for all launcher operations
     - Debug info is no longer generated by default in order to
       optimize compile times. To re-enable it, run with
       `-fdebuginfo 1`
  * Legion
     - Most Legion APIs now support passing a provenance string.
       This provenance information is passed through to tools like
       Legion Spy and Legion Prof so users can map what they are
       seeing back to their source code. In the future, provenance
       strings will also be used by all Legion error messages as well.
  * Realm
     - Support for fills of arbitrary instances (via multi-hop paths where
        needed)
     - Fixed crashes when using external instances and network-registered
        memory at the same time
     - Removed all direct references to CUDA runtime library in CUDA module
     - Caching of minimum-cost data transfer path for repeated copies
     - Dependent partitioning support for image and preimage using structured
        (~affine) transforms in addition to existing unstructured (field-based)
        images/preimages

Version 22.06.0 (June 29, 2022)
  * Regent
     - Support for cross-products in index launches, as well as
        multi-level projection functors.
     - Support for HIP on AMD GPUs has been added. All tasks marked with
        `__demand(__cuda)` are automatically eligible. Note that the name of
        the annotation may change in the future to something more general, but
        for now no change is being made. Some CUDA flags have migrated to more
        general names. See below.
     - The flag `-fcuda 1` is deprecated. Use `-fgpu cuda` instead.
     - The flag `-fcuda-offline` is deprecated. Use `-fgpu-offline` instead.
     - The flag `-fcuda-arch` is deprecated. Use `-fgpu-arch` instead.
     - Enable HIP support with `-fgpu hip` and use the `-fgpu-offline` and
        `-fgpu-arch` flags as necessary/appropriate.
     - Support for new flag `-ffast-math 1` which enables fast-math
        optimizations on CPU and GPU. By default, CPU code has this
        disabled, and GPU code uses only the `contract` flag in LLVM
        to generate FMA instructions. For compute-intensive
        applications, additional performance can sometimes be unlocked
        by enabling the full suite of optimizations with `-ffast-math 1`,
        at the cost of numerical accuracy.
     - Performance improvements for CUDA allow recent LLVM versions
        (e.g., 13) to match or exceed the performance of LLVM
        3.8. Previously, performance regressions made LLVM 3.8 the
        most performant version for use with CUDA. The recommended
        LLVM version moving forward is 13, and `setup_env.py` has been
        updated to set this on all platforms.
     - The versions of GASNet and Terra are now pinned by default in
        `setup_env.py`. You can choose versions explicitly with
        `GASNET_VERSION` (as before, though the previous default was
        unpinned) and `--terra-branch`, respectively.
  * Realm
     - Allow use of system OpenMP runtime (instead of Realm-provided one) with
        `-DLegion_OpenMP_SYSTEM_RUNTIME=ON`.  This allows inter-operation with
        libraries that have already been linked to the system runtime, but
        limits each process to a single OMP processor.

Version 22.03.0 (March 27, 2022)
  * Build
     - Minimum supported cmake version is now 3.7.  (Some optional features
        continue to require even newer versions.)
  * Realm
     - Numerous bug fixes in the `gasnetex` network layer
     - CUDA and HIP support allow direct specification of which gpus to
        use via `-ll:gpu_ids` command-line option
     - Added support for copy paths using Cuda IPC between gpus on the same
        physical node
     - For applications using CUDA without the runtime API hijack AND only
        submitting work to the default CUDA stream, `-cuda:legacysync 1`
        improves the overhead of detecting the completion of device-side work
        launched by a task
     - Realm reduction copies may now indicate exclusive access to the
        destination instance, improving performance by allowing simple
        load/store instead of atomic operations
     - Custom reduction operations (including Legion's built-in ones) can
        provide HIP implementations, permitting in-place reductions in
        HIP device memory
  * Regent
     - Support for custom serialization of types in task parameters and results
     - New experimental timing library under std/timing

Version 21.12.0 (December 31, 2021)
  * Realm
     - Performance improvements for multi-dimensional copies, especially
        inter-process transfers
     - Support for loading CUDA driver (if present) at runtime instead of
        link time, allowing same binary to be used on systems with and without
        CUDA-capable GPUs (enabled with -DLegion_CUDA_DYNAMIC_LOAD=ON in
        cmake build)
     - A separate `Memory` is now created per process for external (system)
        memory instances.  This memory has no capacity for creating instances
        and can confuse applications or Legion mappers that assume exactly
        one Memory of kind `SYSTEM_MEM` exists.  Old behavior can be obtained
        with `-ll:ext_sysmem 0`, but this can fail for configurations that
        register system memory with the network and/or GPUs
     - The `MemoryQuery` now supports a `has_capacity` predicate to restrict
        results to just memories with sufficient total (not current!) capacity
        to allocate an instance of a specified size
  * Build
     - Cmake allows control of max nodes (-DLegion_MAX_NUM_NODES=...) and
        max processors/node (-DLegion_MAX_NUM_PROCS=...) supported by
        Legion build
     - Added dependency tracking to make-based builds

Version 21.09.0 (September 28, 2021)
  * Realm
     - Numerous bug fixes in the `gasnetex` network layer
     - Support for HIP memory type registration with GASNet (with
        GASNet version 2021.9.0+)
     - Arguments to spawned tasks may now be arbitrarily large (network-
        specific limits have been eliminated)
  * Regent
     - Improved support for dynamic checks on index launches with
         potential interference between different region arguments
     - Extensive fixes for separate compilation. This mode has now
         been verified to work with large-scale applications
     - Removed long-obsolete support for `__demand(__external)`
   * Pygion
     - Add support for layout constraints

Version 21.06.0 (June 24, 2021)
  * Build
    - Version information is now compiled into Realm and Legion.  This takes
        the form of a string (e.g. "legion-21.06.0") rather than anything
        that can be compared (i.e. no semantic versioning here).  Compile-time
        defines `REALM_VERSION` and `LEGION_VERSION` are available as well as
        run-time calls `Realm::Runtime::get_library_version` and
        `Legion::Runtime::get_library_version`.
  * Regent
    - Support for dynamic checks on projection functors, enabling a
        much larger class of loops to be supported as index launches
    - Support for local tasks (i.e., without going through the
      runtime) via `__demand(__local)`
  * Realm
    - Windows (MSVC) builds are now tested in CI and and therefore more likely
        to work
    - Realm runtime can now be shutdown and reinitialized in the same process.
        (Exception: GASNet-based network layers do not support this.)
    - Registration of host memory with CUDA driver is skipped for host
        memories larger than 1GB by default due to CUDA driver overhead.
        This threshold can be increased (or decreased) with `-cuda:hostreg`
  * Tools
    - New Rust implementation of Legion Prof is 5-15x faster than the
        original (even with PyPy). For more details, see:
        https://legion.stanford.edu/profiling/#rust-legion-prof

Version 21.03.0 (March 30, 2021)
  * Build
    - Cmake can build an embedded copy of GASNet as part of the Legion build
        with `-DLegion_EMBED_GASNet=ON`
  * Regent
    - Contains three breaking changes to the Regent calling convention:
      - Reductions are now aggregated into region requirements and
          sorted by the index of the first field in the field space
          among the set of fields for each reduction.
      - Task arguments may be passed through either `args` or
          `local_args` for index launched tasks. (Previously Regent
          only used `local_args`.)
      - Region values passed via `args` to an index-launched task may
          be *bogus*. Instead the region requirement should be used to
          obtain the original region.
    - Support for constant time index launches. These are enabled
        automatically, but can be forced on or off with `__demand` or
        `__forbid` with `__constant_time_launches`. This should
        improve scalability at extreme node counts.
    - Support for `rescape` and `remit` to generate metaprogrammed
      code more easily.
    - Experimental support for separate compilation via `-fspeparate 1`
        allows Regent programs to be compiled in parts (potentially in
        parallel). Note that separate compilation currently cannot be
        used with Bishop and requires one of either parallel or
        incremental compilation if `regentlib.start` is used (does not
        apply to `regentlib.saveobj` or `regentlib.save_tasks`).
  * Legion
    - In the control replication branch users will find a new implementaiton
      of Legion's physical analysis that uses heuristics to select which
      sub-trees should be used for performing the analysis. Disjoint and
      complete partitions are especially helpful in aiding the runtime.
    - There is a new implementation of the index space math inside of the
      runtime that now soundly and precisely detect congruences between
      index space math operations. This fixes a long-running class of bugs
      that would cause memory explosions in the physical analysis.
    - In the control replication branch users can now map future values into
      memories the same as they do with regions. This means that future
      payloads can be placed directly on devices like GPUs. Similarly, the
      runtime now accepts future data from tasks that also reside in any
      memory in the machine including device memories.
    - Both the master and control replication branches have support for
      index space attach operations.
    - Expensive transitive reductions on traces are now computed in the
      background allowing trace replays to begin replaying immediately
      with only partial optimizations.
  * Realm
    - Custom reduction operations (including Legion's built-in ones) can
        provide CUDA implementations, permitting in-place reductions in
        CUDA device memory
    - Support for CUDA managed memory (via `-ll:msize`) that is coherent for
        both host and device access.  Includes support for `__managed__`
        variables (only single-GPU if using CUDA runtime hijack mode)
    - `Event::wait` may be called outside of Realm tasks, having the same
        thread-blocking behavior as `Event::external_wait`
    - Experimental support for AMD HIP.  Note that testing coverage is
        incomplete, and breakages may occur in between releases.  For more
        details, see:
        https://github.com/StanfordLegion/legion/issues/1028

version 20.12.0 (December 28, 2020)
  * Build
    - Legion and Realm now require a compiler with (at least) c++11 support
    - Python scripts (e.g. legion_prof and legion_spy) require Python 3.5
  * Realm
    - Improved performance of inter-node instance copies when data is not
        contiguous in source and/or destination
    - Improved responsiveness of utility processors by not using them for
        background work by default
    - Experimental support for building on Windows with MSVC
    - Improved performance (and correctness) when running CUDA tasks without
        the runtime hijack enabled
    - Added `gasnetex` network layer that uses GASNet-EX's native API (instead
        of the legacy GASNet-1 API support).  Requires GASNet version 2020.11.0
	or newer.  For more details, see:
	https://github.com/StanfordLegion/legion/issues/986
  * Legion
    - The mapping interface no longer requires the runtime to return valid
      instances for empty regions (e.g. regions with no points their index space)
  * Tools
    - Legion Spy now has support for arbitrary number of dimensions
  * Examples
    - `examples/nccl` gives a simple example of using NCCL with Legion

Version 20.09.0 (September 28, 2020)
  * Legion
    - Support for mapper-controlled reuse of reduction instances.  See:
        https://github.com/StanfordLegion/legion/issues/545
    - Support for creating compact instances of sparse index spaces.  See:
        https://github.com/StanfordLegion/legion/issues/624
  * Realm
    - Switched from function-specific internal threads to generic "background
        workers" that are shared by all subsystems.  The number of workers is
        controlled by `-ll:bgwork` (default=2).  For further details, see:
        https://github.com/StanfordLegion/legion/issues/662
    - Numerous bug/performance/memory leak fixes
    - Support for OpenMP-enabled code running on a Python processor.  The
        total number of threads available to the processor is set with
        `-ll:pyomp` (default=1 - i.e. just the initial thread)
    - Support for C++ tasks on Python processors.  A C++ task does NOT take
        the Python GIL by default - the task body should call
        `PyGILState_{Ensure,Release}` as needed
    - Increased the maximum number of instances in a single memory from 64K
        to 4 million.
    - Improved performance of concurrent CUDA GPU->GPU copies with 3+ GPUs
  * Tools
    - An installed version of Legion now includes legion_spy, legion_prof
        scripts

Version 20.06.0 (June 29, 2020)
  * Regent
    - Support for `std/format` module for type-safe formatted printing
    - Support for documentation with LDoc
    - Support for `__future` operator to import a C API future
  * Legion
    - Support for inlining tasks into leaf contexts
    - Support for global registration callbacks inside of tasks
    - Added semantic tags for source file and line location
    - Support for multi-region accessors for region requirements with
        co-location constraints
    - Changes to semantics of deletion for index spaces, field spaces, and
        logical regions.  For details, see:
	https://github.com/StanfordLegion/legion/issues/812
    - Support for creating fields spaces with initial fields
  * Realm
    - Subgraphs can be used to capture a template of Realm operations
        that will be executed repeatedly.  Subgraph definitions include
        support for "interpolating" values into individual operations'
        arguments on each instantiation of the subgraph template
    - `create_weighted_subspaces` supports `size_t` weights for precise
        control over the size of each subspace
    - Added support for `omp critical` constructs and dynamic loop
        schedules in OpenMP tasks
    - Added support for `cudaStreamLegacy` and `cudaStreamPerThread` in
        CUDA tasks
    - Realm logs now include a timestamp (relative to runtime init)
        by default.  This behavior can be disabled with `-logtime 0`
    - Performance improvements for copies/fills of 3D instances spaces in
        GPU device memory
    - Added ability to compute a set of "covering rectangles" for sparse
        index spaces, allowing more compact representation in memory
    - Added `MultiAffineAccessor` for accessing compact instances
    - Added ability to delete a `ProcessorGroup`

Version 20.03.0 (March 31, 2020)
  * Regent
    - Behavior change: `__fields` and `__physical` now both require
        explicit field names, i.e., `__fields(r.{x, y})` rather than
        `__fields(r)`. This makes the behavior more unambiguous and
        helps to avoid bugs
    - Added `complete` and `incomplete` keywords that can be used to
        mark partitions as such
    - Added support for setting mapper ID and tag via
        `t:set_mapper_id()` and `t:set_mapping_tag_id()`
    - Initial support for predicated execution of `if` and `while`
        statements
    - Fixed several bugs, memory leaks and improved compile times
  * Legion
    - Introduction of Fortran bindings for Legion
    - Support for creating deferred index spaces from future values
    - Support for construction of partitions from a map of domains or
      from a future map
    - Support for reducing a future map to a single future asynchronously
  * Realm
    - Support for Kokkos parallel launch constructs in Realm (and therefore
        Legion) tasks.  Currently supported Kokkos execution spaces
	are: Serial, OpenMP, CUDA.  Application data remains in logical
	regions, but accessors can be converted to Kokkos (unmanaged) Views
	if needed.  See the `kokkos_interop` example
    - Introduction of experimental MPI-based network layer, enabled with
        `REALM_NETWORKS=mpi` (make) or `-DRealm_NETWORKS=mpi` (cmake).
	Use `REALM_NETWORKS=gasnet1` (or USE_GASNET=1, which still works)
	for the GASNet-based network layer (which works with GASNet-1 or
	GASNet-EX)
    - CUDA Runtime API interposer (a.k.a. "hijack") can now be disabled with
        `USE_CUDART_HIJACK=0` (make) or	`-DLegion_HIJACK_CUDART=OFF` (cmake).
	This can reduce effectivenes of task-parallelism for CUDA tasks, so
	use only if needed
    - More control over GPU selection via: `-cuda:skipgpus N` which leaves the
        first N GPUs available for other uses, `-cuda:skipbusy` which skips
	over busy GPUs, and `-cuda:minavailmem M` which skips GPUs with less
	than M device memory available
    - Reduction in memory usage of Realm internal data structures
  * Tools
    - There is a now a generic launcher script for running Python code
        with Legion that will execute an aribtrary Python program in the
        top-level task of a Legion program. This script mirrors the interface
        to CPython as closely as possible.
    - Legion Spy now supports verification and rendering of indirection copies
    - Legion Prof supports Instance layout constraints related to dimension
        ordering and field alignnment
    - Legion Prof contains a menu option for viewing ready state of operations

Version 19.12.0 (December 31, 2019)
  * Build
    - Both builds (Make and CMake) now generate `legion_defines.h` and
        `realm_defines.h`. By default these headers are generated in
        the source directory (Make) or build directory (CMake). This
        means that languages such as Regent and Python no longer
        require MAX_DIM to be specified explicitly
  * Regent
    - Support for CUDA 10
    - Support for field polymorphic tasks
    - Substantially improved the generality of the index launch
        optimization. Task arguments of the form p[i+k] may now be
        used, where k is a variable defined outside of the loop
    - Add flag `-foverride-demand-index-launch` which can be used to
        force loops to be index launched in cases where the compiler
        cannot prove the disjointness of read-write region
        arguments
    - Added reductions for complex64
    - The scripts `install.py` and `setup_env.py` now use CMake to
        build Terra by default, which should improve portability on
        most machines
    - The behavior of `-fcuda 1` has changed: this flag will now issue
        an error if CUDA cannot be enabled (e.g. because the build
        does not support CUDA, or because the machine has no
        GPUs). Omitting this flag will now enable CUDA if it is
        available (and will not error if it is not available).
        The behavior of `-fopenmp 1` has changed similarly.
    - The behavior of `__demand(__cuda)` has changed. This will now
        issue an error if a loop is not eligible for the CUDA
        transformation, regardless of whether CUDA is actually
        available on the current machine or not. The behavior of
        `__demand(__openmp)` has changed similarly.
    - The annotation `__allow(__cuda)` is now permitted, and permits
        (but does not require) tasks to be optimized with CUDA.
    - Experimental support for 2D kernel launch in the CUDA code generation
  * Python
    - Add support for copies
    - Copies and fills now support multiple fields
    - Tasks (including index launches) now support setting the mapper
        ID and tag
  * Legion
    - A major overhaul of the Legion physical analysis to use an
        approach based on bounding volume hierarchies. The change is
        not visible to users, but will likely impact performance. Most
        programs will get faster; programs that create many partitions
        frequently on the fly may get slower. The later case will be fixed
        in an upcoming release.
    - Added support for indirect copy operations such as gather and
        scatter onto existing copy launchers
  * Realm
    - `Event::subscribe` allows polling via `Event::has_triggered` to
        (eventually) succeed
    - Addition of `CompletionQueue` objects that allow multiple unordered
        `Event` triggers to be efficiently handled by a single consumer
    - Support for `omp_get_level`, `omp_in_parallel`, and
        `omp_set_num_threads` in tasks running on OpenMP processors
    - Support for unstructured scatter and/or gather in copies.  (Handling
        structured cases as well as fills/reductions remains a work in
	progress.)
    - Removed all calls to `Event::wait` from inside other Realm API calls.
        Applications now must make sure that index spaces and instance
	metadata are valid before use.  For details, see:
	https://github.com/StanfordLegion/legion/issues/465

Version 19.09.1 (September 13, 2019)
  * Regent
    - Fix for correctness bug in task inlining.  See:
        https://github.com/StanfordLegion/legion/issues/582

Version 19.09.0 (September 9, 2019)
  * Regent
    - __demand(__index_launch) has been added as an alternative to
        __demand(__parallel) on for loops that avoids confusion with the
        auto-parallelizer. __demand(__parallel) on for loops is deprecated and
        now issues a warning; in a future release this warning will be
        upgraded to an error. For details, see:
        https://github.com/StanfordLegion/legion/issues/520
    - Multi-field expasion is deprecated and now issues an error. The error
        can be temporarily downgraded to a warning, but it is advised that
        users migrate codes away from this syntax as it will become a hard
        error in a future release. For details, see:
        https://github.com/StanfordLegion/legion/issues/501
  * Legion
    - Support for a built-in collection of reduction operators including
        sum, product, max, and min over a variety of types for CPUs and GPUs
  * Realm
    - assorted bug, performance, and memory leak fixes
    - fills to attached HDF5 instances are orders of magnitude faster
    - support for reusing HDF5 file handles with `-hdf5:openfiles` option
    - control which rank opens an HDF5 file with a `rank=nnn:` filename prefix
  * Build System
    - Makefile-based flow attempts to detect CUDA location and GASNet conduit
        if they are not specified
    - Makefile-based flow defaults to building CUDA fat binaries, but can still
        be overridden with the `GPU_ARCH` setting, which now accepts SM arch
	numbers (e.g. "70") as well as names (e.g. "volta")

Version 19.06.0 (June 27, 2019)
  * Legion
    - All tools (Legion Prof, Legion Spy, etc.) now support Python 2 and 3
    - The flag -lg:warn_backtrace prints a backtrace on each warning
        to allow easier pinpointing of problematic code
  * Realm
    - Support for building against debug versions of GASNet
    - Significantly reduced runtime overhead for small Realm tasks
    - External HDF5 instances work with datasets in groups
    - Scheduler locking allows spin-waiting for non-reentrant
        operations (e.g. Python module imports)
    - Memory size (e.g. "-ll:csize") arguments accept k/m/g/t
        size suffixes
    - Better error messages when Realm memory sizes are too large
  * Regent
    - The image, preimage and restrict partitioning operators now
        accept an optional disjoint or aliased keyword to specify the
        disjointness of the resulting partition
    - The address of operator (&) is now supported
    - Support for explicit field maps for HDF5
  * Legion Prof
    - Menu option to select a subset of the profile information
        for viewing
    - Grouping of memory channels, utilization and additional details
        such as source and destination nodes/processors associated with
        the channels
    - Physical instances contain additional information about the regions
        they belong to
  * Python
    - Support for partitioning operators equal and restriction
    - Support for bool and complex types
    - Support for must epoch launches
    - Support for returning a future out of a fence
    - Fixes for macOS

Version 19.04.0 (April 30, 2019)
  * Legion
    - Support for dimensions > 3. Set MAX_DIM at build time
        (or -DLegion_MAX_DIM in CMake) to build with any number of
        dimensions up to 9.
    - Change VariantID to 32 bits to match AUTO_GENERATE_ID
    - Improved mapper interfaces for instance allocation and
      failed instance allocation due to layout constraint conflicts
  * Regent
    - Support for index fills
    - Support for disabling structure-slicing on structs by setting
        __no_field_slicing on the struc type
    - Substantial improvements to the auto-parallelizer, CUDA and
      OpenMP code generators
    - Substantial improvements in compile time for tasks with large
      numbers of fields
    - Build fixes for macOS
    - setup_env.py now works on macOS
  * Realm
    - support for #pragma omp single sections in OpenMP processors
    - Realm IDs uses explicit bit packing instead of fragile C bit fields
    - numerous fixes for create_equal_subspace deppart operations
    - Support for CUDA 10
  * Legion Prof
    - Added support for recording GPU processor times


Version 18.12.0 (December 27, 2018)
  * Realm
    - More assorted bug fixes
    - Minor performance improvements in logging and accessor code
    - Handle signals on an alternate stack for better debugging/backtraces
  * Regent
    - Added a new built-in complex type
    - Experimental support for building with PUC Lua
    - Multiple fixes to CUDA code generation, vectorization,
        auto-parallelization, and mapping optimization
    - Better error messages for __demand(__leaf) and so on
  * Python
    - Use PyGILState for threading for compatibility with modules (e.g. numpy)
    - Support for calling tasks written in Regent

Version 18.09.0 (September 19, 2018)
  * Legion
    - Support for physical tracing, which can provide up to 7x improvement in
        loops with very small tasks. Can be enabled in the mappers that
        inherit from DefaultMapper using -dm:memoize 1
  * Realm
    - Assorted minor bug fixes
    - Support for development snapshots of GASNet-EX (using GASNet-1
        compatibility interfaces for now)
  * Regent
    - Changed precedence of logical operators (and, or) to match that of
        Lua and Terra (or is now lower-precedence than and)
    - Full support for accessing sparse multi-dimensional regions
    - Initial support for incremental compilation. Enable with
        REGENT_INCREMENTAL=1
    - Changes to make compilation entirely deterministic
    - Multiple compilation speed improvements
    - Support for CUDA scalar reductions
    - Experimental support for parallel prefix operators, including CUDA
  * Python
    - Support for defining methods as tasks
    - Support for passing futures to tasks and index tasks
    - Support for explicit return types on extern tasks
    - Improved support for Futures with encodings other than pickle

Version 18.05.0 (May 31, 2018)
  * Legion
    - Migrated all node-local Legion reservations to use Realm
      fast reservations and removed no longer necessary continuations
    - Added support for mapper attached data to all Mappable types
    - Added support for assigning a block of IDs to a library in a consistent
        way across nodes via generate_library_task_ids and friends
  * Realm
    - Added support for "fast" reservations that have better
      performance characteristics for reservations local to a node
  * C API
    - Updated projection functor API to match Legion C++ API
  * Regent
    - Regent now generates disjointness constraints for affine
        expressions in partition accesses. E.g. p[i] and p[i+1] are
        now known to be disjoint at compile time as long as p is a
        disjoint partition
    - Support for non-trivial projection functors in index space launches
        such as f(p[i+1])
    - Improvements to compile time spent in various optimization passes
    - Support for parallel compilation with the flag -fjobs N
    - Miscellaneous fixes

Version 18.02.0 (February 2, 2018)
  * Legion
    - Support for PowerPC vector intrinsics
    - FieldAccessors support "view" coordinates and equivalent bounds checks
    - Improved schedule priorities for Legion meta-tasks
  * Realm
    - Operation priority can now be adjust after a task/copy is launched
    - Assorted bug/memory leak fixes
    - AffineAccessors support an optional translation from "view" coordinates
        to actual coordinates in the instance being accessed
  * Regent
    - Experimental support for calling Regent tasks from C/C++
    - Support for building with CMake
    - Support for running on PowerPC
  * Bindings
    - Obsolete Lua and Terra bindings have been removed. The remaining Terra
      bindings have been renamed to Regent and now produce libregent.so

Version 17.10.0 (October 27, 2017)
  * Legion
    - Introduction of new partitioning API based on dependent partitioning
    - Deprecation of old partitioning API, LegionRuntime::{Arrays,Accessors}
        namespaces
  * Realm
    - Dependent partitioning API, including dimension-aware IndexSpace
    - Point/Rect types moved to Realm namespace
    - Instance creation allows caller to choose precise memory layout
    - Accessors moved to Realm namespace, changed to match new instance layouts
  * C API
    - The C API is now accessed via the `legion.h` header file. Note that this
        is still a redirect back to the current `legion/legion_c.h` header
  * Legion Prof
    - Added support for minimally invasive dumping of intermediate
        profiling data while the application is still running for long runs
  * Python
    - New Python API bindings and native support for Python processors
        Compile with USE_PYTHON=1 and run with -ll:py 1 to enable Python
        Also see examples/python_interop for an example

Version 17.08.0 (August 24, 2017)
  * Build system
    - Added HDF_ROOT variable to customize HDF5 install location
  * Legion
    - New error message format and online reference at
        http://legion.stanford.edu/messages
  * Legion Prof
    - Added new compact binary format for profile logs
    - Added flag: -hl:prof_logfile prof_%.gz
  * Realm
    - Fixes to support big-endian systems
    - Several performance improvements to DMA subsystem
    - Added REALM_DEFAULT_ARGS environment variable
        containing flags to be inserted at front of command line
  * Regent
    - Removed new operator. Unstructured regions are now
        fully allocated by default
    - Added optimization to automatically skip empty tasks
    - Initial support for extern tasks that are defined elsewhere
    - Tasks that use __demand(__openmp) are now constrained
        to run on OpenMP processors by default
    - RDIR: Better support for deeper nested region trees

Version 17.05.0 (May 26, 2017)
  * Build system
    - Finally removed long-obsolete SHARED_LOWLEVEL flag
  * Legion
    - Added C++14 [[deprecated]] attribute to existing deprecated APIs.
        All examples should all compile without deprecation warnings
    - Added Legion executor that enables support for interoperating
        with Agency inside of Legion tasks
  * Realm
    - Switched to new DMA engine
    - Initial support for OpenMP "processors". Compile with USE_OPENMP
        and run with flags -ll:ocpu and -ll:othr.
  * Regent
    - Added support running normal tasks on I/O processors
    - Added support for OpenMP code generation via __demand(__openmp)
  * C API
    - Removed the following deprecated types:
          legion_task_result_t
            (obviated by the new task preamble/postamble)
    - Removed the following deprecated APIs:
          legion_physical_region_get_accessor_generic
          legion_physical_region_get_accessor_array
            (use legion_physical_region_get_field_accessor_* instead)
          legion_runtime_set_registration_callback
            (use legion_runtime_add_registration_callback instead)
          legion_runtime_register_task_void
          legion_runtime_register_task
          legion_runtime_register_task_uint32
          legion_runtime_register_task_uint64
            (use legion_runtime_preregister_task_variant_* instead)
          legion_future_from_buffer
          legion_future_from_uint32
          legion_future_from_uint64
          legion_future_from_bytes
            (use legion_future_from_untyped_pointer instead)
          legion_future_get_result
          legion_future_get_result_uint32
          legion_future_get_result_uint64
          legion_future_get_result_bytes
            (use legion_future_get_untyped_pointer instead)
          legion_future_get_result_size
            (use legion_future_get_untyped_size instead)
          legion_future_map_get_result
            (use legion_future_map_get_future instead)

Version 17.02.0 (February 14, 2017)
  * General
    - Bumped copyright dates
  * Legion
    - Merged versioning branch with support for a higher performance
        version numbering computation
    - More efficient analysis for index space task launches
    - Updated custom projection function API
    - Added support for speculative mapping of predicated operations
    - Added index space copy and fill operations
  * Legion Prof
    - Added a stats view of processors grouped by node and processor type
    - Added ability to collapse/expand each processor/channel/memory in
        a timeline. To collapse/expand a row, click the name. To
        collapse/expand the children of a row, click on the triangle
        next to the name.
    - Grouped the processor timelines to be child elements under the stats
        views
    - Added on-demand loading of each processor/stats in a timeline.
        Elements are only loaded when you expand them, saving bandwidth
  * CMake
    - Switched to separate flags for each of the Legion extras directories:
          -DLegion_BUILD_APPS (for ./apps)
          -DLegion_BUILD_EXAMPLES (for ./examples)
          -DLegion_BUILD_TUTORIAL (for ./tutorial)
          -DLegion_BUILD_TESTS (for ./test)

Version 16.10.0 (October 7, 2016)
  * Realm
    - HDF5 support: moved to Realm module, added DMA channels
    - PAPI support: basic profiling (instructions, caches, branches) added
  * Build flow
    - Fixes to support compilation in 32-bit mode
    - Numerous improvements to CMake build
  * Regent
    - Improvements to vectorization of structured codes
  * Apps
    - Removed bit-rotted applications - some have been replaced by examples
        or Regent applications
  * Tests
    - New test infrastructure and top-level test script `test.py`

Version 16.08.0 (August 30, 2016)
  * Realm
    - Critical-enough ("error" and "fatal" by default, controlled with
        -errlevel) logging messages are mirrored to stderr when -logfile is
        used
    - Command-line options for logging (-error and new -errlevel) support
        English names of logging levels (spew, debug, info, print,
        warn/warning, error, fatal, none) as well as integers
  * Legion
    - Rewrite of the Legion shutdown algorithm for improved scalability
      and avoiding O(N^2) behavior in the number of nodes
  * Regent
    - Installer now prompts for RDIR installation
  * Tools
    - Important Legion Spy performance improvements involving transitive
        reductions

Version 16.06.0 (June 15, 2016)
  * Legion
    - New mapper API:
        use ShimMapper for limited backwards compatibility
    - New task variant registration API
        supports specifying layout constraints for region requirements
        old interface is still available but deprecated
    - Several large bug fixes for internal version numbering computation
  * C API
    - The context parameter for many API calls has been removed
  * Tools
    - Total re-write of Legion Spy

Version 16.05.0 (May 2, 2016)
  * Lots of stuff - we weren't itemizing things before this point.