-
Notifications
You must be signed in to change notification settings - Fork 145
/
CHANGES.txt
989 lines (951 loc) · 47.7 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
This file lists the major changes as they appear in the stable branch. No
attempt is made to keep this list accurate for the master branch.
Version 24.12.0 (December 20, 2024)
* Legion
- Numerous bug fixes
* Regent
- Support for running without the CUDA hijack
- Support for NVIDIA Hopper GPU architecture
* Tools
- Support for exporting profiles to NVTXW format
- Simplifications that may improve performance by removing
obsolete features
* Realm
- Remove the need for dynamic_cast in ExternalResource
- Support for CUPTI profiling
- Support for registring per GPU reduction operations via CUfunction
- Add a flag and default disable ATS/HMM support and shared CPU memories
- Support for scalable barrier via radix tree
- Support for querying resources of NUMA
- Support for backtrace via cpptrace
- More unit tests
- CI coverage for HIP with NVIDIA backend
Version 24.09.0 (September 27, 2024)
* Legion
- Bug fixes for control replication and multi-node configurations
* Regent
- Fixes for ROCm 6.0 code generation
* Tools
- Legion Prof now uses subcommands (e.g., `legion_prof view`) to clarify
which options apply to which actions
- Legion Prof now tracks backtraces at the points where blocking wait
calls are performed by the application
- Legion Prof reports more detailed timing information for tasks
- Legion Prof calculates clock skew between nodes and reports it when
relevant
- Commonly used features of Legion Prof are now enabled by default
- The old Python Legion Prof implementation is no longer supported
* Realm
- `Point` fields `x`, `y`, `z` and `w` have been replaced by methods
- Support for launching CUDA tasks onto a CUDA stream asynchronously via
`cuCtxRecordEvent` without the need of CUDA hijack
- Support for CUDA fabric sharing
- Support for host-to-host copies via CUDA DMA
- Support for querying number of NUMA nodes from the `NumaModuleConfig`
- Added reference counting for preimage operations
- Make `std::atomic` as the default atomic implementation
- Remove `REALM_CXX_STANDARD`, and bump the minimal requirement to C++17
- Implemented an ABI stable wrapper for GASNetEX
- Additional unit tests including `CircularQueue`, `ReplicatedHeap`,
`find_fastest_path`, `DynaamicTableAllocator`, `generate_gather_paths`,
`TransferIteratorIndexSpace`
- Dead code cleanups and bug fixes
Version 24.06.0 (June 28, 2024)
* Build
- Minimum required C++ standard is now 17
- Embedded GASNet build in CMake now automatically enables GPU memory kinds
* Legion
- Support for nonidempotent traces (where the postconditions do
not imply the preconditions of the trace)
- Deletions are now committed in program order, making it easier for
users to reason about when their effects take place
- All tasks (and other operations) are now committed in order (a
prerequisite for anticipated, but not yet implemented, precise
exception support)
- Improvements to Legion's internal algorithm for virtual
instances, fixing various correctness bugs in the implementation
- Improvements to the `DefaultMapper` handling of task layout constraints
* Regent
- Improvements to make compiler more deterministic
- Improvements to auto-detect CUDA
- Support for complex numbers in `std/format`
- Static control replication (SCR) and RDIR have been completely
removed. All SCR and RDIR related flags (`-fflow-*`) have been
removed, except for `-fflow 0` which is permitted (but no
longer does anything, and now issues a warning)
* Tools
- Restore profiler's ability to render dependent partitioning channels
- Render mapper information on mapper calls in the profiler
- Render user-provided profiling information in the profiler
* Realm
- UVM support for the HIP module
- Error code support for command line parser
- Support for querying MIG devices from NVML
- Add indirection channel query
- Additional unit tests and bug fixes
Version 24.03.0 (March 27, 2024)
* Build
- ROCm 6.0 is now supported, and support for ROCm 4.x has been removed
* Legion
- Support for control replication has been merged
- Support for discarding region contents on task completion
- Long-deprecated APIs, such as the old `HighLevel` namespace, have
been removed
* Mappers
- Default mapper support for control replication
- Default and null mapper now use C++ `override` keyword
* Regent
- Support for pure projection functors that capture arguments
- Static control replication (SCR) has been deprecated and will be
removed in a future release
* Tools
- The profiler now correctly recognizes the logger format version
and throws an error if it does not match
- The profiler now reports when a profile was generated with debug
mode (or another expensive setting) was enabled
- Many profiler fixes for correctly rendering runtime and mapper calls
- Profiler now renders GPU device and host execution separately
- Optimizations to improve profiler memory usage and running time
- Rust profiler now requires at least Rust 1.74
* Realm
- Support for registration of dynamically allocated buffers
- Support for handling poisoned events for reservation
- Refactor CUDA allocation and IPC paths
- Support for querying CUDA device information (GPU UUID and ID),
process information (process ID, hostname, host ID) and timer
calibration error from the profiler
- Remove address alignment from serializer and deserializer
- Support for creating network shared peers using IPC mailbox
- Support OMP thread binding and allow for multiple OMP parallel
sections when enabling system OMP runtime
- Add Realm unit tests
- Fixes for Realm tests, sparsity map, MemoryQuery, dynamic framebuffer
memory and memcpy channel
Version 23.12.0 (December 14, 2023)
* Regent
- Support for HIP multi-GPU per runtime
* Realm
- Improve scalability of startup by replacing point-to-point
communication with allgatherv for machine model announcements
- Support shared memory communication for system memory
- Provide sanity check for GPU tasks to detect any leak of CUDA streams
- Support for GPU transposes in CUDA-DMA
- Bug fixes for CUDA-DMA
Version 23.09.0 (September 28, 2023)
* Regent
- Elide future maps in index launches
- Improvements to Pygion interop
* Realm
- Add a machine configuration API that allows applications to configure
the machine model without using the command line
- Expose Realm managed CUDA/HIP stream to applications to launch GPU tasks
without device-wise synchronization when hijack is disabled
- Change timers to use rdtsc
- Improve performance for getting highest priority task available in any
task queue
- Implement framebuffer memory with `cuMemMap`
- Initial work for moving STL dependencies to header only
Version 23.06.0 (June 28, 2023)
* Build
- Fixes for CMake build on macOS
- Fixes for HIP build when arch is specified
* Realm
- Support for better backtraces via libdw and libunwind
- Improve scalability and performance in task spawning by caching
the triggering operation of an event if one is provided
- Fix a minor issue with affinity queries to properly clear the
user-provided vector before populating it
- Add more accurate GPU memory bandwidth affinity calculations if
NVML is available
- Refactor CPU core topology enumeration to serve systems without
NUMA capabilities (like Jetson ARM systems)
- Improve scalability and performance of task spawning by moving event
reuse freelists to be per-processor, reducing lock contention
- Add a microbenchmark for measuring task throughput more accurately
- Add a series of Realm API tutorials
- Replace `CU_EVENT_DEFAULT` with `CU_EVENT_DISABLE_TIMING` for better
performance of CUDA events
- Support Kokkos interop for the HIP module
- Fixes for Realm tests on macOS
* Tools
- Legion Prof now supports search in the new profiler UI
- Legion Prof now supports an HTTP client/server interface. Launch the
server with `--serve` (on port 8080 by default) and attach a client to
it with `--attach http://127.0.0.1:8080`
- Legion Prof now supports a new achival mode via the `--archive`
flag. Generate an offline profile and view it either via `--attach` or
by uploading it to a server and navigating to
`https://legion.stanford.edu/prof-viewer/?url=...`
- Legion Prof modes (client/server/viewer) are now parallel by
default, and perform heavy computations off the UI thread for
better responsiveness
- Add support for rendering indirect copies (i.e., gather/scatter)
- Fix rendering of profiles over HTTP with old profiler UI
- Fix profiling of copies with different numbers of hops between instances
Version 23.03.0 (March 27, 2023)
* Build
- Minimum supported CMake version is now 3.16. (Some optional features may
continue to require even newer versions.)
- Minimum supported GCC version is now 8.
- Minimum supported CUDA version is now 10.
* Legion
- Added support for padded layout constraints to provide scratch space
in instances for tasks to use (see examples/padded_instances).
- Added support for tiled layout constraints to provide an ability to
layout instances by breaking down dimensions (see examples/tiling).
* Realm
- An experimental UCX network backend has been added.
- Updated the Kokkos interop to support Kokkos 4.0.
* Python
- Support loading Legion as a library from a stock Python interpreter.
* Regent
- Fixes to avoid leaking futures.
- Improvements to Regent's predicate optimization.
* Tools
- Legion Prof now supports a native viewer UI. Enable it with the `viewer`
feature (e.g., `cargo run --features=viewer`) and use the flag
`--view`.
- Legion Prof now has better support for rendering a subset of available
nodes. Pass all log files (from all nodes) into Legion Prof and add the
`--subnodes` flag to specify which ones to render. This ensures all
copies in/out of those nodes will be shown correctly.
Version 22.12.0 (December 30, 2022)
* Regent
- Support for nested predication of `if` and `while` statements
* Realm
- Support priorities for Copy operations
- Support building with multiple network backends enabled, and use
-ll:networks (gasnetex/gasnet1/mpi/none) to pick which one to use
during runtime
- Separate CUDA runtime from Realm by removing all references to CUDA
runtime and relying only on driver API, which fixes an issue when
mixing static and dynamic cudart across an application and improves
Realm’s compatibility across driver versions
* Tools
- Legion Prof support visualization of Channel of indirect copy, and
Instances being used by different operations including Task, Copy
and Fill
Version 22.09.0 (September 30, 2022)
* Python
- Support for running packages via `legion_python -m`
- Support for Jupyter Notebook on single node execution.
* Regent
- Deprecated support for LLVM versions less than 11 in
`setup_env.py`. These versions will be removed in the next
release. LLVM 13 is recommended, except on ARM where LLVM 11 is
currently required
- Added support for provenance for all launcher operations
- Debug info is no longer generated by default in order to
optimize compile times. To re-enable it, run with
`-fdebuginfo 1`
* Legion
- Most Legion APIs now support passing a provenance string.
This provenance information is passed through to tools like
Legion Spy and Legion Prof so users can map what they are
seeing back to their source code. In the future, provenance
strings will also be used by all Legion error messages as well.
* Realm
- Support for fills of arbitrary instances (via multi-hop paths where
needed)
- Fixed crashes when using external instances and network-registered
memory at the same time
- Removed all direct references to CUDA runtime library in CUDA module
- Caching of minimum-cost data transfer path for repeated copies
- Dependent partitioning support for image and preimage using structured
(~affine) transforms in addition to existing unstructured (field-based)
images/preimages
Version 22.06.0 (June 29, 2022)
* Regent
- Support for cross-products in index launches, as well as
multi-level projection functors.
- Support for HIP on AMD GPUs has been added. All tasks marked with
`__demand(__cuda)` are automatically eligible. Note that the name of
the annotation may change in the future to something more general, but
for now no change is being made. Some CUDA flags have migrated to more
general names. See below.
- The flag `-fcuda 1` is deprecated. Use `-fgpu cuda` instead.
- The flag `-fcuda-offline` is deprecated. Use `-fgpu-offline` instead.
- The flag `-fcuda-arch` is deprecated. Use `-fgpu-arch` instead.
- Enable HIP support with `-fgpu hip` and use the `-fgpu-offline` and
`-fgpu-arch` flags as necessary/appropriate.
- Support for new flag `-ffast-math 1` which enables fast-math
optimizations on CPU and GPU. By default, CPU code has this
disabled, and GPU code uses only the `contract` flag in LLVM
to generate FMA instructions. For compute-intensive
applications, additional performance can sometimes be unlocked
by enabling the full suite of optimizations with `-ffast-math 1`,
at the cost of numerical accuracy.
- Performance improvements for CUDA allow recent LLVM versions
(e.g., 13) to match or exceed the performance of LLVM
3.8. Previously, performance regressions made LLVM 3.8 the
most performant version for use with CUDA. The recommended
LLVM version moving forward is 13, and `setup_env.py` has been
updated to set this on all platforms.
- The versions of GASNet and Terra are now pinned by default in
`setup_env.py`. You can choose versions explicitly with
`GASNET_VERSION` (as before, though the previous default was
unpinned) and `--terra-branch`, respectively.
* Realm
- Allow use of system OpenMP runtime (instead of Realm-provided one) with
`-DLegion_OpenMP_SYSTEM_RUNTIME=ON`. This allows inter-operation with
libraries that have already been linked to the system runtime, but
limits each process to a single OMP processor.
Version 22.03.0 (March 27, 2022)
* Build
- Minimum supported cmake version is now 3.7. (Some optional features
continue to require even newer versions.)
* Realm
- Numerous bug fixes in the `gasnetex` network layer
- CUDA and HIP support allow direct specification of which gpus to
use via `-ll:gpu_ids` command-line option
- Added support for copy paths using Cuda IPC between gpus on the same
physical node
- For applications using CUDA without the runtime API hijack AND only
submitting work to the default CUDA stream, `-cuda:legacysync 1`
improves the overhead of detecting the completion of device-side work
launched by a task
- Realm reduction copies may now indicate exclusive access to the
destination instance, improving performance by allowing simple
load/store instead of atomic operations
- Custom reduction operations (including Legion's built-in ones) can
provide HIP implementations, permitting in-place reductions in
HIP device memory
* Regent
- Support for custom serialization of types in task parameters and results
- New experimental timing library under std/timing
Version 21.12.0 (December 31, 2021)
* Realm
- Performance improvements for multi-dimensional copies, especially
inter-process transfers
- Support for loading CUDA driver (if present) at runtime instead of
link time, allowing same binary to be used on systems with and without
CUDA-capable GPUs (enabled with -DLegion_CUDA_DYNAMIC_LOAD=ON in
cmake build)
- A separate `Memory` is now created per process for external (system)
memory instances. This memory has no capacity for creating instances
and can confuse applications or Legion mappers that assume exactly
one Memory of kind `SYSTEM_MEM` exists. Old behavior can be obtained
with `-ll:ext_sysmem 0`, but this can fail for configurations that
register system memory with the network and/or GPUs
- The `MemoryQuery` now supports a `has_capacity` predicate to restrict
results to just memories with sufficient total (not current!) capacity
to allocate an instance of a specified size
* Build
- Cmake allows control of max nodes (-DLegion_MAX_NUM_NODES=...) and
max processors/node (-DLegion_MAX_NUM_PROCS=...) supported by
Legion build
- Added dependency tracking to make-based builds
Version 21.09.0 (September 28, 2021)
* Realm
- Numerous bug fixes in the `gasnetex` network layer
- Support for HIP memory type registration with GASNet (with
GASNet version 2021.9.0+)
- Arguments to spawned tasks may now be arbitrarily large (network-
specific limits have been eliminated)
* Regent
- Improved support for dynamic checks on index launches with
potential interference between different region arguments
- Extensive fixes for separate compilation. This mode has now
been verified to work with large-scale applications
- Removed long-obsolete support for `__demand(__external)`
* Pygion
- Add support for layout constraints
Version 21.06.0 (June 24, 2021)
* Build
- Version information is now compiled into Realm and Legion. This takes
the form of a string (e.g. "legion-21.06.0") rather than anything
that can be compared (i.e. no semantic versioning here). Compile-time
defines `REALM_VERSION` and `LEGION_VERSION` are available as well as
run-time calls `Realm::Runtime::get_library_version` and
`Legion::Runtime::get_library_version`.
* Regent
- Support for dynamic checks on projection functors, enabling a
much larger class of loops to be supported as index launches
- Support for local tasks (i.e., without going through the
runtime) via `__demand(__local)`
* Realm
- Windows (MSVC) builds are now tested in CI and and therefore more likely
to work
- Realm runtime can now be shutdown and reinitialized in the same process.
(Exception: GASNet-based network layers do not support this.)
- Registration of host memory with CUDA driver is skipped for host
memories larger than 1GB by default due to CUDA driver overhead.
This threshold can be increased (or decreased) with `-cuda:hostreg`
* Tools
- New Rust implementation of Legion Prof is 5-15x faster than the
original (even with PyPy). For more details, see:
https://legion.stanford.edu/profiling/#rust-legion-prof
Version 21.03.0 (March 30, 2021)
* Build
- Cmake can build an embedded copy of GASNet as part of the Legion build
with `-DLegion_EMBED_GASNet=ON`
* Regent
- Contains three breaking changes to the Regent calling convention:
- Reductions are now aggregated into region requirements and
sorted by the index of the first field in the field space
among the set of fields for each reduction.
- Task arguments may be passed through either `args` or
`local_args` for index launched tasks. (Previously Regent
only used `local_args`.)
- Region values passed via `args` to an index-launched task may
be *bogus*. Instead the region requirement should be used to
obtain the original region.
- Support for constant time index launches. These are enabled
automatically, but can be forced on or off with `__demand` or
`__forbid` with `__constant_time_launches`. This should
improve scalability at extreme node counts.
- Support for `rescape` and `remit` to generate metaprogrammed
code more easily.
- Experimental support for separate compilation via `-fspeparate 1`
allows Regent programs to be compiled in parts (potentially in
parallel). Note that separate compilation currently cannot be
used with Bishop and requires one of either parallel or
incremental compilation if `regentlib.start` is used (does not
apply to `regentlib.saveobj` or `regentlib.save_tasks`).
* Legion
- In the control replication branch users will find a new implementaiton
of Legion's physical analysis that uses heuristics to select which
sub-trees should be used for performing the analysis. Disjoint and
complete partitions are especially helpful in aiding the runtime.
- There is a new implementation of the index space math inside of the
runtime that now soundly and precisely detect congruences between
index space math operations. This fixes a long-running class of bugs
that would cause memory explosions in the physical analysis.
- In the control replication branch users can now map future values into
memories the same as they do with regions. This means that future
payloads can be placed directly on devices like GPUs. Similarly, the
runtime now accepts future data from tasks that also reside in any
memory in the machine including device memories.
- Both the master and control replication branches have support for
index space attach operations.
- Expensive transitive reductions on traces are now computed in the
background allowing trace replays to begin replaying immediately
with only partial optimizations.
* Realm
- Custom reduction operations (including Legion's built-in ones) can
provide CUDA implementations, permitting in-place reductions in
CUDA device memory
- Support for CUDA managed memory (via `-ll:msize`) that is coherent for
both host and device access. Includes support for `__managed__`
variables (only single-GPU if using CUDA runtime hijack mode)
- `Event::wait` may be called outside of Realm tasks, having the same
thread-blocking behavior as `Event::external_wait`
- Experimental support for AMD HIP. Note that testing coverage is
incomplete, and breakages may occur in between releases. For more
details, see:
https://github.com/StanfordLegion/legion/issues/1028
version 20.12.0 (December 28, 2020)
* Build
- Legion and Realm now require a compiler with (at least) c++11 support
- Python scripts (e.g. legion_prof and legion_spy) require Python 3.5
* Realm
- Improved performance of inter-node instance copies when data is not
contiguous in source and/or destination
- Improved responsiveness of utility processors by not using them for
background work by default
- Experimental support for building on Windows with MSVC
- Improved performance (and correctness) when running CUDA tasks without
the runtime hijack enabled
- Added `gasnetex` network layer that uses GASNet-EX's native API (instead
of the legacy GASNet-1 API support). Requires GASNet version 2020.11.0
or newer. For more details, see:
https://github.com/StanfordLegion/legion/issues/986
* Legion
- The mapping interface no longer requires the runtime to return valid
instances for empty regions (e.g. regions with no points their index space)
* Tools
- Legion Spy now has support for arbitrary number of dimensions
* Examples
- `examples/nccl` gives a simple example of using NCCL with Legion
Version 20.09.0 (September 28, 2020)
* Legion
- Support for mapper-controlled reuse of reduction instances. See:
https://github.com/StanfordLegion/legion/issues/545
- Support for creating compact instances of sparse index spaces. See:
https://github.com/StanfordLegion/legion/issues/624
* Realm
- Switched from function-specific internal threads to generic "background
workers" that are shared by all subsystems. The number of workers is
controlled by `-ll:bgwork` (default=2). For further details, see:
https://github.com/StanfordLegion/legion/issues/662
- Numerous bug/performance/memory leak fixes
- Support for OpenMP-enabled code running on a Python processor. The
total number of threads available to the processor is set with
`-ll:pyomp` (default=1 - i.e. just the initial thread)
- Support for C++ tasks on Python processors. A C++ task does NOT take
the Python GIL by default - the task body should call
`PyGILState_{Ensure,Release}` as needed
- Increased the maximum number of instances in a single memory from 64K
to 4 million.
- Improved performance of concurrent CUDA GPU->GPU copies with 3+ GPUs
* Tools
- An installed version of Legion now includes legion_spy, legion_prof
scripts
Version 20.06.0 (June 29, 2020)
* Regent
- Support for `std/format` module for type-safe formatted printing
- Support for documentation with LDoc
- Support for `__future` operator to import a C API future
* Legion
- Support for inlining tasks into leaf contexts
- Support for global registration callbacks inside of tasks
- Added semantic tags for source file and line location
- Support for multi-region accessors for region requirements with
co-location constraints
- Changes to semantics of deletion for index spaces, field spaces, and
logical regions. For details, see:
https://github.com/StanfordLegion/legion/issues/812
- Support for creating fields spaces with initial fields
* Realm
- Subgraphs can be used to capture a template of Realm operations
that will be executed repeatedly. Subgraph definitions include
support for "interpolating" values into individual operations'
arguments on each instantiation of the subgraph template
- `create_weighted_subspaces` supports `size_t` weights for precise
control over the size of each subspace
- Added support for `omp critical` constructs and dynamic loop
schedules in OpenMP tasks
- Added support for `cudaStreamLegacy` and `cudaStreamPerThread` in
CUDA tasks
- Realm logs now include a timestamp (relative to runtime init)
by default. This behavior can be disabled with `-logtime 0`
- Performance improvements for copies/fills of 3D instances spaces in
GPU device memory
- Added ability to compute a set of "covering rectangles" for sparse
index spaces, allowing more compact representation in memory
- Added `MultiAffineAccessor` for accessing compact instances
- Added ability to delete a `ProcessorGroup`
Version 20.03.0 (March 31, 2020)
* Regent
- Behavior change: `__fields` and `__physical` now both require
explicit field names, i.e., `__fields(r.{x, y})` rather than
`__fields(r)`. This makes the behavior more unambiguous and
helps to avoid bugs
- Added `complete` and `incomplete` keywords that can be used to
mark partitions as such
- Added support for setting mapper ID and tag via
`t:set_mapper_id()` and `t:set_mapping_tag_id()`
- Initial support for predicated execution of `if` and `while`
statements
- Fixed several bugs, memory leaks and improved compile times
* Legion
- Introduction of Fortran bindings for Legion
- Support for creating deferred index spaces from future values
- Support for construction of partitions from a map of domains or
from a future map
- Support for reducing a future map to a single future asynchronously
* Realm
- Support for Kokkos parallel launch constructs in Realm (and therefore
Legion) tasks. Currently supported Kokkos execution spaces
are: Serial, OpenMP, CUDA. Application data remains in logical
regions, but accessors can be converted to Kokkos (unmanaged) Views
if needed. See the `kokkos_interop` example
- Introduction of experimental MPI-based network layer, enabled with
`REALM_NETWORKS=mpi` (make) or `-DRealm_NETWORKS=mpi` (cmake).
Use `REALM_NETWORKS=gasnet1` (or USE_GASNET=1, which still works)
for the GASNet-based network layer (which works with GASNet-1 or
GASNet-EX)
- CUDA Runtime API interposer (a.k.a. "hijack") can now be disabled with
`USE_CUDART_HIJACK=0` (make) or `-DLegion_HIJACK_CUDART=OFF` (cmake).
This can reduce effectivenes of task-parallelism for CUDA tasks, so
use only if needed
- More control over GPU selection via: `-cuda:skipgpus N` which leaves the
first N GPUs available for other uses, `-cuda:skipbusy` which skips
over busy GPUs, and `-cuda:minavailmem M` which skips GPUs with less
than M device memory available
- Reduction in memory usage of Realm internal data structures
* Tools
- There is a now a generic launcher script for running Python code
with Legion that will execute an aribtrary Python program in the
top-level task of a Legion program. This script mirrors the interface
to CPython as closely as possible.
- Legion Spy now supports verification and rendering of indirection copies
- Legion Prof supports Instance layout constraints related to dimension
ordering and field alignnment
- Legion Prof contains a menu option for viewing ready state of operations
Version 19.12.0 (December 31, 2019)
* Build
- Both builds (Make and CMake) now generate `legion_defines.h` and
`realm_defines.h`. By default these headers are generated in
the source directory (Make) or build directory (CMake). This
means that languages such as Regent and Python no longer
require MAX_DIM to be specified explicitly
* Regent
- Support for CUDA 10
- Support for field polymorphic tasks
- Substantially improved the generality of the index launch
optimization. Task arguments of the form p[i+k] may now be
used, where k is a variable defined outside of the loop
- Add flag `-foverride-demand-index-launch` which can be used to
force loops to be index launched in cases where the compiler
cannot prove the disjointness of read-write region
arguments
- Added reductions for complex64
- The scripts `install.py` and `setup_env.py` now use CMake to
build Terra by default, which should improve portability on
most machines
- The behavior of `-fcuda 1` has changed: this flag will now issue
an error if CUDA cannot be enabled (e.g. because the build
does not support CUDA, or because the machine has no
GPUs). Omitting this flag will now enable CUDA if it is
available (and will not error if it is not available).
The behavior of `-fopenmp 1` has changed similarly.
- The behavior of `__demand(__cuda)` has changed. This will now
issue an error if a loop is not eligible for the CUDA
transformation, regardless of whether CUDA is actually
available on the current machine or not. The behavior of
`__demand(__openmp)` has changed similarly.
- The annotation `__allow(__cuda)` is now permitted, and permits
(but does not require) tasks to be optimized with CUDA.
- Experimental support for 2D kernel launch in the CUDA code generation
* Python
- Add support for copies
- Copies and fills now support multiple fields
- Tasks (including index launches) now support setting the mapper
ID and tag
* Legion
- A major overhaul of the Legion physical analysis to use an
approach based on bounding volume hierarchies. The change is
not visible to users, but will likely impact performance. Most
programs will get faster; programs that create many partitions
frequently on the fly may get slower. The later case will be fixed
in an upcoming release.
- Added support for indirect copy operations such as gather and
scatter onto existing copy launchers
* Realm
- `Event::subscribe` allows polling via `Event::has_triggered` to
(eventually) succeed
- Addition of `CompletionQueue` objects that allow multiple unordered
`Event` triggers to be efficiently handled by a single consumer
- Support for `omp_get_level`, `omp_in_parallel`, and
`omp_set_num_threads` in tasks running on OpenMP processors
- Support for unstructured scatter and/or gather in copies. (Handling
structured cases as well as fills/reductions remains a work in
progress.)
- Removed all calls to `Event::wait` from inside other Realm API calls.
Applications now must make sure that index spaces and instance
metadata are valid before use. For details, see:
https://github.com/StanfordLegion/legion/issues/465
Version 19.09.1 (September 13, 2019)
* Regent
- Fix for correctness bug in task inlining. See:
https://github.com/StanfordLegion/legion/issues/582
Version 19.09.0 (September 9, 2019)
* Regent
- __demand(__index_launch) has been added as an alternative to
__demand(__parallel) on for loops that avoids confusion with the
auto-parallelizer. __demand(__parallel) on for loops is deprecated and
now issues a warning; in a future release this warning will be
upgraded to an error. For details, see:
https://github.com/StanfordLegion/legion/issues/520
- Multi-field expasion is deprecated and now issues an error. The error
can be temporarily downgraded to a warning, but it is advised that
users migrate codes away from this syntax as it will become a hard
error in a future release. For details, see:
https://github.com/StanfordLegion/legion/issues/501
* Legion
- Support for a built-in collection of reduction operators including
sum, product, max, and min over a variety of types for CPUs and GPUs
* Realm
- assorted bug, performance, and memory leak fixes
- fills to attached HDF5 instances are orders of magnitude faster
- support for reusing HDF5 file handles with `-hdf5:openfiles` option
- control which rank opens an HDF5 file with a `rank=nnn:` filename prefix
* Build System
- Makefile-based flow attempts to detect CUDA location and GASNet conduit
if they are not specified
- Makefile-based flow defaults to building CUDA fat binaries, but can still
be overridden with the `GPU_ARCH` setting, which now accepts SM arch
numbers (e.g. "70") as well as names (e.g. "volta")
Version 19.06.0 (June 27, 2019)
* Legion
- All tools (Legion Prof, Legion Spy, etc.) now support Python 2 and 3
- The flag -lg:warn_backtrace prints a backtrace on each warning
to allow easier pinpointing of problematic code
* Realm
- Support for building against debug versions of GASNet
- Significantly reduced runtime overhead for small Realm tasks
- External HDF5 instances work with datasets in groups
- Scheduler locking allows spin-waiting for non-reentrant
operations (e.g. Python module imports)
- Memory size (e.g. "-ll:csize") arguments accept k/m/g/t
size suffixes
- Better error messages when Realm memory sizes are too large
* Regent
- The image, preimage and restrict partitioning operators now
accept an optional disjoint or aliased keyword to specify the
disjointness of the resulting partition
- The address of operator (&) is now supported
- Support for explicit field maps for HDF5
* Legion Prof
- Menu option to select a subset of the profile information
for viewing
- Grouping of memory channels, utilization and additional details
such as source and destination nodes/processors associated with
the channels
- Physical instances contain additional information about the regions
they belong to
* Python
- Support for partitioning operators equal and restriction
- Support for bool and complex types
- Support for must epoch launches
- Support for returning a future out of a fence
- Fixes for macOS
Version 19.04.0 (April 30, 2019)
* Legion
- Support for dimensions > 3. Set MAX_DIM at build time
(or -DLegion_MAX_DIM in CMake) to build with any number of
dimensions up to 9.
- Change VariantID to 32 bits to match AUTO_GENERATE_ID
- Improved mapper interfaces for instance allocation and
failed instance allocation due to layout constraint conflicts
* Regent
- Support for index fills
- Support for disabling structure-slicing on structs by setting
__no_field_slicing on the struc type
- Substantial improvements to the auto-parallelizer, CUDA and
OpenMP code generators
- Substantial improvements in compile time for tasks with large
numbers of fields
- Build fixes for macOS
- setup_env.py now works on macOS
* Realm
- support for #pragma omp single sections in OpenMP processors
- Realm IDs uses explicit bit packing instead of fragile C bit fields
- numerous fixes for create_equal_subspace deppart operations
- Support for CUDA 10
* Legion Prof
- Added support for recording GPU processor times
Version 18.12.0 (December 27, 2018)
* Realm
- More assorted bug fixes
- Minor performance improvements in logging and accessor code
- Handle signals on an alternate stack for better debugging/backtraces
* Regent
- Added a new built-in complex type
- Experimental support for building with PUC Lua
- Multiple fixes to CUDA code generation, vectorization,
auto-parallelization, and mapping optimization
- Better error messages for __demand(__leaf) and so on
* Python
- Use PyGILState for threading for compatibility with modules (e.g. numpy)
- Support for calling tasks written in Regent
Version 18.09.0 (September 19, 2018)
* Legion
- Support for physical tracing, which can provide up to 7x improvement in
loops with very small tasks. Can be enabled in the mappers that
inherit from DefaultMapper using -dm:memoize 1
* Realm
- Assorted minor bug fixes
- Support for development snapshots of GASNet-EX (using GASNet-1
compatibility interfaces for now)
* Regent
- Changed precedence of logical operators (and, or) to match that of
Lua and Terra (or is now lower-precedence than and)
- Full support for accessing sparse multi-dimensional regions
- Initial support for incremental compilation. Enable with
REGENT_INCREMENTAL=1
- Changes to make compilation entirely deterministic
- Multiple compilation speed improvements
- Support for CUDA scalar reductions
- Experimental support for parallel prefix operators, including CUDA
* Python
- Support for defining methods as tasks
- Support for passing futures to tasks and index tasks
- Support for explicit return types on extern tasks
- Improved support for Futures with encodings other than pickle
Version 18.05.0 (May 31, 2018)
* Legion
- Migrated all node-local Legion reservations to use Realm
fast reservations and removed no longer necessary continuations
- Added support for mapper attached data to all Mappable types
- Added support for assigning a block of IDs to a library in a consistent
way across nodes via generate_library_task_ids and friends
* Realm
- Added support for "fast" reservations that have better
performance characteristics for reservations local to a node
* C API
- Updated projection functor API to match Legion C++ API
* Regent
- Regent now generates disjointness constraints for affine
expressions in partition accesses. E.g. p[i] and p[i+1] are
now known to be disjoint at compile time as long as p is a
disjoint partition
- Support for non-trivial projection functors in index space launches
such as f(p[i+1])
- Improvements to compile time spent in various optimization passes
- Support for parallel compilation with the flag -fjobs N
- Miscellaneous fixes
Version 18.02.0 (February 2, 2018)
* Legion
- Support for PowerPC vector intrinsics
- FieldAccessors support "view" coordinates and equivalent bounds checks
- Improved schedule priorities for Legion meta-tasks
* Realm
- Operation priority can now be adjust after a task/copy is launched
- Assorted bug/memory leak fixes
- AffineAccessors support an optional translation from "view" coordinates
to actual coordinates in the instance being accessed
* Regent
- Experimental support for calling Regent tasks from C/C++
- Support for building with CMake
- Support for running on PowerPC
* Bindings
- Obsolete Lua and Terra bindings have been removed. The remaining Terra
bindings have been renamed to Regent and now produce libregent.so
Version 17.10.0 (October 27, 2017)
* Legion
- Introduction of new partitioning API based on dependent partitioning
- Deprecation of old partitioning API, LegionRuntime::{Arrays,Accessors}
namespaces
* Realm
- Dependent partitioning API, including dimension-aware IndexSpace
- Point/Rect types moved to Realm namespace
- Instance creation allows caller to choose precise memory layout
- Accessors moved to Realm namespace, changed to match new instance layouts
* C API
- The C API is now accessed via the `legion.h` header file. Note that this
is still a redirect back to the current `legion/legion_c.h` header
* Legion Prof
- Added support for minimally invasive dumping of intermediate
profiling data while the application is still running for long runs
* Python
- New Python API bindings and native support for Python processors
Compile with USE_PYTHON=1 and run with -ll:py 1 to enable Python
Also see examples/python_interop for an example
Version 17.08.0 (August 24, 2017)
* Build system
- Added HDF_ROOT variable to customize HDF5 install location
* Legion
- New error message format and online reference at
http://legion.stanford.edu/messages
* Legion Prof
- Added new compact binary format for profile logs
- Added flag: -hl:prof_logfile prof_%.gz
* Realm
- Fixes to support big-endian systems
- Several performance improvements to DMA subsystem
- Added REALM_DEFAULT_ARGS environment variable
containing flags to be inserted at front of command line
* Regent
- Removed new operator. Unstructured regions are now
fully allocated by default
- Added optimization to automatically skip empty tasks
- Initial support for extern tasks that are defined elsewhere
- Tasks that use __demand(__openmp) are now constrained
to run on OpenMP processors by default
- RDIR: Better support for deeper nested region trees
Version 17.05.0 (May 26, 2017)
* Build system
- Finally removed long-obsolete SHARED_LOWLEVEL flag
* Legion
- Added C++14 [[deprecated]] attribute to existing deprecated APIs.
All examples should all compile without deprecation warnings
- Added Legion executor that enables support for interoperating
with Agency inside of Legion tasks
* Realm
- Switched to new DMA engine
- Initial support for OpenMP "processors". Compile with USE_OPENMP
and run with flags -ll:ocpu and -ll:othr.
* Regent
- Added support running normal tasks on I/O processors
- Added support for OpenMP code generation via __demand(__openmp)
* C API
- Removed the following deprecated types:
legion_task_result_t
(obviated by the new task preamble/postamble)
- Removed the following deprecated APIs:
legion_physical_region_get_accessor_generic
legion_physical_region_get_accessor_array
(use legion_physical_region_get_field_accessor_* instead)
legion_runtime_set_registration_callback
(use legion_runtime_add_registration_callback instead)
legion_runtime_register_task_void
legion_runtime_register_task
legion_runtime_register_task_uint32
legion_runtime_register_task_uint64
(use legion_runtime_preregister_task_variant_* instead)
legion_future_from_buffer
legion_future_from_uint32
legion_future_from_uint64
legion_future_from_bytes
(use legion_future_from_untyped_pointer instead)
legion_future_get_result
legion_future_get_result_uint32
legion_future_get_result_uint64
legion_future_get_result_bytes
(use legion_future_get_untyped_pointer instead)
legion_future_get_result_size
(use legion_future_get_untyped_size instead)
legion_future_map_get_result
(use legion_future_map_get_future instead)
Version 17.02.0 (February 14, 2017)
* General
- Bumped copyright dates
* Legion
- Merged versioning branch with support for a higher performance
version numbering computation
- More efficient analysis for index space task launches
- Updated custom projection function API
- Added support for speculative mapping of predicated operations
- Added index space copy and fill operations
* Legion Prof
- Added a stats view of processors grouped by node and processor type
- Added ability to collapse/expand each processor/channel/memory in
a timeline. To collapse/expand a row, click the name. To
collapse/expand the children of a row, click on the triangle
next to the name.
- Grouped the processor timelines to be child elements under the stats
views
- Added on-demand loading of each processor/stats in a timeline.
Elements are only loaded when you expand them, saving bandwidth
* CMake
- Switched to separate flags for each of the Legion extras directories:
-DLegion_BUILD_APPS (for ./apps)
-DLegion_BUILD_EXAMPLES (for ./examples)
-DLegion_BUILD_TUTORIAL (for ./tutorial)
-DLegion_BUILD_TESTS (for ./test)
Version 16.10.0 (October 7, 2016)
* Realm
- HDF5 support: moved to Realm module, added DMA channels
- PAPI support: basic profiling (instructions, caches, branches) added
* Build flow
- Fixes to support compilation in 32-bit mode
- Numerous improvements to CMake build
* Regent
- Improvements to vectorization of structured codes
* Apps
- Removed bit-rotted applications - some have been replaced by examples
or Regent applications
* Tests
- New test infrastructure and top-level test script `test.py`
Version 16.08.0 (August 30, 2016)
* Realm
- Critical-enough ("error" and "fatal" by default, controlled with
-errlevel) logging messages are mirrored to stderr when -logfile is
used
- Command-line options for logging (-error and new -errlevel) support
English names of logging levels (spew, debug, info, print,
warn/warning, error, fatal, none) as well as integers
* Legion
- Rewrite of the Legion shutdown algorithm for improved scalability
and avoiding O(N^2) behavior in the number of nodes
* Regent
- Installer now prompts for RDIR installation
* Tools
- Important Legion Spy performance improvements involving transitive
reductions
Version 16.06.0 (June 15, 2016)
* Legion
- New mapper API:
use ShimMapper for limited backwards compatibility
- New task variant registration API
supports specifying layout constraints for region requirements
old interface is still available but deprecated
- Several large bug fixes for internal version numbering computation
* C API
- The context parameter for many API calls has been removed
* Tools
- Total re-write of Legion Spy
Version 16.05.0 (May 2, 2016)
* Lots of stuff - we weren't itemizing things before this point.