All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Updated triton dependency [#418]
- Fixed strides for QKV gradients for cutlass attention [#535]
- Removed duplicated biases in the FusedMLP layers [#317]
- Rotary embeddings respecting input types [#326]
- Poolformer style instantiating useless projection layers [#349]
- Fix layer position not being properly tracked, causing extra layernorms for programmatic xformers [#348]
- Pass use_triton flag to LayerNorm module [#336]
- Four blocksparsity layouts from DeepSpeed [#320]
- Support several initialization options [#312]
- Conv2DFeedforward feedforward part [#321]
- VisualAttention [#329]
- Automatic blocksparse for causal attention [#334]
- Better hierarchical transformer generation [#345]
- Fused operations with AOTAutograd/NVFuser, integration into MLP [#357]
- Refactor LRA code to use Pytorch Lightning [#343]
- Fix some torchscriptability [#246]
- Fix FourierMix being compatible with AMP [#258]
- Better asserts on QKV dimensions [#264]
- Better perfs for FusedMLP and FusedLinearLayer [#283]
- Deepnorm init missing self-attention [#284]
- Simplicial Embeddings [#259]
- Mem efficient attention, FW pass [#267]
- MHA benchmark
- MLP benchmark
- Move all triton kernels to triton v2 [#272]
- Mem efficient attention, BW pass [#281]
- Metaformer support [#294]
- Expose bias flag for feedforwards, same default as Timm [#220]
- Update eps value for layernorm, same default as torch [#221]
- PreNorm bugfix, only one input was normalized [#233]
- Fix bug where embedding dimensions that did not match model dim would lead to a crash [#244]
- Add DeepNet (DeepNorm) residual path and init [#227]
- Compositional Attention [#41]
- Experimental Ragged attention [#189]
- Mixture of Experts [#181]
- BlockSparseTensor [#202]
- Nd-tensor support for triton softmax [#210]
- Bugfix Favor, single feature map [#183]
- Sanity check blocksparse settings [#207]
- Fixed some picklability [#204]
- Much faster fused dropout [#164]
- Fused dropout repeatability [#173]
- Embedding weight tying option [#172]
- Dropout setting not properly passed in many attentions [#123]
- Fix self attention optimization not being triggered, broken residual path [#119]
- Improve speed by not using contiguous Tensors when not needed [#119]
- Attention mask wrapper [#113]
- ViT comparison benchmark [#117]
- Homogenizing the masks, additive or bool [#79][#85][#86]
- Fix causality flag not being respected [#103]
- Enabling FusedLayerNorm by default in the factory if Triton is available
- Fixing Favor with fp16
- Fixing Favor trainability
- Fused dropout/bias/activation layer [#58]
- Fused layernorm used by default in the factory [#92]
- Nystrom causal attention [#75]
- More robust blocksparse [#24]
- Rotary embeddings [#32]
- More flexible layernorm [#50]