Releases: DefTruth/CUDA-Learn-Notes
Releases · DefTruth/CUDA-Learn-Notes
v2.4.1 Pack LayerNorm
v2.4 Pack Reduce LDST
What's Changed
Full Changelog: v2.3.1...v2.4
v2.3.1 f16x8 Pack Elementwise
What's Changed
- [FA2][Half] Add FA2 f16_mma_m16n8k16 kernel by @DefTruth in #35
- [Refactor][7/N] CUDA Learn Notes refactor Part-7 by @DefTruth in #36
- Clamped input range in Sigmoid kernel to prevent overflow by @Phoenix8215 in #37
- [Sigmoid][F16] Add f16x8_pack kernel, boost 1.5x ~ by @DefTruth in #39
- [Elementwise][Half] support f16x8_pack kernel, boost 1.1x by @DefTruth in #40
- [FlashAttention] replace FLOAT4 with LDST128BITS macro by @DefTruth in #41
- [RELU][FP16] Add f16x8_pack kernel, boost 2.1x by @DefTruth in #42
New Contributors
- @Phoenix8215 made their first contribution in #37
Full Changelog: v2.3...v2.3.1
v2.3 Refactor 6/N
What's Changed
- [Refactor][6/N] CUDA Learn Notes refactor Part-6 by @DefTruth in #17
- [Refactor][5/N] CUDA Learn Notes refactor Part-6 by @DefTruth in #18
- [LayerNorm][Half] support fp16x8 packed LayerNorm by @DefTruth in #19
- [Reduce][Half] add HALF2 & BFLOAT2 macro by @DefTruth in #21
- [RMSNorm][Half] support fp16x8 packed RMSNorm by @DefTruth in #22
- [Bugfix][Kernel] fixed some kernel blocks calculate errors by @DefTruth in #23
- [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in #24
- [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in #25
- [RELU][Half] support fp16x8 RELU kernel by @DefTruth in #26
- [RMSNorm] support f16x8_f32 RMSNorm by @DefTruth in #28
- [RMSNorm][Kernel] Add FLOAT2/HALF2_VARIANCE macro by @DefTruth in #29
- [LayerNorm][Kernel] Add HALF2 SUM/SUB/VAR macro by @DefTruth in #30
- [HGEMM] Add slicked_k&t_8x8_sliced_k_f16x4 by @DefTruth in #31
- [HGEMV][Half] support hgemv k32/k128/f16 by @DefTruth in #32
- [FlashAttention] Refactor flash_attn_1_fwd_f32 kernel by @DefTruth in #33
- Bump up to v2.3 by @DefTruth in #34
Full Changelog: v2.2...v2.3
v2.2 Refactor 5/N
What's Changed
- [Refactor][5/N] CUDA Learn Notes refactor Part-5 by @DefTruth in #15
- Bump up to v2.2 by @DefTruth in #16
Full Changelog: 2.1...v2.2
v2.1 Refactor 4/N Part-4
What's Changed
- [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #10
- [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #11
- [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #12
- [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #13
- [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #14
Full Changelog: v2.0...2.1
v2.0 Refactor 4/N
Full Changelog: v0.8...v2.0
v0.8
CUDA Learn Note v0.7
CUDA Learn Notes v0.5
Full Changelog: v0.3...v0.5