Skip to content

Releases: DefTruth/CUDA-Learn-Notes

v2.4.1 Pack LayerNorm

25 Sep 06:07
4667308
Compare
Choose a tag to compare

What's Changed

  • [Nsight] Add nsys/ncu usage, ptx/sass by @DefTruth in #44
  • [DotProd][FP16] support f16x8_pack kernel by @DefTruth in #45
  • [LayerNorm][FP16] Add pack support for f16x8 LD/ST by @DefTruth in #46

Full Changelog: v2.4...v2.4.1

v2.4 Pack Reduce LDST

24 Sep 02:13
bf283f2
Compare
Choose a tag to compare

What's Changed

  • [Reduce][Kernel] Pack f16/bf16x8 & fp8/i8x16 LD/ST by @DefTruth in #43

Full Changelog: v2.3.1...v2.4

v2.3.1 f16x8 Pack Elementwise

23 Sep 03:44
d43c53d
Compare
Choose a tag to compare

What's Changed

  • [FA2][Half] Add FA2 f16_mma_m16n8k16 kernel by @DefTruth in #35
  • [Refactor][7/N] CUDA Learn Notes refactor Part-7 by @DefTruth in #36
  • Clamped input range in Sigmoid kernel to prevent overflow by @Phoenix8215 in #37
  • [Sigmoid][F16] Add f16x8_pack kernel, boost 1.5x ~ by @DefTruth in #39
  • [Elementwise][Half] support f16x8_pack kernel, boost 1.1x by @DefTruth in #40
  • [FlashAttention] replace FLOAT4 with LDST128BITS macro by @DefTruth in #41
  • [RELU][FP16] Add f16x8_pack kernel, boost 2.1x by @DefTruth in #42

New Contributors

Full Changelog: v2.3...v2.3.1

v2.3 Refactor 6/N

17 Sep 07:57
f9001b9
Compare
Choose a tag to compare

What's Changed

  • [Refactor][6/N] CUDA Learn Notes refactor Part-6 by @DefTruth in #17
  • [Refactor][5/N] CUDA Learn Notes refactor Part-6 by @DefTruth in #18
  • [LayerNorm][Half] support fp16x8 packed LayerNorm by @DefTruth in #19
  • [Reduce][Half] add HALF2 & BFLOAT2 macro by @DefTruth in #21
  • [RMSNorm][Half] support fp16x8 packed RMSNorm by @DefTruth in #22
  • [Bugfix][Kernel] fixed some kernel blocks calculate errors by @DefTruth in #23
  • [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in #24
  • [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in #25
  • [RELU][Half] support fp16x8 RELU kernel by @DefTruth in #26
  • [RMSNorm] support f16x8_f32 RMSNorm by @DefTruth in #28
  • [RMSNorm][Kernel] Add FLOAT2/HALF2_VARIANCE macro by @DefTruth in #29
  • [LayerNorm][Kernel] Add HALF2 SUM/SUB/VAR macro by @DefTruth in #30
  • [HGEMM] Add slicked_k&t_8x8_sliced_k_f16x4 by @DefTruth in #31
  • [HGEMV][Half] support hgemv k32/k128/f16 by @DefTruth in #32
  • [FlashAttention] Refactor flash_attn_1_fwd_f32 kernel by @DefTruth in #33
  • Bump up to v2.3 by @DefTruth in #34

Full Changelog: v2.2...v2.3

v2.2 Refactor 5/N

12 Sep 01:36
86ab98e
Compare
Choose a tag to compare

What's Changed

Full Changelog: 2.1...v2.2

v2.1 Refactor 4/N Part-4

04 Sep 03:16
d616f29
Compare
Choose a tag to compare

What's Changed

  • [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #10
  • [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #11
  • [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #12
  • [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #13
  • [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in #14

Full Changelog: v2.0...2.1

v2.0 Refactor 4/N

01 Sep 13:09
Compare
Choose a tag to compare

Full Changelog: v0.8...v2.0

v0.8

21 Aug 02:22
e943e6a
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.7...v0.8

CUDA Learn Note v0.7

24 Jul 02:09
b21d459
Compare
Choose a tag to compare

Full Changelog: v0.5...v0.6

What's Changed

New Contributors

Full Changelog: v0.6...v0.7

CUDA Learn Notes v0.5

20 Jun 01:14
69fceb8
Compare
Choose a tag to compare