Skip to content

Commit

Permalink
Fix wording
Browse files Browse the repository at this point in the history
  • Loading branch information
mikaylagawarecki committed Oct 4, 2024
1 parent 71c1bac commit 111843b
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions intermediate_source/transformer_building_blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -570,14 +570,16 @@ def forward(self, x):
print(f"Total sequence length in nested key/value {kv_len.sum().item()}, max sequence length {kv_len.max().item()}")
out = new_mha_layer(query, key, value, is_causal=False)

# TODO: anything else I can add here?

################################################################################
# Fully masked rows no longer cause NaNs
# --------------------------------------
#
# There has been a long standing issue with ``nn.MultiheadAttention`` where if a row was
# fully masked by the key_padding_mask, the output of the attention layer would be NaN
# See `issue <https://github.com/pytorch/pytorch/issues/41508>`_. This is because
# the softmax operation would divide by zero.
# There has been a long standing issue with ``nn.MultiheadAttention`` and
# ``scaled_dot_product_attention`` where if a row was fully masked, the output
# of the attention layer would be NaN. See `issue <https://github.com/pytorch/pytorch/issues/41508>`_.
# This is because the softmax operation would divide by zero.
#
# Thanks to `this PR <https://github.com/pytorch/pytorch/pull/133882>`_
# this is no longer the case. Instead, fully masked rows will be set to zero. More
Expand Down

0 comments on commit 111843b

Please sign in to comment.