fix(shard.py): fix isp unpack data indexes err in rotary emb #316
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
bug修复1
问题描述:在当前的代码实现中,unpack data场景下会pop出indexes字段,而在进行rotary_emb计算的位置编码时需要用到该字段,如果不传的话,默认值是offsets=0,这在非序列并行下是没有问题的。但在isp算法,序列并行场景下,seq维度是切分的,这样的话,如果没有indexes字段与切分后的seq字段进行一一对应,会导致rotary_emb计算出现问题。而这在mtp/msp/fsp算法下都没有问题,这是因为这些算法的序列并行在进行linear计算前会把seq维度进行allgather,所以在rotary_emb计算时seq维度是全量的,因此直接使用默认值offsets=0就没有问题。
代码修复:主要在core/parallel/shard.py文件中,首先unpack data保留indexes字段,然后基于sp并行进行切分,同时在2D seq parallel时支持负载均衡式的切分indexes,与seq切分对齐。
bug修复2
问题描述:在dp4cp4并行配置下,跑真实训练数据,大概五百多步后会出现
Nan grad norm
的情况。对比实验发现,dp4并行,dp4tp4并行,与dp4hp4并行场景下均不会出现该问题。代码修复:排查 ring attn 开源代码,更新 update_out_and_lse 函数中使用的计算算子,提高数值稳定性。修复该代码后,再进行loss曲线对比:
新特性1
因测试发现
VocabSequenceParallelCrossEntropyLoss
计算出的loss有问题,所以目前改成了对head output进行all2all处理后,仍使用TP并行计算loss的方式。复用parallel_output开关,若为True,则走FlashCrossEntropyLoss
方法;若为False,则走nn.CrossEntropyLoss
方法。loss曲线如下:新特性2
适配evaluate模块,支持isp序列并行。因valid data没有indexes字段,所以手动加上,使得其支持isp序列并行。