Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… #9217

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

zhangyuqin1998
Copy link
Contributor

@zhangyuqin1998 zhangyuqin1998 commented Sep 29, 2024

…gs for enable_delay_scale_loss

PR types

Bug fixes

PR changes

Others

Description

A.修复动态图自动并行下,split_batches_for_accumulation与动手无法对齐的情况。如图
dae478d91bc7c7a10aa3bcc927793128

B.修复动态图自动并行下,enable_delay_scale_loss逻辑错误的问题。自动并行默认实现enable_delay_scale_loss,预期行为为:

  1. 每个micro batch计算出loss
  2. 反向传播
  3. 对mini batch内的loss进行求和
  4. 对loss进行scale,除以acc数

但当前动态图自动并行的行为为:

  1. 每个micro batch计算出loss
  2. 对该loss进行scale,除以acc数
  3. 反向传播
  4. 对mini batch内的loss进行求和

此外,由于enable_delay_scale_loss逻辑不完善,为自动并行增加开关

C. 修复静态图自动并行下,loss打印展示问题。loss预期的展示行为为:

  1. 对每个micro batch的loss求和
  2. 对求和结果进行scale,除以acc数

但当前静态图自动并行loss对打印展示行为为:

  1. 对每个micro batch的loss进行scale,除以acc数
  2. 进行求和

D. 修复函数名称错误,将traning修正为training

Copy link

paddle-bot bot commented Sep 29, 2024

Thanks for your contribution!

Copy link

codecov bot commented Sep 29, 2024

Codecov Report

Attention: Patch coverage is 0% with 35 lines in your changes missing coverage. Please review.

Project coverage is 53.08%. Comparing base (31e47a5) to head (de3e912).
Report is 31 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/trainer/auto_trainer.py 0.00% 34 Missing ⚠️
paddlenlp/trainer/trainer.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9217      +/-   ##
===========================================
+ Coverage    53.01%   53.08%   +0.06%     
===========================================
  Files          656      657       +1     
  Lines       106153   106543     +390     
===========================================
+ Hits         56278    56558     +280     
- Misses       49875    49985     +110     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

From00
From00 previously approved these changes Sep 29, 2024
Copy link
Collaborator

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangyuqin1998 zhangyuqin1998 force-pushed the fix_split_batches_for_accumulation branch from 225b3a7 to 513086c Compare October 8, 2024 08:29
From00
From00 previously approved these changes Oct 12, 2024
Copy link
Contributor

@liym27 liym27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangyuqin1998 zhangyuqin1998 force-pushed the fix_split_batches_for_accumulation branch from 8981571 to 02f6d7d Compare October 17, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants