Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: model init in parallel #84

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

feat: model init in parallel #84

wants to merge 1 commit into from

Conversation

stcy07
Copy link
Collaborator

@stcy07 stcy07 commented Sep 14, 2024

将model init过程由模型间串行改成模型间并行。megatron的编译步骤的正确性交给其内部的lock file来保证,在llama2-7b(单机8卡)和70b(4机32卡)上测试均能跑通2个episode。不过合并代码前,建议多重复跑几次验证正确性

@haolin-nju
Copy link
Collaborator

所以其实model init过程还是因为有lock file的存在是串行的?模型间并行init体现在哪里呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants