feat/refactor partition strategy #13

huangting4201 · 2024-01-17T09:59:59Z

Motivation

Refactor data(sequence), weight, gradients and os partition strategy.

merge upstream/develop

merge upstream develop

…p_communicator

…t-0.x-activation-ckpt feat(isp.py): isp communicator support 0.x activation ckpt

feat(model/linear.py): update FeedForward class to internlm2

sunpengsdu · 2024-01-31T04:10:54Z

configs/7B_sft.py

+    1. size: int, the size of weight parallel.
+    2. overlap: bool, enable/disable all_gather/reduce_scatter communication overlap, defaults to False.
+    3. memory_pool: bool, enable/disable memory pool, defaults to False.
+"""
 parallel = dict(
    zero1=dict(size=8, fsdp=False),


既然我们都用了自己的wp了，fsdp这个接口要不要就隐藏了，不在范例中体现了

已更新62a665d

sunpengsdu · 2024-01-31T04:11:27Z

configs/7B_sft.py

    pipeline=dict(size=1, interleaved_overlap=True),
-    sequence_parallel=False,
+    weight=dict(size=1, overlap=True, memory_pool=True),
 )


是否需要新增一个带WP的example?

目前的config 7B_sft.py就是带wp的；测例的话可以加一个

哦，鹏哥的意思加一个wp size大于1的样例

已添加62a665d

sunpengsdu · 2024-01-31T05:25:50Z

internlm/core/communication/isp.py

+    module_shapes: Dict[str, torch.Size] = None
+
+
+class MemoryPool:


@huangting4201 @mwiacx 如果升级到pytorch lastest版本（假设使用vmm api的版本从rc版本变为正式版）后，memory pool是可以不用了?

…segments

QiaolingChen00 · 2024-02-04T04:14:21Z

internlm/core/context/process_group_initializer.py

        expert_parallel_size (int): Size of expert parallel.
    """

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
-        self.rank_num_per_group = self.tensor_parallel_size * self.pipeline_parallel_size
-        self.num_group = self.world_size // self.rank_num_per_group
+        self.num_tensor_parallel_group = self.world_size // self.tensor_parallel_size


为什么这里不用管 pp 了？

QiaolingChen00 · 2024-02-04T04:29:34Z

internlm/core/context/process_group_initializer.py


    def _get_expert_parallel_ranks(self):
        """
        Create expert and data parallel groups
-        Example: world_size = 8, model_parallel_size = 2, expert_parallel_size = 2
+        Example: world_size = 8, tensor_parallel_size = 2, expert_parallel_size = 2


EP 遵循的方式？

QiaolingChen00 · 2024-02-04T04:51:59Z

internlm/model/linear.py

+        )
+
+
+class ISPLinear(ColumnParallelLinear):


为 ISP 设linear 是为了加 communicator？

yingtongxiong and others added 30 commits October 7, 2023 14:03

support optimized sp

10aa63f

Merge remote-tracking branch 'upstream/develop' into feat/deepspeed_sp

e5a2909

merge upstream/develop

debug

bf475b6

modify the all2all

bd4af3a

support fstp and refactor code

189a313

support evaluation with fstp

21c1a7f

modify the config

949431f

Merge remote-tracking branch 'upstream/develop' into feat/fstp

0fa1083

merge upstream develop

remove useless code for no-pp

54e5616

fix evaluation bug in pp

144731c

modify the config

ef9e7cc

restore train.py

5d39c33

refactor code

29df765

fix lint

f191853

merge upstream develop

007e58a

fix the ci incompatible in config

a8dea63

merge upstream develop

1b7935d

merge develop

dd67ab9

fix lint

db63754

feat(configs/7B_sft.py): update parallel config comment

5fb6d99

overlap grad_input computation and grad_weight reduce_scatter

0fac845

merge origin

c94be64

communication overlap

792b066

support fine-grained overlap

5fd5a8a

feat(model/linear.py): support block allgather overlap

d0b1346

feat(model/linear.py): change pre backward from wqkv to block

d0f0c22

support hybrid overlap

82204ee

feat(model/linear.py): set block 0 full weight

0d1fa03

feat(model/linear.py): block-grained backward

d1af0d6

impl reduce scatter async

229cc5c

fix(tests): fix ci pipeline test error

8170641

huangting4201 requested review from sunpengsdu, mwiacx and yingtongxiong January 26, 2024 08:53

huangting4201 added 15 commits January 26, 2024 17:55

feat(utils/common.py): remove func get_megatron_flops_2

85dd51f

feat(communication/isp.py): isp communicator support 0.x activation ckpt

971c8eb

feat(train/training_internlm.py): move isp init to func initialize_is…

6853bab

…p_communicator

feat(communication/isp.py): fix prefetch last ckpt block wait handle

8c45118

Merge pull request #4 from huangting4201/feat/isp-communicator-suppor…

e74f2dd

…t-0.x-activation-ckpt feat(isp.py): isp communicator support 0.x activation ckpt

feat(utils/parallel.py): add func is_using_isp

011edcf

fix(tests): fix ci tests error

f02523e

feat(model/modeling_llama.py): update model llama

23ab67f

feat(model/utils.py): simplify code

f11422e

fix(conflicts): resolve conflicts from merging develop

4a27957

feat(model/linear.py): update FeedForward class to internlm2

8e1ee6f

Merge pull request #5 from huangting4201/feat/support-feedforwardv2-ckpt

b5f9ada

feat(model/linear.py): update FeedForward class to internlm2

fix(parallel_context.py): fix private repo ci tests error

d7928a6

feat(parallel_context.py): set zero1 parallel size >= 1

1960dc0

fix(conflicts): resolve conflicts from merging develop

52ace84

sunpengsdu reviewed Jan 31, 2024

View reviewed changes

huangting4201 and others added 3 commits January 31, 2024 15:39

feat(tests): add e2e test case for isp and enable pytorch expandable_…

62a665d

…segments

feat(doc): update doc torch and flashattn version

e91acb4

Merge branch 'develop' into feat/refactor-partition-strategy

2e4f749

sunpengsdu approved these changes Feb 1, 2024

View reviewed changes

sunpengsdu merged commit ae5a7ee into InternLM:develop Feb 1, 2024
13 checks passed

QiaolingChen00 reviewed Feb 4, 2024

View reviewed changes

internlm/model/linear.py

)

class ISPLinear(ColumnParallelLinear):

Copy link

Contributor

QiaolingChen00 Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为 ISP 设linear 是为了加 communicator？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/refactor partition strategy #13

feat/refactor partition strategy #13

huangting4201 commented Jan 17, 2024

sunpengsdu Jan 31, 2024

huangting4201 Jan 31, 2024

huangting4201 Jan 31, 2024

sunpengsdu Jan 31, 2024

sunpengsdu Jan 31, 2024

huangting4201 Jan 31, 2024

huangting4201 Jan 31, 2024

huangting4201 Jan 31, 2024

sunpengsdu Jan 31, 2024

QiaolingChen00 Feb 4, 2024 •

edited

Loading

QiaolingChen00 Feb 4, 2024

QiaolingChen00 Feb 4, 2024

		module_shapes: Dict[str, torch.Size] = None


		class MemoryPool:

feat/refactor partition strategy #13

feat/refactor partition strategy #13

Conversation

huangting4201 commented Jan 17, 2024

Motivation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiaolingChen00 Feb 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiaolingChen00 Feb 4, 2024 •

edited

Loading