Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about resuming training form ckpt #6414

Open
1 task done
Jiawei-Guo opened this issue Dec 21, 2024 · 0 comments
Open
1 task done

Questions about resuming training form ckpt #6414

Jiawei-Guo opened this issue Dec 21, 2024 · 0 comments
Labels
pending This problem is yet to be addressed

Comments

@Jiawei-Guo
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
  • Python version: 3.10.15
  • PyTorch version: 2.5.1+cu124 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA H100 80GB HBM3
  • DeepSpeed version: 0.15.4

Reproduction

torchrun --nproc_per_node $GPUS_PER_NODE
--master_addr $MASTER_ADDR
--node_rank $NODE_RANK
--master_port $MASTER_PORT
--nnodes $NNODES
src/train.py
--deepspeed LLaMA-Factory/examples/deepspeed/ds_z3_config.json
--stage sft
--do_train
--model_name_or_path hf_models/Qwen2-VL-7B
--dataset mammoth_vl_si
--buffer_size 128
--preprocessing_batch_size 128
--streaming
--dispatch_batches false
--max_steps 160000
--template qwen2_vl
--finetuning_type full
--output_dir 1208_sft_qwen2vl_mammoth_si
--overwrite_cache
--overwrite_output_dir false
--warmup_steps 100
--weight_decay 0.1
--ddp_timeout 9000
--learning_rate 5e-6
--lr_scheduler_type cosine
--logging_steps 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--gradient_accumulation_steps 2
--cutoff_len 16384
--save_steps 1000
--report_to wandb
--run_name train_qwen2vl_1208_si
--plot_loss
--num_train_epochs 1
--bf16

Expected behavior

I saw this inside the training log about 12 hours ago, but it still doesn't look like CONTINUE TRAINING has started yet.

May I ask what is causing this?

log:
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 50000
Will skip the first 0 epochs then the first 100000 batches in the first epoch.

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant