Questions about resuming training form ckpt #6414

Jiawei-Guo · 2024-12-21T05:38:12Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.10.15
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H100 80GB HBM3
DeepSpeed version: 0.15.4

Reproduction

torchrun --nproc_per_node $GPUS_PER_NODE
--master_addr $MASTER_ADDR
--node_rank $NODE_RANK
--master_port $MASTER_PORT
--nnodes $NNODES
src/train.py
--deepspeed LLaMA-Factory/examples/deepspeed/ds_z3_config.json
--stage sft
--do_train
--model_name_or_path hf_models/Qwen2-VL-7B
--dataset mammoth_vl_si
--buffer_size 128
--preprocessing_batch_size 128
--streaming
--dispatch_batches false
--max_steps 160000
--template qwen2_vl
--finetuning_type full
--output_dir 1208_sft_qwen2vl_mammoth_si
--overwrite_cache
--overwrite_output_dir false
--warmup_steps 100
--weight_decay 0.1
--ddp_timeout 9000
--learning_rate 5e-6
--lr_scheduler_type cosine
--logging_steps 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--gradient_accumulation_steps 2
--cutoff_len 16384
--save_steps 1000
--report_to wandb
--run_name train_qwen2vl_1208_si
--plot_loss
--num_train_epochs 1
--bf16

Expected behavior

I saw this inside the training log about 12 hours ago, but it still doesn't look like CONTINUE TRAINING has started yet.

May I ask what is causing this?

log:
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 50000
Will skip the first 0 epochs then the first 100000 batches in the first epoch.

Others

No response

The text was updated successfully, but these errors were encountered:

github-actions bot added the pending This problem is yet to be addressed label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about resuming training form ckpt #6414

Questions about resuming training form ckpt #6414

Jiawei-Guo commented Dec 21, 2024

Questions about resuming training form ckpt #6414

Questions about resuming training form ckpt #6414

Comments

Jiawei-Guo commented Dec 21, 2024

Reminder

System Info

Reproduction

Expected behavior

Others