Release v0.5.0 · pytorch/torchtune

Highlights

We are releasing torchtune v0.5.0 with lots of exciting new features! This includes Kaggle integration, a QAT + LoRA training recipe, improved integrations with Hugging Face and vLLM, Gemma2 models, a recipe enabling finetuning for LayerSkip via early exit, and support for NPU devices.

Kaggle integration (#2002)

torchtune is proud to announce our integration with Kaggle! You can now finetune all your favorite models using torchtune in Kaggle notebooks with Kaggle model hub integration. Download a model from the Kaggle Hub, finetune on your dataset with any torchtune recipe, then pick your best model and upload your best checkpoint to the Kaggle Hub to share with the community. Check out our example Kaggle notebook here to get started!

QAT + LoRA training recipe (#1931)

If you've seen the Llama 3.2 quantized models, you may know that they were trained using quantization-aware training with LoRA adapters. This is an effective way to maintain good model performance when you need to quantize for on-device inference. Now you can train your own quant-friendly LoRA models in torchtune with our QAT + LoRA recipe!

To finetune Llama 3.2 3B with QAT + LoRA, you can run:

# Download Llama 3.2 3B
tune download meta-llama/Llama-3.2-3B-Instruct --ignore-patterns "original/consolidated.00.pth"

# Finetune on two devices
tune run --nproc_per_node 2 qat_lora_finetune_distributed --config llama3_2/3B_qat_lora

Improved Hugging Face and vLLM integration (#2074)

We heard your feedback, and we're happy to say that it's now easier than ever to load your torchtune models into Hugging Face or vLLM! It's as simple as:

from transformers import AutoModelForCausalLM

trained_model_path = "/path/to/my/torchtune/checkpoint"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=trained_model_path,
)

See the full examples in our docs! Hugging Face, vLLM

Gemma 2 models (#1835)

We now support models from the Gemma 2 family! This includes the 2B, 9B, and 27B sizes, with recipes for full, LoRA, and QLoRA finetuning on one or more devices. For example, you can finetune Gemma 2 27B with QLoRA by running:

# Download Gemma 2 27B
tune download google/gemma-2-27b --ignore-patterns "gemma-2-27b.gguf"

# Finetune on a single GPU
tune run lora_finetune_single_device --config gemma2/27B_qlora_single_device

A huge thanks to @Optimox for landing these models!

Early exit training recipe (#1076)

LayerSkip is an end-to-end solution to speed up LLM inference. By combining layer dropout with an appropriate dropout schedule and using an early exit loss during training, you can increase the accuracy of early exit at inference time. You can use our early exit config to reproduce experiments from LayerSkip, LayerDrop, and other papers.

You can try torchtune's early exit recipe by running the following:

# Download Llama2 7B
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf

# Finetune with early exit on four devices
tune run --nnodes 1 --nproc_per_node 4 dev/early_exit_finetune_distributed --config recipes/dev/7B_full_early_exit.yaml

NPU support (#1826)

We are excited to share that torchtune can now be used on Ascend NPU devices! All your favorite single-device recipes can be run as-is, with support for distributed recipes coming later. A huge thanks to @noemotiovon for their work to enable this!

What's Changed

nit: Correct compile_loss return type hint by @bradhilton in #1940
Fix grad accum + FSDP CPU offload, pass None via CLI by @ebsmothers in #1941
QAT tutorial nit by @SalmanMohammadi in #1945
A more encompassing fix for offloading + ac by @janeyx99 in #1936
Add Qwen2.5 to live docs by @RdoubleA in #1949
[Bug] model_type argument as str for checkpoints classes by @smujjiga in #1946
llama3.2 90b config updates + nits by @RdoubleA in #1950
Add Ascend NPU as a backend by @noemotiovon in #1826
fix missing key by @felipemello1 in #1952
update memory optimization tutorial by @felipemello1 in #1948
update configs by @felipemello1 in #1954
add expandable segment to integration tests by @felipemello1 in #1963
Fix check in load_from_full_state_dict for modified state dicts by @RylanC24 in #1967
Update torchtune generation to be more flexible by @RylanC24 in #1970
feat: add gemma2b variants by @Optimox in #1835
typo by @felipemello1 in #1972
Update QAT: add grad clipping, torch.compile, collate fn by @andrewor14 in #1854
VQA Documentation by @calvinpelletier in #1974
Convert all non-rgb images to rgb by @vancoykendall in #1976
Early fusion multimodal models by @RdoubleA in #1904
Refactor Recipe State Dict Code by @pbontrager in #1964
Update KV Cache to use num_kv_heads instead of num_heads by @mirceamironenco in #1961
Migrate to epochs: 1 in all configs by @thomasjpfan in #1981
Make sure CLIP resized pos_embed is contiguous by @gau-nernst in #1986
Add **quantization_kwargs to FrozenNF4Linear and LoRALinear and DoRALinear by @joecummings in #1987
Enables Python 3.13 for nightly builds by @thomasjpfan in #1988
DOC Fixes custom message transform example by @thomasjpfan in #1983
Pass quantization_kwargs to CLIP builders by @joecummings in #1994
Adding MM eval tests / attention bugfixes by @SalmanMohammadi in #1989
Update Qwen2.5 configs by @joecummings in #1999
nit: Fix/add some type annotations by @bradhilton in #1982
Fixing special_tokens arg in Llama3VisionTransform by @SalmanMohammadi in #2000
Recent updates to the README by @joecummings in #1979
Bump version to 0.5.0 by @joecummings in #2009
gemma2 had wrong path to scheduler by @felipemello1 in #2013
Create _export directory in torchtune by @Jack-Khuu in #2011
torchrun defaults for concurrent distributed training jobs by @ebsmothers in #2015
Remove unused FSDP components by @ebsmothers in #2016
2D RoPE + CLIP updates by @RdoubleA in #1973
Some KD recipe cleanup by @ebsmothers in #2020
Remove lr_scheduler requirement in lora_dpo_single_device by @thomasjpfan in #1991
chore: remove PyTorch 2.5.0 checks by @JP-sDEV in #1877
Make tokenize tests readable by @krammnic in #1868
add flags to readme by @felipemello1 in #2003
Support for unsharded parameters in state_dict APIs by @RdoubleA in #2023
[WIP] Reducing eval vision tests runtime by @SalmanMohammadi in #2022
log rank zero everywhere by @RdoubleA in #2030
Add LR Scheduler to full finetune distributed by @parthsarthi03 in #2017
Fix Qlora/lora for 3.2 vision by @felipemello1 in #2028
CLIP Text Encoder by @calvinpelletier in #1969
feat(cli): allow users to download models from Kaggle by @KeijiBranshi in #2002
remove default to ignore safetensors by @felipemello1 in #2042
Remove deprecated TiedEmbeddingTransformerDecoder by @EmilyIsCoding in #2047
Use hf transfer as default by @felipemello1 in #2046
Fix issue in loading mixed precision vocab pruned models during torchtune generation for evaluation by @ifed-ucsd in #2043
[export] Add exportable attention and kv cache by @larryliu0820 in #2049
Switch to PyTorch's built-in RMSNorm by @calvinpelletier in #2054
[export] Add exportable position embedding by @larryliu0820 in #2068
MM Docs nits by @SalmanMohammadi in #2067
Add support for QAT + LoRA by @andrewor14 in #1931
Add ability to shard custom layers for DPO and LoRA distributed by @joecummings in #2072
[ez] remove stale pytorch version check by @ebsmothers in #2075
Fail early with packed=True on MM datasets. by @SalmanMohammadi in #2080
Error message on packed=True for stack exchange dataset by @joecummings in #2079
Fix nightly tests for qat_lora_fintune_distributed by @andrewor14 in #2085
Update build_linux_wheels.yaml - Pass test-infra input params by @atalman in #2086
DPO Activation Offloading by @SalmanMohammadi in #2087
Deprecate SimpoLoss by @SalmanMohammadi in #2063
DPO Recipe Doc by @SalmanMohammadi in #2091
initial commit by @songhappy in #1953
Vector Quantized Embeddings by @RdoubleA in #2040
Fix bug in loading multimodal datasets and update tests accordingly by @joecummings in #2110
Set gloo process group for FSDP with CPU offload by @ebsmothers in #2108
Llama 3.3 70B by @pbontrager in #2124
Llama 3.3 readme updates by @ebsmothers in #2125
update configs by @felipemello1 in #2107
Reduce logging output for distributed KD by @joecummings in #2120
Support Early Exit Loss and/or Layer Dropout by @mostafaelhoushi in #1076
Update checkpointing directory -> using vLLM and from_pretrained by @felipemello1 in #2074
pass correct arg by @felipemello1 in #2127
update configs by @felipemello1 in #2128
fix qat_lora_test by @felipemello1 in #2131
guard ckpt imports by @felipemello1 in #2133
[bug fix] add parents=True by @felipemello1 in #2136
[bug fix] re-add model by @felipemello1 in #2135
Update save sizes into GiB by @joecummings in #2143
[bug fix] remove config download when source is kaggle by @felipemello1 in #2144
[fix] remove "with_suffix" by @felipemello1 in #2146
DoRA fixes by @ebsmothers in #2139
[Fix] Llama 3.2 Vision decoder_trainable flag fixed by @pbontrager in #2150

New Contributors

@bradhilton made their first contribution in #1940
@smujjiga made their first contribution in #1946
@noemotiovon made their first contribution in #1826
@RylanC24 made their first contribution in #1967
@vancoykendall made their first contribution in #1976
@Jack-Khuu made their first contribution in #2011
@JP-sDEV made their first contribution in #1877
@KeijiBranshi made their first contribution in #2002
@EmilyIsCoding made their first contribution in #2047
@ifed-ucsd made their first contribution in #2043
@larryliu0820 made their first contribution in #2049
@atalman made their first contribution in #2086
@songhappy made their first contribution in #1953
@mostafaelhoushi made their first contribution in #1076

Full Changelog: v0.4.0...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0