Highlights
We are releasing torchtune v0.5.0 with lots of exciting new features! This includes Kaggle integration, a QAT + LoRA training recipe, improved integrations with Hugging Face and vLLM, Gemma2 models, a recipe enabling finetuning for LayerSkip via early exit, and support for NPU devices.
Kaggle integration (#2002)
torchtune is proud to announce our integration with Kaggle! You can now finetune all your favorite models using torchtune in Kaggle notebooks with Kaggle model hub integration. Download a model from the Kaggle Hub, finetune on your dataset with any torchtune recipe, then pick your best model and upload your best checkpoint to the Kaggle Hub to share with the community. Check out our example Kaggle notebook here to get started!
QAT + LoRA training recipe (#1931)
If you've seen the Llama 3.2 quantized models, you may know that they were trained using quantization-aware training with LoRA adapters. This is an effective way to maintain good model performance when you need to quantize for on-device inference. Now you can train your own quant-friendly LoRA models in torchtune with our QAT + LoRA recipe!
To finetune Llama 3.2 3B with QAT + LoRA, you can run:
# Download Llama 3.2 3B
tune download meta-llama/Llama-3.2-3B-Instruct --ignore-patterns "original/consolidated.00.pth"
# Finetune on two devices
tune run --nproc_per_node 2 qat_lora_finetune_distributed --config llama3_2/3B_qat_lora
Improved Hugging Face and vLLM integration (#2074)
We heard your feedback, and we're happy to say that it's now easier than ever to load your torchtune models into Hugging Face or vLLM! It's as simple as:
from transformers import AutoModelForCausalLM
trained_model_path = "/path/to/my/torchtune/checkpoint"
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=trained_model_path,
)
See the full examples in our docs! Hugging Face, vLLM
Gemma 2 models (#1835)
We now support models from the Gemma 2 family! This includes the 2B, 9B, and 27B sizes, with recipes for full, LoRA, and QLoRA finetuning on one or more devices. For example, you can finetune Gemma 2 27B with QLoRA by running:
# Download Gemma 2 27B
tune download google/gemma-2-27b --ignore-patterns "gemma-2-27b.gguf"
# Finetune on a single GPU
tune run lora_finetune_single_device --config gemma2/27B_qlora_single_device
A huge thanks to @Optimox for landing these models!
Early exit training recipe (#1076)
LayerSkip is an end-to-end solution to speed up LLM inference. By combining layer dropout with an appropriate dropout schedule and using an early exit loss during training, you can increase the accuracy of early exit at inference time. You can use our early exit config to reproduce experiments from LayerSkip, LayerDrop, and other papers.
You can try torchtune's early exit recipe by running the following:
# Download Llama2 7B
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf
# Finetune with early exit on four devices
tune run --nnodes 1 --nproc_per_node 4 dev/early_exit_finetune_distributed --config recipes/dev/7B_full_early_exit.yaml
NPU support (#1826)
We are excited to share that torchtune can now be used on Ascend NPU devices! All your favorite single-device recipes can be run as-is, with support for distributed recipes coming later. A huge thanks to @noemotiovon for their work to enable this!
What's Changed
- nit: Correct compile_loss return type hint by @bradhilton in #1940
- Fix grad accum + FSDP CPU offload, pass None via CLI by @ebsmothers in #1941
- QAT tutorial nit by @SalmanMohammadi in #1945
- A more encompassing fix for offloading + ac by @janeyx99 in #1936
- Add Qwen2.5 to live docs by @RdoubleA in #1949
- [Bug] model_type argument as str for checkpoints classes by @smujjiga in #1946
- llama3.2 90b config updates + nits by @RdoubleA in #1950
- Add Ascend NPU as a backend by @noemotiovon in #1826
- fix missing key by @felipemello1 in #1952
- update memory optimization tutorial by @felipemello1 in #1948
- update configs by @felipemello1 in #1954
- add expandable segment to integration tests by @felipemello1 in #1963
- Fix check in
load_from_full_state_dict
for modified state dicts by @RylanC24 in #1967 - Update torchtune generation to be more flexible by @RylanC24 in #1970
- feat: add gemma2b variants by @Optimox in #1835
- typo by @felipemello1 in #1972
- Update QAT: add grad clipping, torch.compile, collate fn by @andrewor14 in #1854
- VQA Documentation by @calvinpelletier in #1974
- Convert all non-rgb images to rgb by @vancoykendall in #1976
- Early fusion multimodal models by @RdoubleA in #1904
- Refactor Recipe State Dict Code by @pbontrager in #1964
- Update KV Cache to use num_kv_heads instead of num_heads by @mirceamironenco in #1961
- Migrate to
epochs: 1
in all configs by @thomasjpfan in #1981 - Make sure CLIP resized pos_embed is contiguous by @gau-nernst in #1986
- Add **quantization_kwargs to
FrozenNF4Linear
andLoRALinear
andDoRALinear
by @joecummings in #1987 - Enables Python 3.13 for nightly builds by @thomasjpfan in #1988
- DOC Fixes custom message transform example by @thomasjpfan in #1983
- Pass quantization_kwargs to CLIP builders by @joecummings in #1994
- Adding MM eval tests / attention bugfixes by @SalmanMohammadi in #1989
- Update Qwen2.5 configs by @joecummings in #1999
- nit: Fix/add some type annotations by @bradhilton in #1982
- Fixing
special_tokens
arg inLlama3VisionTransform
by @SalmanMohammadi in #2000 - Recent updates to the README by @joecummings in #1979
- Bump version to 0.5.0 by @joecummings in #2009
- gemma2 had wrong path to scheduler by @felipemello1 in #2013
- Create _export directory in torchtune by @Jack-Khuu in #2011
- torchrun defaults for concurrent distributed training jobs by @ebsmothers in #2015
- Remove unused FSDP components by @ebsmothers in #2016
- 2D RoPE + CLIP updates by @RdoubleA in #1973
- Some KD recipe cleanup by @ebsmothers in #2020
- Remove lr_scheduler requirement in lora_dpo_single_device by @thomasjpfan in #1991
- chore: remove PyTorch 2.5.0 checks by @JP-sDEV in #1877
- Make tokenize tests readable by @krammnic in #1868
- add flags to readme by @felipemello1 in #2003
- Support for unsharded parameters in state_dict APIs by @RdoubleA in #2023
- [WIP] Reducing eval vision tests runtime by @SalmanMohammadi in #2022
- log rank zero everywhere by @RdoubleA in #2030
- Add LR Scheduler to full finetune distributed by @parthsarthi03 in #2017
- Fix Qlora/lora for 3.2 vision by @felipemello1 in #2028
- CLIP Text Encoder by @calvinpelletier in #1969
- feat(cli): allow users to download models from Kaggle by @KeijiBranshi in #2002
- remove default to ignore safetensors by @felipemello1 in #2042
- Remove deprecated
TiedEmbeddingTransformerDecoder
by @EmilyIsCoding in #2047 - Use hf transfer as default by @felipemello1 in #2046
- Fix issue in loading mixed precision vocab pruned models during torchtune generation for evaluation by @ifed-ucsd in #2043
- [export] Add exportable attention and kv cache by @larryliu0820 in #2049
- Switch to PyTorch's built-in RMSNorm by @calvinpelletier in #2054
- [export] Add exportable position embedding by @larryliu0820 in #2068
- MM Docs nits by @SalmanMohammadi in #2067
- Add support for QAT + LoRA by @andrewor14 in #1931
- Add ability to shard custom layers for DPO and LoRA distributed by @joecummings in #2072
- [ez] remove stale pytorch version check by @ebsmothers in #2075
- Fail early with
packed=True
on MM datasets. by @SalmanMohammadi in #2080 - Error message on
packed=True
for stack exchange dataset by @joecummings in #2079 - Fix nightly tests for qat_lora_fintune_distributed by @andrewor14 in #2085
- Update build_linux_wheels.yaml - Pass test-infra input params by @atalman in #2086
- DPO Activation Offloading by @SalmanMohammadi in #2087
- Deprecate
SimpoLoss
by @SalmanMohammadi in #2063 - DPO Recipe Doc by @SalmanMohammadi in #2091
- initial commit by @songhappy in #1953
- Vector Quantized Embeddings by @RdoubleA in #2040
- Fix bug in loading multimodal datasets and update tests accordingly by @joecummings in #2110
- Set gloo process group for FSDP with CPU offload by @ebsmothers in #2108
- Llama 3.3 70B by @pbontrager in #2124
- Llama 3.3 readme updates by @ebsmothers in #2125
- update configs by @felipemello1 in #2107
- Reduce logging output for distributed KD by @joecummings in #2120
- Support Early Exit Loss and/or Layer Dropout by @mostafaelhoushi in #1076
- Update checkpointing directory -> using vLLM and from_pretrained by @felipemello1 in #2074
- pass correct arg by @felipemello1 in #2127
- update configs by @felipemello1 in #2128
- fix qat_lora_test by @felipemello1 in #2131
- guard ckpt imports by @felipemello1 in #2133
- [bug fix] add parents=True by @felipemello1 in #2136
- [bug fix] re-add model by @felipemello1 in #2135
- Update save sizes into GiB by @joecummings in #2143
- [bug fix] remove config download when source is kaggle by @felipemello1 in #2144
- [fix] remove "with_suffix" by @felipemello1 in #2146
- DoRA fixes by @ebsmothers in #2139
- [Fix] Llama 3.2 Vision decoder_trainable flag fixed by @pbontrager in #2150
New Contributors
- @bradhilton made their first contribution in #1940
- @smujjiga made their first contribution in #1946
- @noemotiovon made their first contribution in #1826
- @RylanC24 made their first contribution in #1967
- @vancoykendall made their first contribution in #1976
- @Jack-Khuu made their first contribution in #2011
- @JP-sDEV made their first contribution in #1877
- @KeijiBranshi made their first contribution in #2002
- @EmilyIsCoding made their first contribution in #2047
- @ifed-ucsd made their first contribution in #2043
- @larryliu0820 made their first contribution in #2049
- @atalman made their first contribution in #2086
- @songhappy made their first contribution in #1953
- @mostafaelhoushi made their first contribution in #1076
Full Changelog: v0.4.0...v0.5.0