Skip to content

v0.2.0

Compare
Choose a tag to compare
@pbontrager pbontrager released this 16 Jul 16:26
· 478 commits to main since this release

Overview

It’s been awhile since we’ve done a release and we have a ton of cool, new features in the torchtune library including distributed QLoRA support, new models, sample packing, and more! Checkout #new-contributors for an exhaustive list of new contributors to the repo.

Enjoy the new release and happy tuning!

New Features

Here’s some highlights of our new features in v0.2.0.

Recipes

  • We added support for QLoRA with FSDP2! This means users can now run 70B+ models on multiple GPUs. We provide example configs for Llama2 7B and 70B sizes. Note: this currently requires you to install PyTorch nightlies to access the FSDP2 methods. (#909)
  • Also by leveraging FSDP2, we see a speed up of 12% tokens/sec and a 3.2x speedup in model init over FSDP1 with LoRA (#855)
  • We added support for other variants of the Meta-Llama3 recipes including:
    • 70B with LoRA (#802)
    • 70B full finetune (#993)
    • 8B memory-efficient full finetune which saves 46% peak memory over previous version (#990)
  • We introduce a quantization-aware training (QAT) recipe. Training with QAT shows significant improvement in model quality if you plan on quantizing your model post-training. (#980)
  • torchtune made updates to the eval recipe including:
    • Batched inference for faster eval (#947)
    • Support for free generation tasks in EleutherAI Eval Harness (#975)
    • Support for custom eval configs (#1055)

Models

  • Phi-3 Mini-4K-Instruct from Microsoft (#876)
  • Gemma 7B from Google (#971)
  • Code Llama2: 7B, 13B, and 70B sizes from Meta (#847)
  • @salman designed and implemented reward modeling for Mistral models (#840, #991)

Perf, memory, and quantization

  • We made improvements to our FSDP + Llama3 recipe, resulting in 13% more savings in allocated memory for the 8B model. (#865)
  • Added Int8 per token dynamic activation + int4 per axis grouped weight (8da4w) quantization (#884)

Data/Datasets

  • We added support for a widely requested feature - sample packing! This feature drastically speeds up model training - e.g. 2X faster with the alpaca dataset. (#875, #1109)
  • In addition to our instruct tuning, we now also support continued pretraining and include several example datasets like wikitext and CNN DailyMail. (#868)
  • Users can now train on multiple datasets using concat datasets (#889)
  • We now support OpenAI conversation style data (#890)

Miscellaneous

  • @jeromeku added a much more advanced profiler so users can understand the exact bottlenecks in their LLM training. (#1089)
  • We made several metric logging improvements:
    • Log tokens/sec, per-step logging, configurable memory logging (#831)
    • Better formatting for stdout memory logs (#817)
  • Users can now save models in a safetensor format. (#1096)
  • Updated activation checkpointing to support selective layer and selective op activation checkpointing (#785)
  • We worked with the Hugging Face team to provide support for loading adapter weights fine tuned via torchtune directly into the PEFT library. (#933)

Documentation

  • We wrote a new tutorial for fine-tuning Llama3 with chat data (#823) and revamped the datasets tutorial (#994)
  • Looooooooong overdue, but we added proper documentation for the tune CLI (#1052)
  • Improved contributing guide (#896)

Bug Fixes

  • @Optimox found and fixed a bug to ensure that LoRA dropout was correctly applied (#996)
  • Fixed a broken link for Llama3 tutorial in #805
  • Fixed Gemma model generation (#1016)
  • Bug workaround: to download CNN DailyMail, launch a single device recipe first and once it’s downloaded you can use the dataset for distributed recipes.

New Contributors

Full Changelog: v0.1.1...v0.2.0