RFC: Code sharing for ET export, C++ runner and tokenizer, with ExecuTorch #1333

larryliu0820 · 2024-10-28T21:12:35Z

🚀 The feature, motivation and pitch

Currently torchchat is having its own implementation for these features:

Utils to optimize, quantize and export eager model to ExecuTorch.
A LLM runner for AOTI and ExecuTorch.
Tokenizers (sentencepiece and tiktoken) used by both AOTI runner and ET runner.

The problem for this design is that it is not bringing in new features checked-in into the export flow in ExecuTorch. What's worse is that the demo apps hosted in torchchat is expecting a .pte file from the export flow in ExecuTorch instead of the one from torchchat and that will easily break if changes happen to one or the other.

Similar story happens to the C++ implementations of the tokenizers. If we look at tokenizers in ExecuTorch it is a lot similar to what tokenizers in torchchat and the code should be unified to avoid duplication.

Alternatives

An alternative is do nothing. If we keep the status quo, DevX will deteriorate due to constant changes from ExecuTorch that we need to incorporate into torchchat.

Additional context

No response

RFC (Optional)

Proposal

On a high level we want to:

Reuse export flow in ExecuTorch's extension/llm directory.
Setup a new repo for tokenizers/samplers under pytorch-labs.
Let runner code depend on the new tokenizer repo.

Details

Export flow:

Currently torchchat uses export.py to export a model to ET's .pte file.
Proposal: fully migrate to ET’s extension/llm.
New dependency: ET nightly build in pip.

Runner:

Torchchat C++ runner needs to work for both AOTI and ET so it’s quite complicated.
Proposal 1 (preferred):

Setup a separate repo for runner and tokenizer code. Both ET and torchchat depend on it.
- Add a public repo under pytorch-labs organization, say pytorch-labs/tokenizers
- Split existing run.cpp into et_run.cpp and aoti_run.cpp
  - et_run.cpp depends on ExecuTorch as well as pytorch-labs/tokenizers,
  - aoti_run.cpp only depends on pytorch-labs/tokenizers.
- Pros: no code duplication, clear dependencies.
- Cons: maintenance cost for a new repo.

Proposal 2 (short term?):

Use runner building blocks and tokenizer from ET. Refactor existing run.cpp to reuse those components. Add ET as a git submodule.
- Pros: no code duplication.
- Cons: if a user only wants to build an AOTI runner, it’s weird to pull in tokenizer code from ET.

Model definition:

Torchchat depends on torchtune for model definition. All the source transformations will come from the ET extension/llm library. Modules that are modified to be torch.exportable will be hosted in ET extension/llm, torchchat should use those as well.

Example: torchtune’s MultiHeadAttention has an input dependent condition that needs to be rewritten into torch.cond so that it’s exportable. This lives in extension/llm/modules and should be used by torchchat. [Pending discussion] If torchtune is open to host these exportable modules, torchchat should depend on torchtune to get them.

Demo app:

For both Android and iOS, we want to build runner and tokenizer as libraries, package them into artifacts and distribute them to torchchat.
We are already doing this for Android demo app
iOS demo app code should live in torchchat as well and both demo apps should be removed from ET in the future.

The text was updated successfully, but these errors were encountered:

shoumikhin · 2024-11-14T18:31:40Z

I like this direction to modularize the components and sharing them for different use-cases beyond those backed by ET.

It seems like what we could think on further is how to actually modularize it even more, ie. not just reuse the Tokenizer and Sampler, but also most of the runner logic, if possible.

Runner now does some things that are still common for ET or AOTI while preparing the inputs and processing the outputs, eg. the run loop, applying chat pattern with special tokens, etc. So in theory we could have a common Runner pipeline and inject specific minimal components to do the actual forward() call implemented differently for ET or AOTI.

So overall we could have a collection of abstract interfaces for Tokenizer, Sampler, Processor (ie. image processor), Runner (to generate a single next token), Pipeline (to utilize the Runner for completion or multi-turn generation), etc. and also provide some concrete implementations for them, eg. TikTokenTokenizer, ArgMaxSampler, ImageProcessor, LLaMA3Runner, LLaMAChatPipeline, etc. That way we provide some working solution out-the-box, and also allow clients implement their own components by inheriting the interfaces and injecting them instead, eg. a custom SentencePieceTokenizer injected into a custom LLaMA2Runner injected into LLaMAChatPipeline instead of the default LLaMA3Runner.

Such interfaces and default implementations can live in pytorch-labs, if that's a good place for everyone to depend on.

gabe-l-hart · 2024-11-18T15:58:22Z

I also really like the idea of unifying these c++ layers! I've been working on extending the existing c++ tokenizer support to handle parts of the tokenizers library from huggingface. I have a draft PR up for discussion, but the code should be very self-contained and portable if the decision is to move this level of support elsewhere.

A couple of key elements of the PR to point out that may be seeds for this discussion:

Addition of PreTokenizer and TokenDecoder abstract interfaces
Basic config-based factory mechanisms for each (e.g. PreTokenizerConfig) with json parsing.
Unit test framework using gtest and ctest

I'm not at all tied to the specific shapes of these implementations, but wanted to point them out as the discussion touches on how to wrangle these abstractions.

One note I did want to make as I was prototyping (I'll comment on the PR as well): I opted not to use enclosing namespaces since they were not used in the existing torchchat c++ layers. I think for a more reusable abstract layer, we would certainly want to use namespace encapsulation.

larryliu0820 · 2024-11-26T21:34:21Z

@shoumikhin @gabe-l-hart thanks for chiming in. I spawn up https://github.com/pytorch-labs/tokenizers as our first step to enforcing code sharing on tokenizers. I'm still waiting for legal approval to open this repo up but would love to collaborate.

metascroy · 2024-11-26T21:59:13Z

In the short term I think this is a good direction, but long-term I do not like how ExecuTorch does not just "work" well with an LLM exported from torch.export.

ExecuTorch having its own export_llama_lib script and special source transformations is not a good user experience in my opinion. My one worry about the proposal here is that it further entrenches this flow.

larryliu0820 · 2024-11-26T22:13:30Z

In the short term I think this is a good direction, but long-term I do not like how ExecuTorch does not just "work" well with an LLM exported from torch.export.

ExecuTorch having its own export_llama_lib script and special source transformations is not a good user experience in my opinion. My one worry about the proposal here is that it further entrenches this flow.

I somewhat agree, I want to point out source transformation is "optional" meaning ET will also work without source transformation, just slower and consumes more memory. Having source transformation available gives ET user an easy way to optimize their LLM model.

It is a separate question on whether we should internalize these optimizations into the framework, e.g., find a smarter way to reduce unnecessary copies.

kimishpatel · 2024-12-02T00:44:28Z

In the short term I think this is a good direction, but long-term I do not like how ExecuTorch does not just "work" well with an LLM exported from torch.export.
ExecuTorch having its own export_llama_lib script and special source transformations is not a good user experience in my opinion. My one worry about the proposal here is that it further entrenches this flow.

I somewhat agree, I want to point out source transformation is "optional" meaning ET will also work without source transformation, just slower and consumes more memory. Having source transformation available gives ET user an easy way to optimize their LLM model.

It is a separate question on whether we should internalize these optimizations into the framework, e.g., find a smarter way to reduce unnecessary copies.

Actually I think it is fine to apply optimizing transformation be that module swap or graph rewrites. to_backend within ET pipeline is one such optimization and that is explicit opt-in. Scripts that wrap all the (export, to_edge, to_backend etc.) provide convenience wrappers. So in general it is better to do a) building composable components and b) have wrappers or boxing logic that puts together these components, thus allowing custom wrappers. It is better compared to monolithic design by say, having a giant optimized operator lib that works like a charm when it does, but is extremely hard to extend.

larryliu0820 self-assigned this Oct 28, 2024

Jack-Khuu added ExecuTorch Issues related to ExecuTorch installation, export, or build. Mobile uses separate tags RFC Request for Comment labels Oct 30, 2024

mergennachin mentioned this issue Nov 26, 2024

RFC: Improve developer experience by anchoring on multimodal use-case pytorch/executorch#7093

Open

Jack-Khuu added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Code sharing for ET export, C++ runner and tokenizer, with ExecuTorch #1333

RFC: Code sharing for ET export, C++ runner and tokenizer, with ExecuTorch #1333

larryliu0820 commented Oct 28, 2024

shoumikhin commented Nov 14, 2024

gabe-l-hart commented Nov 18, 2024

larryliu0820 commented Nov 26, 2024

metascroy commented Nov 26, 2024

larryliu0820 commented Nov 26, 2024 •

edited

Loading

kimishpatel commented Dec 2, 2024

RFC: Code sharing for ET export, C++ runner and tokenizer, with ExecuTorch #1333

RFC: Code sharing for ET export, C++ runner and tokenizer, with ExecuTorch #1333

Comments

larryliu0820 commented Oct 28, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Proposal

Details

Export flow:

Runner:

Model definition:

Demo app:

shoumikhin commented Nov 14, 2024

gabe-l-hart commented Nov 18, 2024

larryliu0820 commented Nov 26, 2024

metascroy commented Nov 26, 2024

larryliu0820 commented Nov 26, 2024 • edited Loading

kimishpatel commented Dec 2, 2024

larryliu0820 commented Nov 26, 2024 •

edited

Loading