Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

vmpuri · 2024-10-24T22:51:01Z

Replace the WeightOnlyInt8Linear quantization code with TorchAO's int8_weight_only quantization.

Note - this commit also contains lintrunner changes.

Testing:

python3 torchchat.py eval llama3.2-1b --quantize '{"linear:int8": {"groupsize": 0}, "executor":{"accelerator":"cuda"}}' --compile
Using device=cuda
Loading model...
Time to load model: 1.21 seconds
Quantizing the model with: {'linear:int8': {'groupsize': 0}, 'executor': {'accelerator': 'cuda'}}
quantizer is linear int8
Time to quantize model: 0.31 seconds
-----------------------------------------------------------
2024-10-24:15:55:20,261 INFO     [huggingface.py:162] Using device 'cuda'
2024-10-24:15:55:27,792 WARNING  [task.py:763] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-24:15:55:27,792 WARNING  [task.py:775] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-24:15:55:27,792 WARNING  [task.py:763] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-24:15:55:27,792 WARNING  [task.py:775] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-24:15:55:27,792 WARNING  [task.py:763] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
2024-10-24:15:55:27,792 WARNING  [task.py:775] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
Repo card metadata block was not found. Setting CardData to empty.
2024-10-24:15:55:28,687 WARNING  [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty.
2024-10-24:15:55:28,760 INFO     [task.py:395] Building contexts for wikitext on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 501.80it/s]
2024-10-24:15:55:28,889 INFO     [evaluator.py:362] Running loglikelihood_rolling requests
100%|████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [01:10<00:00,  1.13s/it]
Time to run eval: 78.96s.
Time in model.forward: 62.57s, over 162 model evaluations
forward run time stats - Median: 0.00s Min: 0.00s Max: 41.80s
For model /home/puri/.torchchat/model-cache/meta-llama/Meta-Llama-3.2-1B-Instruct/model.pth
wikitext:
 word_perplexity,none: 19.2032
 byte_perplexity,none: 1.7378
 bits_per_byte,none: 0.7973
 alias: wikitext

From current master:

python3 torchchat.py eval llama3.2-1b --quantize '{"linear:int8": {"groupsize": 0}, "executor":{"accelerator":"cuda"}}' --compile
Using device=cuda
Loading model...
Time to load model: 1.20 seconds
Quantizing the model with: {'linear:int8': {'groupsize': 0}, 'executor': {'accelerator': 'cuda'}}
Time to quantize model: 0.19 seconds
-----------------------------------------------------------
2024-10-24:15:43:59,945 INFO     [huggingface.py:162] Using device 'cuda'
2024-10-24:15:44:07,664 WARNING  [task.py:763] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-24:15:44:07,664 WARNING  [task.py:775] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-24:15:44:07,664 WARNING  [task.py:763] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-24:15:44:07,664 WARNING  [task.py:775] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-24:15:44:07,664 WARNING  [task.py:763] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
2024-10-24:15:44:07,664 WARNING  [task.py:775] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
Repo card metadata block was not found. Setting CardData to empty.
2024-10-24:15:44:09,261 WARNING  [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty.
2024-10-24:15:44:09,342 INFO     [task.py:395] Building contexts for wikitext on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 463.50it/s]
2024-10-24:15:44:09,482 INFO     [evaluator.py:362] Running loglikelihood_rolling requests
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [01:00<00:00,  1.03it/s]
Time to run eval: 70.16s.
Time in model.forward: 53.46s, over 162 model evaluations
forward run time stats - Median: 0.00s Min: 0.00s Max: 33.02s
For model /home/puri/.torchchat/model-cache/meta-llama/Meta-Llama-3.2-1B-Instruct/model.pth
wikitext:
 word_perplexity,none: 19.2432
 byte_perplexity,none: 1.7385
 bits_per_byte,none: 0.7978
 alias: wikitext

Lint

pip install -r install/requirements-lintrunner.txt 
lintrunner -a

pytorch-bot · 2024-10-24T22:51:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1328

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1a42fb6 with merge base e30aaa0 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-10-24T23:04:34Z

thanks! can you add a generate.py speed benchmark result for before and after as well

Jack-Khuu · 2024-10-24T23:01:01Z

torchchat/utils/quantize.py

            # Use tensor subclass API for int4 weight only.
            if device == "cuda" and quantizer == "linear:int4":
                quantize_(model, int4_weight_only(q_kwargs["groupsize"]))
+            elif quantizer == "linear:int8":
+                print("quantizer is linear int8")


Suggested change

print("quantizer is linear int8")

Jack-Khuu · 2024-10-24T23:04:10Z

torchchat/utils/quantize.py

    "precision": PrecisionHandler,
    "executor": ExecutorHandler,
    "linear:int4": Int4WeightOnlyQuantizer,
+    "linear:int8": int8_weight_only,


Do we need this?

we can probably use None for now, and remove this later

We check for int8_weight_only and finished check before it looks at the table I think

@vmpuri can you check?

Jack-Khuu · 2024-10-25T00:06:50Z

Can you ack that the numerics look good for MPS and CPU as well?

mikekgfb · 2024-10-25T07:38:58Z

torchchat/utils/quantize.py

            # Use tensor subclass API for int4 weight only.
            if device == "cuda" and quantizer == "linear:int4":
                quantize_(model, int4_weight_only(q_kwargs["groupsize"]))
+            elif quantizer == "linear:int8":
+                print("quantizer is linear int8")
+                quantize_(model, int8_weight_only())


Why not integrate it into a QuantHandler class dispatched thru the handler dict at a single call site rather than build a chain of if statements?

Hi @mikekgfb, we will refactor this part in the future after all quant APIs are moved to torchao I think

torchAO already has a class-based API that is used for other quantizers? Why do these differently, and then later refactor them? Or why not do them all a consistent way now, and if you refactor later, do that?

yeah, quantizer API is deprecated in favor of quantize_, that's why we are gradually refactoring the quantizer APIs to use quantize_, the reason we do it one by one is because there might be missing support/alignment on numerics etc. that we need to do during the migration

Jack-Khuu · 2024-11-12T22:32:35Z

torchchat/utils/quantize.py

-        return linear_int8_aoti(input, self.weight, self.scales)
-
-    def et_forward(self, input: torch.Tensor) -> torch.Tensor:
-        return linear_int8_et(input, self.weight, self.scales)


Int 8 seems like it special cased for ET, reminder to check that as well

vmpuri requested review from jerryzh168, larryliu0820, Jack-Khuu and HDCharles October 24, 2024 22:51

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 24, 2024

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization

92e0a9d

vmpuri force-pushed the torchao_int8_weight_only branch from d43d52e to 92e0a9d Compare October 24, 2024 22:52

vmpuri marked this pull request as ready for review October 24, 2024 22:57

Jack-Khuu reviewed Oct 24, 2024

View reviewed changes

Jack-Khuu approved these changes Oct 25, 2024

View reviewed changes

mikekgfb reviewed Oct 25, 2024

View reviewed changes

Merge branch 'main' into torchao_int8_weight_only

1a42fb6

Jack-Khuu reviewed Nov 12, 2024

View reviewed changes

Jack-Khuu mentioned this pull request Dec 18, 2024

INT8 has a poor performance with groupsize > 0 in Torchchat, compared with BF16 and INT8 groupsize == 0 #1427

Closed

Jack-Khuu added the Quantization Issues related to Quantization or torchao label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

vmpuri commented Oct 24, 2024 •

edited

Loading

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading

jerryzh168 commented Oct 24, 2024 •

edited

Loading

Jack-Khuu Oct 24, 2024

Jack-Khuu Oct 24, 2024

jerryzh168 Oct 25, 2024

Jack-Khuu Oct 25, 2024

Jack-Khuu commented Oct 25, 2024

mikekgfb Oct 25, 2024

jerryzh168 Oct 28, 2024

mikekgfb Oct 30, 2024

jerryzh168 Oct 31, 2024

Jack-Khuu Nov 12, 2024

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

Are you sure you want to change the base?

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

Conversation

vmpuri commented Oct 24, 2024 • edited Loading

pytorch-bot bot commented Oct 24, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1328

✅ No Failures

jerryzh168 commented Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jack-Khuu commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmpuri commented Oct 24, 2024 •

edited

Loading

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading

jerryzh168 commented Oct 24, 2024 •

edited

Loading