[not4land] repro dynamo performance accuracy problem #2519

jerryzh168 · 2024-10-18T22:23:27Z

Summary:

Test Plan:
run the following shell script:

repro_arr=("resnet50")
for m in "${repro_arr[@]}"
do
    for i in {1..5}
    do
        python run_benchmark.py torchao --only $m --quantization noquant --performance --inference --bflo\at16 --inductor-compile-mode max-autotune --output /home/jerryzh/local/benchmark/.userbenchmark/torchao/r\epro_baseline.csv
    done

    for i in {1..5}
    do
        python run_benchmark.py torchao --only $m --quantization autoquant --performance --inference --bf\loat16 --inductor-compile-mode max-autotune --output /home/jerryzh/local/benchmark/.userbenchmark/torchao\/repro_autoquant_v1.csv
    done
done

I got:
repro_baseline.csv

dev,name,batch_size,speedup,abs_latency
cuda,resnet50,32,1.118715,3.674717
cuda,resnet50,32,1.125512,3.610448
cuda,resnet50,32,1.152917,3.708096
cuda,resnet50,32,1.131386,3.643471
cuda,resnet50,32,1.142843,3.772992

dev,name,batch_size,speedup,abs_latency
cuda,resnet50,32,0.994377,4.146444
cuda,resnet50,32,0.973631,4.268407
cuda,resnet50,32,1.006459,4.146043
cuda,resnet50,32,1.005920,4.158831
cuda,resnet50,32,1.160011,3.591534

logs:

benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.31758056640625  milliseconds
1.119x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.4913150024414064  milliseconds
1.126x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.4281463623046875  milliseconds
1.153x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.334197692871094  milliseconds
1.131x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.3522384643554686  milliseconds
1.143x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.020ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.020ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.020ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.020ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.44037841796875  milliseconds
0.994x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.021ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.021ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.021ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.021ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.3911477661132814  milliseconds
0.974x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.020ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.020ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.020ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.020ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.2849554443359374  milliseconds
1.006x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.021ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.021ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.021ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.021ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.3616156005859374  milliseconds
1.006x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.020ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.020ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.020ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.020ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.493924865722656  milliseconds
1.160x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv

Note: log contains the benchmark results from torchao.utils.benchmark_model,

noquant resnet50: 3.31758056640625 3.4913150024414064 3.4281463623046875 3.334197692871094 3.3522384643554686
autoquant resnet50: 3.44037841796875 3.3911477661132814 3.2849554443359374 3.3616156005859374 3.493924865722656

The time of autoquant v.s. noquant is similar in general, not consistently slower as shown in the .csv files

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the cla signed label Oct 18, 2024

jerryzh168 had a problem deploying to docker-s3-upload October 18, 2024 22:23 — with GitHub Actions Failure

jerryzh168 temporarily deployed to docker-s3-upload October 18, 2024 22:23 — with GitHub Actions Inactive

jerryzh168 changed the title ~~[not4land] repro dynamo error~~ [not4land] repro dynamo performance accuracy problem Oct 28, 2024

jerryzh168 mentioned this pull request Oct 28, 2024

Benchmark reliability of torchbenchmarks #2527

Open

[not4land] repro dynamo error

ec965d5

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

jerryzh168 force-pushed the repro branch from f1d0d67 to ec965d5 Compare October 29, 2024 23:08

jerryzh168 temporarily deployed to docker-s3-upload October 29, 2024 23:08 — with GitHub Actions Inactive

jerryzh168 had a problem deploying to docker-s3-upload October 29, 2024 23:09 — with GitHub Actions Failure

jerryzh168 mentioned this pull request Oct 29, 2024

Running huggingface model benchmark #2529

Closed

save local changes

50ce261

jerryzh168 temporarily deployed to docker-s3-upload December 11, 2024 22:50 — with GitHub Actions Inactive

jerryzh168 had a problem deploying to docker-s3-upload December 11, 2024 22:50 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[not4land] repro dynamo performance accuracy problem #2519

[not4land] repro dynamo performance accuracy problem #2519

jerryzh168 commented Oct 18, 2024 •

edited

Loading

[not4land] repro dynamo performance accuracy problem #2519

Are you sure you want to change the base?

[not4land] repro dynamo performance accuracy problem #2519

Conversation

jerryzh168 commented Oct 18, 2024 • edited Loading

jerryzh168 commented Oct 18, 2024 •

edited

Loading