Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not4land] repro dynamo performance accuracy problem #2519

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Oct 18, 2024

Summary:

Test Plan:
run the following shell script:

repro_arr=("resnet50")
for m in "${repro_arr[@]}"
do
    for i in {1..5}
    do
        python run_benchmark.py torchao --only $m --quantization noquant --performance --inference --bflo\at16 --inductor-compile-mode max-autotune --output /home/jerryzh/local/benchmark/.userbenchmark/torchao/r\epro_baseline.csv
    done

    for i in {1..5}
    do
        python run_benchmark.py torchao --only $m --quantization autoquant --performance --inference --bf\loat16 --inductor-compile-mode max-autotune --output /home/jerryzh/local/benchmark/.userbenchmark/torchao\/repro_autoquant_v1.csv
    done
done

I got:
repro_baseline.csv

dev,name,batch_size,speedup,abs_latency
cuda,resnet50,32,1.118715,3.674717
cuda,resnet50,32,1.125512,3.610448
cuda,resnet50,32,1.152917,3.708096
cuda,resnet50,32,1.131386,3.643471
cuda,resnet50,32,1.142843,3.772992
dev,name,batch_size,speedup,abs_latency
cuda,resnet50,32,0.994377,4.146444
cuda,resnet50,32,0.973631,4.268407
cuda,resnet50,32,1.006459,4.146043
cuda,resnet50,32,1.005920,4.158831
cuda,resnet50,32,1.160011,3.591534

logs:

benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.31758056640625  milliseconds
1.119x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.4913150024414064  milliseconds
1.126x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.4281463623046875  milliseconds
1.153x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.334197692871094  milliseconds
1.131x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'noquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv'] ===================
cuda eval  resnet50                           
noquant run
elapsed_time:  3.3522384643554686  milliseconds
1.143x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_baseline.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.020ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.020ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.020ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.020ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.44037841796875  milliseconds
0.994x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.021ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.021ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.021ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.021ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.3911477661132814  milliseconds
0.974x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.020ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.020ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.020ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.020ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.2849554443359374  milliseconds
1.006x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.021ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.021ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.021ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.021ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.3616156005859374  milliseconds
1.006x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv
benchmark args: [['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv']]
=================== [TORCHAO] Running PT2 Benchmark Runner with Args: ['--only', 'resnet50', '--quantization', 'autoquant', '--performance', '--inference', '--bfloat16', '--inductor-compile-mode', 'max-autotune', '--output', '/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv'] ===================
cuda eval  resnet50                           
activation_shapes: torch.Size([32, 2048]), times_seen: 2
weight_shape: torch.Size([1000, 2048]), dtype: torch.bfloat16, bias_shape: torch.Size([1000])
>>time: 0.020ms for <class 'torchao.quantization.autoquant.AQFloatLinearWeight'>, to_beat: infms 
>>time: 0.026ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight'>, to_beat: 0.020ms 
>>time: 0.045ms for <class 'torchao.quantization.autoquant.AQInt8WeightOnlyQuantizedLinearWeight2'>, to_beat: 0.020ms 
>>time: 0.024ms for <class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'> matmul, to_beat: 0.020ms
best_cls=<class 'torchao.quantization.autoquant.AQFloatLinearWeight'>
autoquant run
elapsed_time:  3.493924865722656  milliseconds
1.160x
/home/jerryzh/local/benchmark/.userbenchmark/torchao/repro_autoquant_v1.csv

Note: log contains the benchmark results from torchao.utils.benchmark_model,

noquant resnet50: 3.31758056640625 3.4913150024414064 3.4281463623046875 3.334197692871094 3.3522384643554686
autoquant resnet50: 3.44037841796875 3.3911477661132814 3.2849554443359374 3.3616156005859374 3.493924865722656

The time of autoquant v.s. noquant is similar in general, not consistently slower as shown in the .csv files

Reviewers:

Subscribers:

Tasks:

Tags:

@jerryzh168 jerryzh168 changed the title [not4land] repro dynamo error [not4land] repro dynamo performance accuracy problem Oct 28, 2024
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants