Test if any features in the `component` model can be removed without impacting performance #3677

suhaibmujahid · 2023-09-30T03:05:21Z

Lines 73 to 85 in 41b1372

    
           feature_extractors = [ 
        
               bug_features.has_str(), 
        
               bug_features.severity(), 
        
               bug_features.keywords(), 
        
               bug_features.is_coverity_issue(), 
        
               bug_features.has_crash_signature(), 
        
               bug_features.has_url(), 
        
               bug_features.has_w3c_url(), 
        
               bug_features.has_github_url(), 
        
               bug_features.whiteboard(), 
        
               bug_features.patches(), 
        
               bug_features.landings(), 
        
           ]

Inyrkz · 2023-10-06T10:31:56Z

Hi @suhaibmujahi,

I followed the instructions in the README.md and successfully cloned the repository onto my system. To evaluate the current performance of the model, I attempted to train the component model using the following command:

python3 -m scripts.trainer component

Here is a screenshot

While the repository suggests that the model should take around 30 minutes to train, in my case, it's been running for over 6 hours. Unfortunately, my laptop's battery cannot sustain such a long training process. I'd like to inquire if there are any specific system hardware requirements I should be aware of. Additionally, do you have any recommendations for speeding up the model training process?

Thank you for your assistance.

suhaibmujahid · 2023-10-06T17:01:25Z

Welcome @Inyrkz -- Thank you for your interest in the project!

While the repository suggests that the model should take around 30 minutes to train, in my case, it's been running for over 6 hours.

The duration required to train a model varies based on the model, data size, and hardware used. The readme file warns that training will take more than 30 minutes ("warning this takes 30min+").

For testing purposes, you could limit the data size to speed up the process using the --limit flag. However, this will not be helpful in the context of this issue because we need the full data to have realistic performance values.

Currently, we have an issue on file to enable training on our infrastructure instead of locally (#3688), but I do not know when that will be ready.

Inyrkz · 2023-10-07T12:54:21Z

Thank you for the clarification regarding the training issue. Yeah, using the -- limit flag won't be helpful in this context.

I'll continue to monitor the training process and will be patient as the team works towards a solution. If I have any further questions or encounter any issues, I'll be sure to reach out.

gothwalritu · 2023-10-21T20:23:24Z

Hi @suhaibmujahid ,
I started looking into this bug: Test if any features in the component model can be removed without impacting performance #3677

it took me a while to set-it up. Finally, when I ran the trainer for component. I am getting this error:

First Run:
File "C:\ritu\bugbug\bugbug[utils.py](http://utils.py/)", line 299, in extract_file
zstd_decompress(inner_path)
File "C:\ritu\bugbug\bugbug[utils.py](http://utils.py/)", line 255, in zstd_decompress
subprocess.run(["zstdmt", "-df", f"{path}.zst"], check=True)
File "C:\Program Files\Python311\Lib[subprocess.py](http://subprocess.py/)", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib[subprocess.py](http://subprocess.py/)", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Program Files\Python311\Lib[subprocess.py](http://subprocess.py/)", line 1538, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified

Second Time
(venv) PS C:\ritu\bugbug> python -m scripts.trainer component
2023-10-21 13:19:09,957:INFO:bugbug.db:Downloading https://community-tc.services.mozilla.com/api/index/v1/task/project.bugbug.data_bugs.latest/artifacts/public/bugs.json.zst to data/bugs.json.zst
2023-10-21 13:19:10,926:INFO:main:Training component model
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "C:\ritu\bugbug\scripts[trainer.py](http://trainer.py/)", line 161, in
main()
File "C:\ritu\bugbug\scripts[trainer.py](http://trainer.py/)", line 157, in main
retriever.go(args)
File "C:\ritu\bugbug\scripts[trainer.py](http://trainer.py/)", line 51, in go
metrics = model_obj.train(limit=args.limit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ritu\bugbug\bugbug[model.py](http://model.py/)", line 343, in train
classes, self.class_names = self.get_labels()
^^^^^^^^^^^^^^^^^
File "C:\ritu\bugbug\bugbug\models[component.py](http://component.py/)", line 152, in get_labels
self.meaningful_product_components = self.get_meaningful_product_components(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ritu\bugbug\bugbug\models[component.py](http://component.py/)", line 220, in get_meaningful_product_components
max_count = product_component_counts[0][1]

IndexError: list index out of range

I hope this is the right forum to ask. Any pointers please?

suhaibmujahid · 2023-10-21T21:18:26Z

@gothwalritu you need to have zstd installed. Once you are done with that, delete the data directory and try again.

gothwalritu · 2023-10-21T21:42:17Z

@gothwalritu you need to have zstd installed. Once you are done with that, delete the data directory and try again.

Thanks Suhaib, I followed the steps and getting this error now:

(venv) PS C:\ritu\bugbug> python -m scripts.trainer component
2023-10-21 14:40:45,937:INFO:bugbug.db:Downloading https://community-tc.services.mozilla.com/api/index/v1/task/project.bugbug.data_bugs.latest/artifacts/public/bugs.json.zst to data/bugs.json.zst
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "C:\ritu\bugbug\scripts\trainer.py", line 161, in
main()
File "C:\ritu\bugbug\scripts\trainer.py", line 157, in main
retriever.go(args)
File "C:\ritu\bugbug\scripts\trainer.py", line 43, in go
assert db.download(required_db)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ritu\bugbug\bugbug\db.py", line 99, in download
utils.extract_file(zst_path)
File "C:\ritu\bugbug\bugbug\utils.py", line 299, in extract_file
zstd_decompress(inner_path)
File "C:\ritu\bugbug\bugbug\utils.py", line 255, in zstd_decompress
subprocess.run(["zstdmt", "-df", f"{path}.zst"], check=True)
File "C:\Program Files\Python311\Lib\subprocess.py", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\subprocess.py", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Program Files\Python311\Lib\subprocess.py", line 1538, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified
(venv) PS C:\ritu\bugbug> python --version
Python 3.11.6

I ran it second time and getting the same index out of range error:
File "C:\ritu\bugbug\bugbug\models\component.py", line 220, in get_meaningful_product_components
max_count = product_component_counts[0][1]
~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

PS C:\ritu\bugbug> zstd --version
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***

I installed popen as well.

gothwalritu · 2023-10-21T22:58:04Z

I ran the zstd manually and it is working now :).

PS C:\ritu\bugbug> zstd -df data\bugs.json.zst

data\bugs.json.zst : 2307799708 bytes

gothwalritu · 2023-10-22T01:31:41Z

@suhaibmujahid: The training for component ran for a couple of hours and then threw this error. This looks like a code issue...could you please advise?

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ritu\bugbug\scripts\trainer.py", line 161, in <module>
    main()
  File "C:\ritu\bugbug\scripts\trainer.py", line 157, in main
    retriever.go(args)
  File "C:\ritu\bugbug\scripts\trainer.py", line 56, in go
    json.dump(metrics, metric_file, cls=CustomJsonEncoder)
  File "C:\Program Files\Python311\Lib\json\__init__.py", line 179, in dump
    for chunk in iterable:
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  [Previous line repeated 1 more time]
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "C:\ritu\bugbug\bugbug\utils.py", line 313, in default
    return super().default(obj)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable

gothwalritu · 2023-10-22T04:03:56Z

Hi, it's me again and I again got the zstd issue:

2023-10-21 20:49:28,261:INFO:__main__:Training done
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ritu\bugbug\scripts\trainer.py", line 168, in <module>
    main()
  File "C:\ritu\bugbug\scripts\trainer.py", line 164, in main
    retriever.go(args)
  File "C:\ritu\bugbug\scripts\trainer.py", line 69, in go
    zstd_compress(model_file_name)
  File "C:\ritu\bugbug\bugbug\utils.py", line 248, in zstd_compress
    subprocess.run(["zstdmt", "-f", path], check=True)
  File "C:\Program Files\Python311\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python311\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified

I noticed that the zstd file name is zstd.exe and not zstdmt. So started the training again after I changed the code in my repo to this:

def zstd_compress(path: str) -> None:
    if not os.path.exists(path):
        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path)

    #subprocess.run(["zstdmt", "-f", path], check=True) #gothwalritu
    subprocess.run(["zstd", "-f", path], check=True)  #gothwalritu changed from zstdmt to zstd

hopefully it works this time.

gothwalritu · 2023-10-22T18:56:41Z

Hi @suhaibmujahid , I am working on this task and so far, I have understood that it requires ablation study. My question is the metrics.json file, which is an output of the program, is there an online tool to upload it and compare the performance or the expectation is that we need to write a script to parse the matrics.json file for each feature (removal) and then compute the performance?

suhaibmujahid · 2023-10-23T17:46:25Z

is there an online tool to upload it and compare the performance or the expectation is that we need to write a script to parse the matrics.json file for each feature (removal) and then compute the performance?

@gothwalritu Currently there is not. Feel free to use any tools you find useful.

gothwalritu · 2023-10-25T06:20:03Z

@suhaibmujahid : The ablation study runs for all the features finished today. After analyzing the metrics.json for each run, I found that the model already outputs the averages of the metrics for each target component. Now I am working to design the comparison methodology and then write a report which will document the process, results and conclusions.

I have a query: Although no coding was required to run these models, should I still submit the report via GitHub, or is submitting here more appropriate?

Also please correct me if I am not on the right path.

suhaibmujahid · 2023-10-25T16:16:30Z

@gothwalritu you could submit a PR to apply the findings of your experiments (e.g., dropping a specific feature). A complete formal report is not required, a compression between before and after should be sufficient.

Test if any features in the component model can be removed without impacting performance mozilla#3677 based on /docs/models/component_ablation_study.md is_coverity_issue feature can be removed from component model.

gothwalritu · 2023-10-25T19:41:05Z

@suhaibmujahid : Thanks, Subaib, I have created the pull request and also stored my findings in docs/models/component_feature_ablation.md.
Shall I try working on some other bug now?

mozilla#3677 **Conclusion** Based on the evaluation metrics, removal of _is_coverity_issue_ component exhibits superior performance in terms of Precision, Recall, F1 Score, Geometric Mean, and IBA. Although it has a slightly lower specificity compared to other runs, its higher values in other key metrics signify a better balance and predictive accuracy. On the other hand, removal of _severity_ component registers the lowest performance across most metrics.

suhaibmujahid added the good-first-bug Good for newcomers label Sep 30, 2023

gothwalritu mentioned this issue Oct 25, 2023

Drop the IsCoverityIssue feature from the component model #3758

Merged

franciscaeze mentioned this issue Oct 30, 2023

Deleted "bug_features.severity()" from component.py #3777

Closed

suhaibmujahid linked a pull request Oct 31, 2023 that will close this issue

Drop the IsCoverityIssue feature from the component model #3758

Merged

suhaibmujahid closed this as completed in #3758 Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test if any features in the `component` model can be removed without impacting performance #3677

Test if any features in the `component` model can be removed without impacting performance #3677

suhaibmujahid commented Sep 30, 2023

Inyrkz commented Oct 6, 2023

suhaibmujahid commented Oct 6, 2023

Inyrkz commented Oct 7, 2023

gothwalritu commented Oct 21, 2023

suhaibmujahid commented Oct 21, 2023

gothwalritu commented Oct 21, 2023 •

edited

Loading

gothwalritu commented Oct 21, 2023

gothwalritu commented Oct 22, 2023

gothwalritu commented Oct 22, 2023 •

edited

Loading

gothwalritu commented Oct 22, 2023

suhaibmujahid commented Oct 23, 2023

gothwalritu commented Oct 25, 2023 •

edited

Loading

suhaibmujahid commented Oct 25, 2023

gothwalritu commented Oct 25, 2023

Test if any features in the component model can be removed without impacting performance #3677

Test if any features in the component model can be removed without impacting performance #3677

Comments

suhaibmujahid commented Sep 30, 2023

Inyrkz commented Oct 6, 2023

suhaibmujahid commented Oct 6, 2023

Inyrkz commented Oct 7, 2023

gothwalritu commented Oct 21, 2023

suhaibmujahid commented Oct 21, 2023

gothwalritu commented Oct 21, 2023 • edited Loading

gothwalritu commented Oct 21, 2023

gothwalritu commented Oct 22, 2023

gothwalritu commented Oct 22, 2023 • edited Loading

gothwalritu commented Oct 22, 2023

suhaibmujahid commented Oct 23, 2023

gothwalritu commented Oct 25, 2023 • edited Loading

suhaibmujahid commented Oct 25, 2023

gothwalritu commented Oct 25, 2023

Test if any features in the `component` model can be removed without impacting performance #3677

Test if any features in the `component` model can be removed without impacting performance #3677

gothwalritu commented Oct 21, 2023 •

edited

Loading

gothwalritu commented Oct 22, 2023 •

edited

Loading

gothwalritu commented Oct 25, 2023 •

edited

Loading