Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test if any features in the component model can be removed without impacting performance #3677

Closed
suhaibmujahid opened this issue Sep 30, 2023 · 14 comments · Fixed by #3758
Closed
Labels
good-first-bug Good for newcomers

Comments

@suhaibmujahid
Copy link
Member

feature_extractors = [
bug_features.has_str(),
bug_features.severity(),
bug_features.keywords(),
bug_features.is_coverity_issue(),
bug_features.has_crash_signature(),
bug_features.has_url(),
bug_features.has_w3c_url(),
bug_features.has_github_url(),
bug_features.whiteboard(),
bug_features.patches(),
bug_features.landings(),
]

@suhaibmujahid suhaibmujahid added the good-first-bug Good for newcomers label Sep 30, 2023
@Inyrkz
Copy link

Inyrkz commented Oct 6, 2023

Hi @suhaibmujahi,

I followed the instructions in the README.md and successfully cloned the repository onto my system. To evaluate the current performance of the model, I attempted to train the component model using the following command:

python3 -m scripts.trainer component

component model training
Here is a screenshot

While the repository suggests that the model should take around 30 minutes to train, in my case, it's been running for over 6 hours. Unfortunately, my laptop's battery cannot sustain such a long training process. I'd like to inquire if there are any specific system hardware requirements I should be aware of. Additionally, do you have any recommendations for speeding up the model training process?

Thank you for your assistance.

@suhaibmujahid
Copy link
Member Author

Welcome @Inyrkz -- Thank you for your interest in the project!

While the repository suggests that the model should take around 30 minutes to train, in my case, it's been running for over 6 hours.

The duration required to train a model varies based on the model, data size, and hardware used. The readme file warns that training will take more than 30 minutes ("warning this takes 30min+").

For testing purposes, you could limit the data size to speed up the process using the --limit flag. However, this will not be helpful in the context of this issue because we need the full data to have realistic performance values.

Currently, we have an issue on file to enable training on our infrastructure instead of locally (#3688), but I do not know when that will be ready.

@Inyrkz
Copy link

Inyrkz commented Oct 7, 2023

Thank you for the clarification regarding the training issue. Yeah, using the -- limit flag won't be helpful in this context.

I'll continue to monitor the training process and will be patient as the team works towards a solution. If I have any further questions or encounter any issues, I'll be sure to reach out.

@gothwalritu
Copy link
Contributor

Hi @suhaibmujahid ,
I started looking into this bug: Test if any features in the component model can be removed without impacting performance #3677

it took me a while to set-it up. Finally, when I ran the trainer for component. I am getting this error:

First Run:
File "C:\ritu\bugbug\bugbug[utils.py](http://utils.py/)", line 299, in extract_file
zstd_decompress(inner_path)
File "C:\ritu\bugbug\bugbug[utils.py](http://utils.py/)", line 255, in zstd_decompress
subprocess.run(["zstdmt", "-df", f"{path}.zst"], check=True)
File "C:\Program Files\Python311\Lib[subprocess.py](http://subprocess.py/)", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib[subprocess.py](http://subprocess.py/)", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Program Files\Python311\Lib[subprocess.py](http://subprocess.py/)", line 1538, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified

Second Time
(venv) PS C:\ritu\bugbug> python -m scripts.trainer component
2023-10-21 13:19:09,957:INFO:bugbug.db:Downloading https://community-tc.services.mozilla.com/api/index/v1/task/project.bugbug.data_bugs.latest/artifacts/public/bugs.json.zst to data/bugs.json.zst
2023-10-21 13:19:10,926:INFO:main:Training component model
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "C:\ritu\bugbug\scripts[trainer.py](http://trainer.py/)", line 161, in
main()
File "C:\ritu\bugbug\scripts[trainer.py](http://trainer.py/)", line 157, in main
retriever.go(args)
File "C:\ritu\bugbug\scripts[trainer.py](http://trainer.py/)", line 51, in go
metrics = model_obj.train(limit=args.limit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ritu\bugbug\bugbug[model.py](http://model.py/)", line 343, in train
classes, self.class_names = self.get_labels()
^^^^^^^^^^^^^^^^^
File "C:\ritu\bugbug\bugbug\models[component.py](http://component.py/)", line 152, in get_labels
self.meaningful_product_components = self.get_meaningful_product_components(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ritu\bugbug\bugbug\models[component.py](http://component.py/)", line 220, in get_meaningful_product_components
max_count = product_component_counts[0][1]

IndexError: list index out of range

I hope this is the right forum to ask. Any pointers please?

@suhaibmujahid
Copy link
Member Author

@gothwalritu you need to have zstd installed. Once you are done with that, delete the data directory and try again.

@gothwalritu
Copy link
Contributor

gothwalritu commented Oct 21, 2023

@gothwalritu you need to have zstd installed. Once you are done with that, delete the data directory and try again.

Thanks Suhaib, I followed the steps and getting this error now:

(venv) PS C:\ritu\bugbug> python -m scripts.trainer component
2023-10-21 14:40:45,937:INFO:bugbug.db:Downloading https://community-tc.services.mozilla.com/api/index/v1/task/project.bugbug.data_bugs.latest/artifacts/public/bugs.json.zst to data/bugs.json.zst
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "C:\ritu\bugbug\scripts\trainer.py", line 161, in
main()
File "C:\ritu\bugbug\scripts\trainer.py", line 157, in main
retriever.go(args)
File "C:\ritu\bugbug\scripts\trainer.py", line 43, in go
assert db.download(required_db)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ritu\bugbug\bugbug\db.py", line 99, in download
utils.extract_file(zst_path)
File "C:\ritu\bugbug\bugbug\utils.py", line 299, in extract_file
zstd_decompress(inner_path)
File "C:\ritu\bugbug\bugbug\utils.py", line 255, in zstd_decompress
subprocess.run(["zstdmt", "-df", f"{path}.zst"], check=True)
File "C:\Program Files\Python311\Lib\subprocess.py", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\subprocess.py", line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Program Files\Python311\Lib\subprocess.py", line 1538, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified
(venv) PS C:\ritu\bugbug> python --version
Python 3.11.6

I ran it second time and getting the same index out of range error:
File "C:\ritu\bugbug\bugbug\models\component.py", line 220, in get_meaningful_product_components
max_count = product_component_counts[0][1]
~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

PS C:\ritu\bugbug> zstd --version
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***

I installed popen as well.

@gothwalritu
Copy link
Contributor

I ran the zstd manually and it is working now :).

PS C:\ritu\bugbug> zstd -df data\bugs.json.zst

data\bugs.json.zst : 2307799708 bytes

@gothwalritu
Copy link
Contributor

@suhaibmujahid: The training for component ran for a couple of hours and then threw this error. This looks like a code issue...could you please advise?

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ritu\bugbug\scripts\trainer.py", line 161, in <module>
    main()
  File "C:\ritu\bugbug\scripts\trainer.py", line 157, in main
    retriever.go(args)
  File "C:\ritu\bugbug\scripts\trainer.py", line 56, in go
    json.dump(metrics, metric_file, cls=CustomJsonEncoder)
  File "C:\Program Files\Python311\Lib\json\__init__.py", line 179, in dump
    for chunk in iterable:
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  [Previous line repeated 1 more time]
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "C:\ritu\bugbug\bugbug\utils.py", line 313, in default
    return super().default(obj)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\json\encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable

@gothwalritu
Copy link
Contributor

gothwalritu commented Oct 22, 2023

Hi, it's me again and I again got the zstd issue:

2023-10-21 20:49:28,261:INFO:__main__:Training done
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ritu\bugbug\scripts\trainer.py", line 168, in <module>
    main()
  File "C:\ritu\bugbug\scripts\trainer.py", line 164, in main
    retriever.go(args)
  File "C:\ritu\bugbug\scripts\trainer.py", line 69, in go
    zstd_compress(model_file_name)
  File "C:\ritu\bugbug\bugbug\utils.py", line 248, in zstd_compress
    subprocess.run(["zstdmt", "-f", path], check=True)
  File "C:\Program Files\Python311\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python311\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified

I noticed that the zstd file name is zstd.exe and not zstdmt. So started the training again after I changed the code in my repo to this:

def zstd_compress(path: str) -> None:
    if not os.path.exists(path):
        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path)

    #subprocess.run(["zstdmt", "-f", path], check=True) #gothwalritu
    subprocess.run(["zstd", "-f", path], check=True)  #gothwalritu changed from zstdmt to zstd

hopefully it works this time.

@gothwalritu
Copy link
Contributor

Hi @suhaibmujahid , I am working on this task and so far, I have understood that it requires ablation study. My question is the metrics.json file, which is an output of the program, is there an online tool to upload it and compare the performance or the expectation is that we need to write a script to parse the matrics.json file for each feature (removal) and then compute the performance?

@suhaibmujahid
Copy link
Member Author

is there an online tool to upload it and compare the performance or the expectation is that we need to write a script to parse the matrics.json file for each feature (removal) and then compute the performance?

@gothwalritu Currently there is not. Feel free to use any tools you find useful.

@gothwalritu
Copy link
Contributor

gothwalritu commented Oct 25, 2023

@suhaibmujahid : The ablation study runs for all the features finished today. After analyzing the metrics.json for each run, I found that the model already outputs the averages of the metrics for each target component. Now I am working to design the comparison methodology and then write a report which will document the process, results and conclusions.

I have a query: Although no coding was required to run these models, should I still submit the report via GitHub, or is submitting here more appropriate?

Also please correct me if I am not on the right path.

@suhaibmujahid
Copy link
Member Author

@gothwalritu you could submit a PR to apply the findings of your experiments (e.g., dropping a specific feature). A complete formal report is not required, a compression between before and after should be sufficient.

gothwalritu added a commit to gothwalritu/bugbug that referenced this issue Oct 25, 2023
Test if any features in the component model can be removed without impacting performance mozilla#3677
based on
/docs/models/component_ablation_study.md is_coverity_issue feature can be removed from component model.
@gothwalritu
Copy link
Contributor

@suhaibmujahid : Thanks, Subaib, I have created the pull request and also stored my findings in docs/models/component_feature_ablation.md.
Shall I try working on some other bug now?

gothwalritu added a commit to gothwalritu/bugbug that referenced this issue Oct 26, 2023
mozilla#3677

**Conclusion**

Based on the evaluation metrics, removal of _is_coverity_issue_ component exhibits
superior performance in terms of Precision, Recall, F1 Score, Geometric
Mean, and IBA. Although it has a slightly lower specificity compared to
other runs, its higher values in other key metrics signify a better
balance and predictive accuracy. On the other hand, removal of _severity_ component
registers the lowest performance across most metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-first-bug Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants