Completed Multi-objective NAS experiments without metrics and repeated parameter selections #1704

ekurtgl · 2023-07-06T23:20:30Z

Hi,

I followed the multi-objective NAS tutorial, and I was able to implement the NAS framework for my own problem. In my problem, I have two objectives: num_params and val_acc, and three parameters: n_points (choice parameter), feature_size and num_points (range parameters). My trials run without any error and they all look completed in the resulting experiment data frame, but so many experiments do not have the resulting metrics, and the same parameter selections are repeated so many times during the model search, for example (32, 32, 128) as can be seen below:

When I try this parameter selection manually, the model runs just fine. What might be the reason for such behavior during the search process? Thank you.

The text was updated successfully, but these errors were encountered:

Balandat · 2023-07-07T03:38:50Z

Hmm the fact that the results don't show up for many of the trials is very odd indeed. Can you share the logs generated during the run? Ideally you can also share the code that reproduces this example.

What system are you running this on (os, python version, ax version etc.)? I recall faintly that we've had some issues in the past with reading from the torchx-generated log dirs, that could potentially be an explanation here.

Can you check the log dir generated (see below) whether these trial results are getting logged (but Ax somehow doesn't / can't read them) or whether the don't get logged in the first place? First-order approximation is how many folders are in that log file.

This is where the log temp dir is created (log_dir is a str with the path):

# Make a temporary dir to log our results into
log_dir = tempfile.mkdtemp()

As to the repeated configs: Looks like the ones that are being repeated are the ones that don't show up with data. So my guess would be that the model really wants to explore those but doesn't get data and so keeps trying. So let's try to figure out why we don't get the data first and my guess is that issue goes away.

ekurtgl · 2023-07-10T20:54:13Z

Hi @Balandat ,

So, I ran the experiment for 40 times to test your suggestion, and we can see trial 26 and some others are failing to produce results:

Those empty cells are NaN values in the created data frame and exported as empty cells into the .xlsx file.

When I check out the log_dir, it looks like it has all the folders for each trial, and the folders are not empty even for the failing trials:

So, I guess torchx is writing the files correctly, but I am not sure how I can read these files. If you have any suggestions, I can give them a try. Thanks!

OS: Ubuntu 20.04
Python: 3.10.11
Ax: 0.3.3

Balandat · 2023-07-11T14:45:40Z

Interesting. The code that reads the tensorboard logs for consumption by Ax is here: https://github.com/facebook/Ax/blob/main/ax/metrics/tensorboard.py#L52-L90

The path input arg to get_tb_from_posix for a trial is going to be Path(log_dir).joinpath(str(trial.index)).as_posix() (as defined in MyTensorboardMetric in the tutorial).

Could you try running that code in a notebook for both trials that return and don't return data? But make sure to change the logging level setting to something more verbose (this is set to CRITICAL here: https://github.com/facebook/Ax/blob/main/ax/metrics/tensorboard.py#L27) - e.g. try logging.getLogger("tensorboard").setLevel(logging.DEBUG) that should hopefully provide some insight

ekurtgl · 2023-07-12T23:57:04Z

Hi @Balandat ,

Here I have 50 trials and trial 43 is failing to produce results:

Also, interestingly, previous parameter selection in trial 42 looks the same (32, 32, 70) even though it was successful. I got the following tensorboard logs for successful and failing trials:

So it looks like it finds the evaluation metrics, but just cannot export them appropriately somehow. I just do a simple data frame creation:

Thank you for your help.

Balandat · 2023-07-13T00:16:35Z

So it looks like it finds the evaluation metrics, but just cannot export them appropriately somehow. I just do a simple data frame creation:

Interesting. I think I have a hunch - before calling exp_to_df, can you call experiment.fetch_data()? exp_to_df only calls lookup_data (without fetching data for trials that don't have data attached) here, so it could be that at the generation of the df this never actually tried fetching the data in the first place.

ekurtgl · 2023-07-13T16:58:33Z

Hi @Balandat ,

I tried to add .fetch_data() both within the report_utils and before calling the exp_to_df function, but neither of them solved the problem:

Top script:

Is there any other debugging tip you can think of that I can give a try? Thank you.

lena-kashtelyan · 2023-07-26T17:45:48Z

Labeling this as a bug for now until we are able to investigate

ga92xug · 2023-08-31T01:04:36Z

I have the same problem. I found that the problem seems to arise once a trial has been found that can not be attached to a new client (with the same specifications as the old client) due to parameter constraints. Attached in the sense that has been described here (#1558 (comment)).

bernardbeckerman · 2023-09-13T14:38:40Z

Hi there, and sorry you're running into this issue! I'm not totally sure what's going on here, and have been unable to repro using the linked MOO NAS tutorial, but in the mean time I'm wondering if you might be able to cut the Gordian knot by specifying should_deduplicate=True in your call to choose_generation_strategy, so that these duplicate parameterizations are never suggested. @ekurtgl @ga92xug would either of you be able to give that a try, and/or provide an example that might reproduce this issue locally for me?

ga92xug · 2023-09-22T21:01:22Z

@bernardbeckerman thanks for the tip with the should deduplicate, that solved the problem partially for me. I don't get repeated trials anymore, but there are still out of design trials suggested, like in this issue here: #1568.

lena-kashtelyan · 2023-10-03T18:39:15Z

Hi @ekurtgl , @ga92xug , could one or both of you paste a full code snippets of your Ax optimizations? We're not able to reproduce this and thus cannot get started with solving it.

ekurtgl · 2023-10-04T21:54:03Z

Hi @lena-kashtelyan ,

Here are the code snippets. I had to black out some confidential lines.

I no longer have access to the data used in this code, so I cannot try anything new, but I hope it helps in your debug process.

mpolson64 · 2023-10-31T16:42:35Z

Hi @ekurtgl -- unfortunately I was not able to reproduce this using the Ax v0.3.4 . This could be for any number of reasons, including that our TensorboardMetric code has undergone some changes in recent months that may have inadvertently solved this bug. If you run into this issue in the future and are able to produce a minimal working reproduction don't hesitate to reopen this task and we will resume investigation.

ekurtgl changed the title ~~Completed Mukti-objective NAS experiments without metrics~~ Completed Multi-objective NAS experiments without metrics and repeated parameter selections Jul 6, 2023

lena-kashtelyan assigned Balandat and unassigned Balandat Jul 24, 2023

lena-kashtelyan added the bug Something isn't working label Jul 26, 2023

lena-kashtelyan added the requires repro or more info label Oct 3, 2023

lena-kashtelyan assigned mpolson64 Oct 3, 2023

lena-kashtelyan removed the requires repro or more info label Oct 25, 2023

mpolson64 closed this as completed Oct 31, 2023

mpolson64 added the requires repro or more info label Oct 31, 2023

HuizhiXu mentioned this issue Apr 23, 2024

Multi-objective experiments generate duplicated data #2392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completed Multi-objective NAS experiments without metrics and repeated parameter selections #1704

Completed Multi-objective NAS experiments without metrics and repeated parameter selections #1704

ekurtgl commented Jul 6, 2023

Balandat commented Jul 7, 2023

ekurtgl commented Jul 10, 2023

Balandat commented Jul 11, 2023

ekurtgl commented Jul 12, 2023

Balandat commented Jul 13, 2023

ekurtgl commented Jul 13, 2023

lena-kashtelyan commented Jul 26, 2023

ga92xug commented Aug 31, 2023

bernardbeckerman commented Sep 13, 2023

ga92xug commented Sep 22, 2023

lena-kashtelyan commented Oct 3, 2023

ekurtgl commented Oct 4, 2023

mpolson64 commented Oct 31, 2023

Completed Multi-objective NAS experiments without metrics and repeated parameter selections #1704

Completed Multi-objective NAS experiments without metrics and repeated parameter selections #1704

Comments

ekurtgl commented Jul 6, 2023

Balandat commented Jul 7, 2023

ekurtgl commented Jul 10, 2023

Balandat commented Jul 11, 2023

ekurtgl commented Jul 12, 2023

Balandat commented Jul 13, 2023

ekurtgl commented Jul 13, 2023

lena-kashtelyan commented Jul 26, 2023

ga92xug commented Aug 31, 2023

bernardbeckerman commented Sep 13, 2023

ga92xug commented Sep 22, 2023

lena-kashtelyan commented Oct 3, 2023

ekurtgl commented Oct 4, 2023

mpolson64 commented Oct 31, 2023