Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completed Multi-objective NAS experiments without metrics and repeated parameter selections #1704

Closed
ekurtgl opened this issue Jul 6, 2023 · 13 comments
Assignees
Labels
bug Something isn't working requires repro or more info

Comments

@ekurtgl
Copy link

ekurtgl commented Jul 6, 2023

Hi,

I followed the multi-objective NAS tutorial, and I was able to implement the NAS framework for my own problem. In my problem, I have two objectives: num_params and val_acc, and three parameters: n_points (choice parameter), feature_size and num_points (range parameters). My trials run without any error and they all look completed in the resulting experiment data frame, but so many experiments do not have the resulting metrics, and the same parameter selections are repeated so many times during the model search, for example (32, 32, 128) as can be seen below:

image

When I try this parameter selection manually, the model runs just fine. What might be the reason for such behavior during the search process? Thank you.

@ekurtgl ekurtgl changed the title Completed Mukti-objective NAS experiments without metrics Completed Multi-objective NAS experiments without metrics and repeated parameter selections Jul 6, 2023
@Balandat
Copy link
Contributor

Balandat commented Jul 7, 2023

Hmm the fact that the results don't show up for many of the trials is very odd indeed. Can you share the logs generated during the run? Ideally you can also share the code that reproduces this example.

What system are you running this on (os, python version, ax version etc.)? I recall faintly that we've had some issues in the past with reading from the torchx-generated log dirs, that could potentially be an explanation here.

Can you check the log dir generated (see below) whether these trial results are getting logged (but Ax somehow doesn't / can't read them) or whether the don't get logged in the first place? First-order approximation is how many folders are in that log file.

This is where the log temp dir is created (log_dir is a str with the path):

# Make a temporary dir to log our results into
log_dir = tempfile.mkdtemp()

As to the repeated configs: Looks like the ones that are being repeated are the ones that don't show up with data. So my guess would be that the model really wants to explore those but doesn't get data and so keeps trying. So let's try to figure out why we don't get the data first and my guess is that issue goes away.

@ekurtgl
Copy link
Author

ekurtgl commented Jul 10, 2023

Hi @Balandat ,

So, I ran the experiment for 40 times to test your suggestion, and we can see trial 26 and some others are failing to produce results:

image

Those empty cells are NaN values in the created data frame and exported as empty cells into the .xlsx file.

When I check out the log_dir, it looks like it has all the folders for each trial, and the folders are not empty even for the failing trials:

image

image

So, I guess torchx is writing the files correctly, but I am not sure how I can read these files. If you have any suggestions, I can give them a try. Thanks!

OS: Ubuntu 20.04
Python: 3.10.11
Ax: 0.3.3

@Balandat
Copy link
Contributor

Interesting. The code that reads the tensorboard logs for consumption by Ax is here: https://github.com/facebook/Ax/blob/main/ax/metrics/tensorboard.py#L52-L90

The path input arg to get_tb_from_posix for a trial is going to be Path(log_dir).joinpath(str(trial.index)).as_posix() (as defined in MyTensorboardMetric in the tutorial).

Could you try running that code in a notebook for both trials that return and don't return data? But make sure to change the logging level setting to something more verbose (this is set to CRITICAL here: https://github.com/facebook/Ax/blob/main/ax/metrics/tensorboard.py#L27) - e.g. try logging.getLogger("tensorboard").setLevel(logging.DEBUG) that should hopefully provide some insight

@ekurtgl
Copy link
Author

ekurtgl commented Jul 12, 2023

Hi @Balandat ,

Here I have 50 trials and trial 43 is failing to produce results:

image

Also, interestingly, previous parameter selection in trial 42 looks the same (32, 32, 70) even though it was successful. I got the following tensorboard logs for successful and failing trials:

image

So it looks like it finds the evaluation metrics, but just cannot export them appropriately somehow. I just do a simple data frame creation:

image

Thank you for your help.

@Balandat
Copy link
Contributor

So it looks like it finds the evaluation metrics, but just cannot export them appropriately somehow. I just do a simple data frame creation:

Interesting. I think I have a hunch - before calling exp_to_df, can you call experiment.fetch_data()? exp_to_df only calls lookup_data (without fetching data for trials that don't have data attached) here, so it could be that at the generation of the df this never actually tried fetching the data in the first place.

@ekurtgl
Copy link
Author

ekurtgl commented Jul 13, 2023

Hi @Balandat ,

I tried to add .fetch_data() both within the report_utils and before calling the exp_to_df function, but neither of them solved the problem:

image

Top script:

image

Is there any other debugging tip you can think of that I can give a try? Thank you.

@lena-kashtelyan lena-kashtelyan assigned Balandat and unassigned Balandat Jul 24, 2023
@lena-kashtelyan lena-kashtelyan added the bug Something isn't working label Jul 26, 2023
@lena-kashtelyan
Copy link
Contributor

Labeling this as a bug for now until we are able to investigate

@ga92xug
Copy link

ga92xug commented Aug 31, 2023

I have the same problem. I found that the problem seems to arise once a trial has been found that can not be attached to a new client (with the same specifications as the old client) due to parameter constraints. Attached in the sense that has been described here (#1558 (comment)).

@bernardbeckerman
Copy link
Contributor

Hi there, and sorry you're running into this issue! I'm not totally sure what's going on here, and have been unable to repro using the linked MOO NAS tutorial, but in the mean time I'm wondering if you might be able to cut the Gordian knot by specifying should_deduplicate=True in your call to choose_generation_strategy, so that these duplicate parameterizations are never suggested. @ekurtgl @ga92xug would either of you be able to give that a try, and/or provide an example that might reproduce this issue locally for me?

@ga92xug
Copy link

ga92xug commented Sep 22, 2023

@bernardbeckerman thanks for the tip with the should deduplicate, that solved the problem partially for me. I don't get repeated trials anymore, but there are still out of design trials suggested, like in this issue here: #1568.

@lena-kashtelyan
Copy link
Contributor

Hi @ekurtgl , @ga92xug , could one or both of you paste a full code snippets of your Ax optimizations? We're not able to reproduce this and thus cannot get started with solving it.

@ekurtgl
Copy link
Author

ekurtgl commented Oct 4, 2023

Hi @lena-kashtelyan ,

Here are the code snippets. I had to black out some confidential lines.

image
image
image
image
image
image
image
image
image

I no longer have access to the data used in this code, so I cannot try anything new, but I hope it helps in your debug process.

@mpolson64
Copy link
Contributor

Hi @ekurtgl -- unfortunately I was not able to reproduce this using the Ax v0.3.4 . This could be for any number of reasons, including that our TensorboardMetric code has undergone some changes in recent months that may have inadvertently solved this bug. If you run into this issue in the future and are able to produce a minimal working reproduction don't hesitate to reopen this task and we will resume investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working requires repro or more info
Projects
None yet
Development

No branches or pull requests

6 participants