-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completed Multi-objective NAS experiments without metrics and repeated parameter selections #1704
Comments
Hmm the fact that the results don't show up for many of the trials is very odd indeed. Can you share the logs generated during the run? Ideally you can also share the code that reproduces this example. What system are you running this on (os, python version, ax version etc.)? I recall faintly that we've had some issues in the past with reading from the torchx-generated log dirs, that could potentially be an explanation here. Can you check the log dir generated (see below) whether these trial results are getting logged (but Ax somehow doesn't / can't read them) or whether the don't get logged in the first place? First-order approximation is how many folders are in that log file. This is where the log temp dir is created (
As to the repeated configs: Looks like the ones that are being repeated are the ones that don't show up with data. So my guess would be that the model really wants to explore those but doesn't get data and so keeps trying. So let's try to figure out why we don't get the data first and my guess is that issue goes away. |
Hi @Balandat , So, I ran the experiment for 40 times to test your suggestion, and we can see trial 26 and some others are failing to produce results: Those empty cells are NaN values in the created data frame and exported as empty cells into the .xlsx file. When I check out the So, I guess torchx is writing the files correctly, but I am not sure how I can read these files. If you have any suggestions, I can give them a try. Thanks! OS: Ubuntu 20.04 |
Interesting. The code that reads the tensorboard logs for consumption by Ax is here: https://github.com/facebook/Ax/blob/main/ax/metrics/tensorboard.py#L52-L90 The Could you try running that code in a notebook for both trials that return and don't return data? But make sure to change the logging level setting to something more verbose (this is set to |
Hi @Balandat , Here I have 50 trials and trial 43 is failing to produce results: Also, interestingly, previous parameter selection in trial 42 looks the same (32, 32, 70) even though it was successful. I got the following tensorboard logs for successful and failing trials: So it looks like it finds the evaluation metrics, but just cannot export them appropriately somehow. I just do a simple data frame creation: Thank you for your help. |
Interesting. I think I have a hunch - before calling |
Hi @Balandat , I tried to add Top script: Is there any other debugging tip you can think of that I can give a try? Thank you. |
Labeling this as a bug for now until we are able to investigate |
I have the same problem. I found that the problem seems to arise once a trial has been found that can not be attached to a new client (with the same specifications as the old client) due to parameter constraints. Attached in the sense that has been described here (#1558 (comment)). |
Hi there, and sorry you're running into this issue! I'm not totally sure what's going on here, and have been unable to repro using the linked MOO NAS tutorial, but in the mean time I'm wondering if you might be able to cut the Gordian knot by specifying |
@bernardbeckerman thanks for the tip with the should deduplicate, that solved the problem partially for me. I don't get repeated trials anymore, but there are still out of design trials suggested, like in this issue here: #1568. |
Hi @lena-kashtelyan , Here are the code snippets. I had to black out some confidential lines. I no longer have access to the data used in this code, so I cannot try anything new, but I hope it helps in your debug process. |
Hi @ekurtgl -- unfortunately I was not able to reproduce this using the Ax v0.3.4 . This could be for any number of reasons, including that our TensorboardMetric code has undergone some changes in recent months that may have inadvertently solved this bug. If you run into this issue in the future and are able to produce a minimal working reproduction don't hesitate to reopen this task and we will resume investigation. |
Hi,
I followed the multi-objective NAS tutorial, and I was able to implement the NAS framework for my own problem. In my problem, I have two objectives:
num_params
andval_acc
, and three parameters:n_points
(choice parameter),feature_size
andnum_points
(range parameters). My trials run without any error and they all look completed in the resulting experiment data frame, but so many experiments do not have the resulting metrics, and the same parameter selections are repeated so many times during the model search, for example (32, 32, 128) as can be seen below:When I try this parameter selection manually, the model runs just fine. What might be the reason for such behavior during the search process? Thank you.
The text was updated successfully, but these errors were encountered: