-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] GPEI when function values are very non gaussian, extreme values #1946
Comments
Hi @andrejmuhic, my first instinct would be to do some kind of log transform of the observations to squish the distribution together and render the problem easier to model (note that this wouldn't change the location of the optimum due to the monotonicity of the transform). The issue is of course that you have negative values - do you have some kind of global lower bound If you don't know this global value a priori, then one could in principle try to estimate this in each iteration from the model (e.g. as the minimum observed value minus some interquantile range to avoid squishing things too much near zero), but afaik we don't have this hooked up in Ax at this point - so if you can have some reasonable guess of a global |
Hi, thank you for your advice. |
In general, yes, though you have to be careful to make sure that the transform is consistent for all the data. Like, you can't just find the "best normalizing transform" for data of trials 1-10, pass that transformed data to Ax, and then compute another "best normalizing transform" based on data of trials 1-20 but only pass the transformed outcomes of trials 11-20 to Ax. The transformation has to be applied consistently across all data (and care has to be taken if there are things like outcome constraints or relative effects). FWIW, we do some data transformation automatically in Ax (such as standardizing or winsorizing), but we don't automatically apply log transforms or other kinds of transforms (this is something we could do more of). If the form of your data transform doesn't change, you'll be safe; One option would be to collect a bunch of observations (or use those that you already have), compute the "best normalizing transform" based on those, and then create a new Ax experiment where you add these results as manual arms, and then use the same transform for all subsequent results obtained on any new trials suggested by Ax. |
Hi, @Balandat
Actually the package that I pasted finds the best transformation using cross validation, which is kind off the right thing to do statistically.
Yes I am aware that same transformation needs to be applied to all data.
Actually I have already coded this before I saw your post, just waiting that I have enough samples. We are using manual stuff extensively already. Moreover I am actually adding diagnostics, quality of fit, variable importance etc. I am actually glad that I asked this question as I started going in too much detail (like starting to read botorch source of Thank you for chatting with me. |
When it comes to Bayesian Optimization, we all do :) |
Hi @Balandat, It took me some time.
and also doing a final test on validation data I am not sure how cross_validate is done, I tried reading the source code but this could cause data peaking in some sense if this is not refitted every time when doing the fold? I would just like to make sure that I am doing right thing. The easiest thing for me would be to apply Winsorize (but one cannot invert this one obviously) and I also started playing with https://research.facebook.com/blog/2021/7/high-dimensional-bayesian-optimization-with-sparsity-inducing-priors/ and https://twitter.com/SebastianAment/status/1720440669628514763 Thank you for your help, keep up the good work, I enjoyed reading the papers. |
Yes that is correct. The Ax paradigm generally is that it handles both transforming data before passing it to the lower level & optimization layer and then un-transforming the predictions returned by the modeling & optimization layer. And yes, we usually construct a new model upon generating new points rather than updating an existing one (unless there is no new data in which case we use the old one). I guess it would be possible to try and avoid that to cache some kernel matrices etc. but that would get quite messy really fast with all the complexities of transforms and so on. So our design choice here is to re-fit the transform every time there is new data, which is generally pretty low overhead relative to the actual GP fitting and acquisition function optimization.
Yes this should be fine, this setup gets you around shifting the data to positive values outside of Ax with a global offset (though you could still do that even with the above proposed transforms if you remember applying that shift at the end). Winsorization does work quite well in practice and we use it a lot ourselves, so this seems like a solid approach.
Great, happy to see you test this out, let us know how this goes! FWIW there is quite a bit of work on robust GP(-like) regression, e.g. with student-t processes or using ideas from generalized Bayesian inference. This is something that we hope to support in the future but don't have any concrete plans for at this point. |
Hi, Balandat One more important thing. Thank you for your help. I am getting better and better insight. |
We take those trials into account as "pending points" when computing the acquisition value. Basically, we take the expectation of the acquisition value at some set of points
Yeah if you're working with pending points then this should be expected. I guess we could consider whether we'd want to reduce the logging of this in the context where this is expected - but it's probably not a bad diagnostic to have (cc @saitcakmak) |
Hi, Balandat
This makes perfect sense. |
Hi,
First thank you for doing the hard work in maintaining very helpful library.
I have one question and I would be glad if someone could help me.
We are using standard GPEI for max_x f(x) with bounds on x, something like:
The negative function values can be in absolute value 50x bigger than positive ones.
So very fat tailed and skewed distribution of outcome values f(x).
It seems that our optimization gets stuck due to this for a long time as this seems to hurt out of sample prediction quality of the process.
What is the recommended approach to this?
It is not obvious if different choice of kernel would help here, Matern kernel seems to be pretty general already.
We tried truncation of negative values, restarting optimization with again taking some Sobol steps, that seems to help.
But it could be that we may be missing something obvious.
Would adding constraint on outcome as f(x) > bound be more proper thing to do, the docs https://botorch.org/docs/constraints probably indicate so and to my understanding this approach is already used in ax as actually
qNoisyExpectedImprovement
is used. I am not sure how outcome constraints are handled in detail I just skimmed over ideas and it seems promising approach,So is this the way to go and more sample efficient than truncating outcome values?
Thank you for your help.
The text was updated successfully, but these errors were encountered: