Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train / evaluate multiple TF models in parallel #17

Open
bentsherman opened this issue May 29, 2019 · 3 comments
Open

Train / evaluate multiple TF models in parallel #17

bentsherman opened this issue May 29, 2019 · 3 comments

Comments

@bentsherman
Copy link
Member

The 3-layer MLP that we use generally does not utilize the entire GPU bandwidth, which means that we might be able to run multiple models on the same GPU in parallel and get some speedup. I'm not sure if this is feasible with Tensorflow and its Graphs / Sessions but I'm guessing that each MLP instance would probably need its own TF Graph and TF Session.

Assuming that all works, in phase 1 we can add parallelism easily with the n_jobs parameter of cross_val_score(), and for phase 2 we'd probably have to do it ourselves with multiprocess.

@bentsherman
Copy link
Member Author

phase1-evaluate.py can now use multiple parallel jobs for cross validation. However, I don't think our MLP class is entirely thread-safe yet, as I can only use up to 2 jobs, and if I use more than that I get errors. Even so, being able to run two MLPs in parallel will be a big improvment. On top of that, all of the sklearn classifiers can now use all CPU cores.

Since I'm not a tensorflow expert, there's probably something wrong with my tensorflow code. If we can't fix that, we might be able to use keras, which hides all of the details about graphs and sessions. Alternatively, we might be able to specify a different parallel backend for cross_validate which uses multiprocess instead of multithreading.

@bentsherman
Copy link
Member Author

Using tf.keras didn't change anything, I get the same errors. In order for this feature to work we'll need to use multiprocess.

@bentsherman
Copy link
Member Author

As it turns out, the default parallel backend was already using process-based parallelism. So perhaps the issue is that the additional processes fail because they can't allocate GPU memory, since tensorflow allocates the entire GPU by default.

So maybe we should actually use a multi-threading backend, and try to make all threads use the same context but different graphs? I don't know. Further investigation required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant