Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElasticDL features for large scale recommendation #2156

Open
backyes opened this issue Jul 10, 2020 · 2 comments
Open

ElasticDL features for large scale recommendation #2156

backyes opened this issue Jul 10, 2020 · 2 comments
Assignees

Comments

@backyes
Copy link

backyes commented Jul 10, 2020

Good job on improving tensorflow on kubernetes for easy developing large scale training system. :-D

After reading some tutorials, we found ElasticDL designs new PS architecture and distributed framework, and want to ElasticDL team clarifys some more design considerations.

Large scale recommendation system requires several features from training system,

  • efficiently handle large scale embedding while distributed training enabled,and it requires parameter servers and DL framework to satisfy sparse SGD updating. (How about ElasticDL's features for large scale embedding training)
  • be compatiable with DL framework API to handle large scale embedding to make major model zoo works well.

How about ElasticDL?

@wangkuiyi
Copy link
Collaborator

ElasticDL can handle very large models using its general parameter server in Go, which is based on the previous design we explained in Google Developer Day 2019, but with many performance improvements.

@QiJune I think @backyes 's question is a very inspiring hint -- we should add a benchmark showing the capability of ElasticDL in supporting large models.

@QiJune
Copy link
Collaborator

QiJune commented Jul 11, 2020

@backyes Thank you for your interest!

ElasticDL supports large embedding tables and also supports sparse SGD updating.

An embedding table will be sharded to several PS instances. In forward pass, workers pull embedding vectors from PS. In the backward pass, workers push embedding gradients (IndexedSlices data structure) to PS. Then, the sparse gradients will be applied to the embedding table in PS.

For more design details, please refer to parameter server and high performance PS.

For more implementation details, please refer to the Go PS code base and the RPC interface.

ElasticDL is also compatible with TensorFlow API well.

Users program their models withtf.keras.layers.Embedding directly. ElasticDL supports native TensorFlow Keras API.

ElasticDL will substitute the embedding layer with elasticdl.layers.embedding before training. It's transparent to users.

@wangkuiyi Thank you for the advice. Yes, we could make an experiment, a recommendation model with large embedding tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants